Web Crawling
πŸ•ΈοΈ

Crawl4AI

Blazing-fast, AI-ready web crawler and scraper designed specifically for LLMs, AI agents, and data pipelines

Beginner to Intermediate open-source self-hosted scraping LLM RAG

Alternative To

  • β€’ Firecrawl
  • β€’ Apify
  • β€’ Scrapy

Difficulty Level

Beginner to Intermediate

For experienced users. Complex setup and configuration required.

Overview

Crawl4AI is a powerful open-source web crawler and scraper specifically designed for AI applications. It delivers blazing-fast, LLM-friendly data extraction with features like deep crawling, memory-adaptive dispatching, and automatic HTML-to-Markdown conversion. As the #1 trending GitHub repository in its category, Crawl4AI empowers developers to efficiently extract and process web content for large language models, AI agents, and data pipelines.

System Requirements

  • CPU: 2+ cores
  • RAM: 4GB+ (8GB+ recommended for large-scale crawling)
  • GPU: Not required
  • Storage: Depends on the amount of data you plan to crawl
  • Python: 3.9+

Installation Guide

Option 1: Python Package Installation

The simplest way to install Crawl4AI is via pip:

pip install crawl4ai

To install with additional features:

# For LLM integration
pip install 'crawl4ai[llm]'

# For FastAPI server
pip install 'crawl4ai[server]'

# For full installation with all dependencies
pip install 'crawl4ai[all]'

Option 2: Docker Installation

Crawl4AI provides Docker images for easy deployment with a REST API server:

  1. Create a docker-compose.yml file:

    version: "3"
    services:
      crawl4ai:
        image: unclecode/crawl4ai:latest
        ports:
          - "11235:11235"
        environment:
          - MAX_CONCURRENT_TASKS=4
        volumes:
          - ./crawl4ai_cache:/app/.crawl4ai
    
  2. Start the service:

    docker-compose up -d
    
  3. The API will be available at http://localhost:11235

Option 3: From Source

  1. Clone the repository:

    git clone https://github.com/unclecode/crawl4ai.git
    
  2. Navigate to the project directory:

    cd crawl4ai
    
  3. Install the package in development mode:

    pip install -e .
    

Note: For detailed installation instructions specific to your operating system and environment, please refer to the official documentation.

Practical Exercise: Getting Started with Crawl4AI

Let’s create a simple web crawling application to extract content from a website and convert it to Markdown for use with LLMs.

Step 1: Basic Crawling

First, let’s create a simple script that uses Crawl4AI to crawl a single webpage:

import asyncio
from crawl4ai import AsyncWebCrawler

async def simple_crawl():
    # Create an AsyncWebCrawler instance
    async with AsyncWebCrawler() as crawler:
        # Crawl a webpage
        result = await crawler.arun("https://example.com")

        # Print the extracted Markdown
        print(f"Status: {'Success' if result.success else 'Failed'}")
        if result.success:
            print(f"Title: {result.title}")
            print(f"Markdown length: {len(result.markdown)}")
            print("\nFirst 500 characters of Markdown:")
            print(result.markdown[:500])

if __name__ == "__main__":
    asyncio.run(simple_crawl())

Step 2: Crawling Multiple Pages

Now, let’s expand our example to crawl multiple pages concurrently:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode

async def multi_page_crawl():
    # List of URLs to crawl
    urls = [
        "https://example.com/page1",
        "https://example.com/page2",
        "https://en.wikipedia.org/wiki/Web_crawler",
    ]

    # Configure the crawler
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.USE_CACHE,  # Use cached results if available
        stream=True  # Stream results as they complete
    )

    # Create an AsyncWebCrawler instance
    async with AsyncWebCrawler() as crawler:
        # Stream results as they complete
        async for result in await crawler.arun_many(urls, config=run_config):
            if result.success:
                print(f"βœ… Successfully crawled: {result.url}")
                print(f"   Title: {result.title}")
                print(f"   Markdown length: {len(result.markdown)}")
            else:
                print(f"❌ Failed to crawl: {result.url}")
                print(f"   Error: {result.error_message}")
            print()

if __name__ == "__main__":
    asyncio.run(multi_page_crawl())

Step 3: Deep Crawling

For a more advanced example, let’s implement deep crawling to explore multiple pages on a website:

import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.deep_crawling import DeepCrawlStrategy, DeepCrawlOptions, TraversalStrategy

async def deep_crawl():
    # Define deep crawl options
    options = DeepCrawlOptions(
        max_pages=10,  # Limit to 10 pages
        traversal_strategy=TraversalStrategy.BFS,  # Breadth-first search
        allow_external_domains=False,  # Stay on the same domain
        max_depth=2,  # Limit crawl depth to 2 levels
    )

    # Create the deep crawl strategy
    deep_strategy = DeepCrawlStrategy(options)

    # Create an AsyncWebCrawler instance
    async with AsyncWebCrawler() as crawler:
        # Start the deep crawl
        results = await deep_strategy.arun(
            url="https://docs.python.org/3/tutorial/",
            crawler=crawler
        )

        # Process the results
        print(f"Crawled {len(results)} pages")
        for i, result in enumerate(results, 1):
            if result.success:
                print(f"{i}. {result.url} - {result.title}")
            else:
                print(f"{i}. {result.url} - Failed: {result.error_message}")

if __name__ == "__main__":
    asyncio.run(deep_crawl())

Step 4: Exploring Advanced Features

Once you’re comfortable with the basics, try exploring some more advanced features:

  • Use different extraction strategies (CSS, XPath, or LLM-based)
  • Implement custom content filters
  • Configure browser parameters for JavaScript-heavy websites
  • Set up proxy rotation for large-scale crawling
  • Integrate with LLMs for structured data extraction
  • Use the HTTP API server for microservice architecture

Resources

Official Documentation

The official documentation provides comprehensive guides, API references, and examples:

Crawl4AI Documentation

GitHub Repository

The GitHub repository contains the source code, issues, and contribution guidelines:

Crawl4AI GitHub Repository

Roadmap

Check out the project roadmap to see upcoming features and development plans:

Crawl4AI Roadmap

Community Support

Get help and connect with other Crawl4AI users:

Tutorials and Guides

Learn more about web crawling for AI applications:

Suggested Projects

You might also be interested in these similar projects:

πŸ—„οΈ

Chroma

Chroma is the AI-native open-source embedding database for storing and searching vector embeddings

Difficulty: Beginner to Intermediate
Updated: Mar 23, 2025
⛓️

Langflow

A powerful low-code tool for building and deploying AI-powered agents and workflows

Difficulty: Beginner to Intermediate
Updated: Mar 23, 2025

An open protocol that connects AI models to data sources and tools with a standardized interface

Difficulty: Intermediate
Updated: Mar 23, 2025