Crawl4AI

Blazing-fast, AI-ready web crawler and scraper designed specifically for LLMs, AI agents, and data pipelines

Beginner to Intermediate open-source self-hosted scraping LLM RAG

GitHub Repository Official Website

Alternative To

• Firecrawl
• Apify
• Scrapy

Difficulty Level

Beginner to Intermediate

For experienced users. Complex setup and configuration required.

Overview

Crawl4AI is a powerful open-source web crawler and scraper specifically designed for AI applications. It delivers blazing-fast, LLM-friendly data extraction with features like deep crawling, memory-adaptive dispatching, and automatic HTML-to-Markdown conversion. As the #1 trending GitHub repository in its category, Crawl4AI empowers developers to efficiently extract and process web content for large language models, AI agents, and data pipelines.

System Requirements

CPU: 2+ cores
RAM: 4GB+ (8GB+ recommended for large-scale crawling)
GPU: Not required
Storage: Depends on the amount of data you plan to crawl
Python: 3.9+

Installation Guide

Option 1: Python Package Installation

The simplest way to install Crawl4AI is via pip:

pip install crawl4ai

To install with additional features:

# For LLM integration
pip install 'crawl4ai[llm]'

# For FastAPI server
pip install 'crawl4ai[server]'

# For full installation with all dependencies
pip install 'crawl4ai[all]'

Option 2: Docker Installation

Crawl4AI provides Docker images for easy deployment with a REST API server:

Create a docker-compose.yml file:

version: "3"
services:
  crawl4ai:
    image: unclecode/crawl4ai:latest
    ports:
      - "11235:11235"
    environment:
      - MAX_CONCURRENT_TASKS=4
    volumes:
      - ./crawl4ai_cache:/app/.crawl4ai

Start the service:
```
docker-compose up -d
```
The API will be available at http://localhost:11235

Option 3: From Source

Clone the repository:

git clone https://github.com/unclecode/crawl4ai.git

Navigate to the project directory:
```
cd crawl4ai
```
Install the package in development mode:
```
pip install -e .
```

Note: For detailed installation instructions specific to your operating system and environment, please refer to the official documentation.

Practical Exercise: Getting Started with Crawl4AI

Let’s create a simple web crawling application to extract content from a website and convert it to Markdown for use with LLMs.

Step 1: Basic Crawling

First, let’s create a simple script that uses Crawl4AI to crawl a single webpage:

import asyncio
from crawl4ai import AsyncWebCrawler

async def simple_crawl():
    # Create an AsyncWebCrawler instance
    async with AsyncWebCrawler() as crawler:
        # Crawl a webpage
        result = await crawler.arun("https://example.com")

        # Print the extracted Markdown
        print(f"Status: {'Success' if result.success else 'Failed'}")
        if result.success:
            print(f"Title: {result.title}")
            print(f"Markdown length: {len(result.markdown)}")
            print("\nFirst 500 characters of Markdown:")
            print(result.markdown[:500])

if __name__ == "__main__":
    asyncio.run(simple_crawl())

Step 2: Crawling Multiple Pages

Now, let’s expand our example to crawl multiple pages concurrently:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode

async def multi_page_crawl():
    # List of URLs to crawl
    urls = [
        "https://example.com/page1",
        "https://example.com/page2",
        "https://en.wikipedia.org/wiki/Web_crawler",
    ]

    # Configure the crawler
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.USE_CACHE,  # Use cached results if available
        stream=True  # Stream results as they complete
    )

    # Create an AsyncWebCrawler instance
    async with AsyncWebCrawler() as crawler:
        # Stream results as they complete
        async for result in await crawler.arun_many(urls, config=run_config):
            if result.success:
                print(f"✅ Successfully crawled: {result.url}")
                print(f"   Title: {result.title}")
                print(f"   Markdown length: {len(result.markdown)}")
            else:
                print(f"❌ Failed to crawl: {result.url}")
                print(f"   Error: {result.error_message}")
            print()

if __name__ == "__main__":
    asyncio.run(multi_page_crawl())

Step 3: Deep Crawling

For a more advanced example, let’s implement deep crawling to explore multiple pages on a website:

import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.deep_crawling import DeepCrawlStrategy, DeepCrawlOptions, TraversalStrategy

async def deep_crawl():
    # Define deep crawl options
    options = DeepCrawlOptions(
        max_pages=10,  # Limit to 10 pages
        traversal_strategy=TraversalStrategy.BFS,  # Breadth-first search
        allow_external_domains=False,  # Stay on the same domain
        max_depth=2,  # Limit crawl depth to 2 levels
    )

    # Create the deep crawl strategy
    deep_strategy = DeepCrawlStrategy(options)

    # Create an AsyncWebCrawler instance
    async with AsyncWebCrawler() as crawler:
        # Start the deep crawl
        results = await deep_strategy.arun(
            url="https://docs.python.org/3/tutorial/",
            crawler=crawler
        )

        # Process the results
        print(f"Crawled {len(results)} pages")
        for i, result in enumerate(results, 1):
            if result.success:
                print(f"{i}. {result.url} - {result.title}")
            else:
                print(f"{i}. {result.url} - Failed: {result.error_message}")

if __name__ == "__main__":
    asyncio.run(deep_crawl())