Crawl4AI
Blazing-fast, AI-ready web crawler and scraper designed specifically for LLMs, AI agents, and data pipelines
Alternative To
- β’ Firecrawl
- β’ Apify
- β’ Scrapy
Difficulty Level
For experienced users. Complex setup and configuration required.
Overview
Crawl4AI is a powerful open-source web crawler and scraper specifically designed for AI applications. It delivers blazing-fast, LLM-friendly data extraction with features like deep crawling, memory-adaptive dispatching, and automatic HTML-to-Markdown conversion. As the #1 trending GitHub repository in its category, Crawl4AI empowers developers to efficiently extract and process web content for large language models, AI agents, and data pipelines.
System Requirements
- CPU: 2+ cores
- RAM: 4GB+ (8GB+ recommended for large-scale crawling)
- GPU: Not required
- Storage: Depends on the amount of data you plan to crawl
- Python: 3.9+
Installation Guide
Option 1: Python Package Installation
The simplest way to install Crawl4AI is via pip:
pip install crawl4ai
To install with additional features:
# For LLM integration
pip install 'crawl4ai[llm]'
# For FastAPI server
pip install 'crawl4ai[server]'
# For full installation with all dependencies
pip install 'crawl4ai[all]'
Option 2: Docker Installation
Crawl4AI provides Docker images for easy deployment with a REST API server:
Create a
docker-compose.ymlfile:version: "3" services: crawl4ai: image: unclecode/crawl4ai:latest ports: - "11235:11235" environment: - MAX_CONCURRENT_TASKS=4 volumes: - ./crawl4ai_cache:/app/.crawl4aiStart the service:
docker-compose up -dThe API will be available at
http://localhost:11235
Option 3: From Source
Clone the repository:
git clone https://github.com/unclecode/crawl4ai.gitNavigate to the project directory:
cd crawl4aiInstall the package in development mode:
pip install -e .
Note: For detailed installation instructions specific to your operating system and environment, please refer to the official documentation.
Practical Exercise: Getting Started with Crawl4AI
Let’s create a simple web crawling application to extract content from a website and convert it to Markdown for use with LLMs.
Step 1: Basic Crawling
First, let’s create a simple script that uses Crawl4AI to crawl a single webpage:
import asyncio
from crawl4ai import AsyncWebCrawler
async def simple_crawl():
# Create an AsyncWebCrawler instance
async with AsyncWebCrawler() as crawler:
# Crawl a webpage
result = await crawler.arun("https://example.com")
# Print the extracted Markdown
print(f"Status: {'Success' if result.success else 'Failed'}")
if result.success:
print(f"Title: {result.title}")
print(f"Markdown length: {len(result.markdown)}")
print("\nFirst 500 characters of Markdown:")
print(result.markdown[:500])
if __name__ == "__main__":
asyncio.run(simple_crawl())
Step 2: Crawling Multiple Pages
Now, let’s expand our example to crawl multiple pages concurrently:
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async def multi_page_crawl():
# List of URLs to crawl
urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://en.wikipedia.org/wiki/Web_crawler",
]
# Configure the crawler
run_config = CrawlerRunConfig(
cache_mode=CacheMode.USE_CACHE, # Use cached results if available
stream=True # Stream results as they complete
)
# Create an AsyncWebCrawler instance
async with AsyncWebCrawler() as crawler:
# Stream results as they complete
async for result in await crawler.arun_many(urls, config=run_config):
if result.success:
print(f"β
Successfully crawled: {result.url}")
print(f" Title: {result.title}")
print(f" Markdown length: {len(result.markdown)}")
else:
print(f"β Failed to crawl: {result.url}")
print(f" Error: {result.error_message}")
print()
if __name__ == "__main__":
asyncio.run(multi_page_crawl())
Step 3: Deep Crawling
For a more advanced example, let’s implement deep crawling to explore multiple pages on a website:
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.deep_crawling import DeepCrawlStrategy, DeepCrawlOptions, TraversalStrategy
async def deep_crawl():
# Define deep crawl options
options = DeepCrawlOptions(
max_pages=10, # Limit to 10 pages
traversal_strategy=TraversalStrategy.BFS, # Breadth-first search
allow_external_domains=False, # Stay on the same domain
max_depth=2, # Limit crawl depth to 2 levels
)
# Create the deep crawl strategy
deep_strategy = DeepCrawlStrategy(options)
# Create an AsyncWebCrawler instance
async with AsyncWebCrawler() as crawler:
# Start the deep crawl
results = await deep_strategy.arun(
url="https://docs.python.org/3/tutorial/",
crawler=crawler
)
# Process the results
print(f"Crawled {len(results)} pages")
for i, result in enumerate(results, 1):
if result.success:
print(f"{i}. {result.url} - {result.title}")
else:
print(f"{i}. {result.url} - Failed: {result.error_message}")
if __name__ == "__main__":
asyncio.run(deep_crawl())
Step 4: Exploring Advanced Features
Once you’re comfortable with the basics, try exploring some more advanced features:
- Use different extraction strategies (CSS, XPath, or LLM-based)
- Implement custom content filters
- Configure browser parameters for JavaScript-heavy websites
- Set up proxy rotation for large-scale crawling
- Integrate with LLMs for structured data extraction
- Use the HTTP API server for microservice architecture
Resources
Official Documentation
The official documentation provides comprehensive guides, API references, and examples:
GitHub Repository
The GitHub repository contains the source code, issues, and contribution guidelines:
Roadmap
Check out the project roadmap to see upcoming features and development plans:
Community Support
Get help and connect with other Crawl4AI users:
Tutorials and Guides
Learn more about web crawling for AI applications:
- Crawl4AI Tutorial: Docker Deployment
- Changelog - Stay updated with the latest features and improvements
Suggested Projects
You might also be interested in these similar projects:
Chroma is the AI-native open-source embedding database for storing and searching vector embeddings
A powerful low-code tool for building and deploying AI-powered agents and workflows
An open protocol that connects AI models to data sources and tools with a standardized interface