Docling
Open-source document processing library that simplifies document handling for generative AI applications
Alternative To
- • Unstructured
- • Azure Document Intelligence
- • Amazon Textract
Difficulty Level
For experienced users. Complex setup and configuration required.
Overview
Docling is a powerful open-source document processing library designed to simplify document handling for generative AI applications. It provides a unified approach to parse diverse document formats—including advanced PDF understanding—making it easier to prepare documents for AI workflows. Originally developed by IBM Research Zurich, Docling is now hosted by the LF AI & Data Foundation and has gained significant traction in the AI community with over 28K GitHub stars.
System Requirements
- CPU: Standard multi-core processor
- RAM: 4GB+ (8GB+ recommended for processing larger documents)
- GPU: Not required for basic functionality
- Storage: 500MB+ for installation and dependencies
- OS: Cross-platform (Windows, macOS, Linux)
- Python: 3.8+
Installation Guide
Prerequisites
- Python 3.8 or higher installed on your system
- Pip package manager
- Basic familiarity with Python and command-line tools
Option 1: Python Installation (Recommended)
Install Docling with pip:
pip install docling
For the latest development version:
pip install git+https://github.com/docling-project/docling.git
Option 2: Docker Installation
For containerized usage:
docker pull doclingproject/docling:latest
docker run -it --rm doclingproject/docling
Option 3: From Source
For contributors or customization:
git clone https://github.com/docling-project/docling.git
cd docling
pip install -e .
Practical Exercise: Getting Started with Docling
Let’s walk through a simple exercise to help you understand Docling’s capabilities for AI document processing.
Step 1: Basic Document Loading
First, let’s load and process a document:
from docling import Document
# Load a document (supports PDF, DOCX, XLSX, HTML, images)
doc = Document.from_file("example.pdf")
# Print basic document information
print(f"Document has {len(doc.pages)} pages")
print(f"Document title: {doc.metadata.get('title', 'Unknown')}")
Step 2: Extract and Process Content
Now, let’s extract and view the content:
# Get text content from the document
text_content = doc.get_text()
print(f"Document content (first 500 chars): {text_content[:500]}...")
# Extract tables if present
tables = doc.extract_tables()
for i, table in enumerate(tables):
print(f"Table {i+1}: {len(table.rows)} rows x {len(table.columns)} columns")
Step 3: AI Integration with LangChain
Let’s integrate with LangChain for AI processing:
from langchain.document_loaders import DoclingLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
# Load document with DoclingLoader
loader = DoclingLoader("example.pdf")
documents = loader.load()
# Split text into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)
# Create vector store
embeddings = OpenAIEmbeddings() # Requires OpenAI API key
vectorstore = Chroma.from_documents(chunks, embeddings)
# Simple semantic search
query = "What are the main conclusions of this document?"
results = vectorstore.similarity_search(query, k=3)
print(f"Top 3 relevant chunks for query: {query}")
for i, doc in enumerate(results):
print(f"Result {i+1}: {doc.page_content[:100]}...")
Step 4: Using Docling CLI
Docling also offers a command-line interface for quick document processing:
# Convert PDF to Markdown
docling convert example.pdf example.md
# Extract all tables from a document
docling extract-tables example.pdf --output tables/
# Generate a document summary
docling summarize example.pdf
Step 5: Advanced Features
For working with more complex documents:
# OCR support for scanned documents
doc = Document.from_file("scanned.pdf", use_ocr=True)
# Visual understanding with connected AI models
doc.analyze_visuals()
# Extract document structure (headings, sections)
structure = doc.extract_structure()
for section in structure.sections:
print(f"Section: {section.title}, Level: {section.level}")
Resources
Official Documentation and Links
Community and Support
Tutorials and Examples
Related Projects
- LangChain - Framework for building applications with LLMs
- LlamaIndex - Data framework for LLM applications
- Unstructured - Similar document processing library
Suggested Projects
You might also be interested in these similar projects:
Self-host the latest AI models including Llama 3.3, DeepSeek-R1, Phi-4, and Gemma 3
An optimized Stable Diffusion WebUI with improved performance, reduced VRAM usage, and advanced features
Chroma is the AI-native open-source embedding database for storing and searching vector embeddings