Docling

Open-source document processing library that simplifies document handling for generative AI applications

Beginner-Friendly open-source self-hosted document-parsing pdf-processing

GitHub Repository Official Website

Alternative To

• Unstructured
• Azure Document Intelligence
• Amazon Textract

Difficulty Level

Beginner-Friendly

For experienced users. Complex setup and configuration required.

Overview

Docling is a powerful open-source document processing library designed to simplify document handling for generative AI applications. It provides a unified approach to parse diverse document formats—including advanced PDF understanding—making it easier to prepare documents for AI workflows. Originally developed by IBM Research Zurich, Docling is now hosted by the LF AI & Data Foundation and has gained significant traction in the AI community with over 28K GitHub stars.

System Requirements

CPU: Standard multi-core processor
RAM: 4GB+ (8GB+ recommended for processing larger documents)
GPU: Not required for basic functionality
Storage: 500MB+ for installation and dependencies
OS: Cross-platform (Windows, macOS, Linux)
Python: 3.8+

Installation Guide

Prerequisites

Python 3.8 or higher installed on your system
Pip package manager
Basic familiarity with Python and command-line tools

Option 1: Python Installation (Recommended)

Install Docling with pip:

pip install docling

For the latest development version:

pip install git+https://github.com/docling-project/docling.git

Option 2: Docker Installation

For containerized usage:

docker pull doclingproject/docling:latest
docker run -it --rm doclingproject/docling

Option 3: From Source

For contributors or customization:

git clone https://github.com/docling-project/docling.git
cd docling
pip install -e .

Practical Exercise: Getting Started with Docling

Let’s walk through a simple exercise to help you understand Docling’s capabilities for AI document processing.

Step 1: Basic Document Loading

First, let’s load and process a document:

from docling import Document

# Load a document (supports PDF, DOCX, XLSX, HTML, images)
doc = Document.from_file("example.pdf")

# Print basic document information
print(f"Document has {len(doc.pages)} pages")
print(f"Document title: {doc.metadata.get('title', 'Unknown')}")

Step 2: Extract and Process Content

Now, let’s extract and view the content:

# Get text content from the document
text_content = doc.get_text()
print(f"Document content (first 500 chars): {text_content[:500]}...")

# Extract tables if present
tables = doc.extract_tables()
for i, table in enumerate(tables):
    print(f"Table {i+1}: {len(table.rows)} rows x {len(table.columns)} columns")

Step 3: AI Integration with LangChain

Let’s integrate with LangChain for AI processing:

from langchain.document_loaders import DoclingLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# Load document with DoclingLoader
loader = DoclingLoader("example.pdf")
documents = loader.load()

# Split text into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)

# Create vector store
embeddings = OpenAIEmbeddings()  # Requires OpenAI API key
vectorstore = Chroma.from_documents(chunks, embeddings)

# Simple semantic search
query = "What are the main conclusions of this document?"
results = vectorstore.similarity_search(query, k=3)
print(f"Top 3 relevant chunks for query: {query}")
for i, doc in enumerate(results):
    print(f"Result {i+1}: {doc.page_content[:100]}...")

Step 4: Using Docling CLI

Docling also offers a command-line interface for quick document processing:

# Convert PDF to Markdown
docling convert example.pdf example.md

# Extract all tables from a document
docling extract-tables example.pdf --output tables/

# Generate a document summary
docling summarize example.pdf

Step 5: Advanced Features

For working with more complex documents:

# OCR support for scanned documents
doc = Document.from_file("scanned.pdf", use_ocr=True)

# Visual understanding with connected AI models
doc.analyze_visuals()

# Extract document structure (headings, sections)
structure = doc.extract_structure()
for section in structure.sections:
    print(f"Section: {section.title}, Level: {section.level}")