Data Prep Kit
Open-source toolkit for accelerating unstructured data preparation for Large Language Model applications
Alternative To
- β’ Commercial ETL tools
- β’ Custom data pipelines
- β’ Manual data cleaning
Difficulty Level
For experienced users. Complex setup and configuration required.
Overview
Data Prep Kit is an open-source toolkit designed to accelerate unstructured data preparation for Large Language Model (LLM) application developers. It helps developers cleanse, transform, and enrich use case-specific unstructured data for pre-training, fine-tuning, instruct-tuning, and Retrieval Augmented Generation (RAG) applications.
Developed by IBM Research and hosted by the LF AI & Data Foundation, Data Prep Kit provides tools for handling the critical but often time-consuming data preparation phase of LLM development. It’s built to scale from laptop to data center environments, making it suitable for both individual developers and enterprise teams.
System Requirements
- CPU: 2+ cores (4+ recommended for larger datasets)
- RAM: 4GB+ (8GB+ recommended for performance)
- GPU: Not required
- Storage: Varies based on dataset size
- Operating System: Linux, macOS, Windows with Python 3.10-3.12
Installation Guide
Prerequisites
- Python 3.10-3.12 installed
- pip (Python package manager)
- Basic knowledge of Python and command line interfaces
Standard Installation
The simplest way to install Data Prep Kit is via pip:
pip install 'data-prep-toolkit-transforms[all]'
This installs the core package with all optional dependencies. For a minimal installation without extra features:
pip install data-prep-toolkit-transforms
Development Installation
For contributing or customizing:
Clone the repository:
git clone https://github.com/data-prep-kit/data-prep-kit.git cd data-prep-kitInstall in development mode:
pip install -e '.[dev]'
Using Data Prep Kit
Data Prep Kit provides 30+ data transformation modules for handling different aspects of data preparation. Here’s how to get started with some common transformations:
Basic Data Ingestion
from data_prep_toolkit import ingestion
# Convert various formats to Parquet for efficient processing
converter = ingestion.HtmlToParquetConverter()
converter.convert("input_html_files/", "output_data/")
# Convert code repositories to processable format
code_converter = ingestion.CodeToParquetConverter()
code_converter.convert("input_repo/", "output_data/")
Data Cleaning and Filtering
from data_prep_toolkit import cleaning, filtering
# Remove duplicate content
deduplicator = cleaning.Deduplicator(method="exact")
deduplicator.process("input_data/", "deduplicated_data/")
# Filter content based on quality metrics
quality_filter = filtering.QualityFilter(min_score=0.7)
quality_filter.filter("input_data/", "filtered_data/")
# Remove PII information
pii_redactor = cleaning.PiiRedactor()
pii_redactor.process("input_data/", "redacted_data/")
Language Processing
from data_prep_toolkit import language
# Identify the language of text
lang_identifier = language.LanguageIdentifier()
lang_identifier.process("input_data/", "language_identified_data/")
# Tokenize text for LLM processing
tokenizer = language.Tokenizer(model="gpt2")
tokenizer.process("input_data/", "tokenized_data/")
Practical Exercise: Building a RAG Pipeline with Data Prep Kit
Let’s create a simple RAG data preparation pipeline using Data Prep Kit:
Step 1: Install Required Packages
pip install 'data-prep-toolkit-transforms[all]' langchain chromadb openai
Step 2: Prepare Your Data with Data Prep Kit
import os
from data_prep_toolkit import pipeline
from data_prep_toolkit.transforms import (
HtmlToParquetConverter,
TextCleaner,
PiiRedactor,
QualityFilter,
TextChunker
)
# Define a preprocessing pipeline
prep_pipeline = pipeline.Pipeline([
HtmlToParquetConverter(),
TextCleaner(remove_urls=True, fix_unicode=True),
PiiRedactor(),
QualityFilter(min_length=100),
TextChunker(chunk_size=1000, overlap=100)
])
# Process your data
prep_pipeline.run(
input_path="raw_website_data/",
output_path="processed_data/"
)
Step 3: Create a Vector Database from Processed Data
import pandas as pd
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = "your-api-key"
# Load processed data
df = pd.read_parquet("processed_data/chunks.parquet")
# Initialize embeddings model
embeddings = OpenAIEmbeddings()
# Create a vector database
documents = [{"page_content": text, "metadata": {"source": src}}
for text, src in zip(df["text"], df["source"])]
vectorstore = Chroma.from_documents(
documents=documents,
embedding=embeddings,
persist_directory="./chroma_db"
)
vectorstore.persist()
Step 4: Create a Simple Query System
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
# Load the vector store
vectorstore = Chroma(
persist_directory="./chroma_db",
embedding_function=embeddings
)
# Create a retrieval chain
qa_chain = RetrievalQA.from_chain_type(
llm=OpenAI(temperature=0),
chain_type="stuff",
retriever=vectorstore.as_retriever()
)
# Query the system
query = "What are the main topics covered in this dataset?"
answer = qa_chain.run(query)
print(answer)
Key Features
Data Transformation Modules
Data Prep Kit provides modules for various transformation tasks:
- Ingestion: Convert HTML, Code, Web content to Parquet
- Cleaning: Text cleaning, PII redaction, deduplication
- Filtering: Quality assessments, content filtering
- Language Processing: Language identification, tokenization
- Analysis: Code quality assessment, content profiling
- Safety: Hate speech detection, toxic content filtering
Framework Integration
Data Prep Kit integrates with popular frameworks for scaling:
- Python: Native Python processing for small datasets
- Ray: Distributed computing for medium datasets
- Spark: Big data processing for large datasets
- Kubeflow: Pipeline automation and orchestration
Extensibility
The toolkit is designed to be extensible:
- Custom transform development
- Pluggable architecture for adding new capabilities
- Integration with existing data workflows
Resources
Official Documentation
Community Support
Tutorials and Examples
Suggested Projects
You might also be interested in these similar projects:
Blazing-fast, AI-ready web crawler and scraper designed specifically for LLMs, AI agents, and data pipelines
Chroma is the AI-native open-source embedding database for storing and searching vector embeddings
A powerful low-code tool for building and deploying AI-powered agents and workflows