Data Processing
πŸ”§

Data Prep Kit

Open-source toolkit for accelerating unstructured data preparation for Large Language Model applications

Beginner to Intermediate data-preparation LLM RAG python

Alternative To

  • β€’ Commercial ETL tools
  • β€’ Custom data pipelines
  • β€’ Manual data cleaning

Difficulty Level

Beginner to Intermediate

For experienced users. Complex setup and configuration required.

Overview

Data Prep Kit is an open-source toolkit designed to accelerate unstructured data preparation for Large Language Model (LLM) application developers. It helps developers cleanse, transform, and enrich use case-specific unstructured data for pre-training, fine-tuning, instruct-tuning, and Retrieval Augmented Generation (RAG) applications.

Developed by IBM Research and hosted by the LF AI & Data Foundation, Data Prep Kit provides tools for handling the critical but often time-consuming data preparation phase of LLM development. It’s built to scale from laptop to data center environments, making it suitable for both individual developers and enterprise teams.

System Requirements

  • CPU: 2+ cores (4+ recommended for larger datasets)
  • RAM: 4GB+ (8GB+ recommended for performance)
  • GPU: Not required
  • Storage: Varies based on dataset size
  • Operating System: Linux, macOS, Windows with Python 3.10-3.12

Installation Guide

Prerequisites

  • Python 3.10-3.12 installed
  • pip (Python package manager)
  • Basic knowledge of Python and command line interfaces

Standard Installation

The simplest way to install Data Prep Kit is via pip:

pip install 'data-prep-toolkit-transforms[all]'

This installs the core package with all optional dependencies. For a minimal installation without extra features:

pip install data-prep-toolkit-transforms

Development Installation

For contributing or customizing:

  1. Clone the repository:

    git clone https://github.com/data-prep-kit/data-prep-kit.git
    cd data-prep-kit
    
  2. Install in development mode:

    pip install -e '.[dev]'
    

Using Data Prep Kit

Data Prep Kit provides 30+ data transformation modules for handling different aspects of data preparation. Here’s how to get started with some common transformations:

Basic Data Ingestion

from data_prep_toolkit import ingestion

# Convert various formats to Parquet for efficient processing
converter = ingestion.HtmlToParquetConverter()
converter.convert("input_html_files/", "output_data/")

# Convert code repositories to processable format
code_converter = ingestion.CodeToParquetConverter()
code_converter.convert("input_repo/", "output_data/")

Data Cleaning and Filtering

from data_prep_toolkit import cleaning, filtering

# Remove duplicate content
deduplicator = cleaning.Deduplicator(method="exact")
deduplicator.process("input_data/", "deduplicated_data/")

# Filter content based on quality metrics
quality_filter = filtering.QualityFilter(min_score=0.7)
quality_filter.filter("input_data/", "filtered_data/")

# Remove PII information
pii_redactor = cleaning.PiiRedactor()
pii_redactor.process("input_data/", "redacted_data/")

Language Processing

from data_prep_toolkit import language

# Identify the language of text
lang_identifier = language.LanguageIdentifier()
lang_identifier.process("input_data/", "language_identified_data/")

# Tokenize text for LLM processing
tokenizer = language.Tokenizer(model="gpt2")
tokenizer.process("input_data/", "tokenized_data/")

Practical Exercise: Building a RAG Pipeline with Data Prep Kit

Let’s create a simple RAG data preparation pipeline using Data Prep Kit:

Step 1: Install Required Packages

pip install 'data-prep-toolkit-transforms[all]' langchain chromadb openai

Step 2: Prepare Your Data with Data Prep Kit

import os
from data_prep_toolkit import pipeline
from data_prep_toolkit.transforms import (
    HtmlToParquetConverter,
    TextCleaner,
    PiiRedactor,
    QualityFilter,
    TextChunker
)

# Define a preprocessing pipeline
prep_pipeline = pipeline.Pipeline([
    HtmlToParquetConverter(),
    TextCleaner(remove_urls=True, fix_unicode=True),
    PiiRedactor(),
    QualityFilter(min_length=100),
    TextChunker(chunk_size=1000, overlap=100)
])

# Process your data
prep_pipeline.run(
    input_path="raw_website_data/",
    output_path="processed_data/"
)

Step 3: Create a Vector Database from Processed Data

import pandas as pd
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings

# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = "your-api-key"

# Load processed data
df = pd.read_parquet("processed_data/chunks.parquet")

# Initialize embeddings model
embeddings = OpenAIEmbeddings()

# Create a vector database
documents = [{"page_content": text, "metadata": {"source": src}}
             for text, src in zip(df["text"], df["source"])]

vectorstore = Chroma.from_documents(
    documents=documents,
    embedding=embeddings,
    persist_directory="./chroma_db"
)
vectorstore.persist()

Step 4: Create a Simple Query System

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

# Load the vector store
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings
)

# Create a retrieval chain
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(temperature=0),
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

# Query the system
query = "What are the main topics covered in this dataset?"
answer = qa_chain.run(query)
print(answer)

Key Features

Data Transformation Modules

Data Prep Kit provides modules for various transformation tasks:

  • Ingestion: Convert HTML, Code, Web content to Parquet
  • Cleaning: Text cleaning, PII redaction, deduplication
  • Filtering: Quality assessments, content filtering
  • Language Processing: Language identification, tokenization
  • Analysis: Code quality assessment, content profiling
  • Safety: Hate speech detection, toxic content filtering

Framework Integration

Data Prep Kit integrates with popular frameworks for scaling:

  • Python: Native Python processing for small datasets
  • Ray: Distributed computing for medium datasets
  • Spark: Big data processing for large datasets
  • Kubeflow: Pipeline automation and orchestration

Extensibility

The toolkit is designed to be extensible:

  • Custom transform development
  • Pluggable architecture for adding new capabilities
  • Integration with existing data workflows

Resources

Official Documentation

Community Support

Tutorials and Examples

Suggested Projects

You might also be interested in these similar projects:

πŸ•ΈοΈ

Crawl4AI

Blazing-fast, AI-ready web crawler and scraper designed specifically for LLMs, AI agents, and data pipelines

Difficulty: Beginner to Intermediate
Updated: Mar 23, 2025
πŸ—„οΈ

Chroma

Chroma is the AI-native open-source embedding database for storing and searching vector embeddings

Difficulty: Beginner to Intermediate
Updated: Mar 23, 2025
⛓️

Langflow

A powerful low-code tool for building and deploying AI-powered agents and workflows

Difficulty: Beginner to Intermediate
Updated: Mar 23, 2025