Skip to content

gjunjie/simple_rag_framework

Repository files navigation

PDF Chunking System with Table and Image Processing

A comprehensive PDF processing system that extracts content from PDFs, removes headers/footers, converts tables to JSON, processes images with BLIP vision model for captions, and generates vector embeddings for RAG (Retrieval-Augmented Generation) systems.

Features

  • Header/Footer Removal: Automatically detects and removes headers, footers, and duplicate page numbers
  • Table Extraction: Extracts tables and converts them to structured JSON format with headers and rows
  • Image Captioning: Uses BLIP vision model to generate concise captions for images
  • Text Chunking: Splits text content into manageable chunks with overlap
  • Vector Embeddings: Generates embeddings for text chunks, table summaries, and image captions
  • Elasticsearch Indexing: Indexes chunks with embeddings into Elasticsearch for efficient retrieval
  • Hybrid Search: Combines BM25 (keyword) and vector (semantic) search for better retrieval results
  • JSON Output: Exports all chunks (text, tables, images) with embeddings to a structured JSON file

Requirements

  • Python 3.8 or higher
  • pip

Installation

Option 1: Using setup script (Recommended)

# Using shell script (macOS/Linux)
chmod +x setup_env.sh
./setup_env.sh

# Or using Python script (Cross-platform)
python3 setup_env.py

Option 2: Manual setup

# Create virtual environment
python3 -m venv .venv

# Activate virtual environment
# On macOS/Linux:
source .venv/bin/activate
# On Windows:
.venv\Scripts\activate

# Upgrade pip
pip install --upgrade pip

# Install dependencies
pip install -r requirements.txt

Usage

Basic Usage

from pdf_chunker import PDFChunker

# Initialize chunker
chunker = PDFChunker("path/to/your/document.pdf", output_dir="output")

# Process PDF
chunker.process_pdf(chunk_size=1024, chunk_overlap=100)

# Save to JSON
chunker.save_to_json("output_chunks.json")

# Close document
chunker.close()

Command Line Usage

# Activate virtual environment first
source .venv/bin/activate

# Run the script (processes test_document files)
python pdf_chunker.py

Processing Test Documents

The script includes a main function that processes files in the test_document directory:

python pdf_chunker.py

This will process:

  • test_document/image_extraction_example.pdf
  • test_document/table_extraction_example.pdf

Generating Embeddings

After processing PDFs and generating chunks, you can generate vector embeddings for all chunks:

# Generate embeddings for a specific JSON file
python embedding_generator.py output/image_extraction_example_chunks.json

# Generate embeddings with custom model and output file
python embedding_generator.py output/table_extraction_example_chunks.json \
    -o output/table_extraction_example_chunks_with_embeddings.json \
    -m all-mpnet-base-v2 \
    -b 64

# Generate embeddings for all JSON files in output directory
python generate_embeddings.py

Embedding Generator Options

  • -o, --output: Output file path (default: overwrites input)
  • -m, --model: Sentence transformer model name (default: all-MiniLM-L6-v2)
    • all-MiniLM-L6-v2: Fast, lightweight, 384 dimensions (recommended for most use cases)
    • all-mpnet-base-v2: Better quality, 768 dimensions (slower but more accurate)
  • -b, --batch-size: Batch size for embedding generation (default: 32)

Indexing to Elasticsearch

After generating embeddings, you can index the chunks into Elasticsearch for retrieval:

# Index chunks from JSON file into Elasticsearch
python elasticsearch_indexer.py output/image_extraction_example_chunks_with_embeddings.json -i rag_chunks

# Index with custom Elasticsearch connection
python elasticsearch_indexer.py output/table_extraction_example_chunks_with_embeddings.json \
    -i rag_chunks \
    --host localhost \
    --port 9200 \
    --scheme http

# Delete existing index and recreate
python elasticsearch_indexer.py output/image_extraction_example_chunks_with_embeddings.json \
    -i rag_chunks \
    --delete-existing

Elasticsearch Indexer Options

  • -i, --index: Elasticsearch index name (default: rag_chunks)
  • --host: Elasticsearch host (default: localhost)
  • --port: Elasticsearch port (default: 9200)
  • --scheme: Connection scheme - http or https (default: http)
  • --username: Elasticsearch username (optional)
  • --password: Elasticsearch password (optional)
  • --delete-existing: Delete existing index if it exists

Hybrid Search (BM25 + Vector)

The system supports hybrid search that combines BM25 (keyword-based) and vector (semantic) search for better retrieval results.

Using the Retriever

# Hybrid search (BM25 + vector) - recommended
python retriever.py "your search query" -i rag_chunks -k 10

# BM25 only (keyword search)
python retriever.py "your search query" -i rag_chunks -t bm25

# Vector only (semantic search)
python retriever.py "your search query" -i rag_chunks -t vector

# Custom weights for hybrid search
python retriever.py "your search query" -i rag_chunks \
    --bm25-weight 0.7 \
    --vector-weight 0.3

# Disable RRF (use manual score combination)
python retriever.py "your search query" -i rag_chunks --no-rrf

# Enable RRF reranking (post-processing)
python retriever.py "your search query" -i rag_chunks --rerank rrf

# Enable neural reranking (post-processing)
python retriever.py "your search query" -i rag_chunks --rerank neural

# Neural reranking with custom model
python retriever.py "your search query" -i rag_chunks \
    --rerank neural \
    --rerank-model cross-encoder/ms-marco-MiniLM-L-12-v2

Retriever Options

  • query: Search query text (required)
  • -i, --index: Elasticsearch index name (default: rag_chunks)
  • -k, --top-k: Number of results to return (default: 10)
  • -t, --search-type: Type of search - hybrid, bm25, or vector (default: hybrid)
  • --bm25-weight: Weight for BM25 scores in hybrid search (0.0-1.0, default: 0.5)
  • --vector-weight: Weight for vector scores in hybrid search (0.0-1.0, default: 0.5)
  • --no-rrf: Disable Reciprocal Rank Fusion in Elasticsearch (use manual score combination)
  • --rerank: Enable reranking: rrf or neural (default: disabled)
  • --rerank-model: Cross-encoder model name for neural reranking (default: cross-encoder/ms-marco-MiniLM-L-6-v2)
  • --rrf-k: RRF rank constant k (default: 60)
  • --rerank-top-k: Number of candidates for reranking (default: top_k * 2)
  • --embedding-model: Embedding model name (default: all-MiniLM-L6-v2)
  • --es-host: Elasticsearch host (default: localhost)
  • --es-port: Elasticsearch port (default: 9200)
  • --no-scores: Hide scores in output

Using in Python

from retriever import Retriever
from elasticsearch_indexer import ElasticsearchIndexer
from embedding_generator import EmbeddingGenerator

# Initialize retriever
retriever = Retriever(
    es_host="localhost",
    es_port=9200,
    embedding_model="all-MiniLM-L6-v2"
)

# Hybrid search (BM25 + vector)
results = retriever.retrieve(
    query="your search query",
    index_name="rag_chunks",
    top_k=10,
    search_type="hybrid",
    bm25_weight=0.5,
    vector_weight=0.5,
    use_rrf=True
)

# Hybrid search with neural reranking
retriever = Retriever(
    es_host="localhost",
    es_port=9200,
    embedding_model="all-MiniLM-L6-v2",
    rerank_method="neural",
    rerank_model="cross-encoder/ms-marco-MiniLM-L-6-v2"
)
results = retriever.retrieve(
    query="your search query",
    index_name="rag_chunks",
    top_k=10,
    search_type="hybrid"
)

# Display results
output = retriever.format_results(results)
print(output)

Hybrid Search Methods

  1. Reciprocal Rank Fusion (RRF) (default, recommended)

    • Combines results from BM25 and vector search without score normalization
    • More robust and performs better in most cases
    • Automatically handles score differences between the two search methods
  2. Manual Score Combination

    • Runs both searches separately and combines scores with weights
    • Normalizes scores before combining
    • Provides more control but requires tuning

Reranking

The system supports reranking as a post-processing step to improve final ordering of search results. Reranking can be applied after initial retrieval to get better ranking quality.

Reranking Methods

  1. Reciprocal Rank Fusion (RRF) (Post-processing)

    • Combines multiple result lists (e.g., BM25 and vector results) using RRF
    • Fast and effective for combining heterogeneous search results
    • No model required, works with any result lists
    • Formula: score = sum(1 / (k + rank)) for each result list
  2. Neural Reranking (Cross-Encoder)

    • Uses a cross-encoder model to score query-document pairs
    • More accurate than RRF but slower
    • Requires a pre-trained cross-encoder model
    • Better for final ordering when you have compute resources

Using Reranking

# Rerank results using RRF
python retriever.py "your search query" -i rag_chunks --rerank rrf

# Rerank results using neural model
python retriever.py "your search query" -i rag_chunks --rerank neural

# Neural reranking with custom model
python retriever.py "your search query" -i rag_chunks \
    --rerank neural \
    --rerank-model cross-encoder/ms-marco-MiniLM-L-12-v2 \
    --rerank-top-k 50

Using Reranker Standalone

# Rerank results from a JSON file
python reranker.py "your query" results.json --method neural -k 10 -o reranked_results.json

Using in Python

from reranker import Reranker

# Initialize RRF reranker
rrf_reranker = Reranker(method="rrf", rrf_rank_constant=60)

# Initialize neural reranker
neural_reranker = Reranker(
    method="neural",
    model_name="cross-encoder/ms-marco-MiniLM-L-6-v2"
)

# Rerank single result list
reranked = neural_reranker.rerank(
    query="your search query",
    results=initial_results,
    top_k=10
)

# Rerank hybrid results (combine BM25 and vector results)
reranked = rrf_reranker.rerank_hybrid_results(
    bm25_results=bm25_results,
    vector_results=vector_results,
    query="your search query",
    top_k=10
)

Neural Reranker Models

Available cross-encoder models for neural reranking:

  • cross-encoder/ms-marco-MiniLM-L-6-v2 (default, fast, 384 dimensions)
  • cross-encoder/ms-marco-MiniLM-L-12-v2 (better quality, 384 dimensions)
  • cross-encoder/ms-marco-electra-base (high quality, slower)

Note: Neural reranking is slower than RRF but provides better accuracy, especially for final ordering.

Output Format

The system generates a JSON file with the following structure:

{
  "document_info": {
    "source_file": "path/to/document.pdf",
    "total_pages": 10,
    "total_chunks": 25,
    "chunk_types": {
      "text": 20,
      "table": 3,
      "image": 2
    }
  },
  "chunks": [
    {
      "chunk_id": 0,
      "type": "text",
      "page": 1,
      "content": "Text content here...",
      "metadata": {
        "chunk_size": 500,
        "start_char": 0,
        "end_char": 500
      }
    },
    {
      "chunk_id": 1,
      "type": "table",
      "page": 2,
      "headers": ["Column 1", "Column 2"],
      "rows": [
        ["Value 1", "Value 2"],
        ["Value 3", "Value 4"]
      ],
      "metadata": {
        "num_rows": 2,
        "num_cols": 2
      }
    },
    {
      "chunk_id": 2,
      "type": "image",
      "page": 3,
      "caption": "A diagram showing the process flow",
      "image_path": "output/extracted_images/img_page_3_idx_0.png",
      "metadata": {
        "image_index": 0,
        "format": "png"
      }
    }
  ]
}

After generating embeddings, each chunk will include an embedding field:

{
  "chunk_id": 0,
  "type": "text",
  "page": 1,
  "content": "Text content here...",
  "embedding": [0.123, -0.456, 0.789, ...],
  "embedding_model": "all-MiniLM-L6-v2",
  "embedding_dim": 384,
  "metadata": {
    "chunk_size": 500,
    "start_char": 0,
    "end_char": 500
  }
}

The document_info will also include embedding metadata:

{
  "document_info": {
    "source_file": "path/to/document.pdf",
    "embedding_model": "all-MiniLM-L6-v2",
    "embedding_dim": 384,
    ...
  }
}

Configuration

Chunking Parameters

  • chunk_size: Maximum size of text chunks in characters (default: 1024)
  • chunk_overlap: Overlap between chunks in characters (default: 100)

Header/Footer Detection

The system automatically detects headers and footers by:

  • Analyzing text position (top/bottom 15% of page)
  • Identifying repeated patterns across pages
  • Detecting page numbers

BLIP Model

The system uses the Salesforce/blip-image-captioning-base model for image captioning. The model is automatically downloaded on first use.

Embedding Models

The embedding generator uses sentence-transformers models. The default model all-MiniLM-L6-v2 provides a good balance between speed and quality:

  • Embedding Dimension: 384
  • Speed: Fast inference
  • Quality: Good semantic understanding

For better quality (but slower), use all-mpnet-base-v2 (768 dimensions).

Embeddings are generated for:

  • Text chunks: Direct embedding of text content
  • Tables: Summary text embedding (headers + sample rows)
  • Images: Caption embedding (generated by BLIP model)

Project Structure

RAG_system/
├── pdf_chunker.py          # Main PDF processing script
├── embedding_generator.py  # Embedding generation script
├── generate_embeddings.py  # Convenience script to generate embeddings for all JSON files
├── elasticsearch_indexer.py # Elasticsearch indexing and hybrid search
├── retriever.py           # Retrieval interface with hybrid search support
├── reranker.py            # Reranking module (RRF and neural reranking)
├── check_index.py         # Utility script to check Elasticsearch index contents
├── requirements.txt       # Python dependencies
├── setup_env.sh           # Shell script for environment setup
├── setup_env.py           # Python script for environment setup
├── README.md              # This file
├── .venv/                 # Virtual environment (created after setup)
├── output/                # Output directory (created automatically)
│   ├── extracted_images/  # Extracted images
│   └── *.json            # Output JSON files (with embeddings)
└── test_document/         # Test PDF files
    ├── image_extraction_example.pdf
    └── table_extraction_example.pdf

Dependencies

  • PyMuPDF (fitz): PDF processing and text extraction
  • transformers: BLIP model for image captioning
  • torch: Machine learning framework
  • Pillow: Image processing
  • pandas: Data manipulation (for table handling)
  • numpy: Numerical operations
  • sentence-transformers: Vector embedding generation
  • elasticsearch: Elasticsearch client for indexing and hybrid search (BM25 + vector)

Troubleshooting

BLIP Model Loading Issues

If you encounter issues loading the BLIP model:

  • Ensure you have sufficient disk space (model is ~1GB)
  • Check your internet connection for model download
  • The system will continue without image captioning if the model fails to load

Memory Issues

For large PDFs:

  • Process files in batches
  • Reduce chunk size if memory is limited
  • Consider using GPU if available (CUDA)

Table Extraction Limitations

The current table extraction uses basic heuristics. For complex tables, consider:

  • Using specialized libraries like camelot-py or tabula-py
  • Manual table extraction for critical documents

License

This project is provided as-is for educational and research purposes.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors