PDF Chunking System with Table and Image Processing

A comprehensive PDF processing system that extracts content from PDFs, removes headers/footers, converts tables to JSON, processes images with BLIP vision model for captions, and generates vector embeddings for RAG (Retrieval-Augmented Generation) systems.

Features

Header/Footer Removal: Automatically detects and removes headers, footers, and duplicate page numbers
Table Extraction: Extracts tables and converts them to structured JSON format with headers and rows
Image Captioning: Uses BLIP vision model to generate concise captions for images
Text Chunking: Splits text content into manageable chunks with overlap
Vector Embeddings: Generates embeddings for text chunks, table summaries, and image captions
Elasticsearch Indexing: Indexes chunks with embeddings into Elasticsearch for efficient retrieval
Hybrid Search: Combines BM25 (keyword) and vector (semantic) search for better retrieval results
JSON Output: Exports all chunks (text, tables, images) with embeddings to a structured JSON file

Requirements

Python 3.8 or higher
pip

Installation

Option 1: Using setup script (Recommended)

# Using shell script (macOS/Linux)
chmod +x setup_env.sh
./setup_env.sh

# Or using Python script (Cross-platform)
python3 setup_env.py

Option 2: Manual setup

# Create virtual environment
python3 -m venv .venv

# Activate virtual environment
# On macOS/Linux:
source .venv/bin/activate
# On Windows:
.venv\Scripts\activate

# Upgrade pip
pip install --upgrade pip

# Install dependencies
pip install -r requirements.txt

Usage

Basic Usage

from pdf_chunker import PDFChunker

# Initialize chunker
chunker = PDFChunker("path/to/your/document.pdf", output_dir="output")

# Process PDF
chunker.process_pdf(chunk_size=1024, chunk_overlap=100)

# Save to JSON
chunker.save_to_json("output_chunks.json")

# Close document
chunker.close()

Command Line Usage

# Activate virtual environment first
source .venv/bin/activate

# Run the script (processes test_document files)
python pdf_chunker.py

Processing Test Documents

The script includes a main function that processes files in the test_document directory:

python pdf_chunker.py

This will process:

test_document/image_extraction_example.pdf
test_document/table_extraction_example.pdf

Generating Embeddings

After processing PDFs and generating chunks, you can generate vector embeddings for all chunks:

# Generate embeddings for a specific JSON file
python embedding_generator.py output/image_extraction_example_chunks.json

# Generate embeddings with custom model and output file
python embedding_generator.py output/table_extraction_example_chunks.json \
    -o output/table_extraction_example_chunks_with_embeddings.json \
    -m all-mpnet-base-v2 \
    -b 64

# Generate embeddings for all JSON files in output directory
python generate_embeddings.py

Embedding Generator Options

-o, --output: Output file path (default: overwrites input)
-m, --model: Sentence transformer model name (default: all-MiniLM-L6-v2)
- all-MiniLM-L6-v2: Fast, lightweight, 384 dimensions (recommended for most use cases)
- all-mpnet-base-v2: Better quality, 768 dimensions (slower but more accurate)
-b, --batch-size: Batch size for embedding generation (default: 32)

Indexing to Elasticsearch

After generating embeddings, you can index the chunks into Elasticsearch for retrieval:

# Index chunks from JSON file into Elasticsearch
python elasticsearch_indexer.py output/image_extraction_example_chunks_with_embeddings.json -i rag_chunks

# Index with custom Elasticsearch connection
python elasticsearch_indexer.py output/table_extraction_example_chunks_with_embeddings.json \
    -i rag_chunks \
    --host localhost \
    --port 9200 \
    --scheme http

# Delete existing index and recreate
python elasticsearch_indexer.py output/image_extraction_example_chunks_with_embeddings.json \
    -i rag_chunks \
    --delete-existing

Elasticsearch Indexer Options

-i, --index: Elasticsearch index name (default: rag_chunks)
--host: Elasticsearch host (default: localhost)
--port: Elasticsearch port (default: 9200)
--scheme: Connection scheme - http or https (default: http)
--username: Elasticsearch username (optional)
--password: Elasticsearch password (optional)
--delete-existing: Delete existing index if it exists

Hybrid Search (BM25 + Vector)

The system supports hybrid search that combines BM25 (keyword-based) and vector (semantic) search for better retrieval results.

Using the Retriever

# Hybrid search (BM25 + vector) - recommended
python retriever.py "your search query" -i rag_chunks -k 10

# BM25 only (keyword search)
python retriever.py "your search query" -i rag_chunks -t bm25

# Vector only (semantic search)
python retriever.py "your search query" -i rag_chunks -t vector

# Custom weights for hybrid search
python retriever.py "your search query" -i rag_chunks \
    --bm25-weight 0.7 \
    --vector-weight 0.3

# Disable RRF (use manual score combination)
python retriever.py "your search query" -i rag_chunks --no-rrf

# Enable RRF reranking (post-processing)
python retriever.py "your search query" -i rag_chunks --rerank rrf

# Enable neural reranking (post-processing)
python retriever.py "your search query" -i rag_chunks --rerank neural

# Neural reranking with custom model
python retriever.py "your search query" -i rag_chunks \
    --rerank neural \
    --rerank-model cross-encoder/ms-marco-MiniLM-L-12-v2

Retriever Options

query: Search query text (required)
-i, --index: Elasticsearch index name (default: rag_chunks)
-k, --top-k: Number of results to return (default: 10)
-t, --search-type: Type of search - hybrid, bm25, or vector (default: hybrid)
--bm25-weight: Weight for BM25 scores in hybrid search (0.0-1.0, default: 0.5)
--vector-weight: Weight for vector scores in hybrid search (0.0-1.0, default: 0.5)
--no-rrf: Disable Reciprocal Rank Fusion in Elasticsearch (use manual score combination)
--rerank: Enable reranking: rrf or neural (default: disabled)
--rerank-model: Cross-encoder model name for neural reranking (default: cross-encoder/ms-marco-MiniLM-L-6-v2)
--rrf-k: RRF rank constant k (default: 60)
--rerank-top-k: Number of candidates for reranking (default: top_k * 2)
--embedding-model: Embedding model name (default: all-MiniLM-L6-v2)
--es-host: Elasticsearch host (default: localhost)
--es-port: Elasticsearch port (default: 9200)
--no-scores: Hide scores in output

Using in Python

from retriever import Retriever
from elasticsearch_indexer import ElasticsearchIndexer
from embedding_generator import EmbeddingGenerator

# Initialize retriever
retriever = Retriever(
    es_host="localhost",
    es_port=9200,
    embedding_model="all-MiniLM-L6-v2"
)

# Hybrid search (BM25 + vector)
results = retriever.retrieve(
    query="your search query",
    index_name="rag_chunks",
    top_k=10,
    search_type="hybrid",
    bm25_weight=0.5,
    vector_weight=0.5,
    use_rrf=True
)

# Hybrid search with neural reranking
retriever = Retriever(
    es_host="localhost",
    es_port=9200,
    embedding_model="all-MiniLM-L6-v2",
    rerank_method="neural",
    rerank_model="cross-encoder/ms-marco-MiniLM-L-6-v2"
)
results = retriever.retrieve(
    query="your search query",
    index_name="rag_chunks",
    top_k=10,
    search_type="hybrid"
)

# Display results
output = retriever.format_results(results)
print(output)

Hybrid Search Methods

Reciprocal Rank Fusion (RRF) (default, recommended)
- Combines results from BM25 and vector search without score normalization
- More robust and performs better in most cases
- Automatically handles score differences between the two search methods
Manual Score Combination
- Runs both searches separately and combines scores with weights
- Normalizes scores before combining
- Provides more control but requires tuning

Reranking

The system supports reranking as a post-processing step to improve final ordering of search results. Reranking can be applied after initial retrieval to get better ranking quality.

Reranking Methods

Reciprocal Rank Fusion (RRF) (Post-processing)
- Combines multiple result lists (e.g., BM25 and vector results) using RRF
- Fast and effective for combining heterogeneous search results
- No model required, works with any result lists
- Formula: score = sum(1 / (k + rank)) for each result list
Neural Reranking (Cross-Encoder)
- Uses a cross-encoder model to score query-document pairs
- More accurate than RRF but slower
- Requires a pre-trained cross-encoder model
- Better for final ordering when you have compute resources

Using Reranking

# Rerank results using RRF
python retriever.py "your search query" -i rag_chunks --rerank rrf

# Rerank results using neural model
python retriever.py "your search query" -i rag_chunks --rerank neural

# Neural reranking with custom model
python retriever.py "your search query" -i rag_chunks \
    --rerank neural \
    --rerank-model cross-encoder/ms-marco-MiniLM-L-12-v2 \
    --rerank-top-k 50

Using Reranker Standalone

# Rerank results from a JSON file
python reranker.py "your query" results.json --method neural -k 10 -o reranked_results.json

Using in Python

from reranker import Reranker

# Initialize RRF reranker
rrf_reranker = Reranker(method="rrf", rrf_rank_constant=60)

# Initialize neural reranker
neural_reranker = Reranker(
    method="neural",
    model_name="cross-encoder/ms-marco-MiniLM-L-6-v2"
)

# Rerank single result list
reranked = neural_reranker.rerank(
    query="your search query",
    results=initial_results,
    top_k=10
)

# Rerank hybrid results (combine BM25 and vector results)
reranked = rrf_reranker.rerank_hybrid_results(
    bm25_results=bm25_results,
    vector_results=vector_results,
    query="your search query",
    top_k=10
)

Neural Reranker Models

Available cross-encoder models for neural reranking:

cross-encoder/ms-marco-MiniLM-L-6-v2 (default, fast, 384 dimensions)
cross-encoder/ms-marco-MiniLM-L-12-v2 (better quality, 384 dimensions)
cross-encoder/ms-marco-electra-base (high quality, slower)

Note: Neural reranking is slower than RRF but provides better accuracy, especially for final ordering.

Output Format

The system generates a JSON file with the following structure:

{
  "document_info": {
    "source_file": "path/to/document.pdf",
    "total_pages": 10,
    "total_chunks": 25,
    "chunk_types": {
      "text": 20,
      "table": 3,
      "image": 2
    }
  },
  "chunks": [
    {
      "chunk_id": 0,
      "type": "text",
      "page": 1,
      "content": "Text content here...",
      "metadata": {
        "chunk_size": 500,
        "start_char": 0,
        "end_char": 500
      }
    },
    {
      "chunk_id": 1,
      "type": "table",
      "page": 2,
      "headers": ["Column 1", "Column 2"],
      "rows": [
        ["Value 1", "Value 2"],
        ["Value 3", "Value 4"]
      ],
      "metadata": {
        "num_rows": 2,
        "num_cols": 2
      }
    },
    {
      "chunk_id": 2,
      "type": "image",
      "page": 3,
      "caption": "A diagram showing the process flow",
      "image_path": "output/extracted_images/img_page_3_idx_0.png",
      "metadata": {
        "image_index": 0,
        "format": "png"
      }
    }
  ]
}

After generating embeddings, each chunk will include an embedding field:

{
  "chunk_id": 0,
  "type": "text",
  "page": 1,
  "content": "Text content here...",
  "embedding": [0.123, -0.456, 0.789, ...],
  "embedding_model": "all-MiniLM-L6-v2",
  "embedding_dim": 384,
  "metadata": {
    "chunk_size": 500,
    "start_char": 0,
    "end_char": 500
  }
}

The document_info will also include embedding metadata:

{
  "document_info": {
    "source_file": "path/to/document.pdf",
    "embedding_model": "all-MiniLM-L6-v2",
    "embedding_dim": 384,
    ...
  }
}

Configuration

Chunking Parameters

chunk_size: Maximum size of text chunks in characters (default: 1024)
chunk_overlap: Overlap between chunks in characters (default: 100)

Header/Footer Detection

The system automatically detects headers and footers by:

Analyzing text position (top/bottom 15% of page)
Identifying repeated patterns across pages
Detecting page numbers

BLIP Model

The system uses the Salesforce/blip-image-captioning-base model for image captioning. The model is automatically downloaded on first use.

Embedding Models

The embedding generator uses sentence-transformers models. The default model all-MiniLM-L6-v2 provides a good balance between speed and quality:

Embedding Dimension: 384
Speed: Fast inference
Quality: Good semantic understanding

For better quality (but slower), use all-mpnet-base-v2 (768 dimensions).

Embeddings are generated for:

Text chunks: Direct embedding of text content
Tables: Summary text embedding (headers + sample rows)
Images: Caption embedding (generated by BLIP model)

Project Structure

RAG_system/
├── pdf_chunker.py          # Main PDF processing script
├── embedding_generator.py  # Embedding generation script
├── generate_embeddings.py  # Convenience script to generate embeddings for all JSON files
├── elasticsearch_indexer.py # Elasticsearch indexing and hybrid search
├── retriever.py           # Retrieval interface with hybrid search support
├── reranker.py            # Reranking module (RRF and neural reranking)
├── check_index.py         # Utility script to check Elasticsearch index contents
├── requirements.txt       # Python dependencies
├── setup_env.sh           # Shell script for environment setup
├── setup_env.py           # Python script for environment setup
├── README.md              # This file
├── .venv/                 # Virtual environment (created after setup)
├── output/                # Output directory (created automatically)
│   ├── extracted_images/  # Extracted images
│   └── *.json            # Output JSON files (with embeddings)
└── test_document/         # Test PDF files
    ├── image_extraction_example.pdf
    └── table_extraction_example.pdf

Dependencies

PyMuPDF (fitz): PDF processing and text extraction
transformers: BLIP model for image captioning
torch: Machine learning framework
Pillow: Image processing
pandas: Data manipulation (for table handling)
numpy: Numerical operations
sentence-transformers: Vector embedding generation
elasticsearch: Elasticsearch client for indexing and hybrid search (BM25 + vector)

Troubleshooting

BLIP Model Loading Issues

If you encounter issues loading the BLIP model:

Ensure you have sufficient disk space (model is ~1GB)
Check your internet connection for model download
The system will continue without image captioning if the model fails to load

Memory Issues

For large PDFs:

Process files in batches
Reduce chunk size if memory is limited
Consider using GPU if available (CUDA)

Table Extraction Limitations

The current table extraction uses basic heuristics. For complex tables, consider:

Using specialized libraries like camelot-py or tabula-py
Manual table extraction for critical documents

License

This project is provided as-is for educational and research purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
test_document		test_document
.gitignore		.gitignore
README.md		README.md
check_index.py		check_index.py
elasticsearch_indexer.py		elasticsearch_indexer.py
embedding_generator.py		embedding_generator.py
generate_embeddings.py		generate_embeddings.py
pdf_chunker.py		pdf_chunker.py
requirements.txt		requirements.txt
reranker.py		reranker.py
retriever.py		retriever.py
run_workflow.sh		run_workflow.sh
setup_env.py		setup_env.py
setup_env.sh		setup_env.sh

Folders and files

Latest commit

History

Repository files navigation

PDF Chunking System with Table and Image Processing

Features

Requirements

Installation

Option 1: Using setup script (Recommended)

Option 2: Manual setup

Usage

Basic Usage

Command Line Usage

Processing Test Documents

Generating Embeddings

Embedding Generator Options

Indexing to Elasticsearch

Elasticsearch Indexer Options

Hybrid Search (BM25 + Vector)

Using the Retriever

Retriever Options

Using in Python

Hybrid Search Methods

Reranking

Reranking Methods

Using Reranking

Using Reranker Standalone

Using in Python

Neural Reranker Models

Output Format

Configuration

Chunking Parameters

Header/Footer Detection

BLIP Model

Embedding Models

Project Structure

Dependencies

Troubleshooting

BLIP Model Loading Issues

Memory Issues

Table Extraction Limitations

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages