A comprehensive PDF processing system that extracts content from PDFs, removes headers/footers, converts tables to JSON, processes images with BLIP vision model for captions, and generates vector embeddings for RAG (Retrieval-Augmented Generation) systems.
- Header/Footer Removal: Automatically detects and removes headers, footers, and duplicate page numbers
- Table Extraction: Extracts tables and converts them to structured JSON format with headers and rows
- Image Captioning: Uses BLIP vision model to generate concise captions for images
- Text Chunking: Splits text content into manageable chunks with overlap
- Vector Embeddings: Generates embeddings for text chunks, table summaries, and image captions
- Elasticsearch Indexing: Indexes chunks with embeddings into Elasticsearch for efficient retrieval
- Hybrid Search: Combines BM25 (keyword) and vector (semantic) search for better retrieval results
- JSON Output: Exports all chunks (text, tables, images) with embeddings to a structured JSON file
- Python 3.8 or higher
- pip
# Using shell script (macOS/Linux)
chmod +x setup_env.sh
./setup_env.sh
# Or using Python script (Cross-platform)
python3 setup_env.py# Create virtual environment
python3 -m venv .venv
# Activate virtual environment
# On macOS/Linux:
source .venv/bin/activate
# On Windows:
.venv\Scripts\activate
# Upgrade pip
pip install --upgrade pip
# Install dependencies
pip install -r requirements.txtfrom pdf_chunker import PDFChunker
# Initialize chunker
chunker = PDFChunker("path/to/your/document.pdf", output_dir="output")
# Process PDF
chunker.process_pdf(chunk_size=1024, chunk_overlap=100)
# Save to JSON
chunker.save_to_json("output_chunks.json")
# Close document
chunker.close()# Activate virtual environment first
source .venv/bin/activate
# Run the script (processes test_document files)
python pdf_chunker.pyThe script includes a main function that processes files in the test_document directory:
python pdf_chunker.pyThis will process:
test_document/image_extraction_example.pdftest_document/table_extraction_example.pdf
After processing PDFs and generating chunks, you can generate vector embeddings for all chunks:
# Generate embeddings for a specific JSON file
python embedding_generator.py output/image_extraction_example_chunks.json
# Generate embeddings with custom model and output file
python embedding_generator.py output/table_extraction_example_chunks.json \
-o output/table_extraction_example_chunks_with_embeddings.json \
-m all-mpnet-base-v2 \
-b 64
# Generate embeddings for all JSON files in output directory
python generate_embeddings.py-o, --output: Output file path (default: overwrites input)-m, --model: Sentence transformer model name (default:all-MiniLM-L6-v2)all-MiniLM-L6-v2: Fast, lightweight, 384 dimensions (recommended for most use cases)all-mpnet-base-v2: Better quality, 768 dimensions (slower but more accurate)
-b, --batch-size: Batch size for embedding generation (default: 32)
After generating embeddings, you can index the chunks into Elasticsearch for retrieval:
# Index chunks from JSON file into Elasticsearch
python elasticsearch_indexer.py output/image_extraction_example_chunks_with_embeddings.json -i rag_chunks
# Index with custom Elasticsearch connection
python elasticsearch_indexer.py output/table_extraction_example_chunks_with_embeddings.json \
-i rag_chunks \
--host localhost \
--port 9200 \
--scheme http
# Delete existing index and recreate
python elasticsearch_indexer.py output/image_extraction_example_chunks_with_embeddings.json \
-i rag_chunks \
--delete-existing-i, --index: Elasticsearch index name (default:rag_chunks)--host: Elasticsearch host (default:localhost)--port: Elasticsearch port (default:9200)--scheme: Connection scheme -httporhttps(default:http)--username: Elasticsearch username (optional)--password: Elasticsearch password (optional)--delete-existing: Delete existing index if it exists
The system supports hybrid search that combines BM25 (keyword-based) and vector (semantic) search for better retrieval results.
# Hybrid search (BM25 + vector) - recommended
python retriever.py "your search query" -i rag_chunks -k 10
# BM25 only (keyword search)
python retriever.py "your search query" -i rag_chunks -t bm25
# Vector only (semantic search)
python retriever.py "your search query" -i rag_chunks -t vector
# Custom weights for hybrid search
python retriever.py "your search query" -i rag_chunks \
--bm25-weight 0.7 \
--vector-weight 0.3
# Disable RRF (use manual score combination)
python retriever.py "your search query" -i rag_chunks --no-rrf
# Enable RRF reranking (post-processing)
python retriever.py "your search query" -i rag_chunks --rerank rrf
# Enable neural reranking (post-processing)
python retriever.py "your search query" -i rag_chunks --rerank neural
# Neural reranking with custom model
python retriever.py "your search query" -i rag_chunks \
--rerank neural \
--rerank-model cross-encoder/ms-marco-MiniLM-L-12-v2query: Search query text (required)-i, --index: Elasticsearch index name (default:rag_chunks)-k, --top-k: Number of results to return (default:10)-t, --search-type: Type of search -hybrid,bm25, orvector(default:hybrid)--bm25-weight: Weight for BM25 scores in hybrid search (0.0-1.0, default:0.5)--vector-weight: Weight for vector scores in hybrid search (0.0-1.0, default:0.5)--no-rrf: Disable Reciprocal Rank Fusion in Elasticsearch (use manual score combination)--rerank: Enable reranking:rrforneural(default: disabled)--rerank-model: Cross-encoder model name for neural reranking (default:cross-encoder/ms-marco-MiniLM-L-6-v2)--rrf-k: RRF rank constant k (default:60)--rerank-top-k: Number of candidates for reranking (default:top_k * 2)--embedding-model: Embedding model name (default:all-MiniLM-L6-v2)--es-host: Elasticsearch host (default:localhost)--es-port: Elasticsearch port (default:9200)--no-scores: Hide scores in output
from retriever import Retriever
from elasticsearch_indexer import ElasticsearchIndexer
from embedding_generator import EmbeddingGenerator
# Initialize retriever
retriever = Retriever(
es_host="localhost",
es_port=9200,
embedding_model="all-MiniLM-L6-v2"
)
# Hybrid search (BM25 + vector)
results = retriever.retrieve(
query="your search query",
index_name="rag_chunks",
top_k=10,
search_type="hybrid",
bm25_weight=0.5,
vector_weight=0.5,
use_rrf=True
)
# Hybrid search with neural reranking
retriever = Retriever(
es_host="localhost",
es_port=9200,
embedding_model="all-MiniLM-L6-v2",
rerank_method="neural",
rerank_model="cross-encoder/ms-marco-MiniLM-L-6-v2"
)
results = retriever.retrieve(
query="your search query",
index_name="rag_chunks",
top_k=10,
search_type="hybrid"
)
# Display results
output = retriever.format_results(results)
print(output)-
Reciprocal Rank Fusion (RRF) (default, recommended)
- Combines results from BM25 and vector search without score normalization
- More robust and performs better in most cases
- Automatically handles score differences between the two search methods
-
Manual Score Combination
- Runs both searches separately and combines scores with weights
- Normalizes scores before combining
- Provides more control but requires tuning
The system supports reranking as a post-processing step to improve final ordering of search results. Reranking can be applied after initial retrieval to get better ranking quality.
-
Reciprocal Rank Fusion (RRF) (Post-processing)
- Combines multiple result lists (e.g., BM25 and vector results) using RRF
- Fast and effective for combining heterogeneous search results
- No model required, works with any result lists
- Formula:
score = sum(1 / (k + rank))for each result list
-
Neural Reranking (Cross-Encoder)
- Uses a cross-encoder model to score query-document pairs
- More accurate than RRF but slower
- Requires a pre-trained cross-encoder model
- Better for final ordering when you have compute resources
# Rerank results using RRF
python retriever.py "your search query" -i rag_chunks --rerank rrf
# Rerank results using neural model
python retriever.py "your search query" -i rag_chunks --rerank neural
# Neural reranking with custom model
python retriever.py "your search query" -i rag_chunks \
--rerank neural \
--rerank-model cross-encoder/ms-marco-MiniLM-L-12-v2 \
--rerank-top-k 50# Rerank results from a JSON file
python reranker.py "your query" results.json --method neural -k 10 -o reranked_results.jsonfrom reranker import Reranker
# Initialize RRF reranker
rrf_reranker = Reranker(method="rrf", rrf_rank_constant=60)
# Initialize neural reranker
neural_reranker = Reranker(
method="neural",
model_name="cross-encoder/ms-marco-MiniLM-L-6-v2"
)
# Rerank single result list
reranked = neural_reranker.rerank(
query="your search query",
results=initial_results,
top_k=10
)
# Rerank hybrid results (combine BM25 and vector results)
reranked = rrf_reranker.rerank_hybrid_results(
bm25_results=bm25_results,
vector_results=vector_results,
query="your search query",
top_k=10
)Available cross-encoder models for neural reranking:
cross-encoder/ms-marco-MiniLM-L-6-v2(default, fast, 384 dimensions)cross-encoder/ms-marco-MiniLM-L-12-v2(better quality, 384 dimensions)cross-encoder/ms-marco-electra-base(high quality, slower)
Note: Neural reranking is slower than RRF but provides better accuracy, especially for final ordering.
The system generates a JSON file with the following structure:
{
"document_info": {
"source_file": "path/to/document.pdf",
"total_pages": 10,
"total_chunks": 25,
"chunk_types": {
"text": 20,
"table": 3,
"image": 2
}
},
"chunks": [
{
"chunk_id": 0,
"type": "text",
"page": 1,
"content": "Text content here...",
"metadata": {
"chunk_size": 500,
"start_char": 0,
"end_char": 500
}
},
{
"chunk_id": 1,
"type": "table",
"page": 2,
"headers": ["Column 1", "Column 2"],
"rows": [
["Value 1", "Value 2"],
["Value 3", "Value 4"]
],
"metadata": {
"num_rows": 2,
"num_cols": 2
}
},
{
"chunk_id": 2,
"type": "image",
"page": 3,
"caption": "A diagram showing the process flow",
"image_path": "output/extracted_images/img_page_3_idx_0.png",
"metadata": {
"image_index": 0,
"format": "png"
}
}
]
}After generating embeddings, each chunk will include an embedding field:
{
"chunk_id": 0,
"type": "text",
"page": 1,
"content": "Text content here...",
"embedding": [0.123, -0.456, 0.789, ...],
"embedding_model": "all-MiniLM-L6-v2",
"embedding_dim": 384,
"metadata": {
"chunk_size": 500,
"start_char": 0,
"end_char": 500
}
}The document_info will also include embedding metadata:
{
"document_info": {
"source_file": "path/to/document.pdf",
"embedding_model": "all-MiniLM-L6-v2",
"embedding_dim": 384,
...
}
}chunk_size: Maximum size of text chunks in characters (default: 1024)chunk_overlap: Overlap between chunks in characters (default: 100)
The system automatically detects headers and footers by:
- Analyzing text position (top/bottom 15% of page)
- Identifying repeated patterns across pages
- Detecting page numbers
The system uses the Salesforce/blip-image-captioning-base model for image captioning. The model is automatically downloaded on first use.
The embedding generator uses sentence-transformers models. The default model all-MiniLM-L6-v2 provides a good balance between speed and quality:
- Embedding Dimension: 384
- Speed: Fast inference
- Quality: Good semantic understanding
For better quality (but slower), use all-mpnet-base-v2 (768 dimensions).
Embeddings are generated for:
- Text chunks: Direct embedding of text content
- Tables: Summary text embedding (headers + sample rows)
- Images: Caption embedding (generated by BLIP model)
RAG_system/
├── pdf_chunker.py # Main PDF processing script
├── embedding_generator.py # Embedding generation script
├── generate_embeddings.py # Convenience script to generate embeddings for all JSON files
├── elasticsearch_indexer.py # Elasticsearch indexing and hybrid search
├── retriever.py # Retrieval interface with hybrid search support
├── reranker.py # Reranking module (RRF and neural reranking)
├── check_index.py # Utility script to check Elasticsearch index contents
├── requirements.txt # Python dependencies
├── setup_env.sh # Shell script for environment setup
├── setup_env.py # Python script for environment setup
├── README.md # This file
├── .venv/ # Virtual environment (created after setup)
├── output/ # Output directory (created automatically)
│ ├── extracted_images/ # Extracted images
│ └── *.json # Output JSON files (with embeddings)
└── test_document/ # Test PDF files
├── image_extraction_example.pdf
└── table_extraction_example.pdf
- PyMuPDF (fitz): PDF processing and text extraction
- transformers: BLIP model for image captioning
- torch: Machine learning framework
- Pillow: Image processing
- pandas: Data manipulation (for table handling)
- numpy: Numerical operations
- sentence-transformers: Vector embedding generation
- elasticsearch: Elasticsearch client for indexing and hybrid search (BM25 + vector)
If you encounter issues loading the BLIP model:
- Ensure you have sufficient disk space (model is ~1GB)
- Check your internet connection for model download
- The system will continue without image captioning if the model fails to load
For large PDFs:
- Process files in batches
- Reduce chunk size if memory is limited
- Consider using GPU if available (CUDA)
The current table extraction uses basic heuristics. For complex tables, consider:
- Using specialized libraries like
camelot-pyortabula-py - Manual table extraction for critical documents
This project is provided as-is for educational and research purposes.