Retrieval-first document search system for factory and operations workflows, designed for GST invoices, bills of materials, e-way bills, and financial PDFs.
This project focuses on the business problem that matters most in practice: given an invoice number, supplier, amount, line item, material code, or phrase, return the correct document, page, and supporting evidence quickly.
The system ingests PDF documents, extracts text and lightweight metadata, creates multiple evidence units, and serves search through:
- a Python CLI
- a FastAPI backend
- a Streamlit interface
- a Docker Compose deployment stack
It is intentionally retrieval-first rather than parser-first. Structured extraction is used as enrichment, but the primary product is evidence-backed document retrieval.
- PDF ingestion from files or folders
- Shared inbox ingestion workflow through API and Streamlit
- Multi-store retrieval architecture with Postgres, OpenSearch, Qdrant, and Redis
- Evidence chunking across text blocks, lines, table rows, and header fields
- Query routing for exact identifier search, lexical lookup, and hybrid retrieval
- Evidence-first responses with file path, page, snippet, and confidence
- Optional Gemini-based answer synthesis on top of retrieved evidence
- Streamlit surfaces generated answers, route selection, and answer backend for fuzzy semantic queries
- FastAPI search endpoints for integration
- Streamlit UI for business users
- Retrieval and extraction evaluation workflows
This repository is strongest as a retrieval-first search system for business documents.
Best-fit use cases:
- invoice number lookup
- line-item and amount lookup
- BOM and material code search
- e-way bill lookup
- evidence-backed internal document search
flowchart LR
U[User / CLI / API / UI] --> QS[Query Service]
QS --> RT[Query Router]
RT --> EX[Exact Match]
RT --> LX[Lexical Retrieval]
RT --> HY[Hybrid Retrieval]
EX --> PG[(Postgres)]
LX --> PG
LX --> OS[(OpenSearch)]
HY --> PG
HY --> OS
HY --> QD[(Qdrant)]
QS --> RR[Rerank and Merge]
RR --> AS[Answer Service]
AS --> GM[Gemini Optional]
RR --> OUT[Evidence-First Response]
flowchart TD
A[PDF File or Folder] --> B[Text Extraction]
B --> C[Document Classification]
C --> D[Metadata Extraction]
D --> E[Chunk Builder]
E --> F1[Text Chunks]
E --> F2[Line Chunks]
E --> F3[Table Row Chunks]
E --> F4[Header Field Chunks]
D --> G[(Postgres Documents)]
F1 --> H[(Postgres Chunks)]
F2 --> H
F3 --> H
F4 --> H
H --> I[Embedding Service]
I --> J[(Qdrant)]
H --> K[(OpenSearch)]
A --> L[(Local Storage)]
- Embedding backend:
sentence-transformer - Embedding model:
BAAI/bge-small-en-v1.5 - LLM backend:
gemini - LLM model:
gemini-2.5-flash - Fallback embeddings: local hash embeddings
Why this setup works well for this project:
- identifiers, codes, totals, and rates depend heavily on lexical retrieval and normalization
- semantic embeddings help more on fuzzy text queries than exact business lookup
- Gemini is used as a summarization layer after retrieval, not as the primary search engine
src/factory_rag/
core/ runtime and configuration
processing/ extraction, chunking, metadata, routing, embeddings helpers
services/ ingestion, retrieval, query execution, answer generation
stores/ Postgres, OpenSearch, Qdrant, Redis, local storage
apps/
streamlit_app.py
eval/
datasets/
runners/
checks/
tests/
unit tests for routing and parsing behavior
scripts/
data generation and helper scripts
pip install -e .docker compose up -d --buildThis starts:
- FastAPI on
http://localhost:8000 - Streamlit on
http://localhost:8501 - Postgres on
localhost:5432 - Qdrant on
localhost:6333 - OpenSearch on
localhost:9200 - Redis on
localhost:6379
python main.py bootstrap
python main.py ingest dataTo rebuild documents after chunking or indexing changes:
python main.py ingest golden_set --forceFor the Streamlit and Docker deployment flow, the application also supports a configured inbox directory for bulk ingestion. Users can place PDFs in the inbox folder and trigger ingestion from the UI without entering filesystem paths manually.
Verbose retrieval output:
python main.py query "TF/2026-27/001"
python main.py query "which item has the rate of 85000.00?"
python main.py query "find me material code for Seat Foam Cushion"Business-facing concise output:
python main.py find "TF/2026-27/001"
python main.py find "annual cloud hosting invoice"
python main.py find "Seat Foam Cushion"Health check:
python main.py healthRun the API locally:
python main.py serve --host 0.0.0.0 --port 8000Primary endpoints:
GET /GET /healthGET /metricsPOST /documents/ingestGET /documents/inboxPOST /documents/ingest/inboxGET /documents/{id}POST /queryPOST /find
Run the business-facing UI:
streamlit run apps/streamlit_app.pyThe UI surfaces the top match first and shows:
- generated answer when available
- answer backend and retrieval route
- document number
- file name
- page
- storage path
- amount and supplier when available
- evidence snippet or nearby context
It also includes a simple document intake workflow:
- users drop PDFs into the configured inbox folder
- Streamlit triggers bulk ingestion through the API
- the UI displays processed, duplicate, partial, and failed counts
For broad or fuzzy queries, the UI now shows the generated answer returned by /find, including whether the backend answer came from deterministic formatting, extractive logic, or Gemini summarization.
Two evaluation modes are included:
python main.py evaluate-retrieval eval/datasets/golden_set_retrieval_golden.json
python main.py evaluate-retrieval golden_set/final_metrics/final_metrics_retrieval_golden.jsonPrimary metrics:
Recall@1Recall@3Recall@5Snippet@1Snippet@3Snippet@5MRR
Latest holdout retrieval snapshot:
- Dataset:
golden_set/final_metrics/final_metrics_retrieval_golden.json - Queries:
36 Recall@1:0.8611Recall@3:0.9722Recall@5:1.0Snippet@1:0.5556Snippet@3:0.6944Snippet@5:0.7778MRR:0.9222
python main.py evaluate eval/datasets/golden_set_extraction_eval.json
python main.py evaluate golden_set/final_metrics/final_metrics_extraction_eval.jsonPrimary metrics:
- document pass rate
- field accuracy
- line item accuracy
Latest holdout extraction snapshot:
- Dataset:
golden_set/final_metrics/final_metrics_extraction_eval.json - Documents passed:
0/18 - Field accuracy:
0.5591 - Line item accuracy:
0.7222
Unit tests cover routing, metadata normalization, and BOM parsing behavior.
python -m unittest discover -s tests -vRAG_POSTGRES_DSN=postgresql://rag:ragpass@localhost:5432/ragdb
RAG_QDRANT_URL=http://localhost:6333
RAG_OPENSEARCH_URL=http://localhost:9200
RAG_REDIS_URL=redis://localhost:6379/0
RAG_EMBEDDING_BACKEND=sentence-transformer
RAG_EMBEDDING_MODEL=BAAI/bge-small-en-v1.5
RAG_LLM_BACKEND=gemini
RAG_LLM_MODEL=gemini-2.5-flash
GEMINI_API_KEY=your_key_here
RAG_ENABLE_SUMMARY=1
RAG_STORAGE_DIR=storage
RAG_DATA_DIR=data
RAG_INGEST_INBOX=data/incoming
RAG_POSTGRES_CONNECT_TIMEOUT=5
RAG_POSTGRES_CONNECT_RETRIES=8
RAG_POSTGRES_CONNECT_RETRY_DELAY=2
RAG_FIND_TIMEOUT_SECONDS=300Optional alternatives:
RAG_EMBEDDING_BACKEND=geminiRAG_GEMINI_EMBEDDING_MODEL=gemini-embedding-001
To clear ingested state and start from scratch:
docker compose down -v
Remove-Item -Recurse -Force .\storage\*
docker compose up -d
python main.py bootstrapThen reingest:
python main.py ingest golden_set --forceThe current repository supports two ingestion styles:
- direct CLI or API ingestion by path
- inbox-based ingestion for non-technical users
The inbox workflow is the better operational fit for a small business team:
- users drop PDFs into a shared
incomingfolder - the UI or API triggers bulk ingestion
- duplicates are skipped using checksum-based deduplication
For a more production-style setup, the recommended future design is:
incoming/for newly received filesprocessed/for successfully ingested filesfailed/for files that need manual review- scheduled ingestion every few minutes or at a fixed daily time
- persistent checksum and ingestion-state tracking rather than relying on folder dates
Generate fresh sample PDFs:
python scripts/generate_demo_pdfs.pyExample workflow:
python main.py ingest data/generated_batch
python main.py find "seat foam cushion"
python main.py evaluate eval/datasets/generated_batch_eval.json- Retrieval-first rather than parser-first
- Evidence-first responses rather than unsupported chat answers
- Small canonical schema with flexible metadata
- Lexical retrieval for identifiers, codes, totals, and rates
- Hybrid retrieval for fuzzy text queries
- LLMs as optional summarization helpers rather than source of truth
- strongest on native-text PDFs
- OCR and scanned-document support is limited
- extraction remains little weak across unseen layouts
- financial statement retrieval is weaker than invoice retrieval
- some BOM component questions still depend on better row-level evidence selection
- automated scheduled ingestion from a shared inbox folder
incoming / processed / failedfolder lifecycle for cleaner operations- background ingestion jobs instead of long synchronous UI requests
- stronger startup healthchecks and service readiness coordination across API and Postgres
- ingestion manifests and richer operational reporting
- OCR support for scanned documents
- stronger financial statement retrieval and answer-bearing snippet selection
Apache 2.0
