Transform your PDFs into an intelligent, conversational knowledge base powered by cutting-edge AI
A production-ready Retrieval-Augmented Generation (RAG) system that enables natural language conversations with your PDF documents. Built with enterprise-grade technologies including LangChain, FAISS vector search, and high-speed LLM inference via Groq.
- Features
- Live Demo
- Tech Stack
- UI Overview
- Quick Start
- Installation
- Configuration
- Usage Guide
- Architecture
- Troubleshooting
- Contributing
- License
- Contact
- π Multi-PDF Upload - Process single or multiple PDF documents simultaneously
- π Semantic Search - FAISS-powered vector similarity search for accurate retrieval
- π€ Dual LLM Support - Groq (ultra-fast) with OpenAI fallback
- π Source Citations - Every answer includes document references with page numbers
- π¬ Chat History - Persistent conversation tracking with download capability
- π Smart Caching - Persistent FAISS index for instant subsequent queries
- π¨ Modern UI - Glassmorphic design with gradient animations
- π Real-time Logs - Interactive log viewer with filtering
- π§ Debug Mode - View retrieved context chunks for transparency
- π₯ Export Options - Download chat history as text files
- β‘ OCR Fallback - Automatic image-based PDF text extraction
- π― Adaptive Retrieval - Configurable Top-K and chunk parameters
π Try it now:
- Streamlit Profile - https://share.streamlit.io/user/ratnesh-181998
- Project Demo - https://universal-pdf-rag-chatbot-mhsi4ygebe6hmq3ij6d665.streamlit.app/
rag langchain streamlit python llm chatbot faiss generative-ai groq llama-3 pdf-parser vector-search
Upload a PDF and ask questions in seconds!
Experience the full flow:
| Layer | Technology | Purpose |
|---|---|---|
| Frontend | Streamlit 1.51+ | Interactive web interface |
| LLM | Groq (Llama 3.3 70B) | Lightning-fast inference (2-5s response) |
| Fallback LLM | OpenAI GPT-3.5 | Backup for high availability |
| Embeddings | HuggingFace Transformers | Sentence embeddings (all-mpnet-base-v2) |
| Vector Store | FAISS | High-performance similarity search |
| Orchestration | LangChain 0.2.16 | RAG pipeline management |
| PDF Parsing | PyMuPDF + Unstructured | Text extraction with OCR fallback |
| Language | Python 3.9+ | Core application logic |
langchain==0.2.16
langchain-groq>=0.0.1
faiss-cpu>=1.7.4
sentence-transformers>=2.2.2
streamlit>=1.28.0
Components:
- Header - Gradient title with author credit
- Quick Guide - Visual workflow (Upload β Process β Query β Answer)
- File Uploader - Drag-and-drop PDF upload zone
- Tabbed Navigation - Chat, Info, Logs, Notes
Features:
- Text Area Input - Dark-themed query box
- Enter Query Button - Submit questions
- Clear Conversation - Reset chat history
- Message Bubbles - User (purple gradient) vs AI (dark glass)
- Source Expanders - Collapsible citation details
- Debug Context - View retrieved document chunks
Controls:
- Configuration Display - Embedding model, chunk size, Top-K
- Groq Toggle - Prefer Groq LLM checkbox
- Rebuild Index - Force FAISS re-indexing
- API Status - Real-time Groq/OpenAI availability
- Download Chat - Export conversation history
- Python 3.9 or higher
- Groq API Key (free at console.groq.com)
- OR OpenAI API Key
# Clone repository
git clone https://github.com/Ratnesh-181998/Universal-PDF-RAG-Chatbot.git
cd Universal-PDF-RAG-Chatbot
# Run setup script
setup.bat# Clone repository
git clone https://github.com/Ratnesh-181998/Universal-PDF-RAG-Chatbot.git
cd Universal-PDF-RAG-Chatbot
# Install and run
pip install -r requirements.txt
export GROQ_API_KEY="your_key_here"
streamlit run app.pygit clone https://github.com/Ratnesh-181998/Universal-PDF-RAG-Chatbot.git
cd Universal-PDF-RAG-Chatbot# Windows
python -m venv venv
venv\Scripts\activate
# Linux/Mac
python3 -m venv venv
source venv/bin/activatepip install -r requirements.txtOption A: Environment Variables (Recommended)
# Windows PowerShell
$env:GROQ_API_KEY="gsk_your_groq_api_key_here"
# Windows CMD
set GROQ_API_KEY=gsk_your_groq_api_key_here
# Linux/Mac
export GROQ_API_KEY="gsk_your_groq_api_key_here"Option B: Streamlit Secrets (For Deployment)
Create .streamlit/secrets.toml:
GROQ_API_KEY = "gsk_your_groq_api_key_here"
OPENAI_API_KEY = "sk_your_openai_key_here" # Optional fallbackstreamlit run app.pyThe app will open at http://localhost:8501
# Vector Store
INDEX_DIR = "faiss_index_storage" # FAISS index save location
# Embeddings
EMBEDDING_MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"
# Text Chunking
CHUNK_SIZE = 800 # Characters per chunk
CHUNK_OVERLAP = 150 # Overlap between chunks
# Retrieval
TOP_K = 10 # Number of chunks to retrieve
# Available Groq Models
GROQ_MODELS = [
"llama-3.3-70b-versatile", # Latest, most powerful
"llama-3.1-70b-versatile", # Stable alternative
"llama-3.1-8b-instant", # Fastest
]| Use Case | CHUNK_SIZE | TOP_K | Model |
|---|---|---|---|
| Precise Answers | 500 | 15 | llama-3.3-70b |
| Balanced | 800 | 10 | llama-3.1-70b |
| Speed Priority | 1000 | 5 | llama-3.1-8b |
-
Upload PDFs
- Click the file uploader
- Select one or more PDF files (max 200MB each)
- Wait 20-30 seconds for processing
-
Ask Questions
- Type your question in the text area
- Click "Enter Query"
- Wait 20-30 seconds for AI response
-
Review Answers
- Read the AI-generated response
- Expand "View Sources" to see citations
- Check "Debug Context" to verify retrieved chunks
-
Manage Conversation
- Click "Clear Conversation" to reset
- Download chat history via sidebar button
β
"What are the main conclusions of this research paper?"
β
"Summarize the methodology section"
β
"What does the author say about climate change?"
β
"List all recommendations from the report"
β
"Compare the results in Table 3 and Table 5"
No Answer Found?
- Click "π§ Rebuild Index" in sidebar
- Re-upload your PDFs
- Check "Debug Context" to see what was retrieved
Slow Performance?
- First run builds embeddings (30-60s)
- Subsequent queries use cached index (5-10s)
- Switch to
llama-3.1-8b-instantfor speed
βββββββββββββββ
β PDF Upload β
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββββββ
β PyMuPDF Loader β βββΊ OCR Fallback (if needed)
ββββββββ¬βββββββββββ
β
βΌ
ββββββββββββββββββββββββ
β Text Splitter β (Chunk: 800, Overlap: 150)
ββββββββ¬ββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββ
β HuggingFace Embedder β (all-mpnet-base-v2)
ββββββββ¬ββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββ
β FAISS Vector Store β (Persistent Index)
ββββββββ¬ββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββ
β User Query β
ββββββββ¬ββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββ
β Similarity Search β (Top-K Retrieval)
ββββββββ¬ββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββ
β LLM (Groq/OpenAI) β (Context + Query)
ββββββββ¬ββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββ
β Answer + Citations β
ββββββββββββββββββββββββ
Universal-PDF-RAG-Chatbot/
βββ app.py # Main Streamlit application
βββ requirements.txt # Python dependencies
βββ LICENSE # MIT License
βββ README.md # This file
βββ QUICKSTART.md # Fast setup guide
βββ .gitignore # Git exclusions
βββ .streamlit/
β βββ secrets.toml # API keys (not in git)
βββ faiss_index_storage/ # FAISS index (auto-created)
β βββ index.faiss
β βββ index.pkl
βββ app.log # Application logs
Cause: Missing API keys
Solution:
# Set environment variable
export GROQ_API_KEY="your_key_here"
# OR add to .streamlit/secrets.toml
GROQ_API_KEY = "your_key_here"Cause: Transformers library incompatibility
Solution:
pip install tf-kerasCause: Corrupted or image-only PDFs
Solution:
- Ensure PDF has extractable text
- App will auto-fallback to OCR for image PDFs
- Try a different PDF to verify
Cause: Building embeddings for first time
Solution:
- Normal behavior (30-60s)
- Subsequent queries use cached index (5-10s)
Solution:
pip install --upgrade pip
pip install -r requirements.txt --force-reinstallContributions are welcome! Here's how:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add AmazingFeature') - Push to branch (
git push origin feature/AmazingFeature) - Open a Pull Request
# Clone your fork
git clone https://github.com/YOUR_USERNAME/Universal-PDF-RAG-Chatbot.git
# Install dev dependencies
pip install -r requirements.txt
# Run tests (if available)
pytest tests/This project is licensed under the MIT License - see the LICENSE file for details.
β
Commercial use
β
Modification
β
Distribution
β
Private use
β Liability
β Warranty
RATNESH SINGH
Data Scientist | AI/ML Engineer
- π§ Email: rattudacsit2021gate@gmail.com
- πΌ LinkedIn: linkedin.com/in/ratneshkumar1998
- π GitHub: github.com/Ratnesh-181998
- π± Phone: +91-947XXXXX46
- π Live Demo: Streamlit Cloud
- π Documentation: GitHub Wiki
- π Issue Tracker: GitHub Issues
- β Star this repo if you find it useful!
This project leverages amazing open-source technologies:
- LangChain - RAG orchestration framework
- Groq - Ultra-fast LLM inference
- FAISS - Efficient vector search by Meta AI
- Streamlit - Rapid web app development
- HuggingFace - Transformer models and embeddings
- PyMuPDF - PDF text extraction
- Support for DOCX, TXT, CSV files
- Multi-language document support
- Conversation memory across sessions
- Voice input/output integration
- Docker containerization
- Cloud deployment guides (AWS, GCP, Azure)
- Batch processing mode
- Custom prompt templates UI
- Advanced analytics dashboard
| Metric | Value |
|---|---|
| Average Query Time | 5-10 seconds |
| First Upload Processing | 30-60 seconds |
| Supported PDF Size | Up to 200MB |
| Concurrent Users | 10+ (Streamlit Cloud) |
| Accuracy (F1 Score) | ~0.85 on test set |
- β
API keys stored in
.streamlit/secrets.toml(gitignored) - β No data persistence beyond session (unless explicitly saved)
- β FAISS index stored locally (not cloud-synced)
β οΈ Uploaded PDFs processed in-memory onlyβ οΈ For sensitive documents, deploy on private infrastructure
Licensed under the MIT License - Feel free to fork and build upon this innovation! π
