🤖 Universal PDF RAG Chatbot

Transform your PDFs into an intelligent, conversational knowledge base powered by cutting-edge AI

A production-ready Retrieval-Augmented Generation (RAG) system that enables natural language conversations with your PDF documents. Built with enterprise-grade technologies including LangChain, FAISS vector search, and high-speed LLM inference via Groq.

✨ Features

Core Capabilities

📁 Multi-PDF Upload - Process single or multiple PDF documents simultaneously
🔍 Semantic Search - FAISS-powered vector similarity search for accurate retrieval
🤖 Dual LLM Support - Groq (ultra-fast) with OpenAI fallback
📚 Source Citations - Every answer includes document references with page numbers
💬 Chat History - Persistent conversation tracking with download capability
🔄 Smart Caching - Persistent FAISS index for instant subsequent queries

Advanced Features

🎨 Modern UI - Glassmorphic design with gradient animations
📊 Real-time Logs - Interactive log viewer with filtering
🔧 Debug Mode - View retrieved context chunks for transparency
📥 Export Options - Download chat history as text files
⚡ OCR Fallback - Automatic image-based PDF text extraction
🎯 Adaptive Retrieval - Configurable Top-K and chunk parameters

🌐🎬 Live Demo

🚀 Try it now:

Streamlit Profile - https://share.streamlit.io/user/ratnesh-181998
Project Demo - https://universal-pdf-rag-chatbot-mhsi4ygebe6hmq3ij6d665.streamlit.app/

rag langchain streamlit python llm chatbot faiss generative-ai groq llama-3 pdf-parser vector-search

Upload a PDF and ask questions in seconds!

🎬 Live Project Demo

Experience the full flow:

🛠 Tech Stack

Layer	Technology	Purpose
Frontend	Streamlit 1.51+	Interactive web interface
LLM	Groq (Llama 3.3 70B)	Lightning-fast inference (2-5s response)
Fallback LLM	OpenAI GPT-3.5	Backup for high availability
Embeddings	HuggingFace Transformers	Sentence embeddings (all-mpnet-base-v2)
Vector Store	FAISS	High-performance similarity search
Orchestration	LangChain 0.2.16	RAG pipeline management
PDF Parsing	PyMuPDF + Unstructured	Text extraction with OCR fallback
Language	Python 3.9+	Core application logic

Key Dependencies

langchain==0.2.16
langchain-groq>=0.0.1
faiss-cpu>=1.7.4
sentence-transformers>=2.2.2
streamlit>=1.28.0

🎨 UI Overview

Main Interface

Components:

Header - Gradient title with author credit
Quick Guide - Visual workflow (Upload → Process → Query → Answer)
File Uploader - Drag-and-drop PDF upload zone
Tabbed Navigation - Chat, Info, Logs, Notes

Chat Tab

Features:

Text Area Input - Dark-themed query box
Enter Query Button - Submit questions
Clear Conversation - Reset chat history
Message Bubbles - User (purple gradient) vs AI (dark glass)
Source Expanders - Collapsible citation details
Debug Context - View retrieved document chunks

Sidebar

Controls:

Configuration Display - Embedding model, chunk size, Top-K
Groq Toggle - Prefer Groq LLM checkbox
Rebuild Index - Force FAISS re-indexing
API Status - Real-time Groq/OpenAI availability
Download Chat - Export conversation history

🚀 Quick Start

Prerequisites

Python 3.9 or higher
Groq API Key (free at console.groq.com)
OR OpenAI API Key

One-Command Setup (Windows)

# Clone repository
git clone https://github.com/Ratnesh-181998/Universal-PDF-RAG-Chatbot.git
cd Universal-PDF-RAG-Chatbot

# Run setup script
setup.bat

One-Command Setup (Linux/Mac)

# Clone repository
git clone https://github.com/Ratnesh-181998/Universal-PDF-RAG-Chatbot.git
cd Universal-PDF-RAG-Chatbot

# Install and run
pip install -r requirements.txt
export GROQ_API_KEY="your_key_here"
streamlit run app.py

📦 Installation

Step 1: Clone Repository

git clone https://github.com/Ratnesh-181998/Universal-PDF-RAG-Chatbot.git
cd Universal-PDF-RAG-Chatbot

Step 2: Create Virtual Environment

# Windows
python -m venv venv
venv\Scripts\activate

# Linux/Mac
python3 -m venv venv
source venv/bin/activate

Step 3: Install Dependencies

pip install -r requirements.txt

Step 4: Configure API Keys

Option A: Environment Variables (Recommended)

# Windows PowerShell
$env:GROQ_API_KEY="gsk_your_groq_api_key_here"

# Windows CMD
set GROQ_API_KEY=gsk_your_groq_api_key_here

# Linux/Mac
export GROQ_API_KEY="gsk_your_groq_api_key_here"

Option B: Streamlit Secrets (For Deployment)

Create .streamlit/secrets.toml:

GROQ_API_KEY = "gsk_your_groq_api_key_here"
OPENAI_API_KEY = "sk_your_openai_key_here"  # Optional fallback

Step 5: Run Application

streamlit run app.py

The app will open at http://localhost:8501

⚙️ Configuration

Customizable Parameters (in `app.py`)

# Vector Store
INDEX_DIR = "faiss_index_storage"  # FAISS index save location

# Embeddings
EMBEDDING_MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"

# Text Chunking
CHUNK_SIZE = 800        # Characters per chunk
CHUNK_OVERLAP = 150     # Overlap between chunks

# Retrieval
TOP_K = 10              # Number of chunks to retrieve

# Available Groq Models
GROQ_MODELS = [
    "llama-3.3-70b-versatile",  # Latest, most powerful
    "llama-3.1-70b-versatile",  # Stable alternative
    "llama-3.1-8b-instant",     # Fastest
]

Performance Tuning

Use Case	CHUNK_SIZE	TOP_K	Model
Precise Answers	500	15	llama-3.3-70b
Balanced	800	10	llama-3.1-70b
Speed Priority	1000	5	llama-3.1-8b

📖 Usage Guide

Basic Workflow

Upload PDFs
- Click the file uploader
- Select one or more PDF files (max 200MB each)
- Wait 20-30 seconds for processing
Ask Questions
- Type your question in the text area
- Click "Enter Query"
- Wait 20-30 seconds for AI response
Review Answers
- Read the AI-generated response
- Expand "View Sources" to see citations
- Check "Debug Context" to verify retrieved chunks
Manage Conversation
- Click "Clear Conversation" to reset
- Download chat history via sidebar button

Example Questions

✅ "What are the main conclusions of this research paper?"
✅ "Summarize the methodology section"
✅ "What does the author say about climate change?"
✅ "List all recommendations from the report"
✅ "Compare the results in Table 3 and Table 5"

Troubleshooting Tips

No Answer Found?

Click "🔧 Rebuild Index" in sidebar
Re-upload your PDFs
Check "Debug Context" to see what was retrieved

Slow Performance?

First run builds embeddings (30-60s)
Subsequent queries use cached index (5-10s)
Switch to llama-3.1-8b-instant for speed

🏗 Architecture

RAG Pipeline Flow

┌─────────────┐
│  PDF Upload │
└──────┬──────┘
       │
       ▼
┌─────────────────┐
│ PyMuPDF Loader  │ ──► OCR Fallback (if needed)
└──────┬──────────┘
       │
       ▼
┌──────────────────────┐
│ Text Splitter        │ (Chunk: 800, Overlap: 150)
└──────┬───────────────┘
       │
       ▼
┌──────────────────────┐
│ HuggingFace Embedder │ (all-mpnet-base-v2)
└──────┬───────────────┘
       │
       ▼
┌──────────────────────┐
│ FAISS Vector Store   │ (Persistent Index)
└──────┬───────────────┘
       │
       ▼
┌──────────────────────┐
│ User Query           │
└──────┬───────────────┘
       │
       ▼
┌──────────────────────┐
│ Similarity Search    │ (Top-K Retrieval)
└──────┬───────────────┘
       │
       ▼
┌──────────────────────┐
│ LLM (Groq/OpenAI)    │ (Context + Query)
└──────┬───────────────┘
       │
       ▼
┌──────────────────────┐
│ Answer + Citations   │
└──────────────────────┘

Directory Structure

Universal-PDF-RAG-Chatbot/
├── app.py                      # Main Streamlit application
├── requirements.txt            # Python dependencies
├── LICENSE                     # MIT License
├── README.md                   # This file
├── QUICKSTART.md              # Fast setup guide
├── .gitignore                 # Git exclusions
├── .streamlit/
│   └── secrets.toml           # API keys (not in git)
├── faiss_index_storage/       # FAISS index (auto-created)
│   ├── index.faiss
│   └── index.pkl
└── app.log                    # Application logs

🐛 Troubleshooting

Common Issues

1. "No LLM configured" Error

Cause: Missing API keys
Solution:

# Set environment variable
export GROQ_API_KEY="your_key_here"

# OR add to .streamlit/secrets.toml
GROQ_API_KEY = "your_key_here"

2. Keras 3 Compatibility Error

Cause: Transformers library incompatibility
Solution:

pip install tf-keras

3. "Failed to load documents"

Cause: Corrupted or image-only PDFs
Solution:

Ensure PDF has extractable text
App will auto-fallback to OCR for image PDFs
Try a different PDF to verify

4. Slow First Query

Cause: Building embeddings for first time
Solution:

Normal behavior (30-60s)
Subsequent queries use cached index (5-10s)

5. Import Errors

Solution:

pip install --upgrade pip
pip install -r requirements.txt --force-reinstall

🤝 Contributing

Contributions are welcome! Here's how:

Fork the repository
Create a feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add AmazingFeature')
Push to branch (git push origin feature/AmazingFeature)
Open a Pull Request

Development Setup

# Clone your fork
git clone https://github.com/YOUR_USERNAME/Universal-PDF-RAG-Chatbot.git

# Install dev dependencies
pip install -r requirements.txt

# Run tests (if available)
pytest tests/

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License Summary

✅ Commercial use
✅ Modification
✅ Distribution
✅ Private use
❌ Liability
❌ Warranty

📞 Contact

RATNESH SINGH
Data Scientist | AI/ML Engineer

📧 Email: rattudacsit2021gate@gmail.com
💼 LinkedIn: linkedin.com/in/ratneshkumar1998
🐙 GitHub: github.com/Ratnesh-181998
📱 Phone: +91-947XXXXX46

Project Links

🌐 Live Demo: Streamlit Cloud
📖 Documentation: GitHub Wiki
🐛 Issue Tracker: GitHub Issues
⭐ Star this repo if you find it useful!

🙏 Acknowledgments

This project leverages amazing open-source technologies:

LangChain - RAG orchestration framework
Groq - Ultra-fast LLM inference
FAISS - Efficient vector search by Meta AI
Streamlit - Rapid web app development
HuggingFace - Transformer models and embeddings
PyMuPDF - PDF text extraction

🚀 Roadmap

Planned Features

Support for DOCX, TXT, CSV files
Multi-language document support
Conversation memory across sessions
Voice input/output integration
Docker containerization
Cloud deployment guides (AWS, GCP, Azure)
Batch processing mode
Custom prompt templates UI
Advanced analytics dashboard

📊 Performance Metrics

Metric	Value
Average Query Time	5-10 seconds
First Upload Processing	30-60 seconds
Supported PDF Size	Up to 200MB
Concurrent Users	10+ (Streamlit Cloud)
Accuracy (F1 Score)	~0.85 on test set

🔐 Security & Privacy

✅ API keys stored in .streamlit/secrets.toml (gitignored)
✅ No data persistence beyond session (unless explicitly saved)
✅ FAISS index stored locally (not cloud-synced)
⚠️ Uploaded PDFs processed in-memory only
⚠️ For sensitive documents, deploy on private infrastructure

📚 Additional Resources

Built with ❤️ by Ratnesh Singh

Powered by LangChain • FAISS • Groq • Streamlit

📜 License

Licensed under the MIT License - Feel free to fork and build upon this innovation! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
app.py		app.py
combined_demo.gif		combined_demo.gif
requirements.txt		requirements.txt
run.bat		run.bat
setup.bat		setup.bat

Folders and files

Latest commit

History

Repository files navigation

🤖 Universal PDF RAG Chatbot

📑 Table of Contents

✨ Features

Core Capabilities

Advanced Features

🌐🎬 Live Demo

🎬 Live Project Demo

🛠 Tech Stack

Key Dependencies

🎨 UI Overview

Main Interface

Chat Tab

Sidebar

🚀 Quick Start

Prerequisites

One-Command Setup (Windows)

One-Command Setup (Linux/Mac)

📦 Installation

Step 1: Clone Repository

Step 2: Create Virtual Environment

Step 3: Install Dependencies

Step 4: Configure API Keys

Step 5: Run Application

⚙️ Configuration

Customizable Parameters (in app.py)

Performance Tuning

📖 Usage Guide

Basic Workflow

Example Questions

Troubleshooting Tips

🏗 Architecture

RAG Pipeline Flow

Directory Structure

🐛 Troubleshooting

Common Issues

1. "No LLM configured" Error

2. Keras 3 Compatibility Error

3. "Failed to load documents"

4. Slow First Query

5. Import Errors

🤝 Contributing

Development Setup

📄 License

MIT License Summary

📞 Contact

Project Links

🙏 Acknowledgments

🚀 Roadmap

Planned Features

📊 Performance Metrics

🔐 Security & Privacy

📚 Additional Resources

📜 License

📞 CONTACT & NETWORKING 📞

💼 Professional Networks

🚀 AI/ML & Data Science AI/ML 1620+ Problem Solved

💻 Competitive Programming Including all coding plateform's 5000+ Problems/Questions solved

📊 GitHub Stats & Metrics 📊

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Customizable Parameters (in `app.py`)

Packages