Skip to content

Latest commit

 

History

History
409 lines (315 loc) · 9.18 KB

File metadata and controls

409 lines (315 loc) · 9.18 KB

📚 RAG Project - Complete Documentation Index

Quick Navigation

🚀 Getting Started

📖 Documentation

💻 Code Structure

src/
├── config/          Configuration management
├── rag/             RAG pipeline components  
└── utils/           Helper utilities

Project Overview

RAG Document Assistant is a production-ready system that combines:

  • Document processing (PDF, DOCX, TXT)
  • AI-powered semantic search (Google Gemini embeddings)
  • Vector database storage (Pinecone)
  • Conversational AI (Google Gemini LLM)
  • Web interface (Streamlit)

🎯 Core Features

File Upload - Support for .txt, .pdf, .docx
Smart Chunking - Automatic text splitting with overlap
Vector Embeddings - Google Gemini semantic vectors
Semantic Search - Pinecone vector similarity
AI Responses - Context-grounded answers
Source Attribution - All answers cited
Chat Interface - Streamlit web app
CLI Tools - Command-line utilities


📂 Project Files

Root Level Files

  • app.py - Streamlit web interface
  • main.py - CLI entry point
  • setup_project.py - Project setup script
  • requirements.txt - Python dependencies
  • .env.template - Configuration template
  • Makefile - Convenient commands

Documentation

  • README.md - User guide
  • QUICKSTART.md - 5-minute setup
  • DOCUMENTATION.md - Technical details
  • PROJECT_SUMMARY.md - Implementation status
  • INDEX.md - This file

Source Code (src/)

src/
├── __init__.py
├── config/
│   ├── __init__.py
│   └── config.py              Configuration class
├── rag/
│   ├── __init__.py
│   ├── pinecone_manager.py     Pinecone operations
│   ├── embedding_service.py    Google Gemini embeddings
│   ├── document_processor.py   Document pipeline
│   └── rag_chain.py            RAG chain & LLM
└── utils/
    ├── __init__.py
    ├── helpers.py              Logging & utilities
    ├── chunking.py             Text splitting
    └── text_processor.py       File extraction

🚀 Quick Start

1. Install & Setup

pip install -r requirements.txt
python setup_project.py

2. Configure

# Edit .env with your API keys
GOOGLE_API_KEY=sk-...
PINECONE_API_KEY=...
PINECONE_ENVIRONMENT=us-east-1

3. Initialize

python main.py init

4. Run

streamlit run app.py

💬 Usage Examples

Web Interface

streamlit run app.py
# Then:
# 1. Upload documents
# 2. Process them
# 3. Ask questions

Command Line

# Initialize index
python main.py init

# Process documents
python main.py process documents/

# Process single file
python main.py process document.txt

# With namespace
python main.py process docs/ --namespace my-project

Python Code

from src.rag import RAGChain

chain = RAGChain()
result = chain.query("What is the main topic?")
print(result["answer"])

🔒 Safety Features

Context Verification - Answers based only on uploaded documents
Hallucination Prevention - Refuses out-of-scope questions
Source Attribution - All answers include source references
Error Handling - Comprehensive logging and error messages
Config Validation - Required API keys are checked


📊 System Architecture

┌─────────────────────────────────┐
│   Streamlit Web Interface       │
│  - File Upload                  │
│  - Chat Interface              │
│  - Source Display              │
└────────────┬────────────────────┘
             │
┌────────────▼────────────────────┐
│   RAG Processing Pipeline       │
├─────────────────────────────────┤
│ 1. Document Processing          │
│    - Extract text from files   │
│    - Split into chunks         │
│                                │
│ 2. Embedding Service            │
│    - Generate vectors          │
│    - Dimension: 768            │
│                                │
│ 3. Vector Storage              │
│    - Upsert to Pinecone       │
│    - Store metadata            │
│                                │
│ 4. RAG Chain                    │
│    - Retrieve documents        │
│    - Generate responses        │
│    - Cite sources             │
└─────────────────────────────────┘

⚙️ Configuration

Edit .env to customize behavior:

# API Keys (required)
GOOGLE_API_KEY=your_key
PINECONE_API_KEY=your_key
PINECONE_ENVIRONMENT=us-east-1

# Processing
CHUNK_SIZE=1000
CHUNK_OVERLAP=200

# Retrieval
RETRIEVAL_TOP_K=5

# Model
GOOGLE_MODEL_NAME=gemini-2.5-flash
EMBEDDING_MODEL=models/embedding-001

# Logging
LOG_LEVEL=INFO

🛠️ Available Commands

Via Make (if installed)

make help       # Show all commands
make install    # Install dependencies
make setup      # Setup project
make init       # Initialize Pinecone
make run        # Start Streamlit
make clean      # Clean up

Direct Python

python main.py init                          # Initialize
python main.py process <path>               # Process documents
python main.py process <path> --namespace <name>  # With namespace

Streamlit

streamlit run app.py                        # Start web UI

📚 Supported File Formats

Format Extension Notes
Text .txt Plain text files
PDF .pdf Using PyPDF2
Word .docx Using python-docx

🔧 API Services

Google Gemini API

  • Embeddings: models/embedding-001
    • Dimension: 768
    • Semantic understanding
  • LLM: gemini-2.5-flash
    • Chat capabilities
    • Context-aware responses

Pinecone

  • Vector Storage: Scalable cloud database
  • Similarity Search: Cosine distance
  • Metadata Support: Store source info
  • Namespaces: Organize documents

🐛 Troubleshooting

Configuration Issues

  • Ensure .env exists with API keys
  • Run python setup_project.py

Connection Issues

  • Check internet connection
  • Verify API key validity
  • Test API access

Processing Issues

  • Check file format (.txt, .pdf, .docx)
  • Verify chunk size settings
  • Review error logs

Query Issues

  • Ensure documents are processed
  • Check index statistics
  • Verify query format

📈 Performance Tips

  1. Faster Processing: Increase CHUNK_SIZE
  2. Better Context: Increase CHUNK_OVERLAP
  3. More Results: Increase RETRIEVAL_TOP_K
  4. Organized Storage: Use different namespaces

🧪 Testing

Verify Installation

python -c "from src.rag import *; print('✓ OK')"

Test with Sample

# Create sample file
echo "Sample content for testing" > sample.txt

# Process it
python main.py process sample.txt

# Query via Streamlit
streamlit run app.py

📝 Documentation References

In This Repository

  • README.md - User guide
  • QUICKSTART.md - 5-minute setup
  • DOCUMENTATION.md - Technical details
  • PROJECT_SUMMARY.md - Implementation status

External Resources


📊 Project Statistics

Metric Value
Total Files 15+
Lines of Code 1200+
Classes 7
Functions 40+
Documentation Files 5
Configuration Options 15+

✅ Completion Status

Status: ✅ PRODUCTION READY

All requirements from the project specification have been implemented with:

  • Production-grade code quality
  • Comprehensive documentation
  • Full feature set
  • Error handling
  • Security considerations

🎯 Next Steps

For First-Time Users

  1. Read QUICKSTART.md
  2. Follow the 5-minute setup
  3. Upload sample documents
  4. Test queries

For Developers

  1. Review DOCUMENTATION.md
  2. Explore source code in src/
  3. Check PROJECT_SUMMARY.md
  4. Extend functionality as needed

For Operations

  1. Configure .env with production keys
  2. Run python main.py init
  3. Set up document pipeline
  4. Monitor via logs

📞 Support Contacts

For issues:

  1. Check troubleshooting in README.md
  2. Review logs for error details
  3. Verify API key configuration
  4. Check documentation for guidance

Last Updated: December 2024
Version: 1.0.0
Status: Production Ready ✅


Start with QUICKSTART.md for the fastest way to get running!