Skip to content

Latest commit

Β 

History

History
373 lines (300 loc) Β· 10.9 KB

File metadata and controls

373 lines (300 loc) Β· 10.9 KB

RAG Project - Complete Documentation

πŸ“š Project Overview

This is a Production-Ready Retrieval-Augmented Generation (RAG) System built with enterprise-grade components:

  • Framework: LangChain (RAG orchestration)
  • Embeddings: Google Gemini API (semantic understanding)
  • Vector DB: Pinecone (scalable vector storage)
  • LLM: Google Gemini Pro (generation)
  • UI: Streamlit (web interface)

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    User Interface                        β”‚
β”‚                   (Streamlit Web App)                    β”‚
β”‚  - File Upload - Chat Interface - Display Results       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              RAG Processing Pipeline                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                          β”‚
β”‚  1. Document Processor                                  β”‚
β”‚     - Text Extraction (.txt, .pdf, .docx)              β”‚
β”‚     - Text Chunking (with overlap)                     β”‚
β”‚                                                          β”‚
β”‚  2. Embedding Service                                   β”‚
β”‚     - Generate vectors via Google Gemini               β”‚
β”‚     - Dimension: 768                                    β”‚
β”‚                                                          β”‚
β”‚  3. Vector Storage                                      β”‚
β”‚     - Pinecone index management                        β”‚
β”‚     - Metadata storage                                 β”‚
β”‚                                                          β”‚
β”‚  4. RAG Chain                                           β”‚
β”‚     - Retrieval (semantic search)                      β”‚
β”‚     - Generation (LLM response)                        β”‚
β”‚     - Prompt engineering                               β”‚
β”‚                                                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“‚ File Structure Explained

rag-project/
β”‚
β”œβ”€β”€ src/                          # Main source code
β”‚   β”œβ”€β”€ __init__.py              # Package initialization
β”‚   β”‚
β”‚   β”œβ”€β”€ config/                  # Configuration module
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── config.py            # Centralized config (reads .env)
β”‚   β”‚
β”‚   β”œβ”€β”€ rag/                     # Core RAG implementation
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ pinecone_manager.py  # Pinecone CRUD operations
β”‚   β”‚   β”œβ”€β”€ embedding_service.py # Google Gemini embeddings
β”‚   β”‚   β”œβ”€β”€ document_processor.py# Document pipeline
β”‚   β”‚   └── rag_chain.py         # LangChain RAG chain
β”‚   β”‚
β”‚   └── utils/                   # Utility modules
β”‚       β”œβ”€β”€ __init__.py
β”‚       β”œβ”€β”€ helpers.py           # Logging, formatting
β”‚       β”œβ”€β”€ chunking.py          # Text splitting logic
β”‚       └── text_processor.py    # File parsing
β”‚
β”œβ”€β”€ app.py                       # Streamlit web interface
β”œβ”€β”€ main.py                      # CLI entry point
β”œβ”€β”€ setup_project.py             # Setup automation
β”œβ”€β”€ requirements.txt             # Python dependencies
β”‚
β”œβ”€β”€ .env.template                # Config template
β”œβ”€β”€ .env                         # Config (create from template)
β”‚
└── README.md                    # User documentation

πŸ”„ Workflow Diagram

Processing Documents

User Upload File
       ↓
    Extract Text (text_processor.py)
       ↓
    Split into Chunks (chunking.py)
       ↓
    Generate Embeddings (embedding_service.py)
       ↓
    Upsert to Pinecone (pinecone_manager.py)
       ↓
    Document Ready for Queries

Answering Questions

User Question (Chat Interface)
       ↓
    Generate Question Embedding (embedding_service.py)
       ↓
    Search Pinecone for Similar Chunks (pinecone_manager.py)
       ↓
    Retrieve Top-K Results (default: 5)
       ↓
    Format Context from Retrieved Chunks
       ↓
    Send to Gemini with Custom Prompt (rag_chain.py)
       ↓
    Stream Response to User

πŸ”Œ API Integrations

Google Gemini API

  • Models Used:
    • models/embedding-001 - Text embeddings
    • gemini-2.5-flash - Text generation
  • Key Operations:
    • embed_content() - Generate embeddings
    • ChatGoogleGenerativeAI() - LLM interface

Pinecone API

  • Index: rag-documents-index (configurable)
  • Key Operations:
    • create_index() - Initialize vector database
    • upsert() - Store vectors with metadata
    • query() - Semantic search

πŸ’Ύ Data Schema

Pinecone Vector Format

{
  "id": "filename_0_a1b2c3d4",
  "values": [0.23, 0.45, ...],  // 768-dimensional embedding
  "metadata": {
    "chunk_index": 0,
    "source": "document.txt",
    "text": "First 500 characters of chunk..."
  }
}

Document Metadata

  • chunk_index: Position in source document
  • source: Original filename
  • text: Content preview (first 500 chars)

βš™οΈ Configuration Parameters

Parameter Default Purpose
CHUNK_SIZE 1000 Characters per chunk
CHUNK_OVERLAP 200 Character overlap between chunks
RETRIEVAL_TOP_K 5 Number of results to retrieve
EMBEDDING_DIMENSION 768 Embedding vector dimension
LANGCHAIN_VERBOSE False Enable verbose logging

πŸ” Security & Safety

Built-in Safeguards

  1. Hallucination Prevention

    • Custom prompt instructs model to refuse out-of-scope questions
    • "I don't have information in the uploaded documents to answer that."
  2. Context Verification

    • Only uses retrieved documents as context
    • No external data sources
  3. Source Attribution

    • Links answers back to source documents
    • Shows document excerpts
  4. Logging

    • All operations logged for audit trail
    • Configurable log levels

πŸš€ Performance Optimization

Chunking Strategy

  • Recursive character splitting
  • Respects semantic boundaries (paragraphs, sentences)
  • Configurable size and overlap

Embedding Strategy

  • Batch processing for multiple texts
  • Caching ready (can be added)
  • Async support ready

Retrieval Strategy

  • Vector similarity search (cosine distance)
  • Top-K filtering
  • Metadata filtering support

πŸ› οΈ Development Guide

Adding New Features

  1. New File Formats

    # Add to text_processor.py
    elif file_ext == ".new_format":
        return TextProcessor._extract_from_new_format(file_path)
  2. Custom Prompt Templates

    # Modify in rag_chain.py _create_qa_chain()
    CUSTOM_PROMPT = PromptTemplate(
        template="Your custom template...",
        input_variables=["context", "question"]
    )
  3. New Retrieval Strategies

    # Create in rag_chain.py
    def retrieve_with_reranking(self, question: str):
        # Custom retrieval logic

πŸ“Š Monitoring & Debugging

Enable Verbose Logging

# In .env file
LANGCHAIN_VERBOSE=True
LOG_LEVEL=DEBUG

Check Index Stats

from src.rag import PineconeManager

pm = PineconeManager()
stats = pm.get_index_stats()
print(stats)  # Shows vector counts, dimensions

πŸ§ͺ Testing

Test Document Processing

# Create test file
echo "Test content" > test.txt

# Process it
python main.py process test.txt

Test RAG Chain

from src.rag import RAGChain

chain = RAGChain()
result = chain.query("Test question?")
print(result["answer"])

πŸ“ Code Examples

Process Documents Programmatically

from src.rag import DocumentProcessor

processor = DocumentProcessor()
chunks = processor.process_file("document.txt", "document.txt")
print(f"Created {chunks} chunks")

Query Documents

from src.rag import RAGChain

chain = RAGChain()
result = chain.query("What is the main topic?")
print(result["answer"])
for doc in result["source_documents"]:
    print(f"Source: {doc.metadata['source']}")

πŸ”„ Update & Maintenance

Updating Documents

  1. Process new documents (appends to index)
  2. Use --namespace flag for isolation
  3. Clear index if needed: Update PINECONE_INDEX_NAME in .env

Clearing Data

  1. Streamlit UI: "Clear All Data" button
  2. CLI: Create new index with different name

πŸ“ˆ Scalability

Suitable For

  • Small to medium document repositories (millions of vectors)
  • Real-time query performance needs
  • Multi-tenant support (via namespaces)
  • Cost-effective vector storage

When to Scale

  • Consider vector database partitioning
  • Implement caching layer
  • Add async batch processing
  • Monitor Pinecone index size

πŸ› Common Issues & Solutions

Issue Cause Solution
No embeddings generated Invalid API key Check GOOGLE_API_KEY
Connection refused API timeout Check internet, retry
Hallucinated answers Prompt design Adjust prompt template
Slow queries Large TOP_K Reduce RETRIEVAL_TOP_K
Memory issues Large documents Reduce CHUNK_SIZE

πŸ“š Module Reference

config.py

  • Config class with all settings
  • validate() method for config checks

embedding_service.py

  • EmbeddingService for generating vectors
  • embed_text() - single text
  • embed_texts() - batch processing

pinecone_manager.py

  • PineconeManager for index operations
  • create_index() - setup
  • upsert_vectors() - store
  • query_vectors() - retrieve

document_processor.py

  • DocumentProcessor for full pipeline
  • process_file() - single file
  • process_multiple_files() - batch

rag_chain.py

  • RAGChain for Q&A
  • query() - get answers
  • is_relevant_to_documents() - check relevance

πŸŽ“ Learning Path

  1. Beginner: Use Streamlit UI only
  2. Intermediate: Explore CLI commands
  3. Advanced: Modify code and add features
  4. Expert: Integrate into production systems

πŸ“ž Support Resources


Version: 1.0.0
Last Updated: December 2024
Status: Production Ready