This README is a practical guide for building your intuition and skills in Retrieval-Augmented Generation (RAG). It helps you understand foundational concepts and connect them into a working RAG pipeline.
If you're new, start with the basics:
- Machine learning fundamentals: vectors, matrices, embeddings, similarity metrics
- Python fundamentals: data structures, functions, basic file I/O
- Deep learning basics: neural networks, activation functions, loss functions, optimizers
You can run example code without mastering these, but understanding them helps you debug and improve RAG systems.
- A relational database stores data in rows and columns.
- A vector database stores data as high-dimensional points (vectors).
An embedding is a numerical representation of an object (text, image, audio, etc.) that captures meaning. Similar items end up closer together in vector space, which enables similarity search.
- Relational DB: a 2D table of values
- Vector DB: a 3D (or higher-dimensional) space of points
When you store vectors, you’re storing points, not rows/columns. For example, an image of a dog could be represented by latent dimensions like “four-legged”, “mammal”, and “domestic”—each dimension contributes to where it lands in vector space.
- Compare meaning, not just exact keywords
- Enable similarity search (nearest neighbors)
- Power retrieval for RAG
A typical RAG system has two phases:
- Indexing (offline / batch): prepare your knowledge base for retrieval
- Retrieval + Generation (online): retrieve relevant context and generate an answer
-
Document Corpus (Knowledge Base)
Your source of truth: web pages, PDFs, internal docs, tickets, wikis, etc. -
Document Loader (Ingestion)
Converts raw sources into text + metadata (title, URL, page number, date, etc.) -
Text Splitter (Chunking)
Splits text into smaller units (“chunks”) that are easier to embed and retrieve. -
Embedding Model
Converts chunks (and queries) into vectors. -
Vector Store / Vector Database
Stores vectors + metadata and supports nearest-neighbor search. -
Retriever
Given a query, returns top-k relevant chunks. May also apply filters, reranking, or query expansion. -
Prompt Template
Formats the question + retrieved context into a structured input for the LLM. -
LLM (Generator)
Produces the final answer using the retrieved context (ideally with citations). -
Orchestrator (Pipeline/Chain)
Coordinates steps: load → split → embed → store → retrieve → prompt → generate.
This is where LangChain (or similar) fits.
Note: In LangChain, many of these are modular components that you connect together.
Goal: Turn messy sources into clean text + metadata.
Common loaders:
- PDF: PyMuPDF, pdfplumber, PDFMiner
- HTML/Web: BeautifulSoup, lxml, newspaper3k (optional), custom scrapers
- CSV: pandas, Python
csv - JSON: Python
json, jq-like transforms, custom parsing
Tip: Always capture metadata (source URL, doc title, page number, timestamp).
Metadata becomes crucial for filtering and citations later.
Typical steps:
- Remove headers/footers (especially PDFs)
- Fix weird whitespace / hyphenation / broken lines
- Convert tables carefully (tables can break retrieval if flattened poorly)
- Normalize encoding, remove boilerplate
Goal: Retrieve just enough relevant context without losing meaning.
Common splitting strategies:
- By characters (fixed size): simplest, but risks breaking sentences
- By paragraphs/sentences: better readability and context integrity
- Recursive splitting: split by sections → paragraphs → sentences → characters (best general approach)
Important chunking parameters:
- Chunk size: too small = lacks context; too large = retrieval becomes noisy
- Chunk overlap: helps preserve continuity between chunks
- Structure-aware splitting: preserve headings and sections when possible
Before retrieval, many systems do some combination of:
- Query rewriting: make the user question clearer / more searchable
- Decomposition: split a complex question into sub-questions
- Intent classification: decide if retrieval is needed or not
Goal: Fetch top-k chunks that are most relevant.
Retrieval types:
- Dense retrieval (vector search): semantic similarity using embeddings
- Sparse retrieval (keyword/BM25): exact term matching
- Hybrid retrieval: combine dense + sparse for better coverage
Common improvements:
- Metadata filters: (e.g.,
source=handbook,date>2024,product=payments) - Reranking: a second model re-sorts candidates for relevance
- MMR (diversity): reduces redundant chunks and increases coverage
Goal: Provide the LLM with reliable context.
Good prompt patterns:
- Provide retrieved chunks under a clear “Context” section
- Instruct the model to answer only using context
- Ask it to say “I don’t know” when context is insufficient
- Include citation format like:
[source: doc.pdf p.3]
The LLM generates the answer using:
- The user question
- The retrieved context
- The instruction/prompt template
Common generation controls:
- Short factual answers vs. long explanations
- Citation requirements
- Structured output (JSON, bullets, steps)
- Fast / lightweight: good for prototypes; may lose nuance
- Balanced: good tradeoffs for most apps
- High-quality / expensive: best performance, higher cost and latency
What matters most in practice:
- Retrieval quality on your actual data
- Latency constraints
- Cost at scale
- Language/domain coverage
- Flat (brute force): simplest, accurate; slow at large scale
- ANN (Approximate Nearest Neighbor): fast and scalable
Common approaches/tools: FAISS, HNSW, Annoy, ScaNN
RAG is usually improved through iteration. Common failure modes:
- Bad chunking: retrieved chunks miss the necessary context
- Poor retrieval: embeddings don’t match your domain well
- Context overload: too many chunks → model gets confused
- Hallucination: model answers beyond provided context
- Source mismatch: wrong documents retrieved because metadata wasn’t used
Basic evaluation checklist:
- Are retrieved chunks actually relevant to the question?
- Are the chunks complete enough to answer correctly?
- Does the prompt force the model to use context?
- Are citations pointing to the correct sources?
LangChain helps implement RAG by providing modular building blocks.
- Document Loaders: ingest PDFs/HTML/CSV/JSON into
Documentobjects - Text Splitters: create chunks while maintaining context
- Embeddings: plug in an embedding model
- Vector Stores: store/query vectors (FAISS, Chroma, etc.)
- Retrievers: unify retrieval logic (top-k, filters, hybrid patterns)
- Chains: define a sequence of steps for your pipeline
- Agents: decide actions/tools dynamically (useful when retrieval is conditional)
- Memory: store conversation context (helpful for chat-style RAG)
- Prompts: templates that guide consistent and safe LLM output
Reminder: LangChain simplifies wiring—but you still need to understand chunking, retrieval quality, and evaluation.
As you move beyond basic RAG, the focus shifts from “making retrieval work” to making it reliable, precise, and scalable. Advanced RAG techniques improve retrieval quality, reduce hallucinations, handle complex questions, and support specialized data types (like images, tables, or multiple document granularities).
-
Hybrid Retrieval (Dense + Sparse)
Combines vector search (semantic similarity) with keyword search (BM25/lexical).
Helpful when queries include exact tokens like IDs, error codes, filenames, or proper nouns. -
Multi-Step Retrieval (Coarse-to-Fine)
Uses stages to improve scalability and relevance: retrieve broadly first, then refine.
Common pattern: document/section retrieval → chunk retrieval → reranking. -
Reranking (Second-Stage Retrieval)
After retrieving candidates (e.g., top-20), rerank them using stronger relevance signals.
Improves precision of the final top-k chunks used in the prompt. -
Filtering (Noise Reduction + Control)
Exclude irrelevant or low-quality content before/after retrieval.
Often uses content rules (duplicates/boilerplate), metadata constraints (date/source), or user scope. -
Hierarchical Retrieval
Retrieve at multiple granularities: document → section → chunk.
Reduces noise in long documents and improves context targeting. -
Multi-Vector Retrieval
Store multiple embeddings per unit (chunk/doc) to represent different views: content, title, summary, entities, or Q&A pairs.
Improves recall across different query styles. -
Contextual / Query-Aware Retrieval
Rewrite or expand the query using synonyms, related concepts, or intent-aware rewriting to improve retrieval matches. -
Knowledge-Enhanced Retrieval
Use structured knowledge like entity linking, taxonomies, or knowledge graphs to connect related concepts and reduce ambiguity. -
Cross-Modal Retrieval
Retrieve across modalities (text ↔ image ↔ audio).
Useful for diagrams, screenshots, tables-as-images, and mixed-media knowledge bases. -
Active Learning (User Feedback Loop)
Use user feedback (clicks, ratings, corrections) to improve chunking, retrieval, reranking, and prompting over time.
Chunking strongly affects retrieval quality—if the chunk boundaries are wrong, even the best embedding model may retrieve the wrong context.
-
Importance of Text Splitting
Splitting controls what becomes “retrievable.” If key facts are split apart, retrieval may miss them or return incomplete context. -
Context Preservation
Use overlap, structure-aware splitting (headings/sections), and sentence/paragraph boundaries to keep meaning intact. -
Impact on Retrieval Accuracy
Good chunking improves both:- Recall: ability to find relevant information
- Precision: reducing irrelevant results in top-k
Chunking is a tradeoff between speed, cost, and answer quality.
-
Chunk Size
- Smaller chunks → faster retrieval and more precise matches, but may lose necessary context.
- Larger chunks → better context and coherence, but may include irrelevant text and waste tokens.
-
Semantic Coherence The goal is to make each chunk represent a complete idea.
A good chunk usually contains enough context to answer a question without relying on neighboring chunks.
-
Synonym Expansion
Expand queries with synonyms or alternate terms.
Example: “refund” → “reimbursement”, “chargeback”, “return”. -
Concept Expansion
Add related concepts that often appear with the same intent.
Example: “reset password” → “account recovery”, “2FA”, “email verification”. -
Query Reformulation (Query Rewriting)
Rewrite the user query to be more searchable and retrieval-friendly.
Example: “Why can’t I log in?” → “login error troubleshooting steps and common causes”.
Tip: Many production systems use a combination: hybrid retrieval + reranking + good chunking before trying more complex solutions.
-
Reranking
Reorders retrieved results using additional signals to improve relevance (often applied after retrieving top-k or top-n candidates).Common reranking approaches:
- Contextual Reranking: uses the query + candidate chunk together to judge relevance more accurately.
- Model-Based Reranking: uses ML models (e.g., cross-encoders, LLM rerankers) to score and reorder results.
- Feedback-Based Ranking: adapts ranking based on user feedback (clicks, upvotes/downvotes, corrections).
-
Filtering
Excludes irrelevant, low-quality, or out-of-scope data before or after retrieval to reduce noise and improve answer quality.Common filtering approaches:
- Content-Based Filtering: remove boilerplate, short/empty chunks, duplicates, or low-information text.
- Metadata Filtering: filter by source, date, department, doc type, language, access level, etc.
- User-Preference Filtering: personalize results based on role, history, or user-selected scope.
-
Coarse-to-Fine Retrieval Strategies
An initial broad search (coarse) is followed by refined retrieval (fine) to handle large datasets efficiently.How it works:
- Coarse stage: retrieve high-level candidates (documents/sections) quickly.
- Fine stage: search within those candidates using smaller chunks, better reranking, or stricter filters.
Advantages:
- Scalability: reduces compute by narrowing the search space early.
- Accuracy: later stages apply deeper analysis (reranking, better chunk matching) to improve relevance.
The 5_caching_responses_rag folder now follows a step-by-step learning structure:
5_caching_responses_rag/a_in_memory/with_openai.py: in-memory query cache (single-process demo)5_caching_responses_rag/b_postgres/with_openai.py: persistent query cache in PostgreSQL5_caching_responses_rag/c_redis/with_openai.py: fast query cache in Redis (optional TTL)
Module guide:
5_caching_responses_rag/README.md
Why this matters:
- Reduces repeated LLM cost for identical query + retrieved context
- Improves response latency for repeated questions
- Shows tradeoffs between local memory, durable SQL cache, and low-latency key-value cache
- A “Mini Projects” section (e.g., PDF Q&A, website FAQ bot, internal docs assistant)
- A “Metrics” section (precision@k, recall@k, answer faithfulness, citation accuracy)
- A “RAG patterns” section (multi-query, reranking, hybrid search, graph RAG)
SQL Agent - https://docs.langchain.com/oss/python/langchain/sql-agent - https://docs.langchain.com/oss/python/langchain/rag Caching Response: - https://aws.amazon.com/blogs/database/optimize-llm-response-costs-and-latency-with-effective-caching/ - https://colab.research.google.com/github/sugarforever/LangChain-Tutorials/blob/main/LangChain_Caching.ipynb