This project is practical implementation of Retrieval-Augmented Generation (RAG). The goal is simple: upload/use a PDF as knowledge, then let users chat with an AI assistant grounded in that PDF.
A base LLM is powerful, but it does not automatically know your private PDF content. If the user asks specific questions (for example, "what is in case #32?"), the model may hallucinate or answer vaguely without data grounding.
flowchart LR
D[Large set of PDF pages] --> LLM[LLM]
U[User asks a specific question] --> LLM
LLM --> R[Response]
N[No built-in context about your private docs] -.-> LLM
A naive approach is: extract all text and send everything into the prompt every time.
Problems:
- High token cost
- Context window limits
- Slower responses
- Unnecessary irrelevant text in prompt
flowchart LR
P[All PDF text] --> S[System Prompt + Full Text]
Q[User Query] --> S
S --> LLM[LLM]
LLM --> A[Answer]
C1[High cost] -.-> S
C2[Context window limit] -.-> S
RAG solves this by splitting work into two phases:
- Indexing phase
- Retrieval + generation phase
flowchart LR
subgraph Indexing
A[PDF] --> B[Chunking]
B --> C[Embedding Model]
C --> D[(Vector DB / Qdrant)]
end
subgraph Query Time
Q[User Query] --> QE[Query Embedding]
QE --> D
D --> R[Top-K Relevant Chunks]
R --> G[LLM Answer Generation]
Q --> G
end
In this project:
- Load PDF (
PyPDFLoader) - Split text into chunks (
RecursiveCharacterTextSplitter) - Convert chunks to vectors (
OpenAIEmbeddings) - Store vectors + metadata in Qdrant (
QdrantVectorStore)
Metadata (like page number and source file) is preserved so answers can point users to where information came from.
flowchart LR
PDF[PDF Document] --> SPLIT[Chunking]
SPLIT --> CH1[Chunk A + metadata]
SPLIT --> CH2[Chunk B + metadata]
SPLIT --> CH3[Chunk C + metadata]
CH1 --> EMB[Embedding Model]
CH2 --> EMB
CH3 --> EMB
EMB --> VDB[(Qdrant Vector DB)]
At query time:
- User asks a question
- Query is embedded
- Similar chunks are retrieved from Qdrant
- Retrieved context is injected into prompt
- LLM answers based on that context
flowchart LR
U[User Query] --> E[Query Embedding]
E --> VDB[(Qdrant Vector DB)]
VDB --> K[Top Relevant Chunks]
K --> P[Prompt with Context]
U --> P
P --> LLM[GPT Model]
LLM --> OUT[Grounded Answer + Page References]
index.py: indexing pipeline (PDF -> chunks -> embeddings -> Qdrant)chat.py: retrieval + answer generation loopconstants.py: shared constants (Qdrant URL, collection name, embedding model).env.example: required environment variables template
Use .env based on .env.example:
OPENAI_API_KEY=your_openai_api_key_here- Start Qdrant:
docker compose up -d- Build the vector index:
python3 index.py- Start chat:
python3s chat.py- LLM alone is not enough for private knowledge grounding.
- Sending full documents in every prompt is expensive and limited.
- RAG gives a scalable pattern: index once, retrieve relevant chunks, generate grounded answers.
- Good chunking and metadata quality directly improve answer quality.