Retrieval-Augmented Document Question Answering (RAG) System

1. Problem Statement

Large Language Models (LLMs) generate responses based on patterns learned during training. While effective for general knowledge, they fail in scenarios involving private, domain-specific, or up-to-date documents. When asked questions outside their training data, LLMs often produce hallucinated or incorrect answers without indicating uncertainty.

This project addresses the problem of reliable question answering over custom documents, where:

The information is not publicly available to the model.
Answers must be grounded in provided documents.
The system should clearly indicate when an answer is not supported by the data.

2. Solution Overview

This system implements Retrieval-Augmented Generation (RAG) to combine document retrieval with language generation.

Instead of asking the LLM to answer directly, the system:

Retrieves relevant document excerpts based on the user query.
Provides only this retrieved context to the LLM.
Constrains the model to generate answers strictly from the provided context.

This approach reduces hallucinations and enables traceable, document-grounded answers.

3. System Architecture

End-to-end flow:

User Query
- A question is sent to the API.
Embedding Generation
- The query is converted into a dense vector using a sentence embedding model.
Vector Retrieval
- FAISS performs similarity search over document embeddings.
- Maximal Marginal Relevance (MMR) is used to select relevant and diverse chunks.
Context Construction
- Retrieved document chunks are combined into a single context block.
Prompt Construction
- A constrained prompt instructs the LLM to answer using only the provided context.
LLM Generation
- The LLM generates a concise answer or explicitly states uncertainty.
Response Assembly
- The final response includes the answer and the source document filenames.

4. Dataset & Preprocessing

Documents

Plain text (.txt) documents stored locally.
Each file represents an independent knowledge source.

Chunking Strategy

Chunk size: 350 characters
Chunk overlap: 50 characters

Rationale:

Ensures each chunk fits within LLM context limits.
Overlap prevents loss of information at chunk boundaries.
Values were selected empirically to balance retrieval granularity and semantic coherence.

5. Models & Design Choices

Embedding Model

Model: sentence-transformers/all-MiniLM-L6-v2
Why: Lightweight, fast, and well-suited for semantic similarity tasks.

Language Model

Model: google/flan-t5-base
Why: Instruction-tuned and capable of following constrained prompts.
Trade-offs: High memory usage for local inference; slower startup times.

Vector Store

FAISS
Why: Efficient in-memory similarity search, suitable for small-to-medium datasets.

Retrieval Strategy

MMR (Maximal Marginal Relevance)
Why: Reduces redundant chunks and improves coverage of distinct relevant information, which helps reduce hallucinations.

6. API Design

Endpoint

POST /ask

Request Example

{
  "question": "What role does FAISS play in document retrieval?"
}

Response Example

{
  "question": "What role does FAISS play in document retrieval?",
  "answer": "FAISS is used to perform efficient similarity search over document embeddings to retrieve relevant context for answering questions.",
  "sources": ["vector_databases.txt"]
}

Source Attribution

Sources are extracted from the metadata of retrieved document chunks.
Filenames are returned to provide transparency into where the answer originated.

7. Deployment

Dockerization

The application is containerized using Docker.
All dependencies are installed at build time for reproducibility.

AWS Deployment

Deployed on AWS EC2 (Ubuntu 22.04 LTS).
Container exposed via port 8000.
Swagger UI available for interactive testing.

Environment Configuration

CPU-only inference.
Instance memory requirements dictated local LLM usage.

8. Challenges & Learnings

Memory Constraints

Local inference with flan-t5-base caused Out-Of-Memory (OOM) crashes on low-memory instances.

Debugging Process

Inspected Docker logs to identify premature container exits.
Used Linux kernel logs (dmesg) to confirm OOM kills.
Adjusted instance memory configuration to stabilize the service.

Key Learnings

Local LLM inference has non-trivial infrastructure requirements.
Production deployment requires careful resource planning.
System-level debugging is critical for ML applications in production.

9. Limitations

Scalability: FAISS is in-memory and not suitable for very large datasets.
Accuracy: Retrieval quality directly impacts answer quality.
Multi-document reasoning: The system does not perform advanced reasoning across multiple documents.
Cold start latency: Model loading increases startup time.
Evaluation: No automated metrics are implemented for answer quality.

10. Future Improvements

Add a reranking stage to improve retrieval precision.
Experiment with adaptive or semantic chunking strategies.
Replace FAISS with a production-grade vector database.
Introduce monitoring, logging, and evaluation pipelines.
Support document updates without full re-indexing.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
test_outputs		test_outputs
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
sample_questions.txt		sample_questions.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Retrieval-Augmented Document Question Answering (RAG) System

1. Problem Statement

2. Solution Overview

3. System Architecture

4. Dataset & Preprocessing

Documents

Chunking Strategy

5. Models & Design Choices

Embedding Model

Language Model

Vector Store

Retrieval Strategy

6. API Design

Endpoint

Request Example

Response Example

Source Attribution

7. Deployment

Dockerization

AWS Deployment

Environment Configuration

8. Challenges & Learnings

Memory Constraints

Debugging Process

Key Learnings

9. Limitations

10. Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Retrieval-Augmented Document Question Answering (RAG) System

1. Problem Statement

2. Solution Overview

3. System Architecture

4. Dataset & Preprocessing

Documents

Chunking Strategy

5. Models & Design Choices

Embedding Model

Language Model

Vector Store

Retrieval Strategy

6. API Design

Endpoint

Request Example

Response Example

Source Attribution

7. Deployment

Dockerization

AWS Deployment

Environment Configuration

8. Challenges & Learnings

Memory Constraints

Debugging Process

Key Learnings

9. Limitations

10. Future Improvements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages