A simple, educational demo that compares two approaches to document-based question answering:
- CLaRa ("latent doc compression"): Documents are compressed into latent space without prompt stuffing
- Normal RAG: Retrieve top-K document chunks β stuff them into the prompt β LLM generates answer
clara-vs-rag-demo/
βββ src/ # Source code modules
β βββ __init__.py # Package initialization
β βββ models.py # Model loading and management
β βββ rag.py # Normal RAG pipeline implementation
β βββ clara.py # CLaRa pipeline implementation
β βββ utils.py # Utility functions (parsing, formatting)
β βββ app.py # Gradio web application
βββ tests/ # Test files
β βββ __init__.py
β βββ test_frontend.py # Frontend/utility function tests
βββ examples/ # Example documents
β βββ README.md
β βββ sample_docs.txt # Sample documents for testing
βββ app.py # Main entry point
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ LICENSE # MIT License
βββ .gitignore # Git ignore rules
src/models.py: Handles loading of all models (CLaRa, RAG LLM, embeddings)src/rag.py: Implements the Normal RAG pipeline (embed β retrieve β prompt β generate)src/clara.py: Implements the CLaRa pipeline (latent compression)src/utils.py: Helper functions for document parsing and output formattingsrc/app.py: Gradio UI setup and main application logicapp.py: Simple entry point that launches the application
This demo helps you understand the key difference between these two approaches:
-
Normal RAG = Prompt Stuffing: We retrieve the most relevant documents using embeddings, then paste them directly into the LLM's prompt. This is simple but can hit token limits and may not scale well.
-
CLaRa = Latent Compression: Documents are passed directly to CLaRa, which compresses them into a latent representation. No prompt stuffing occursβthe model handles documents internally through its compression mechanism.
- Python 3.8+
- CUDA-capable GPU recommended (but CPU will work, just slower)
- ~15-20GB disk space for model downloads
pip install -r requirements.txt
python app.pyThe app will:
- Download and load models (first run may take 10-15 minutes)
- Start a Gradio web interface at
http://localhost:7860
-
Paste Documents: Enter your documents in the textarea, separated by
---Document 1 text here... --- Document 2 text here... --- Document 3 text here... -
Enter Question: Type your question in the question box
-
Choose Mode:
- Normal RAG: Uses embedding-based retrieval + prompt stuffing
- CLaRa: Uses latent document compression
-
Adjust Top-K (RAG mode only): Select how many documents to retrieve (1-5)
-
Click Run: See the answer, evidence, and timing information
Documents:
The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France. It is named after the engineer Gustave Eiffel, whose company designed and built the tower from 1887 to 1889.
---
The Statue of Liberty is a neoclassical sculpture on Liberty Island in New York Harbor. It was a gift from France to the United States and was dedicated in 1886.
---
The Great Wall of China is a series of fortifications made of stone, brick, and other materials, generally built along an east-to-west line across the historical northern borders of China.
Question: "Which monument was a gift from France?"
Normal RAG Output:
- Retrieves top-K documents based on similarity
- Shows which docs were selected and their similarity scores
- Shows prompt length (characters)
- Shows retrieval + generation timing
CLaRa Output:
- Shows number of documents passed to the model
- Shows generation timing
- No prompt stuffingβdocs compressed latently
- CLaRa:
apple/CLaRa-7B-Instructwithcompression-16checkpoint - RAG LLM:
Qwen/Qwen2.5-3B-Instruct(or fallback toPhi-3-mini-4k-instruct) - Embeddings:
sentence-transformers/all-MiniLM-L6-v2
- Embed all documents and the question using sentence transformers
- Compute cosine similarity (dot product of normalized embeddings)
- Select top-K documents
- Build prompt: System instruction + context docs + question
- Generate answer using the RAG LLM
- Pass documents directly to CLaRa model
- CLaRa compresses documents into latent space
- Generate answer using compressed representation
- No prompt stuffing occurs
Run the test suite to verify everything works:
# Install pytest if not already installed
pip install pytest
# Run tests
pytest tests/ -vThe tests cover:
- Document parsing functionality
- Output formatting for both RAG and CLaRa modes
- Edge cases (empty docs, single doc, etc.)
If you don't have a local GPU, you can run this on Google Colab:
- Upload the files to Colab
- Install dependencies:
!pip install -r requirements.txt - Run:
!python app.py - Use the public URL provided by Gradio
This repository is structured to be educational and easy to understand:
- Modular Design: Each component (models, RAG, CLaRa) is in its own file
- Clear Separation: Business logic separated from UI code
- Testable: Utility functions can be easily tested
- Extensible: Easy to add new features or modify existing ones
- See how RAG works: embedding β retrieval β prompt construction β generation
- Understand CLaRa's approach: direct document compression without prompt stuffing
- Compare the two approaches side-by-side with timing and evidence
Feel free to:
- Add more example documents
- Improve the UI/UX
- Add more comprehensive tests
- Optimize model loading
- Add support for more models
- First run will download models (~10-15GB total)
- GPU recommended for reasonable performance
- CLaRa model loading may require checking the official Apple CLaRa repository for exact API usage
- This is an educational demo, not production-grade code
This demo uses publicly available models. Please check individual model licenses:
- CLaRa: Check Apple's license
- Qwen: Apache 2.0
- Sentence Transformers: Apache 2.0