Skip to content

SHWETA0920/CodeAtlas

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CodeAtlas 🧠

AI-powered codebase intelligence. Drop any GitHub repo or upload project files β€” ask questions, debug, and understand the entire codebase instantly like a senior developer who has read every single file.


CodeAtlas Python Next.js Flask LangChain License


πŸ“‹ Table of Contents


πŸ€” What is CodeAtlas?

Every developer knows the pain of:

  • Joining a new project and spending days just understanding the codebase
  • Searching Ctrl+F across 50 files to find where a feature is implemented
  • Debugging an issue that spans 5 different files across 3 modules
  • Trying to understand why a function exists and what it connects to

CodeAtlas solves all of this.

You give it a GitHub repo URL or upload your project files. It reads, parses, and indexes the entire codebase using RAG (Retrieval-Augmented Generation). Then you can ask it anything in plain English β€” and it answers with full context, file references, and code snippets.

You:  "Where is user authentication handled and how does JWT work here?"

DevBrain: Authentication is handled across 3 files:
          β†’ auth/middleware.py (lines 12–67): JWT verification logic
          β†’ routes/auth_routes.py (lines 23–89): login/logout endpoints  
          β†’ models/user.py (lines 5–34): User model with password hashing
          
          The JWT token is generated in login_user() using PyJWT with a 
          24-hour expiry. The middleware validates it on every protected route...

βš™οΈ How It Works

DevBrain AI uses a RAG (Retrieval-Augmented Generation) pipeline with code-aware chunking:

User Input (GitHub URL / Files)
         ↓
   Code Parsing
   (Tree-sitter splits by functions, classes, modules)
         ↓
   Embedding Generation
   (HuggingFace all-MiniLM-L6-v2 β†’ vectors)
         ↓
   Vector Storage
   (FAISS local index with metadata)
         ↓
   User Query
         ↓
   Multi-Query Expansion
   ("auth" β†’ ["authentication", "login", "JWT", "middleware"])
         ↓
   Similarity Search
   (Find top-K most relevant code chunks)
         ↓
   Context Stitching
   (Merge chunks from multiple files)
         ↓
   LLM Response
   (Groq/OpenAI generates grounded answer)
         ↓
   Answer + Source References

Why Code-Aware Chunking Matters

Normal RAG splits text by character count. This breaks functions in half and destroys context.

DevBrain uses Tree-sitter to split by actual code structure:

# ❌ Normal chunking (bad)
"...def login_user(username, pas"   ← chunk boundary mid-function
"sword):\n    user = db.query..."

# βœ… CodeAtlas chunking (good)
"def login_user(username, password):    ← entire function = one chunk
    user = db.query(User).filter(...)
    if not verify_password(...):
        raise AuthError(...)
    return generate_jwt(user.id)"

This means every chunk is a complete, meaningful unit of code β€” which massively improves retrieval accuracy.


✨ Features

Core Features

  • πŸ” Smart Code Search β€” Finds code by meaning, not just keywords. "Where is rate limiting?" finds it even if the variable is named throttle_requests
  • 🧠 Architecture Explainer β€” Ask "explain the project structure" and get a full breakdown of modules, dependencies, and data flow
  • πŸ› Debugging Assistant β€” "Why is login failing?" traces through auth middleware, database queries, and error handling across all relevant files
  • ⚑ Code Optimizer β€” Ask to optimize a function and get suggestions with time complexity analysis
  • πŸ“ Multi-File Context β€” Answers span multiple files simultaneously, showing exactly how code connects

Advanced RAG Features

  • Multi-Query Expansion β€” One query becomes 4 variants to maximize retrieval coverage
  • Context Stitching β€” Intelligently merges relevant chunks from different files without duplication
  • Metadata Filtering β€” Filter search to specific languages (python) or modules (auth)
  • Cross-Encoder Reranker β€” Optional second-pass reranking for higher precision (enable with USE_RERANKER=true)
  • Streaming Responses β€” Answers stream token-by-token like ChatGPT for instant feedback

Developer Experience

  • Background Indexing β€” Ingestion runs in a background thread with real-time progress polling
  • Project Caching β€” Indexed projects are cached; re-asking questions is instant
  • Source References β€” Every answer shows exactly which files and line numbers were used
  • Language Detection β€” Automatically detects Python, JS, TS, Go, Rust, Java, C++, and more
  • Smart File Filtering β€” Automatically ignores node_modules, build, .git, binary files

πŸ› οΈ Tech Stack

Backend

Technology Version Purpose
Python 3.11+ Core runtime
Flask 3.0.3 REST API + SSE streaming
LangChain 0.2.6 RAG orchestration
Tree-sitter 0.21.3 Code-aware AST parsing
GitPython 3.1.43 Clone GitHub repos

Embeddings & Vector Store

Technology Purpose
sentence-transformers/all-MiniLM-L6-v2 Free local embeddings (80MB, 384-dim)
microsoft/codebert-base Alternative code-specific embeddings
text-embedding-3-small (OpenAI) Cloud embeddings (paid)
FAISS Local vector index (MVP)
pgvector Production vector DB (PostgreSQL)

LLM Providers

Provider Models Cost
Groq llama3-70b, llama3-8b, mixtral-8x7b Free tier
OpenAI gpt-4o, gpt-3.5-turbo Paid
Google Gemini gemini-1.5-flash Free tier

Frontend

Technology Version Purpose
Next.js 14.2.35 React framework
TypeScript 5+ Type safety
Tailwind CSS 3.4 Styling
react-markdown 9.0 Render markdown answers
lucide-react 0.400 Icons

πŸ“ Project Structure

devbrain-ai/
β”‚
β”œβ”€β”€ πŸ“„ README.md
β”œβ”€β”€ πŸ“„ docker-compose.yml          # Full stack: backend + frontend + postgres
β”œβ”€β”€ πŸ“„ .gitignore
β”‚
β”œβ”€β”€ 🐍 backend/
β”‚   β”œβ”€β”€ app.py                     # Flask app factory, registers blueprints
β”‚   β”œβ”€β”€ config.py                  # All config loaded from .env
β”‚   β”œβ”€β”€ requirements.txt           # Python dependencies
β”‚   β”œβ”€β”€ Dockerfile
β”‚   β”œβ”€β”€ .env.example               # Template β€” copy to .env
β”‚   β”‚
β”‚   β”œβ”€β”€ ingestion/                 # Step 1-3: Input β†’ Parse β†’ Chunk
β”‚   β”‚   β”œβ”€β”€ github_loader.py       # Clone repo, walk files, detect language
β”‚   β”‚   β”œβ”€β”€ file_loader.py         # Handle zip uploads, extract safely
β”‚   β”‚   β”œβ”€β”€ file_filter.py         # Ignore node_modules, binaries, etc.
β”‚   β”‚   └── chunker.py             # Tree-sitter AST chunking + fallback
β”‚   β”‚
β”‚   β”œβ”€β”€ embeddings/                # Step 4: Text β†’ Vectors
β”‚   β”‚   β”œβ”€β”€ __init__.py            # Unified interface (picks provider)
β”‚   β”‚   β”œβ”€β”€ openai_embedder.py     # OpenAI text-embedding-3-small
β”‚   β”‚   └── hf_embedder.py         # HuggingFace sentence-transformers
β”‚   β”‚
β”‚   β”œβ”€β”€ vectorstore/               # Step 5: Store & Search Vectors
β”‚   β”‚   β”œβ”€β”€ __init__.py            # Unified interface (faiss or pgvector)
β”‚   β”‚   β”œβ”€β”€ faiss_store.py         # Local FAISS flat index + JSON metadata
β”‚   β”‚   └── pgvector_store.py      # PostgreSQL pgvector (production)
β”‚   β”‚
β”‚   β”œβ”€β”€ retrieval/                 # Step 6: Core RAG Logic
β”‚   β”‚   β”œβ”€β”€ retriever.py           # Main pipeline: expandβ†’searchβ†’rerankβ†’stitch
β”‚   β”‚   β”œβ”€β”€ multi_query.py         # Expand 1 query into N variants via LLM
β”‚   β”‚   β”œβ”€β”€ context_stitcher.py    # Deduplicate & merge chunks into context
β”‚   β”‚   └── reranker.py            # Optional cross-encoder reranker
β”‚   β”‚
β”‚   β”œβ”€β”€ llm/                       # Step 7: Generate Response
β”‚   β”‚   β”œβ”€β”€ prompt_builder.py      # System prompt + context template
β”‚   β”‚   └── llm_chain.py           # OpenAI/Groq, streaming + non-streaming
β”‚   β”‚
β”‚   β”œβ”€β”€ api/                       # Flask route handlers
β”‚   β”‚   β”œβ”€β”€ ingest_routes.py       # POST /api/ingest, GET /api/ingest/status
β”‚   β”‚   └── query_routes.py        # POST /api/query (SSE streaming)
β”‚   β”‚
β”‚   └── utils/
β”‚       β”œβ”€β”€ language_detector.py   # Map file extensions β†’ language names
β”‚       └── metadata_extractor.py  # Extract file path, module, function name
β”‚
└── βš›οΈ frontend/
    β”œβ”€β”€ package.json
    β”œβ”€β”€ next.config.js             # Rewrites /api/backend/* β†’ Flask
    β”œβ”€β”€ tailwind.config.ts
    β”œβ”€β”€ tsconfig.json
    β”œβ”€β”€ Dockerfile
    β”œβ”€β”€ .env.local.example
    β”‚
    β”œβ”€β”€ app/
    β”‚   β”œβ”€β”€ layout.tsx             # Root layout, fonts, metadata
    β”‚   β”œβ”€β”€ globals.css            # Tailwind base + custom styles
    β”‚   β”œβ”€β”€ page.tsx               # Landing page: GitHub URL + file upload
    β”‚   └── chat/
    β”‚       └── page.tsx           # Chat interface page
    β”‚
    β”œβ”€β”€ components/
    β”‚   β”œβ”€β”€ ChatInterface.tsx      # Full streaming chat UI with state mgmt
    β”‚   └── SourceFiles.tsx        # Referenced files panel with scores
    β”‚
    └── lib/
        └── api.ts                 # Typed fetch helpers for all API calls

πŸ“¦ Prerequisites

Before you start, install these on your machine:

Required

Python 3.11+

python --version   # should show 3.11 or higher

Download: https://python.org/downloads

⚠️ Windows: check "Add Python to PATH" during installation

Node.js 20+

node --version     # should show v20 or higher

Download: https://nodejs.org (choose LTS)

Git

git --version

Download: https://git-scm.com/downloads

API Keys (at least one LLM provider)

Provider Free? Get Key
Groq βœ… Recommended Yes, free tier https://console.groq.com
OpenAI Paid https://platform.openai.com/api-keys
Google Gemini Free tier https://aistudio.google.com/app/apikey

HuggingFace embeddings are completely free β€” no key needed, model downloads automatically.

VS Code Extensions (Recommended)

  • Python (Microsoft)
  • Pylance
  • ES7+ React/Redux Snippets
  • Tailwind CSS IntelliSense
  • TypeScript (built-in)

πŸš€ Installation & Setup

Option A β€” Local Setup (No Docker)

Best for development. Fastest to get started.

Step 1: Get the project

# If you have the zip:
unzip codeatlas.zip
cd codeatlas

# Or clone from GitHub:
git clone https://github.com/your-username/codeatlas.git
cd codeatlas

Step 2: Backend setup

cd backend

# Create virtual environment
python -m venv venv

# Activate it
source venv/bin/activate        # macOS / Linux
venv\Scripts\activate           # Windows (Command Prompt)
.\venv\Scripts\Activate.ps1     # Windows (PowerShell)

# Install dependencies (takes 3-5 minutes)
pip install -r requirements.txt

Step 3: Configure backend

cp .env.example .env

Open backend/.env and configure:

# ── Minimum config to get started ────────────────────────

# Groq (free) β€” get key at https://console.groq.com
LLM_PROVIDER=groq
LLM_MODEL=llama3-8b-8192
GROQ_API_KEY=your_groq_api_key_here

# HuggingFace embeddings β€” FREE, no key needed
EMBEDDING_PROVIDER=huggingface
HF_EMBED_MODEL=sentence-transformers/all-MiniLM-L6-v2
SENTENCE_TRANSFORMERS_HOME=./model_cache

# FAISS β€” local vector store, no setup needed
VECTOR_STORE=faiss
FAISS_INDEX_DIR=./faiss_indexes

# Performance
TOP_K=5
MULTI_QUERY_COUNT=0
USE_RERANKER=false

# App
FLASK_DEBUG=true
SECRET_KEY=your_secret_key_here
UPLOAD_DIR=./uploads

Step 4: Frontend setup

Open a new terminal (keep backend terminal open):

cd frontend
npm install
cp .env.local.example .env.local

frontend/.env.local should contain:

BACKEND_URL=http://localhost:5000

Step 5: Run both servers

Terminal 1 β€” Backend:

cd backend
venv\Scripts\activate           # Windows
# source venv/bin/activate      # macOS/Linux
python app.py

Expected output:

* Running on http://0.0.0.0:5000
* Debug mode: on

Terminal 2 β€” Frontend:

cd frontend
npm run dev

Expected output:

β–² Next.js 14.2.35
- Local: http://localhost:3000
βœ“ Ready in 2.1s

Step 6: Open the app

Go to http://localhost:3000 in your browser.

To verify backend is healthy: http://127.0.0.1:5000/health

{ "status": "ok", "service": "devbrain-ai" }

Option B β€” Docker Setup

Best for production. Runs everything in one command including PostgreSQL.

Step 1: Install Docker Desktop

Download: https://docker.com/products/docker-desktop

Step 2: Configure environment

cd codeatlas
cp backend/.env.example .env

Edit .env:

OPENAI_API_KEY=sk-...          # or use Groq:
GROQ_API_KEY=gsk_...
LLM_PROVIDER=groq
LLM_MODEL=llama3-8b-8192
EMBEDDING_PROVIDER=huggingface
VECTOR_STORE=pgvector           # use pgvector with Docker

Step 3: Start everything

docker compose up --build

First run downloads images and builds containers (~5 minutes). Subsequent runs start in ~30 seconds.

Step 4: Stop

docker compose down             # stop containers
docker compose down -v          # stop + wipe database

βš™οΈ Configuration

All backend config lives in backend/.env. Full reference:

LLM Settings

Variable Default Options Description
LLM_PROVIDER openai openai, groq Which LLM to use
LLM_MODEL gpt-4o Any model name Specific model
OPENAI_API_KEY β€” sk-... Required for OpenAI
GROQ_API_KEY β€” gsk_... Required for Groq

Groq model options:

  • llama3-8b-8192 β€” fastest, good quality
  • llama3-70b-8192 β€” best quality, slower
  • mixtral-8x7b-32768 β€” best for long codebases (32k context)

Embedding Settings

Variable Default Options Description
EMBEDDING_PROVIDER openai openai, huggingface Embedding source
OPENAI_EMBED_MODEL text-embedding-3-small β€” OpenAI model
HF_EMBED_MODEL microsoft/codebert-base Any HF model Local model
SENTENCE_TRANSFORMERS_HOME β€” ./model_cache Cache directory

Recommended free HuggingFace models:

  • sentence-transformers/all-MiniLM-L6-v2 β€” 80MB, fastest, great quality
  • microsoft/codebert-base β€” 500MB, code-specific

Vector Store Settings

Variable Default Options Description
VECTOR_STORE faiss faiss, pgvector Storage backend
FAISS_INDEX_DIR ./faiss_indexes Any path FAISS save location
PG_HOST localhost β€” PostgreSQL host
PG_PORT 5432 β€” PostgreSQL port
PG_USER devbrain β€” PostgreSQL user
PG_PASSWORD devbrain β€” PostgreSQL password
PG_DB devbrain β€” Database name

RAG Settings

Variable Default Description
TOP_K 8 Number of chunks retrieved per query
MULTI_QUERY_COUNT 4 Query expansion variants (0 = disabled)
USE_RERANKER false Enable cross-encoder reranking
RERANKER_MODEL cross-encoder/ms-marco-MiniLM-L-6-v2 Reranker model
MAX_CHUNK_TOKENS 400 Max tokens per code chunk

πŸ“‘ API Reference

POST /api/ingest

Index a GitHub repository or uploaded files.

GitHub URL:

curl -X POST http://localhost:5000/api/ingest \
  -H "Content-Type: application/json" \
  -d '{"github_url": "https://github.com/user/repo"}'

File upload:

curl -X POST http://localhost:5000/api/ingest \
  -F "files[]=@myproject.zip"

Response:

{
  "project_id": "user_repo",
  "status": "processing"
}

GET /api/ingest/status/:project_id

Poll ingestion progress.

curl http://localhost:5000/api/ingest/status/user_repo

Response (processing):

{
  "project_id": "user_repo",
  "status": "processing",
  "message": "Embedding 342 chunks...",
  "files": 47,
  "chunks": 342
}

Response (ready):

{
  "project_id": "user_repo",
  "status": "ready",
  "message": "Indexing complete",
  "files": 47,
  "chunks": 342
}

POST /api/query

Query the indexed codebase.

curl -X POST http://localhost:5000/api/query \
  -H "Content-Type: application/json" \
  -d '{
    "project_id": "user_repo",
    "query": "Where is authentication handled?",
    "stream": false,
    "filter_language": "python",
    "filter_module": "auth"
  }'

Parameters:

Field Type Required Description
project_id string βœ… From ingest response
query string βœ… Natural language question
stream boolean ❌ Enable SSE streaming (default: false)
filter_language string ❌ Only search this language
filter_module string ❌ Only search files containing this string

Non-streaming response:

{
  "answer": "Authentication is handled in auth/middleware.py...",
  "sources": [
    {
      "file_path": "auth/middleware.py",
      "language": "python",
      "chunk_type": "function",
      "name": "verify_jwt_token",
      "start_line": 23,
      "end_line": 67,
      "score": 0.9241
    }
  ]
}

Streaming response (SSE):

data: {"type": "sources", "sources": [...]}

data: {"type": "token", "content": "Authentication"}
data: {"type": "token", "content": " is"}
data: {"type": "token", "content": " handled"}
...
data: [DONE]

GET /health

Check if backend is running.

curl http://localhost:5000/health
{ "status": "ok", "service": "devbrain-ai" }

πŸ†“ Free API Providers

You can run DevBrain AI at zero cost with these free providers:

Groq (LLM) β€” Recommended

  • Free tier: 14,400 requests/day, 30 req/min
  • Get key: https://console.groq.com
  • Best models:
    • llama3-8b-8192 β€” fast everyday use
    • llama3-70b-8192 β€” complex analysis
    • mixtral-8x7b-32768 β€” large codebases

HuggingFace (Embeddings) β€” Completely Free

  • No API key needed
  • Runs entirely on your machine
  • Downloads once, cached forever
  • Best model: sentence-transformers/all-MiniLM-L6-v2

Google Gemini (Alternative LLM)

FAISS (Vector Store) β€” Completely Free

  • Runs locally, no server needed
  • Persists to disk between restarts
  • Handles codebases up to ~100k chunks easily

πŸ“– Usage Guide

1. Index a Repository

On the homepage, paste any public GitHub URL:

https://github.com/pallets/flask
https://github.com/tiangolo/fastapi
https://github.com/vercel/next.js

Or click "Upload project files" to upload a .zip of your local project.

Watch the progress bar β€” indexing takes:

  • Small repo (< 50 files): ~30 seconds
  • Medium repo (50–200 files): 1–3 minutes
  • Large repo (200+ files): 3–10 minutes

2. Ask Questions

Once indexed, you can ask anything in natural language:

Architecture questions:

Explain the overall project architecture
What are the main modules and how do they connect?
Draw the data flow from HTTP request to database

Finding code:

Where is authentication handled?
Where is the database connection configured?
How is error handling implemented?
Where are environment variables loaded?

Debugging:

Why would the login endpoint return a 401 error?
What could cause a database connection timeout?
Why is the API rate limiter not working?

Code understanding:

Explain what the middleware.py file does
How does the caching layer work?
What design patterns are used in this project?

Optimization:

How can I optimize the database queries in user_service.py?
What's the time complexity of the search function?

3. Use Filters

In the top-right filter menu, you can narrow search:

  • Language filter: python, javascript, typescript, etc.
  • Module filter: auth, api, models, etc.

4. Read Source References

Every answer shows which files were used, with:

  • File path
  • Function/class name
  • Line numbers
  • Confidence score (0–100%)

πŸ”§ Troubleshooting

Backend Issues

ModuleNotFoundError on startup:

# Make sure venv is activated
venv\Scripts\activate           # Windows
source venv/bin/activate        # macOS/Linux

# Reinstall dependencies
pip install -r requirements.txt

tree-sitter build error on Windows:

pip install tree-sitter --no-build-isolation

Also install Visual C++ Build Tools: https://visualstudio.microsoft.com/visual-cpp-build-tools/

faiss-cpu installation fails:

pip install faiss-cpu --no-cache-dir

HuggingFace model download stuck:

  • First download is ~80MB, can take 2–5 minutes on slow internet
  • Check internet connection
  • Set SENTENCE_TRANSFORMERS_HOME=./model_cache to cache locally

Groq rate limit error:

Error: 429 Too Many Requests

Set MULTI_QUERY_COUNT=0 in .env to reduce API calls.

Flask not starting on port 5000 (macOS): macOS Monterey+ uses port 5000 for AirPlay. Change Flask port:

# In app.py, change last line to:
app.run(debug=DEBUG, host="0.0.0.0", port=5001)

And update frontend/.env.local:

BACKEND_URL=http://localhost:5001

Frontend Issues

next.config.js TypeScript syntax error: Replace next.config.js content with:

/** @type {import('next').NextConfig} */
const nextConfig = {
  async rewrites() {
    return [{
      source: '/api/backend/:path*',
      destination: `${process.env.BACKEND_URL || 'http://localhost:5000'}/api/:path*`,
    }]
  },
}
module.exports = nextConfig

npm install fails:

node --version   # must be 18+
npm --version    # must be 9+

Frontend can't reach backend (CORS error):

  • Make sure Flask is running: http://127.0.0.1:5000/health
  • Check frontend/.env.local has BACKEND_URL=http://localhost:5000
  • Restart both servers after changing .env

"Project not found" error: The project hasn't finished indexing or errored. Check:

curl http://localhost:5000/api/ingest/status/your_project_id

Re-index a Project

FAISS indexes are cached. To force re-index:

# Delete the index files
rm backend/faiss_indexes/project_name.*

Then ingest the project again from the UI.


⚑ Performance Tips

Fastest Free Setup (Recommended for Dev)

LLM_PROVIDER=groq
LLM_MODEL=llama3-8b-8192        # 8b is 3-4x faster than 70b
GROQ_API_KEY=gsk_...

EMBEDDING_PROVIDER=huggingface
HF_EMBED_MODEL=sentence-transformers/all-MiniLM-L6-v2
SENTENCE_TRANSFORMERS_HOME=./model_cache

VECTOR_STORE=faiss
TOP_K=5                          # fewer chunks = faster
MULTI_QUERY_COUNT=0              # disable for speed
USE_RERANKER=false

Speed Benchmarks (approximate)

Operation Fast config Default config
First run (model download) 1–2 min 1–2 min
Subsequent starts 3–5s 3–5s
Index small repo (30 files) ~20s ~45s
Index medium repo (100 files) ~1 min ~3 min
Query response 2–4s 5–10s

For Large Repos

If indexing a very large repo (500+ files), increase chunking:

MAX_CHUNK_TOKENS=600
TOP_K=6

Use mixtral-8x7b-32768 for its larger context window:

LLM_MODEL=mixtral-8x7b-32768

πŸ—ΊοΈ Roadmap

  • GitHub OAuth β€” index private repos
  • Conversation memory β€” multi-turn chat that remembers context
  • Code diff analysis β€” "What changed between these two commits?"
  • Dependency graph β€” visual map of how files connect
  • VS Code extension β€” query DevBrain directly from your editor
  • Multiple repo support β€” ask questions across multiple projects
  • Export answers β€” save conversations as markdown
  • Webhook indexing β€” auto re-index on git push

πŸ“„ License

MIT License β€” free to use, modify, and distribute.


πŸ™ Acknowledgements

Built with:


Made with ❀️ by developers, for developers
CodeAtlas β€” Because reading code shouldn't take weeks

About

A full-stack AI system leveraging RAG, vector search, and code-aware chunking to analyze large repositories. Supports multi-file context retrieval, execution flow tracing, and LLM-powered explanations for debugging and code understanding.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors