CODEWALK

AI-powered codebase intelligence tool
Point it at any repo → understand the entire codebase in hours, not weeks

Features • Demo • Frontend • Local Setup • Cloud • MCP • API • Architecture • Contributing

What is Codewalk?

Codewalk analyzes any codebase and gives you:

Module detection — groups files into logical modules automatically
Dependency graph — extracts every import/require → builds the full dependency map
Blast radius — "if I change this file, what breaks?"
Reading order — optimal file reading sequence (dependencies first)
Execution flow — entry points, module-to-module and file-to-file dependency flow
AI chat — ask anything about the code, powered by RAG + tool-calling agent
Code review — review git diffs for bugs, security issues, and style (context-enriched, OWASP-focused)
Incremental reindex — re-embed only changed files using content hash comparison
Graph intelligence — DuckDB + igraph: symbol-level call graph, cycle detection, centrality analysis, import chain tracing
Corrective RAG — distance-based chunk filtering + LLM answer grading + query rewriting for higher quality answers
Voice interface — talk to your codebase hands-free: mic → transcribe → Copilot routes → speak answer

Three ways to use it locally, plus optional cloud indexing:

Interface	Best for
Web UI (Next.js)	Visual exploration — Knowledge Graph UI, diagrams, module browser, blast radius viewer
MCP Server	VS Code Copilot, Claude Code, Cursor — AI agents use tools directly
REST API	Scripts, CI/CD, custom integrations

Cloud (optional): Push to GitHub → Codewalk Cloud indexes on the server → MCP downloads the index and queries locally. See Cloud Deployment.

🎙️ Voice is available via both MCP (codewalk_voice_ask + codewalk_speak) and REST API (POST /voice/ask) — ask questions by speaking, hear answers read aloud.

Why Codewalk?

Scenario	How Codewalk helps
New dev joins the team	Point Codewalk at the repo → get an overview, module map, and reading order. Self-onboard in hours instead of weeks of "hey, can you explain this?"
LLM token costs are high	Without RAG, the LLM needs your entire codebase in context — slow and expensive. Codewalk embeds code into a vector DB and retrieves only the relevant chunks per query. Faster answers, fraction of the tokens.
Senior dev switches modules	You know the auth module but now need to work on payments. Get module info, blast radius, and execution flow without bugging the payments team.
Before a refactor	Check blast radius before touching shared code. "If I change `base_model.py`, what breaks?" — get the answer before you break prod.
PR reviews	Run `codewalk_run_review` (MCP) or `POST /review` (API) — automated multi-stage review with OWASP security checks, blast radius warnings, and team guidelines matching. MCP mode returns enriched context so the calling model (Claude/GPT) performs the review directly — no separate LLM needed.
Documentation is outdated	Codewalk analyzes the actual code, not stale wiki pages. Always up to date.

✨ Features

Feature	Description
🔍 Module Detection	Auto-groups files into packages/modules by directory structure
🕸️ Dependency Graph	Parses imports across 15+ languages via tree-sitter
💥 Blast Radius	BFS on reversed dependency graph → shows transitive impact of any change
📖 Reading Order	Topological sort → "read config.py before embedder.py because embedder imports config"
🔄 Execution Flow	Entry points, module/file dependency chains, Mermaid diagrams
🤖 AI Chat	LangGraph agent with 7 tools, multi-turn conversation with memory
🔎 Semantic Search	ChromaDB vector search on embedded code chunks (RAG)
🔬 Code Review	Multi-stage review pipeline: test coverage, blast radius, guidelines RAG, context-enriched deep analysis
🔄 Incremental Reindex	Content hash comparison — only re-embeds changed files, skips unchanged
🧩 MCP Server	38 MCP tools for VS Code Copilot / Claude Code / Cursor / Codex
🎙️ Voice Interface	Talk to your codebase — mic recording, local STT (faster-whisper), agent-driven routing (MCP + API), TTS response
🔬 Graph Intelligence	DuckDB persistent graph + igraph C-speed traversal: cycle detection, centrality, import chain tracing
🧬 Corrective RAG	Distance-based chunk filtering (free) + LLM answer grading + query rewriting for reliable answers
📦 Parent-Child Chunking	Full functions stored as parents, sub-chunks searched — retrieve complete context on match
⚡ Parallel Embedding	Producer-consumer pipeline — CPU chunking overlaps with GPU embedding
🏗️ Multi-Provider LLM	Ollama (local), OpenAI, Anthropic, Groq, Gemini, OpenRouter, DeepSeek
📚 Doc Indexing	Index team docs (.md, .pdf, .txt) — search and ask questions with source citations
🔄 Reflection	Actor→Critic→Improve loop used by deep research to refine cross-cutting reports
🧑‍💻 Human-in-the-Loop	Approval gate before any code/file modification — LangGraph checkpoint + interrupt
🔬 Deep Research	Fan-out parallel search → merge → synthesize → reflect for complex cross-cutting questions
🏗️ Architecture Health	Graph stats, bottleneck files (betweenness centrality), PageRank, cycle detection with fix suggestions
🌐 15+ Languages	Python, JS, TS, Java, Go, Rust, Ruby, PHP, C#, C++, C, Dart, Kotlin, Swift, YAML

Supported Languages

Language	Extensions	Tree-sitter Parsing	Import Extraction
Python	`.py`	✅	✅
JavaScript	`.js`, `.jsx`	✅	✅
TypeScript	`.ts`, `.tsx`	✅	✅
Java	`.java`	✅	✅
Go	`.go`	✅	✅
Rust	`.rs`	✅	✅
Ruby	`.rb`	✅	✅
PHP	`.php`	✅	✅
C#	`.cs`	✅	✅
C++	`.cpp`	✅	✅
C	`.c`	✅	✅
Kotlin	`.kt`	✅	✅
Swift	`.swift`	✅	✅
Dart	`.dart`	✅ (optional install)	✅
YAML	`.yaml`, `.yml`	—	—
JSON	`.json`	—	—
TOML	`.toml`	—	—
Markdown	`.md`	—	—

Tree-sitter parsing = extracts functions, classes, and methods for accurate chunking and function explanations.
Import extraction = builds the dependency graph, blast radius, and reading order.
Languages without tree-sitter support still get indexed via text splitting — they work with semantic search and AI chat, just without function-level granularity.

🆚 Codewalk vs. alternatives

Codewalk is not another AI autocomplete. It is a codebase intelligence layer: it builds a persistent dependency graph, embeds your code, indexes your docs, and exposes that intelligence through a UI, an MCP server, and an API.

If you need deep cross-file reasoning, blast-radius analysis, or AI review inside your existing IDE agent, Codewalk fits where general-purpose assistants stop.

Use case	Typical approach	What Codewalk does differently
Explain this codebase	Ask a generic chat model and paste files	Builds a live graph + RAG so answers are grounded and cite real files
PR review	Lint + human review	LLM review with blast-radius, architecture, and custom guidelines
Refactor shared code	Grep for imports	Dependency graph + blast radius showing transitive impact
Onboard a new developer	Read wiki pages	Reading order + module map generated from actual code
Team knowledge	Search Confluence/Notion	Index docs alongside code and ask with citations
AI agent tooling	Write custom scripts or prompts	38 MCP tools the agent can call directly

🎬 Demo

Web UI

codewalk-demo-frontend.mp4

MCP with VS Code Copilot

codewalk-demo-mcp.mp4

REST API

🎥 [Video coming soon]

Voice Interface

codewalk_demo_voice.mp4

🖥️ Frontend — Knowledge Graph UI

Codewalk ships with a Next.js frontend for visual codebase exploration.

What you can do

Structural view — explore the repo as a layered dependency graph: modules, files, classes, and functions laid out as an interactive path flow.
Knowledge view — semantic graph of entities and relationships surfaced by the AI analysis.
Path Finder — pick a source and target node and discover import/dependency paths between them.
Search — fuzzy + semantic search across files, symbols, and concepts.
Blast Radius / Diff mode — visually highlight changed and affected nodes.
Themes — switch between presets (Dark Gold, Dark Ocean, Dark Forest, Dark Rose, Light Minimal), accent colors, and heading fonts; your choice is saved locally.
Info Panel — unified node details, metrics, source preview, and project overview.
Sidebar tab lock — index-dependent tabs stay locked until GET /index-status reports indexed: true.
Cloud Admin — visit /admin to register repos, list repos, trigger indexing, copy tokens, and check server health/version.

Run it

cd frontend
npm install
npm run dev

If you change frontend code and see stale chunk 404s or client-side exceptions, restart with a clean build cache:

npm run dev:clean      # clears .next and starts fresh
npm run restart        # kills port 3000 and restarts dev
# or
./scripts/restart-frontend.sh

Set NEXT_PUBLIC_API_URL to point at the backend (e.g. http://localhost:8000 or https://api.codewalk.xyz).

Then open http://localhost:3000, analyze a repo, and click Knowledge Graph.

Demo

🎥 [Video coming soon — add frontend walkthrough here]

⚙️ Setup (local)

Production cloud server? See FULL_SETUP_GUIDE.md — step-by-step: Hetzner, api.codewalk.xyz, GitHub App, webhooks, MCP download.

Prerequisites

Tool	Version	Check
Python	3.10+	`python3 --version`
Node.js	18+	`node --version`
Git	Any	`git --version`
Ollama (optional)	Latest	`ollama --version`

1. Clone the codewalk repo

git clone https://github.com/gupta29470/codewalk.git
cd codewalk

2. Backend setup in codewalk

# Create virtual environment
python3 -m venv .codewalk-env
source .codewalk-env/bin/activate    # macOS / Linux
# .codewalk-env\Scripts\activate     # Windows

# Install Python dependencies
pip install -r requirements.txt

⚠️ VPN / Corporate Network / Private Network Issues

If you're behind a VPN, corporate proxy, or private network, package installations and model downloads may fail due to blocked connections or SSL certificate errors.

Recommended: Use a normal (non-VPN) network for first-time setup.

Codewalk's setup downloads packages from PyPI, npm, and HuggingFace. These are one-time downloads — once installed, everything runs locally. If possible:

Disconnect from VPN temporarily
Run the setup steps (pip install, npm install, start the backend once to download the embedding model)
Reconnect to VPN — everything is cached locally, no more downloads needed

After the first run, Codewalk works fully offline (with Ollama). The VPN/corporate network won't cause any issues.

Optional: Dart/Flutter support (tree-sitter-dart)

# If you get an SSH error, run this first:
git config --global url."https://github.com/".insteadOf "git@github.com:"

# Then install:
pip install "tree-sitter-dart @ git+https://github.com/UserNobody14/tree-sitter-dart.git"

Without this, Codewalk still works — Dart files just won't get tree-sitter parsing (falls back to text splitting).

3. Frontend setup in codewalk

cd frontend
npm install
cd ..

4. Configure environment in codewalk

Copy the template: cp env.local.example.txt .env then edit:

# ─── LLM Configuration ──────────────────────────────────────
# Provider: ollama | openai | anthropic | gemini | groq | openrouter
LLM_PROVIDER=ollama
LLM_MODEL=qwen2.5-coder:7b

# ─── Embeddings ──────────────────────────────────────────────
EMBEDDING_MODEL=jinaai/jina-code-embeddings-1.5b

# ─── API Keys (only fill the one you're using) ──────────────
# GROQ_API_KEY=gsk_...
# OPENAI_API_KEY=sk-...
# ANTHROPIC_API_KEY=sk-ant-...
# GOOGLE_API_KEY=AI...
# OPENROUTER_API_KEY=sk-or-...

5. Pull an Ollama model (if using local LLM)

ollama pull qwen2.5-coder:7b

Recommended models by size

Model	Size	Tool Calling	Best For
`qwen2.5-coder:7b`	4.7 GB	✅	Code-focused, fast
`qwen3.5:latest` (8B)	6.6 GB	✅	General + code
`qwen3.5:27b`	17 GB	✅	Best accuracy

🚀 Usage

Option 1: Web UI

Open two terminals in codewalk:

Terminal 1 — Backend API

source .codewalk-env/bin/activate
uvicorn src.codewalk.api.main:app --reload --port 8000

Terminal 2 — Frontend

cd frontend
npm run dev

If the frontend throws stale chunk 404s after pulling or editing code, restart it cleanly:

npm run dev:clean
# or from the project root
./scripts/restart-frontend.sh

Open http://localhost:3000 → click Analyze Codebase (the repo is discovered from the working directory via codewalk.yaml).

Then explore:

Knowledge Graph — interactive structural + knowledge graph, layer/module legend, node-category filters, detail-level toggle, persona selector, Path Finder, export menu, code viewer, file explorer, tour/onboarding, mobile layout, edge styling, and diff overlay
Overview — tech stack, modules, dependency diagram, riskiest files
Modules — browse all modules, click one for file list + dependencies
Blast Radius — which files break if you change each file
Reading Order — optimal file reading sequence with risk levels
Execution Flow — Mermaid diagram of module/file dependencies
Chat — ask any question ("explain the authentication flow", "what does scanner.py do?")
Code Review — review git diffs, review single files, load team guidelines
Voice — click the mic, ask a question by speaking, hear the answer read aloud
Smart Reindex — incremental re-embed with stats (skipped, changed, deleted)
Cloud Admin — /admin page for repo registration, token management, and server health

Option 2: MCP Server (VS Code Copilot / Claude Code / Cursor)

See MCP Integration below.

Option 3: REST API

# Start the backend
source .codewalk-env/bin/activate
uvicorn src.codewalk.api.main:app --reload --port 8000

Step 1 — Analyze a codebase:

# Run from inside the repo you want to analyze (repo is discovered from cwd via codewalk.yaml)
curl -X POST http://localhost:8000/analyze \
  -H "Content-Type: application/json" \
  -d '{"index_mode": "auto"}'

Step 2 — Check index status and explore the results:

# Check whether the current workspace is indexed
# Optional: ?repo_path=/path/to/repo (defaults to cwd discovery)
curl "http://localhost:8000/index-status" | python3 -m json.tool

# Project overview (tech stack, modules, riskiest files)
curl http://localhost:8000/overview | python3 -m json.tool

# List all modules
curl http://localhost:8000/modules | python3 -m json.tool

# Dive into a specific module
curl http://localhost:8000/modules/auth | python3 -m json.tool

# What breaks if I change files in the auth module?
curl http://localhost:8000/blast-radius/auth | python3 -m json.tool

# Optimal reading order
curl http://localhost:8000/reading-order | python3 -m json.tool

# Execution flow (entry points, dependency chains)
curl http://localhost:8000/execution-flow | python3 -m json.tool

Step 3 — Chat with the agent:

# Ask a question
curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "Explain this project", "thread_id": "thread-1"}'

# Follow-up (same thread_id = conversation memory)
curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "What does the auth module do?", "thread_id": "thread-1"}'

# After code changes — refresh analysis without re-embedding
curl -X POST http://localhost:8000/refresh

# Incremental reindex — only re-embed changed files
curl -X POST http://localhost:8000/incremental-reindex

# Review current git diff for bugs, security, style
curl -X POST http://localhost:8000/review \
  -H "Content-Type: application/json" \
  -d '{"staged": false, "target_branch": "master"}'

See API Reference for full request/response details on every endpoint.

🔌 MCP Integration

Codewalk runs as an MCP (Model Context Protocol) server, so any AI agent that speaks MCP can use it.

Cloud MCP (index on server, query locally)

Cloud server indexes your repo on git push (GitHub App webhook)
Clone codewalk locally (MCP server code)
Open target repo in Cursor/VS Code (${workspaceFolder})
Configure MCP with cloud URL + repo token:

{
  "servers": {
    "codewalk": {
      "command": "/path/to/codewalk/.codewalk-env/bin/python",
      "args": [
        "-c",
        "import os, sys; sys.path.insert(0, os.environ['CODEWALK_PATH']); from src.codewalk.mcp.server import mcp; mcp.run(transport='stdio')"
      ],
      "cwd": "${workspaceFolder}",
      "env": {
        "CODEWALK_PATH": "/path/to/codewalk",
        "CODEWALK_SERVER_URL": "https://api.codewalk.xyz",
        "CODEWALK_REPO_NAME": "owner/repo",
        "CODEWALK_REPO_TOKEN": "cw_repo_xxxxxxxx"
      }
    }
  }
}

cwd should be the target repo (where codewalk.yaml lives). CODEWALK_PATH tells Python where to find the Codewalk source package. Open the target repo in your editor so the server starts from that workspace.

Get repo_token after first index (on server):

docker compose exec postgres psql -U codewalk -d codewalk -c \
  "SELECT repo_token FROM repos WHERE full_name='owner/repo';"

Run codewalk_connect_repo in Cursor or let analyze auto-download the index. Cloud sync tools include codewalk_pull_index, codewalk_connect_repo, codewalk_index_status, codewalk_check_version, and codewalk_show_knowledge_graph.

Every MCP tool is wrapped with a workspace-change guard (_refresh_state_if_moved) that re-discovers the current working directory and resets state if the workspace changes.

⚠️ One repo per MCP server process. Codewalk keeps runtime state (vector store, graph, repo path) in memory. Pointing the same running MCP server at multiple repos — or rapidly switching workspaces in the same process — can overwrite or corrupt that state. Use one editor window / one MCP connection per repo. The stdio transport is safe because each connection spawns a separate process, but do not route commands for different repos into the same server instance.

Local-only MCP (index on your machine)

No cloud — index runs locally via codewalk_analyze_codebase. After rebuild_analysis_cache, MCP embeds with index_from_paths_parallel (same pipeline helpers as the API, but MCP scans via scan_repo_files + codewalk.yaml excludes rather than calling full_index_parallel directly).

Surface	Local embed entrypoint	Notes
MCP `codewalk_analyze_codebase`	`index_from_paths_parallel`	`rebuild_analysis_cache` → parallel chunk/embed → `write_manifest`
API `POST /analyze` (+ `/analyze/stream`)	`full_index_parallel`	Same Chroma output under `{repo}/.codewalk/`

Review & approve fixes (agent + MCP)

You talk to your IDE agent; the agent calls Codewalk MCP tools. Codewalk does not render UI — each host has its own approve/reject experience (Cursor approval cards, Copilot chat, Claude Code prompts, etc.). The agent must present each fix and wait for your approval through that host UI (or yes/no in chat).

Agent runs codewalk_run_review (returns enriched context for the host LLM to review) or codewalk_review_file (runs the full pipeline on one file)
For each finding: codewalk_finding_verdict records whether you accept or reject it
Apply accepted fixes: codewalk_apply_accepted applies every accepted finding with recommended_code in one call; or use codewalk_approve_action → codewalk_apply_fix(..., approval_token=<token>) for a single fix
After edits: codewalk_verify_fix → codewalk_incremental_reindex

Full agent rules: src/codewalk/mcp/server.py FastMCP instructions (sent on MCP connect).

Example: @codewalk review my changes, then fix each issue only after I approve

Cloud re-download: codewalk_pull_index / codewalk_connect_repo / auto-download on analyze all replace local .codewalk/ (delete then extract). Force refresh: rm -rf .codewalk then pull.

MCP tools — index requirements

Tool	Index required?	Notes
`codewalk_analyze_codebase`	Builds/loads	Cloud download or local embed
`codewalk_generate_config`	No	Creates starter `codewalk.yaml`
Query tools (search, overview, modules, symbols, …)	Yes	`_require_index()` auto-loads disk
`codewalk_find_circular_dependencies`	Yes	Uses graph data
`codewalk_get_architecture_health`	Yes	Graph stats + cycles
`codewalk_incremental_reindex`, `codewalk_refresh_analysis`	Yes
`codewalk_run_review`, `codewalk_review_file`, `codewalk_get_stack_info`	Soft / Yes	Better with index; `run_review` returns context for the host LLM
`codewalk_get_review_details`, `codewalk_finding_verdict`	Yes	Reads persisted session
`codewalk_approve_action` / `codewalk_apply_fix`	No / edits files	Token required for `apply_fix`
`codewalk_apply_accepted`	Yes	Applies all accepted findings from a session
`codewalk_verify_fix`	No	Runs static analysis + tests
`codewalk_run_static_analysis`	No	ruff/mypy/eslint/etc.
`codewalk_run_tests`	No	pytest/npm test/etc.
`codewalk_pull_index`, `codewalk_connect_repo`, `codewalk_index_status`	Cloud config	Replace `.codewalk/` on download
Docs / guidelines / voice / `check_version` / `show_knowledge_graph`	Varies	See MCP server `instructions`

Starting the MCP Server in VS Code

Open VS Code in the codewalk project
Press Cmd+Shift+P (macOS) or Ctrl+Shift+P (Windows/Linux)
Type MCP: List Servers and select it
You'll see codewalk in the list
Click Start Server next to codewalk
The server starts in the background (stdio transport)
Open Copilot Chat → type @codewalk → all Codewalk MCP tools are available

VS Code Copilot

Add to .vscode/mcp.json in your desired project:

⚠️ Replace /path/to/codewalk with the actual absolute path where you cloned codewalk. cwd (${workspaceFolder}) should be the target repo so the server discovers codewalk.yaml. CODEWALK_PATH must point at the cloned Codewalk repo so src.codewalk.mcp.server resolves.

{
  "servers": {
    "codewalk": {
      "command": "/path/to/codewalk/.codewalk-env/bin/python",
      "args": [
        "-c",
        "import os, sys; sys.path.insert(0, os.environ['CODEWALK_PATH']); from src.codewalk.mcp.server import mcp; mcp.run(transport='stdio')"
      ],
      "cwd": "${workspaceFolder}",
      "env": {
        "CODEWALK_PATH": "/path/to/codewalk"
      }
    }
  }
}

Team config (codewalk.yaml): Put repo-specific settings in the repo root:
guidelines_path: contributing-docs
docs_path: team-docs
indexing:
  exclude:
    - tests/**
    - docs/**
    - scripts/legacy/**
    - "*.generated.*"
  include:
    - docs/architecture/**
indexing.exclude is a list of paths/patterns skipped during scanning. indexing.include overrides exclusions (and the core safety net) for specific paths. These are checked at scan time. Generate a starter config with stack-specific excludes via python -m src.codewalk.cli generate-config or @codewalk Run codewalk_generate_config.

Customizing file filters: Codewalk uses a deterministic core safety net (src/codewalk/ingestion/file_filter.py) — no LLM involved. It always skips universally bad content (.git, node_modules, dependency/build/cache dirs, binaries, media, secrets, lock files, generated suffixes). Repo- or framework-specific exclusions (e.g., tools/, scripts/, cdk/, migrations/, story files) belong in codewalk.yaml (often generated by generate-config). If a folder or file is not being indexed that you need, you have three options:

codewalk.yaml indexing.include — override exclusions for specific paths. Example: ["docs/architecture/**", "src/migrations/schema.py"].

codewalk.yaml indexing.exclude — repo-specific dirs/patterns. Example: ["tests/**", "docs/**", "*.generated.*"].

.codewalkignore file — gitignore-style patterns in the repo root (see below).

You generally do not need to duplicate node_modules, .git, build dirs, etc. in codewalk.yaml; those are handled by the core safety net.

.codewalkignore — Create a .codewalkignore file in the root of the repo you're analyzing to skip specific files/directories:
# Skip test files
tests/
*_test.py

# Skip specific directories
data/
wiki/
blogs/

# Skip specific file patterns
*.config.js
setup.py
Syntax (gitignore-like):

folder/ — skip any path containing this directory

*.pattern — glob match against full path or filename

filename — matches exact filename or path segment

# comment — ignored

blank lines — ignored

Patterns are cached in _codewalkignore_patterns (loaded once per session). If you change the repo being analyzed, reset_codewalkignore() clears the cache so the next repo's .codewalkignore gets loaded.

Then in Copilot Chat: @codewalk → it will call codewalk_analyze_codebase automatically.

Note: After adding or modifying .vscode/mcp.json, reload the VS Code window: Cmd+Shift+P → Developer: Reload Window.

Claude Code

Add to ~/.claude/mcp.json:

{
  "mcpServers": {
    "codewalk": {
      "command": "/path/to/codewalk/.codewalk-env/bin/python",
      "args": [
        "-c",
        "import os, sys; sys.path.insert(0, os.environ['CODEWALK_PATH']); from src.codewalk.mcp.server import mcp; mcp.run(transport='stdio')"
      ],
      "cwd": "${workspaceFolder}",
      "env": {
        "CODEWALK_PATH": "/path/to/codewalk"
      }
    }
  }
}

Cursor

Settings → MCP Servers → Add:

{
  "codewalk": {
    "command": "/path/to/codewalk/.codewalk-env/bin/python",
    "args": ["-m", "src.codewalk.mcp.server"],
    "cwd": "${workspaceFolder}",
    "env": {
      "CODEWALK_PATH": "/path/to/codewalk"
    }
  }
}

Exclusions now live in codewalk.yaml (indexing.exclude) or .codewalkignore, not in the EXCLUDE_PATHS env var.

OpenAI Codex CLI

Add to ~/.codex/mcp.json:

{
  "mcpServers": {
    "codewalk": {
      "command": "/path/to/codewalk/.codewalk-env/bin/python",
      "args": [
        "-c",
        "import os, sys; sys.path.insert(0, os.environ['CODEWALK_PATH']); from src.codewalk.mcp.server import mcp; mcp.run(transport='stdio')"
      ],
      "cwd": "${workspaceFolder}",
      "env": {
        "CODEWALK_PATH": "/path/to/codewalk"
      }
    }
  }
}

How It Works (First-Time Setup)

The first time you use Codewalk on a new codebase, it needs to index the files.
You just tell the AI to analyze — the AI handles the rest automatically.

Tool Calling Sequence

┌─────────────────────────────────────────────────────────────────────┐
│                    SETUP WORKFLOW (run once)                        │
│                                                                     │
│  Step 1 (only step)                                                 │
│  codewalk_analyze_codebase                                          │
│       │  scans files, builds dependency graph, detects modules,     │
│       │  filters with file_filter.py, chunks, embeds — all in one   │
│       ▼                                                             │
│  ✅ READY — all query tools unlocked                                │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                   QUERY TOOLS (use after setup)                     │
│                                                                     │
│  codewalk_get_overview          → project summary + dependency flow │
│  codewalk_search_codebase       → semantic code search              │
│  codewalk_lookup_symbol         → find symbols by name across repo  │
│  codewalk_get_module_info       → inspect a specific module         │
│  codewalk_explain_function      → AI-powered function explanation   │
│  codewalk_explain_class         → AI-powered class explanation      │
│  codewalk_get_blast_radius_map  → change risk analysis              │
│  codewalk_find_circular_dependencies → detect import cycles         │
│  codewalk_get_reading_order     → optimal file reading sequence     │
│  codewalk_get_execution_flow    → module/file dependency flow       │
│  codewalk_get_architecture_health → bottlenecks, cycles, key files  │
│  codewalk_call_chain(source, target) → trace import path between    │
│  codewalk_show_knowledge_graph  → export graph for visualization    │
│  codewalk_index_docs(docs_path) → index .md/.pdf/.txt docs          │
│  codewalk_search_docs(query)    → search indexed documents           │
│  codewalk_ask_docs(question)    → RAG answer grounded in docs        │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                   REVIEW & HITL TOOLS                               │
│                                                                     │
│  codewalk_run_review            → gather review context for host LLM│
│  codewalk_review_file           → full pipeline review of one file  │
│  codewalk_get_stack_info        → deterministic stack signals       │
│  codewalk_get_review_details    → retrieve a persisted review       │
│  codewalk_load_guidelines       → load team coding standards        │
│  codewalk_finding_verdict       → accept/reject a finding           │
│  codewalk_apply_accepted        → apply all accepted fixes          │
│  codewalk_approve_action(text)  → HITL gate (returns approval_token)│
│  codewalk_apply_fix(..., token) → apply one fix after approval      │
│  codewalk_verify_fix            → static analysis + tests           │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                 MAINTENANCE (after code changes)                    │
│                                                                     │
│  codewalk_generate_config       → starter codewalk.yaml             │
│  codewalk_incremental_reindex   → re-embed only changed files       │
│  codewalk_refresh_analysis      → re-scan without re-embedding      │
│  codewalk_run_static_analysis   → ruff/mypy/eslint/etc. on files    │
│  codewalk_run_tests             → pytest/npm test/etc. on files     │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                    CLOUD (when configured)                          │
│                                                                     │
│  codewalk_pull_index            → download latest server index      │
│  codewalk_connect_repo          → one-step cloud setup              │
│  codewalk_index_status          → local vs cloud version            │
│  codewalk_check_version         → server health/version             │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                      VOICE (hands-free)                             │
│                                                                     │
│  MCP:  codewalk_voice_ask  → mic → transcribe                       │
│        Copilot picks tool  → calls it → codewalk_speak(summary)     │
│                                                                     │
│  API:  POST /voice/ask     → mic → transcribe → agent invokes tool  │
│        agent answer        → format_voice_response() → MP3          │
└─────────────────────────────────────────────────────────────────────┘

💡 Before indexing: Close unnecessary applications (browsers, Slack, Docker, etc.). Indexing loads the embedding model into memory and processes all files at once — freeing up RAM helps it run faster and avoids slowdowns.

You type this in Copilot Chat:

@codewalk analyze this codebase [auto(default) | reindex(update index) | full(delete existing index and generate new index)]
or
@codewalk_analyze_codebase [auto(default) | reindex(update index) | full(delete existing index and generate new index)]

What happens behind the scenes (you don't need to do anything):

The AI calls codewalk_analyze_codebase → scans all files, filters with file_filter.py, detects modules, builds the dependency graph, chunks and embeds everything in one call

You'll see progress like:

✓ Codebase analyzed and indexed successfully
  Files found: 142
  Files indexed: 121
  Chunks embedded: 380
  Modules found: api, analysis, embeddings, ingestion, rag

✅ Ready to answer questions — use query tools directly.

Note: After indexing, the AI agent should automatically call these tools. If it doesn't, you can invoke them manually — the hints above tell you exactly which tools to run.

Note: This only happens once. Next time you say @codewalk analyze this codebase, it detects the existing index and skips straight to "ready."

⚠️ If the AI Stops Mid-Workflow

The setup is now a single call — codewalk_analyze_codebase does everything. If the AI stops after that, just call any query tool yourself:

AI stopped after...	You call next
`codewalk_analyze_codebase`	Any query tool — `codewalk_get_overview`, `codewalk_search_codebase`, etc.

Tip: Look for the ⏩ NEXT STEP line at the bottom of each tool's output — it tells you exactly what to do.

MCP Tools — What You Can Ask

After indexing is done, here's every tool you can use.
You don't need to remember tool names — just ask naturally and the AI picks the right tool.

"Give me the big picture"

Tool: codewalk_get_overview — no parameters needed

You just joined a new team. You have no idea what this project does. Start here.

@codewalk give me an overview of this project
or
@codewalk_get_overview

When to use: Day 1 on a new project. You want to know what you're dealing with.

"What's in this module?"

Tool: codewalk_get_module_info(module_name) — pass the module name

You saw "auth" in the overview and want to dig into it.

@codewalk tell me about the auth module
or
@codewalk_get_module_info auth

When to use: You need to work on a specific module and want to see all its files, classes, and functions at a glance.

"Explain this function to me"

Tool: codewalk_explain_function(function_name) — pass the function or class name

Your tech lead mentioned verify_request in a PR review. You have no idea what it does.

@codewalk explain the verify_request function
or
@codewalk_explain_function verify_request function

When to use: You see a function name in code/PR/docs and want to understand exactly what it does without reading the whole file yourself.

"Search for something in the codebase"

Tool: codewalk_search_codebase(query) — pass any natural language question

You need to find where database connections are handled but don't know which file.

@codewalk how does this project handle database connections?
or 
@codewalk_search_codebase how does this project handle database connections?

When to use: You have a question about a concept ("error handling", "file upload", "caching") and don't know which files to look at.

"What breaks if I change this?"

Tool: codewalk_get_blast_radius_map(target) — pass a module name, file name, or leave empty

You're about to refactor models/base.py. Before you touch it, you want to know the damage.

@codewalk what's the blast radius of base.py / auth?
or
@codewalk_get_blast_radius_map base.py / auth?

When to use: Before refactoring or making changes. "Is it safe to change this, or will half the project break?"

"Where should I start reading?"

Tool: codewalk_get_reading_order(module_name) — pass a module name or leave empty for entire repo

You want to understand the agent module but don't know which file to read first.

@codewalk what order should I read the agent module?
or 
@codewalk_get_reading_order

When to use: You want to understand code without constantly jumping between files wondering "wait, what's this import?"

"How does the code flow?"

Tool: codewalk_get_execution_flow(module_name) — pass a module name or leave empty for module-level view

You want to understand how modules connect to each other.

@codewalk show me the execution flow
or 
@codewalk_get_execution_flow

When to use: You want to understand "what calls what" — the big picture of how code connects.

"I changed some code, refresh the analysis"

Tool: codewalk_refresh_analysis — no parameters needed

You added 3 new files and refactored a module. The analysis is now stale.

@codewalk refresh the analysis
or 
@codewalk_refresh_analysis

When to use: After you commit code changes and want updated blast radius / reading order / execution flow results.

"Some files changed, update the embeddings"

Tool: codewalk_incremental_reindex — no parameters needed

You changed a few files but don't want to re-embed the entire codebase.

@codewalk reindex changed files
or
@codewalk_incremental_reindex

When to use: After code changes when you want the vector search to reflect the latest code without a full re-index. Uses content hashes — only re-embeds what actually changed.

"Review my changes for bugs"

Tool: codewalk_run_review — optional: staged=true, target_branch="master"

You're about to push a PR and want an automated code review.

@codewalk review my changes
or
@codewalk_run_review
@codewalk_run_review staged=true target_branch="master"

When to use: Before pushing a PR. codewalk_run_review gathers the full diff, neighborhood context, blast radius, and stack signals, then returns them to Copilot so it can perform the review directly using enriched context — no local LLM overhead, instant results.

"Review this specific file"

Tool: codewalk_review_file(file_path) — pass the file path

You want to check if a file follows the project's conventions.

@codewalk review src/codewalk/pipeline.py
or
@codewalk_review_file src/codewalk/pipeline.py

When to use: When you want to review any file — no git diff needed. Reads the file directly, enriches it with caller context (who imports it), security patterns from the vector store, similar code elsewhere in the codebase, and team guidelines. Copilot performs the review natively — no local LLM, instant results.

"Load our team's coding guidelines"

Tool: codewalk_load_guidelines(docs_path) — pass path to guidelines directory

Your team has coding standards in markdown files.

@codewalk load guidelines from docs/standards
or
@codewalk_load_guidelines docs/standards

When to use: Once per project. After loading, codewalk_run_review and codewalk_review_file automatically include your team's standards in their context.

"Talk to the codebase hands-free"

Tools: codewalk_voice_ask + codewalk_speak — no parameters needed

You want to ask a question by speaking instead of typing.

@codewalk_voice_ask

What happens:

🔔 Beep — signals "start talking"
🎙️ Records your voice (up to 30s, stops after 5s of silence)
📝 Transcribes locally via faster-whisper
🧠 Copilot reads the transcript and picks the right codewalk tool
⚙️ Copilot calls the tool and gets the result
🔊 Copilot calls codewalk_speak(summary) — speaks a 2-4 sentence summary aloud

When to use: Hands-free coding. You're reading code and want to ask "what does this function do?" without switching to the keyboard.

Note: Routing is done by Copilot (full LLM), not a separate model — no Ollama required for MCP voice. The REST API (POST /voice/ask) sends the transcript directly to the chat agent, which picks the right tool natively.

"Is the architecture healthy?"

Tool: codewalk_get_architecture_health — no parameters needed

You want a health check: bottleneck files, circular dependencies, and the most important files.

@codewalk check the architecture health
or
@codewalk_get_architecture_health

Returns: Graph stats, bottleneck files (betweenness centrality), most important files (PageRank), circular dependencies with suggested fixes.

When to use: Before a refactor, code review, or whenever you suspect architectural issues.

"How does file A reach file B?"

Tool: codewalk_call_chain(source, target) — two file names

You want to trace the import chain between two files — "how does a change in config.py eventually affect server.py?"

@codewalk trace the import chain from config.py to server.py
or
@codewalk_call_chain config.py server.py

Returns: Shortest import path with hop count and full file paths.

When to use: Understanding how changes propagate, debugging import issues, or tracing dependency chains.

"Find circular dependencies"

Tool: codewalk_find_circular_dependencies() — no parameters

Detect import cycles that can cause brittle architecture or load-order bugs.

@codewalk find circular dependencies
or
@codewalk_find_circular_dependencies

When to use: Before a refactor or when investigating why two modules feel tightly coupled.

"Look up a symbol"

Tool: codewalk_lookup_symbol(symbol_name) — pass a function, class, or method name

Find every definition and key references of a named symbol across the repo.

@codewalk lookup symbol authenticate_user
or
@codewalk_lookup_symbol authenticate_user

When to use: You know a name and want its exact file, line, and callers without doing a semantic search.

"Generate a starter config"

Tool: codewalk_generate_config() — no parameters

Create a stack-specific codewalk.yaml with sensible excludes for your repo.

@codewalk generate a codewalk.yaml for this repo
or
@codewalk_generate_config

When to use: First-time setup, before the first analyze, to avoid indexing build artifacts and tests.

"Run static analysis"

Tool: codewalk_run_static_analysis(file_paths) — pass one or more files

Run language-appropriate linters/type-checkers (ruff, mypy, eslint, etc.) on the given files.

@codewalk run static analysis on src/auth.py
or
@codewalk_run_static_analysis src/auth.py

When to use: After applying a fix or editing files to catch style/type issues quickly.

"Run tests"

Tool: codewalk_run_tests(file_paths) — pass one or more files

Auto-detect and run the relevant test command (pytest, npm test, go test, cargo test, etc.).

@codewalk run tests for src/auth.py
or
@codewalk_run_tests src/auth.py

When to use: After a fix or refactor to confirm nothing broke.

Quick Reference — What To Ask

You want to...	Just say...
First-time setup	`@codewalk analyze this codebase`or `@codewalk_analyze_codebase`
Big picture overview	`@codewalk give me an overview` or `@codewalk_get_overview`
Understand a module	`@codewalk tell me about the auth module` or `@codewalk_get_module_info auth`
Understand a function	`@codewalk explain the verify_request function` or `@codewalk_explain_function verify_request`
Find code by concept	`@codewalk how does error handling work?` or `@codewalk_search_codebase how does error handling work?`
Check change risk	`@codewalk what's the blast radius of config.py?` or `@codewalk_get_blast_radius_map config.py?`
Find riskiest files	`@codewalk show me the riskiest files`
Best reading order	`@codewalk what order should I read the agent module?` or `@codewalk_get_reading_order agent module`
See dependency flow	`@codewalk show me the execution flow` or `@codewalk_get_execution_flow`
After code changes	`@codewalk refresh the analysis` or `@codewalk_refresh_analysis`
Update embeddings	`@codewalk reindex changed files` or `@codewalk_incremental_reindex`
Review git diff	`@codewalk review my changes` or `@codewalk_run_review`
Review a file	`@codewalk review src/auth.py` or `@codewalk_review_file src/auth.py`
Get stack signals	`@codewalk what stack is this?` or `@codewalk_get_stack_info`
Load guidelines	`@codewalk load guidelines from docs/` or `@codewalk_load_guidelines docs/`
Architecture health	`@codewalk check architecture health` or `@codewalk_get_architecture_health`
Trace import chain	`@codewalk trace chain from config.py to server.py` or `@codewalk_call_chain config.py server.py`
Find circular dependencies	`@codewalk find circular dependencies` or `@codewalk_find_circular_dependencies`
Lookup a symbol	`@codewalk lookup symbol authenticate_user` or `@codewalk_lookup_symbol authenticate_user`
Run static analysis	`@codewalk run static analysis on src/auth.py` or `@codewalk_run_static_analysis src/auth.py`
Run tests	`@codewalk run tests for src/auth.py` or `@codewalk_run_tests src/auth.py`
Generate repo config	`@codewalk generate a codewalk.yaml` or `@codewalk_generate_config`
Search team docs	`@codewalk search docs for deployment` or `@codewalk_search_docs deployment`
Ask docs a question	`@codewalk how do we deploy?` or `@codewalk_ask_docs how do we deploy`
Deep research	`@codewalk research how error handling works across the codebase`
Accept/reject findings	`@codewalk accept finding 3` → `@codewalk_finding_verdict`
Apply accepted fixes	`@codewalk apply accepted fixes` or `@codewalk_apply_accepted`
Approve then apply one fix	`@codewalk approve apply fix to auth.py` → `@codewalk_approve_action` → `@codewalk_apply_fix`
Ask by speaking (hands-free)	`@codewalk_voice_ask` → Copilot calls tool → `@codewalk_speak`

📡 API Reference

Base URL: http://localhost:8000

Start the server:

source .codewalk-env/bin/activate
uvicorn src.codewalk.api.main:app --reload --port 8000

Analysis Endpoints

`POST /analyze` — Index a codebase

curl -X POST http://localhost:8000/analyze \
  -H "Content-Type: application/json" \
  -d '{
    "collection_name": "",
    "index_mode": "auto"
  }'

Response:

{
  "status": "complete",
  "repo_path": "/Users/you/projects/my-app",
  "files_scanned": 142,
  "chunks_created": 380,
  "modules": ["api", "auth", "models", "utils", "frontend"]
}

The repo is discovered from the current working directory via codewalk.yaml (auto-created if missing). repo_path is no longer accepted in the request body.
index_mode: "auto" (skip if indexed), "reindex" (smart update), "full" (wipe & rebuild)
collection_name: leave empty — reads manifest.collection_name if present, else repo folder name
auto + index on disk → load only (load_scoped_analysis), no re-embed — same idea as MCP codewalk_analyze_codebase
No index → full_index_parallel with codewalk.yaml excludes (local embed on API server)

`POST /analyze/stream` — Index with live progress (SSE)

curl -N -X POST http://localhost:8000/analyze/stream \
  -H "Content-Type: application/json" \
  -d '{"index_mode": "auto"}'

Response (Server-Sent Events) — step values from analyze_stream() in main.py:

`step`	When
`init`	Always first — checking existing index
`skip`	`index_mode: auto` + `.codewalk/` on disk (load only), or non-full/reindex skip
`scan`	Full index or reindex — file scan (`codewalk.yaml` excludes on full)
`chunk`	Full index — parallel chunk + embed
`embed`	Full index — embed count
`store`	Full index — Chroma persist + manifest
`reindex`	`index_mode: reindex` — new/changed/deleted counts
`analyze`	Dependency graph + module detection
`agent`	`state.initialize` (DuckDB, docs, guidelines, agent)
`done`	Success (`result` object on final event when complete)
`error`	Exception message

index_mode: auto + existing .codewalk/ (fast path):

data: {"step": "init", "message": "Checking existing index..."}
data: {"step": "skip", "message": "Loaded existing index (380 chunks)"}
data: {"step": "done", "message": "Analysis complete!", "result": {...}}

index_mode: full (or empty index):

data: {"step": "init", "message": "Checking existing index..."}
data: {"step": "scan", "message": "Scanning directory..."}
data: {"step": "scan", "message": "Scanned 142 files (codewalk.yaml excludes applied)"}
data: {"step": "chunk", "message": "Chunking + embedding in parallel..."}
data: {"step": "chunk", "message": "Created 380 chunks"}
data: {"step": "embed", "message": "Embedded 380 chunks"}
data: {"step": "store", "message": "Storing in vector database..."}
data: {"step": "store", "message": "Stored 380 chunks in ChromaDB"}
data: {"step": "analyze", "message": "Building dependency graph..."}
data: {"step": "agent", "message": "Creating AI agent..."}
data: {"step": "analyze", "message": "Detected 5 modules"}
data: {"step": "done", "message": "Analysis complete!", "result": {...}}

`GET /index-status` — Check whether the current workspace is indexed

# Check cwd-discovered repo
curl http://localhost:8000/index-status | python3 -m json.tool

# Optional: check a specific repo path
curl "http://localhost:8000/index-status?repo_path=/Users/you/projects/my-app" | python3 -m json.tool

Response:

{
  "indexed": true,
  "repo_path": "/Users/you/projects/my-app"
}

The frontend sidebar uses this endpoint to lock index-dependent tabs until indexed: true.

API endpoints — index requirements (parity with MCP)

All query endpoints call state.require_index() — auto-loads .codewalk/ from disk after server restart (same as MCP _require_index()).

Endpoint	Index required?	MCP equivalent	Notes
`POST /analyze`	Builds or loads	`codewalk_analyze_codebase`	API: `full_index_parallel`; MCP local: `index_from_paths_parallel`
`POST /analyze/stream`	Builds or loads	same (SSE progress)	Steps: `init`→`skip`/`scan`→`chunk`→`embed`→`store`→`analyze`→`agent`→`done`
`POST /chat`, `/chat/stream`	Yes	agent + tools	API HITL via `POST /chat/approve`
`GET /overview`	Yes	`codewalk_get_overview`
`GET /modules`, `/modules/{name}`	Yes	`codewalk_get_module_info`
`GET /blast-radius`, `/blast-radius/{m}`	Yes	`codewalk_get_blast_radius_map`
`GET /reading-order`	Yes	`codewalk_get_reading_order`
`GET /execution-flow`	Yes	`codewalk_get_execution_flow`
`GET /architecture`, `/cycles`	Yes	`codewalk_get_architecture_health`
`POST /semantic-search`	Yes	`codewalk_search_codebase`	Chroma semantic search endpoint
`POST /rag/expand-query`	Yes	—	LLM query expansion for RAG
`POST /rag/rerank`	Yes	—	LLM chunk reranking
`POST /rag/symbol-lookup`	Yes	`codewalk_lookup_symbol`	DuckDB symbol lookup
`POST /tools/static-analysis`	No	`codewalk_run_static_analysis`	ruff/mypy/eslint/etc. on files
`POST /tools/run-tests`	No	`codewalk_run_tests`	pytest/npm test/etc. on files
`GET /version`	No	`codewalk_check_version`	Codewalk version + commit info
`GET /staleness`	Yes	—	Local vs cloud index staleness
`POST /refresh`	Yes	`codewalk_refresh_analysis`	No re-embed
`POST /incremental-reindex`	Yes	`codewalk_incremental_reindex`	`team_config` + manifest collection
`POST /review`	Soft (better with index)	`codewalk_run_review`	Works with partial context
`POST /review/stream`	Soft (better with index)	—	SSE progress events
`POST /review/cancel`	Yes	—	Cancel a running review
`POST /review/file`	Yes	`codewalk_review_file`
`POST /review/guidelines`	No	`codewalk_load_guidelines`	Guidelines Chroma only
`POST /review/verdict`	Yes	`codewalk_finding_verdict`	Accept/reject a finding
`POST /review/apply-accepted`	Yes	`codewalk_apply_accepted`	Apply all accepted fixes
`POST /review/apply`	No (repo path only)	`codewalk_apply_fix`	Caller approves in UI; no token gate
`POST /docs/index`, `/docs/search`, `/docs/ask`	Doc index only	`codewalk_index_docs` etc.
`POST /chat/approve`	Yes	—	Resume/reject interrupted agent
`POST /voice/ask`	Yes	`codewalk_voice_ask`
`POST /research`	Yes	deep research
`GET /health`	No	—
Cloud `GET /indexes/...`	Download only	`codewalk_pull_index`	Blocked when cloud-only mode

Cloud API server (DATABASE_URL set): query endpoints above return 400 — indexes are built on server, queried locally via MCP download.

`POST /refresh` — Re-scan without re-embedding

curl -X POST http://localhost:8000/refresh

Response:

{
  "status": "refreshed",
  "repo_path": "/Users/you/projects/my-app",
  "files": 142,
  "modules": ["api", "auth", "models", "utils", "frontend"]
}

Chat Endpoint

`POST /chat` — Ask the AI agent a question

curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "Explain how authentication works in this project", "thread_id": "thread-1"}'

Response:

{
  "answer": "The authentication flow starts in auth/middleware.py which checks JWT tokens on every request. The token validation logic is in auth/jwt.py which uses the python-jose library...",
  "thread_id": "thread-1"
}

Multi-turn conversation — use the same thread_id:

# Follow-up question
curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "What happens if the token expires?", "thread_id": "thread-1"}'

View Endpoints

`GET /overview` — Project overview

curl http://localhost:8000/overview

Response:

{
  "tech_stack": ["Python", "FastAPI", "React"],
  "total_files": 142,
  "total_modules": 5,
  "modules": [
    {"name": "api", "file_count": 12, "depends_on": ["auth", "models"]},
    {"name": "auth", "file_count": 5, "depends_on": ["models"]}
  ],
  "diagram": "graph TD\n    api --> auth\n    api --> models\n    auth --> models",
  "overview_text": "## Project Overview\nTech stack: Python, FastAPI...",
  "riskiest_files": [
    {"file": "models/base.py", "risk_level": "high", "affected_files": 23}
  ]
}

`GET /modules` — List all modules

curl http://localhost:8000/modules

Response:

{
  "modules": [
    {"name": "api", "file_count": 12, "languages": ["python"]},
    {"name": "auth", "file_count": 5, "languages": ["python"]},
    {"name": "frontend", "file_count": 34, "languages": ["typescript", "css"]}
  ],
  "total": 5
}

`GET /modules/{name}` — Module details

curl http://localhost:8000/modules/auth

Response:

{
  "name": "auth",
  "file_count": 5,
  "files": ["auth/middleware.py", "auth/jwt.py", "auth/permissions.py", "auth/models.py", "auth/__init__.py"],
  "languages": {"python": 5},
  "depends_on": ["models"],
  "depended_by": ["api"],
  "blast_radius": [
    {"file": "auth/middleware.py", "risk_level": "moderate", "affected_files": 8}
  ],
  "module_risk": "moderate"
}

`GET /blast-radius` — Top 15 riskiest files

curl http://localhost:8000/blast-radius

Response:

{
  "module": null,
  "module_risk": "high",
  "total_files": 15,
  "files": [
    {
      "file": "models/base.py",
      "risk_level": "high",
      "affected_files": 23,
      "direct": ["api/routes.py", "auth/models.py"],
      "transitive": ["api/views.py", "auth/middleware.py"]
    }
  ]
}

`GET /blast-radius/{module}` — Blast radius for a module

curl http://localhost:8000/blast-radius/auth

`GET /reading-order` — Recommended reading order

curl http://localhost:8000/reading-order

Response:

{
  "order": [
    {
      "file": "config.py",
      "position": 1,
      "why": "No internal dependencies",
      "risk_level": "moderate",
      "affected_files": 12,
      "direct": ["embedder.py", "chain.py"],
      "transitive": ["pipeline.py"]
    },
    {
      "file": "models/base.py",
      "position": 2,
      "why": "No internal dependencies | Used by: routes.py, views.py",
      "risk_level": "high",
      "affected_files": 23
    }
  ]
}

`GET /execution-flow` — Execution flow diagram

curl http://localhost:8000/execution-flow

Response:

{
  "flow": "## Execution Flow — Module Level\nEntry modules: api, cli\nTotal modules: 5\n\n### Module Dependencies\n  api (12 files) → depends on: auth, models\n  auth (5 files) → depends on: models\n  models (8 files) → (standalone)\n  utils (6 files) → (standalone)\n  frontend (34 files) → (standalone)"
}

Maintenance Endpoints

`POST /incremental-reindex` — Re-embed only changed files

curl -X POST http://localhost:8000/incremental-reindex

Response:

{
  "repo_path": "/Users/you/projects/my-app",
  "files_on_disk": 142,
  "files_skipped": 138,
  "files_reindexed": 3,
  "files_deleted": 1,
  "chunks_embedded": 12,
  "total_time": "2.3s"
}

How it works: Incremental reindex first performs a Chroma incremental update (only changed files are embedded/deleted). It then fully rebuilds DuckDB and knowledge-graph.json from all Chroma chunks, and re-indexes docs/guidelines. The manifest ({repo}/.codewalk/manifest.json) is updated every write with an incremented index_version and a chunk_count reflecting total Chroma chunks.

Review Endpoints

`POST /review` — Review git diff

curl -X POST http://localhost:8000/review \
  -H "Content-Type: application/json" \
  -d '{"staged": false, "target_branch": "master"}'

Response:

{
  "verdict": "request_changes",
  "verdict_reason": "Critical security issue found that must be fixed before merge.",
  "issues": [
    {
      "severity": "critical",
      "category": "security",
      "file_path": "src/auth/jwt.py",
      "line_number": 42,
      "title": "JWT secret hardcoded",
      "explanation": "The JWT signing secret is hardcoded in the source file.",
      "suggestion": "Move the secret to an environment variable.",
      "code_snippet": "SECRET = 'my-secret-key'",
      "blocking": true,
      "confidence": "high"
    }
  ],
  "summary": "Found 1 critical issue in 3 files (+45 / -12 lines)",
  "narrative_summary": "",
  "files_reviewed": 3,
  "lines_added": 45,
  "lines_removed": 12,
  "session_id": "25-June-2026-143052-feature-x-to-main",
  "architecture_flags": {},
  "schema_version": "2.0",
  "merge_blockers": ["JWT secret hardcoded"],
  "clusters": [],
  "fixed_count": 0,
  "new_count": 1,
  "still_present_count": 0
}

staged: If true, review only staged changes (--staged). Default: false.
target_branch: Diff against a branch (e.g. "master" for full PR review). Default: null (unstaged changes).
incremental: Carry forward previous findings when true. Default: false.
narrative_summary: Set true for an LLM-written narrative summary (slower). Default: false.

`POST /review/file` — Review a single file

curl -X POST http://localhost:8000/review/file \
  -H "Content-Type: application/json" \
  -d '{"file_path": "src/codewalk/pipeline.py"}'

Response:

{
  "verdict": "approve_with_nits",
  "verdict_reason": "Non-critical issues found. Fix recommended but not blocking.",
  "issues": [
    {
      "severity": "suggestion",
      "category": "style",
      "file_path": "src/codewalk/pipeline.py",
      "line_number": 120,
      "title": "Consider extracting helper function",
      "explanation": "The inline loop is repeated in two places.",
      "suggestion": "Move the loop body into a private helper.",
      "code_snippet": "for chunk in chunks:"
    }
  ],
  "summary": "Clean change with one minor style suggestion.",
  "file_path": "src/codewalk/pipeline.py"
}

`POST /review/guidelines` — Load coding guidelines

curl -X POST http://localhost:8000/review/guidelines \
  -H "Content-Type: application/json" \
  -d '{"docs_path": "/path/to/guidelines"}'

Response:

{
  "status": "loaded",
  "chunks": 24,
  "path": "/path/to/guidelines"
}

Voice Endpoint

`POST /voice/ask` — Voice-in, voice-out Q&A

Upload an audio file (webm/mp3/wav from browser mic). Codewalk transcribes it, sends it to the chat agent (which picks the right tool natively), and returns both the text answer and a spoken MP3 response.

curl -X POST http://localhost:8000/voice/ask \
  -F "audio=@question.webm" \
  -F "thread_id=voice"

Response:

{
  "question": "what does the auth module do?",
  "answer": "The auth module contains 5 files handling JWT validation...",
  "speech": "The auth module handles JWT validation and permissions.",
  "audio_base64": "SUQzBAAAAAAAI1RTU0UAAAA..."
}

audio (required): Audio file upload (webm, mp3, wav)
thread_id (optional): Conversation thread ID. Default: "voice"
audio_base64: Base64-encoded MP3 of the spoken answer — decode and play in the browser

Pipeline: audio upload → faster-whisper STT → chat agent (picks tool natively) → summarize → edge-tts → MP3 response

`GET /health` — Health check

curl http://localhost:8000/health

Response:

{
  "status": "ok"
}

☁️ Cloud Deployment

Production API: https://api.codewalk.xyz (indexing + webhooks + index download).
Marketing site (optional): https://codewalk.xyz

Architecture

git push → GitHub App webhook → api.codewalk.xyz
              ↓
         build in .incoming.{commit}/ → atomic_swap → active index
              ↓
Local MCP → GET /indexes/{owner}/{repo} → query locally

Indexing is server-side only — the cloud API does not serve /analyze or /chat for indexed repos. MCP downloads the index tarball and queries locally.

Cloud indexing lifecycle

Event	What happens
First `git push`	Auto-registers repo, incremental index, `index_status: ready`
Later pushes	Incremental re-index; cloud Postgres `index_version` bumps and is written to the downloaded `.codewalk/manifest.json`
Push during indexing	Older run superseded; newest commit wins
Deploy / API restart	Orphan jobs cancelled; catch-up re-indexes stale/pending repos (~15s)
Stale `codewalk_version`	Catch-up full re-index after semver deploy
Crash mid-write	Atomic swap — active index unchanged until publish succeeds

Laptop after server index updates: codewalk_pull_index (not codewalk_analyze_codebase when cloud is configured).

Staleness banners (MCP): [Cloud] → pull index / wait for server catch-up; [Local] → codewalk_analyze_codebase. See deploy/SERVER_OPS.md §6.

Cloud Admin UI: The frontend includes an /admin page to register repos, list repos, trigger indexing, copy per-repo tokens, and check server health/version. Production API base is configured via NEXT_PUBLIC_API_URL.

Local-ahead safety: codewalk_pull_index and codewalk_connect_repo warn and require force=True when the local .codewalk/manifest.json index_version is ahead of the cloud Postgres row.

Architecture (components)

Component	Where	Role
Cloud server	Hetzner + Docker	Index on push only
GitHub Actions	GitHub	Build image + deploy server
GitHub App	GitHub	Send `push` webhooks (must install app on repos)
Local MCP	Your laptop	Download index, run queries

Quick start

Follow FULL_SETUP_GUIDE.md (complete step-by-step)
Server .env: cp env.server.example.txt → /opt/codewalk/.env
GitHub App webhook: https://api.codewalk.xyz/webhooks/github
Install App on each repo to index (creating the app is not enough)
git push to that repo — this registers it and starts indexing (install alone is not enough)
Verify: POST /admin/repos with X-Admin-Key → index_status: ready
Get repo_token from DB → set CODEWALK_REPO_TOKEN in MCP config

Day-to-day server ops: deploy/SERVER_OPS.md — health, SQL, logs, reset-repo.sh (prepare/reset/delete, --dry-run).

Indexing a repo (checklist)

Step	Required for indexing?
Server running + cloud `.env` (App ID, PEM, webhook secret)	Yes
GitHub Actions secrets (`HETZNER_*`)	No — deploy only
Install GitHub App on the repo	Yes
`git push` to the repo	Yes — triggers register + index
`repo_token` in local MCP	Yes — for downloading the index

GitHub Actions deploys the server. Indexing is triggered by GitHub App push webhooks, not Actions.

Per-repo team config (`codewalk.yaml`)

Each indexed repo can have a codewalk.yaml at its root:

indexing:
  branches:          # only these branches trigger indexing (fnmatch)
    - master
    - release/**
  exclude:
    # Repo-specific dirs/files (the core safety net already skips
    # node_modules, build artifacts, binaries, secrets, lock files, etc.)
    - frontend/**
    - docs/**
  include:
    # Override an exclusion for a specific path
    - docs/architecture/**

Generate a starter config with stack-specific excludes:

python -m src.codewalk.cli generate-config

Or via MCP: @codewalk Run codewalk_generate_config.

Cloud reads codewalk.yaml on every index. Pushes to other branches are ignored. See FULL_SETUP_GUIDE.md § Phase 7.

Config templates

File	Use
FULL_SETUP_GUIDE.md	Complete A→Z setup
deploy/DEPLOY.md	Deployment guide
deploy/SERVER_OPS.md	Server ops — health, indexing, SQL, reset-repo.sh
env.server.example.txt	Hetzner `/opt/codewalk/.env`
env.local.example.txt	Local dev `.env`
mcp.json.example	MCP config → `.vscode/mcp.json`
env.example.txt	All env vars index

CI/CD (GitHub Actions)

Push to master → build image → GHCR → deploy to Hetzner (deploy-server.sh syncs compose + Caddyfile).

Secrets: HETZNER_HOST, HETZNER_USER, HETZNER_SSH_KEY

GitHub Actions deploys the server. Indexing is triggered by GitHub App push webhooks, not Actions.

🏗️ Architecture

┌─────────────────────────────────────────────────────────┐
│                      INTERFACES                         │
│                                                         │
│   Next.js Web UI (:3000)    MCP Server    REST API      │
│   ├── Overview              (stdio)       (:8000)       │
│   ├── Modules                  │             │          │
│   ├── Blast Radius             │             │          │
│   ├── Reading Order        Voice Interface   │          │
│   ├── Execution Flow       (mic → speak)     │          │
│   ├── Code Review              │             │          │
│   ├── Smart Reindex            │             │          │
│   └── Chat ──────────────────┐ │             │          │
│                              ▼ ▼             ▼          │
├──────────────────────────────────────────────────────────┤
│                     AGENT LAYER                          │
│                                                          │
│   LangGraph StateGraph ─── LLM (bind_tools) ───┐        │
│          │                                      │        │
│          ▼                                      ▼        │
│   ┌─ 11 Agent Tools ─────────────────────────────┐       │
│   │ search_codebase     get_overview             │       │
│   │ get_module_info     get_blast_radius_map     │       │
│   │ explain_function    get_reading_order        │       │
│   │ get_execution_flow  get_architecture_health  │       │
│   │ load_guidelines     apply_fix                │       │
│   │ verify_fix                                   │       │
│   └──────────────────────────────────────────────┘       │
├──────────────────────────────────────────────────────────┤
│                   INGESTION LAYER                         │
│                                                          │
│   scanner.py ──► file_filter.py ──► tech_detect.py       │
│   (file enum         (skip rules       (language/        │
│    & hashing)         & safety net)     framework id)     │
├──────────────────────────────────────────────────────────┤
│                    ANALYSIS LAYER                         │
│                                                          │
│   code_parser.py ──► dependency_graph.py ──► module_     │
│   (tree-sitter       (import extraction    detector      │
│    15+ langs)         → graph)                            │
│                         │                                │
│                         ▼                                │
│   blast_radius.py   reading_order.py                     │
│   (BFS reverse       (topological                         │
│    graph)             sort)                                │
├──────────────────────────────────────────────────────────┤
│                    GRAPH LAYER                           │
│                                                          │
│   graph/store.py ──► graph/runtime.py                    │
│   (DuckDB 10-table    (igraph C-speed                    │
│    persistent          traversal: cycles,                 │
│    graph)              centrality, paths)                 │
│                                                          │
│   .codewalk/graph.duckdb  ◄── files, imports, symbols,   │
│                               symbol_calls, chunks,       │
│                               modules, module_deps        │
├──────────────────────────────────────────────────────────┤
│                    REVIEW LAYER                          │
│                                                          │
│   diff_parser.py → reviewers/ → pipeline/ → engine.py    │
│   (git diff         (pluggable         (cluster/rank/    │
│    parsing)          reviewers)         verify/summary)   │
│                                                          │
│   report.py ────────► fix_applier.py                     │
│   (Finding dataclasses)  (approved fix application)       │
├──────────────────────────────────────────────────────────┤
│                   EMBEDDING LAYER                        │
│                                                          │
│   chunker.py ──► embedder.py ──► vector_store.py         │
│   (smart code     (Jina 1.5B     (ChromaDB               │
│    chunks)         MPS/CUDA)      persistent)             │
├──────────────────────────────────────────────────────────┤
│                     VOICE LAYER                          │
│                                                          │
│   ┌── mic ──► stt.py ──► router.py ──► tool exec ──┐    │
│   │  sounddevice   faster-whisper   get_llm()          │    │
│   │  (record)      (transcribe)     (route to tool) │    │
│   │                                                 │    │
│   │            ┌─ content tool? ─┐                   │    │
│   │            │  YES            │ NO (admin)        │    │
│   │            ▼                 ▼                   │    │
│   │   main LLM (get_llm())   return text only       │    │
│   │   raw result → speech    (no TTS)               │    │
│   │         │                                       │    │
│   │   tts.py ◄── speech                             │    │
│   │   edge-tts (speak answer)                       │    │
│   └──────────────────────────────────────────────    │
│                                                          │
│   Voice Flow:                                            │
│   🔔 beep → 🎙️ record (30s max, 5s silence stop)        │
│   → 📝 transcribe (faster-whisper, local)                │
│   → 🧠 route (configured LLM picks the right tool + args) │
│   → ⚙️ execute tool                                      │
│   → 🔇 admin tool? → text result only (silent)           │
│   → 🔊 content tool? → main LLM → speech → edge-tts     │
├──────────────────────────────────────────────────────────┤
│                     CORE LAYER (v2.4–v2.7)               │
│                                                          │
│   core/reflect.py   → Actor→Critic→Improve loop         │
│   core/hitl.py      → LangGraph interrupts + checkpoint │
│   core/fanout.py    → Parallel fan-out/fan-in graphs    │
│                                                          │
│   Used by: agent (hitl), research (reflect + fanout)    │
│   — generic, composable, zero duplication                │
├──────────────────────────────────────────────────────────┤
│                     LLM LAYER                            │
│                                                          │
│   config.py ──► get_llm() factory                        │
│   Ollama │ OpenAI │ Anthropic │ Gemini │ Groq │ ...      │
└──────────────────────────────────────────────────────────┘

Directory Structure

codewalk/
├── src/codewalk/
│   ├── config.py                  # Settings + LLM provider factory
│   ├── pipeline.py                # Orchestration (parallel embed)
│   ├── ingestion/                 # File scanning & tech detection
│   │   ├── scanner.py             #   File enumeration
│   │   ├── file_filter.py         #   Skip rules (node_modules, etc.)
│   │   └── tech_detect.py         #   Language/framework detection
│   ├── analysis/                  # Code parsing & dependency analysis
│   │   ├── code_parser.py         #   Tree-sitter (15+ languages)
│   │   ├── dependency_graph.py    #   Import extraction → graph
│   │   ├── module_detector.py     #   Auto-grouping into modules
│   │   ├── blast_radius.py        #   Change impact (BFS)
│   │   └── reading_order.py       #   Topological sort
│   ├── graph/                     # Graph intelligence layer
│   │   ├── graph_store.py         #   DuckDB 10-table schema + stable hash IDs
│   │   └── graph_runtime.py       #   igraph: cycles, centrality, shortest path
│   ├── embeddings/                # Vectorization
│   │   ├── chunker.py             #   Code → chunks
│   │   ├── embedder.py            #   Chunks → vectors
│   │   └── vector_store.py        #   ChromaDB storage
│   ├── agent/                     # LangGraph chat agent
│   │   ├── graph.py               #   StateGraph + fallback parser
│   │   ├── tools.py               #   11 tool functions
│   │   └── prompts.py             #   System prompt
│   ├── rag/                       # RAG pipeline
│   │   ├── chain.py               #   ask() + ask_corrective() (corrective RAG)
│   │   ├── retrieval_quality.py   #   Distance-based chunk filtering (free)
│   │   ├── answer_grader.py       #   LLM answer quality grading
│   │   └── query_rewriter.py      #   LLM query reformulation
│   ├── review/                    # Code review pipeline
│   │   ├── engine.py              #   Main review orchestrator (run_review)
│   │   ├── report.py              #   Finding, ReviewReport, Severity, Category
│   │   ├── diff_parser.py         #   git diff → parsed DiffFile objects
│   │   ├── fix_applier.py         #   Apply approved fixes safely
│   │   ├── finding_store.py       #   Persist review findings
│   │   ├── session_store.py       #   Persist review sessions
│   │   ├── reviewers/             #   Pluggable reviewers (generic, security, …)
│   │   ├── pipeline/              #   Post-processing (cluster/rank/verify/summary)
│   │   └── renderers/             #   Output formatters (markdown/cli/api)
│   ├── api/                       # FastAPI REST
│   │   ├── main.py                #   35+ endpoints
│   │   ├── models.py              #   Pydantic schemas
│   │   ├── state.py               #   Singleton app state + restart resilience
│   │   └── cloud.py               #   Cloud mode (GitHub App + webhooks)
│   ├── voice/                     # Voice interface
│   │   ├── stt.py                 #   Mic recording + faster-whisper transcription
│   │   ├── tts.py                 #   edge-tts speech synthesis (thread-safe)
│   │   ├── router.py              #   LLM-based tool routing (via get_llm)
│   │   ├── backends.py            #   Tool execution bridge
│   │   └── companion.py           #   Standalone voice loop
│   ├── core/                      # Reusable LangGraph patterns
│   │   ├── reflect.py             #   Actor→Critic→Improve loop
│   │   ├── hitl.py                #   Human-in-the-loop interrupts
│   │   └── fanout.py              #   Parallel fan-out/fan-in graphs
│   ├── research/                  # Deep research mode
│   │   ├── deep_research.py       #   End-to-end deep research entry point
│   │   ├── researcher.py          #   Parallel search + synthesis
│   │   ├── planner.py             #   Decompose question into sub-questions
│   │   └── synthesizer.py         #   Merge parallel findings into report
│   ├── generation/                # Explanation / diagram generation
│   │   ├── overview_generator.py  #   Project overview text
│   │   ├── module_explainer.py    #   Module-level explanations
│   │   ├── diagram_generator.py   #   Mermaid diagrams
│   │   └── flow_generator.py      #   Execution flow diagrams
│   ├── doc_knowledge/             # Docs & guidelines indexing
│   │   ├── doc_parser.py          #   Parse .md/.pdf/.txt/.rst
│   │   └── doc_store.py           #   ChromaDB doc collection wrapper
│   ├── services/                  # Deterministic service wrappers
│   │   ├── search_service.py      #   retrieval pipeline wrapper
│   │   └── symbol_service.py      #   symbol lookup wrapper
│   ├── tools/                     # Agent / MCP tool implementations
│   │   ├── static_analysis.py     #   Static analysis runner
│   │   ├── test_runner.py         #   Test execution runner
│   │   └── tool_runner.py         #   Generic tool dispatch
│   ├── worker/                    # Background cloud indexing worker
│   │   ├── indexer.py             #   Poll Postgres jobs, build indexes
│   │   ├── github_app.py          #   GitHub App token retrieval
│   │   └── atomic_store.py        #   Atomic index directory swap
│   ├── eval/                      # Evaluation & benchmarking
│   │   ├── evaluator.py           #   RAGAS RAG evaluation
│   │   ├── experiments.py         #   A/B parameter sweeps
│   │   └── generate_multilang_review_fixtures.py # Review eval fixtures
│   ├── debug/                     # Development/debug utilities
│   │   └── fanout_agent.py        #   Fan-out graph experiments
│   ├── cli.py                     #   Command-line interface
│   └── mcp/                       # Model Context Protocol
│       └── server.py              #   38 MCP tools (stdio)
│
├── frontend/                      # Next.js 14 web UI
│   └── src/app/
│       ├── page.tsx               #   Home (analyze form)
│       ├── chat/page.tsx          #   AI chat interface
│       ├── overview/page.tsx      #   Project overview
│       ├── modules/page.tsx       #   Module browser
│       ├── modules/[name]/page.tsx#   Single module detail
│       ├── blast-radius/page.tsx  #   Change impact viewer
│       ├── reading-order/page.tsx #   Reading order viewer
│       ├── execution-flow/page.tsx#   Flow diagram viewer
│       ├── knowledge-graph/page.tsx#  Interactive graph explorer
│       ├── review/page.tsx        #   Code review (diff/file/guidelines)
│       ├── voice/page.tsx         #   Voice assistant (mic → transcribe → speak)
│       ├── admin/page.tsx         #   Admin dashboard
│       ├── architecture/page.tsx  #   Architecture health viewer
│       ├── docs/page.tsx          #   Team docs search & ask
│       ├── research/page.tsx      #   Deep research interface
│       └── incremental-reindex/   #   Smart reindex page
│           └── page.tsx
│
├── <target-repo>/.codewalk/
│   ├── chroma/                    # ChromaDB persistent storage (per repo)
│   ├── graph.duckdb               # DuckDB graph database (relationships)
│   ├── knowledge-graph.json       # Serialized knowledge graph entities/relationships
│   └── manifest.json              # Version tracking + index metadata (index_version, chunk_count)
│
├── deploy/                        # Production deployment
│   ├── Dockerfile                 #   Multi-stage Python 3.11 build
│   ├── docker-compose.yml         #   Postgres + API + Caddy orchestration
│   ├── Caddyfile                  #   Reverse proxy config (IP or domain mode)
│   ├── hetzner-setup.sh           #   One-click Hetzner VPS provisioning
│   ├── DEPLOY.md                  #   Full deployment guide
│   └── SERVER_OPS.md              #   Health, indexing, logs, SQL commands
│
├── .github/workflows/             # CI/CD
│   └── deploy.yml                 #   Build Docker image → GHCR → deploy
│
├── requirements.txt               # Python dependencies
├── codewalk.yaml                  # Per-repo config (branches, excludes)
├── env.example.txt                # Environment variable index
├── env.server.example.txt         # Hetzner server .env template
├── env.local.example.txt          # Local dev .env template
├── FULL_SETUP_GUIDE.md            # Complete cloud + MCP setup (step by step)
├── .env                           # Configuration (gitignored)
└── .vscode/mcp.json               # MCP server config

🔧 Environment Variables

Variable	Default	Description
`LLM_PROVIDER`	`ollama`	LLM backend: `ollama`, `openai`, `anthropic`, `gemini`, `groq`, `openrouter`
`LLM_MODEL`	`qwen3.5:27b`	Model name (must match provider)
`EMBEDDING_MODEL`	`jinaai/jina-code-embeddings-1.5b`	Sentence-transformer model for code embeddings
`GROQ_API_KEY`	—	Groq API key
`OPENAI_API_KEY`	—	OpenAI API key
`ANTHROPIC_API_KEY`	—	Anthropic API key
`GOOGLE_API_KEY`	—	Google Gemini API key
`OPENROUTER_API_KEY`	—	OpenRouter API key
`POSTGRES_PASSWORD`	—	Postgres password (for Docker/server deployment)
`CORS_ORIGINS`	`*`	Comma-separated allowed origins (e.g. `https://yourdomain.com`)
`RATE_LIMIT_REQUESTS`	`60`	Max requests per IP per window
`RATE_LIMIT_WINDOW`	`60`	Rate limit window in seconds
`INDEX_STORAGE_PATH`	`/var/codewalk`	Path for ChromaDB/DuckDB data (Docker default)
`GITHUB_APP_ID`	—	GitHub App ID (server cloud mode)
`GITHUB_APP_PRIVATE_KEY_PATH`	—	PEM path inside container, e.g. `/var/codewalk/secrets/key.pem`
`GITHUB_WEBHOOK_SECRET`	—	Must match GitHub App webhook secret
`ADMIN_API_KEY`	—	`X-Admin-Key` header for `/admin/*` routes
`CODEWALK_SERVER_URL`	—	MCP: cloud API URL, e.g. `https://api.codewalk.xyz`
`CODEWALK_REPO_NAME`	—	MCP: `owner/repo` slug
`CODEWALK_REPO_TOKEN`	—	MCP: per-repo download token (`cw_repo_...`)

🤖 Supported LLM Providers

Provider	Set `LLM_PROVIDER=`	API Key	Notes
Ollama	`ollama`	None	Fully local, no internet. Run `ollama serve` first
OpenAI	`openai`	`OPENAI_API_KEY`	GPT models, etc.
Anthropic	`anthropic`	`ANTHROPIC_API_KEY`	Claude models
Google Gemini	`gemini`	`GOOGLE_API_KEY`	Gemini models
Groq	`groq`	`GROQ_API_KEY`	Groq models
OpenRouter	`openrouter`	`OPENROUTER_API_KEY`	Access to 100+ models
DeepSeek	`deepseek`	`DEEPSEEK_API_KEY`	DeepSeek V3, R1 models

🧹 Clearing the Index (Reset ChromaDB)

To wipe all indexed data and start fresh, delete the .codewalk/chroma/ directory inside the target repo:

# From the target repo root:
rm -rf .codewalk/chroma/

This removes all embedded chunks and collections. Next time you run codewalk_analyze_codebase (MCP) or POST /analyze (API), it will re-index from scratch.

When to do this:

You switched to a different repo and want a clean index

Embeddings seem stale or corrupted

You changed the embedding model and need to re-embed everything

You want to use index_mode: "full" but it's still picking up old data

Adding `.codewalk/` to `.gitignore`

Codewalk stores its index data inside each target repo at .codewalk/ (ChromaDB embeddings, DuckDB graph, version metadata). This directory should not be committed to version control.

Add this to each target repo's .gitignore:

# Codewalk index (auto-generated)
.codewalk/

This is only needed in the target repo you're analyzing, not in the codewalk repo itself.

🛠️ Tech Stack

Layer	Technology
Backend	Python 3.10+, FastAPI, Uvicorn
Agent	LangGraph, LangChain
Vector DB	ChromaDB (persistent, per-repo at `.codewalk/chroma/`)
Graph DB	DuckDB (persistent, per-repo at `.codewalk/graph.duckdb`) — Why DuckDB over SQLite?
Graph Runtime	igraph (C-speed traversal, in-memory from DuckDB)
Voice STT	faster-whisper (local, small model, int8)
Voice TTS	edge-tts (free, en-US-AriaNeural)
Voice Router	User's configured LLM (via get_llm())
Embeddings	Jina Code Embeddings 1.5B (768-dim, Ollama/MPS)
Code Parsing	Tree-sitter (15+ language grammars)
Frontend	Next.js 14, React 18, TypeScript 5
Styling	Tailwind CSS, shadcn/ui
Diagrams	Mermaid.js
MCP	Model Context Protocol (stdio transport)

⚠️ Known Limitations

Single-repo state (no concurrent multi-repo)

Codewalk holds one repo's state in memory at a time (vector store, dependency graph, module map, repo path). Using more than one repo in the same process can override and corrupt that state — index files, cached graph handles, and the active repo path can end up pointing at the wrong workspace.

Interface	Multi-repo behavior
MCP (stdio)	✅ *Safe per connection.* Each MCP connection spawns a separate Python process, so two repos in two editor windows are isolated. However, do not point the same running MCP server at multiple repos or switch workspaces rapidly within the same process — that will corrupt in-memory state.
FastAPI (REST)	⚠️ Not safe. Two concurrent `/analyze` calls for different repos will race — whoever finishes last overwrites the shared globals. Only one repo at a time.
Web UI	⚠️ Same as REST. The browser hits the FastAPI backend. Analyze one repo, explore it, then analyze another. Don't run two analyses in parallel from different browser tabs.

This is by design, not a bug. Codewalk is optimized for the common case: one developer, one repo at a time. If you need concurrent multi-repo support on the API side, it would require a dict[repo_path, SessionState] architecture — contributions welcome.

MCP users: You're already safe as long as each repo gets its own MCP server process (one VS Code window / Claude Code session / Cursor instance per repo). Do not route commands for different repos into the same stdio connection or reuse one server process across workspaces.

🤝 Contributing

Fork this repo
Clone your fork: git clone https://github.com/<your-username>/codewalk.git
Create a branch: git checkout -b feat/my-feature
Make your changes and test them
Commit: git commit -m "feat: add my feature"
Push: git push origin feat/my-feature
Open a Pull Request against master

All contributions welcome — bug fixes, new language support, UI improvements, docs, anything.

Found a bug? Open an issue with screenshots, error logs, or references — it helps us fix it faster.

📜 License

MIT

⭐ If you find Codewalk useful, give it a star — it helps others discover it!

Built by gupta29470
LinkedIn · Twitter/X

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
.github		.github
.vscode		.vscode
assets		assets
deploy		deploy
docs		docs
frontend		frontend
scripts		scripts
src/codewalk		src/codewalk
tests/fixtures		tests/fixtures
.dockerignore		.dockerignore
.gitignore		.gitignore
FULL_SETUP_GUIDE.md		FULL_SETUP_GUIDE.md
LICENSE		LICENSE
MCP_EXAMPLES.md		MCP_EXAMPLES.md
README.md		README.md
codewalk.yaml		codewalk.yaml
env.example.txt		env.example.txt
env.local.example.txt		env.local.example.txt
env.server.example.txt		env.server.example.txt
mcp.json.example		mcp.json.example
mcp.json.local.example		mcp.json.local.example
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

CODEWALK

What is Codewalk?

Why Codewalk?

✨ Features

Supported Languages

🆚 Codewalk vs. alternatives

🎬 Demo

Web UI

MCP with VS Code Copilot

REST API

Voice Interface

🖥️ Frontend — Knowledge Graph UI

What you can do

Run it

Demo

⚙️ Setup (local)

Prerequisites

1. Clone the codewalk repo

2. Backend setup in codewalk

3. Frontend setup in codewalk

4. Configure environment in codewalk

5. Pull an Ollama model (if using local LLM)

🚀 Usage

Option 1: Web UI

Option 2: MCP Server (VS Code Copilot / Claude Code / Cursor)

Option 3: REST API

🔌 MCP Integration

Cloud MCP (index on server, query locally)

Local-only MCP (index on your machine)

Review & approve fixes (agent + MCP)

MCP tools — index requirements

Starting the MCP Server in VS Code

VS Code Copilot

Claude Code

Cursor

OpenAI Codex CLI

How It Works (First-Time Setup)

Tool Calling Sequence

⚠️ If the AI Stops Mid-Workflow

MCP Tools — What You Can Ask

"Give me the big picture"

"What's in this module?"

"Explain this function to me"

"Search for something in the codebase"

"What breaks if I change this?"

"Where should I start reading?"

"How does the code flow?"

"I changed some code, refresh the analysis"

"Some files changed, update the embeddings"

"Review my changes for bugs"

"Review this specific file"

"Load our team's coding guidelines"

"Talk to the codebase hands-free"

"Is the architecture healthy?"

"How does file A reach file B?"

"Find circular dependencies"

"Look up a symbol"

"Generate a starter config"

"Run static analysis"

"Run tests"

Quick Reference — What To Ask

📡 API Reference

Analysis Endpoints

POST /analyze — Index a codebase

POST /analyze/stream — Index with live progress (SSE)

GET /index-status — Check whether the current workspace is indexed

API endpoints — index requirements (parity with MCP)

POST /refresh — Re-scan without re-embedding

Chat Endpoint

POST /chat — Ask the AI agent a question

View Endpoints

GET /overview — Project overview

GET /modules — List all modules

GET /modules/{name} — Module details

GET /blast-radius — Top 15 riskiest files

GET /blast-radius/{module} — Blast radius for a module

`POST /analyze` — Index a codebase

`POST /analyze/stream` — Index with live progress (SSE)

`GET /index-status` — Check whether the current workspace is indexed

`POST /refresh` — Re-scan without re-embedding

`POST /chat` — Ask the AI agent a question

`GET /overview` — Project overview

`GET /modules` — List all modules

`GET /modules/{name}` — Module details

`GET /blast-radius` — Top 15 riskiest files

`GET /blast-radius/{module}` — Blast radius for a module

`GET /reading-order` — Recommended reading order

`GET /execution-flow` — Execution flow diagram

`POST /incremental-reindex` — Re-embed only changed files

`POST /review` — Review git diff

`POST /review/file` — Review a single file

`POST /review/guidelines` — Load coding guidelines

`POST /voice/ask` — Voice-in, voice-out Q&A

`GET /health` — Health check

Per-repo team config (`codewalk.yaml`)

Adding `.codewalk/` to `.gitignore`

Packages