Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ jobs:
enable-cache: true

- name: Install
run: uv pip install --system -e ".[dev,server,ui,openai]"
run: uv pip install --system -e ".[dev,server,ui,openai,mcp]"

- name: Lint (ruff)
run: ruff check .
Expand Down
2 changes: 1 addition & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
- `coderag/store/`: `sqlite_store.py` (source of truth + FTS5) and `vector_index.py` (FAISS Flat/IVF cache).
- `coderag/retrieval/`: Hybrid dense + BM25 search fused with RRF.
- `coderag/indexer.py`, `coderag/watch.py`: Incremental indexing and the debounced watcher.
- `coderag/surfaces/`: `cli.py`, `http_api.py` (FastAPI), `webui.py` — thin adapters over the facade.
- `coderag/surfaces/`: `cli.py`, `http_api.py` (FastAPI), `webui.py`, `mcp_server.py` (MCP, for AI agents) — thin adapters over the facade.
- `tests/`: pytest suite (offline by default via the `fake` provider; real model behind `-m integration`).
- `example.env` → copy to `.env`; CI lives in `.github/`.

Expand Down
9 changes: 7 additions & 2 deletions DEVELOPMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ coderag/
│ ├── sqlite_store.py # files/chunks/vectors + FTS5 lexical search
│ └── vector_index.py # FaissVectorIndex: Flat (exact) / IVF (scale)
├── retrieval/ # Hybrid search: dense + BM25, fused with RRF
└── surfaces/ # cli.py · http_api.py (FastAPI) · webui.py
└── surfaces/ # cli.py · http_api.py (FastAPI) · webui.py · mcp_server.py (MCP)
```

### Design invariants (don't break these)
Expand All @@ -41,9 +41,14 @@ coderag/
- **`chunks.id` is the FAISS id and is `AUTOINCREMENT`** — ids are never reused, which keeps
a stale cache from resurrecting deleted content.
- **Delete-before-add.** A changed file's old chunks are removed from both SQLite and FAISS
before new ones are added (`Indexer._index_file`). This is the bug the old `monitor.py` had.
before new ones are added (`Indexer._write`). This is the bug the old `monitor.py` had.
- **The embedding dimension comes from the provider**, never a hard-coded constant. A model
change is detected via `meta.embed_dim` and triggers a clean rebuild.
- **Writes serialize; reads don't block.** All indexing/deletion goes through one lock on the
`CodeRAG` facade (`_index_lock`), and `FaissVectorIndex` guards its own add/remove/search/
rebuild — so the MCP server's background index and live watcher run safely alongside
concurrent agent searches. Indexing may parallelize chunk+embed across `index_workers`
threads, but the SQLite/FAISS writes stay single-writer (`Indexer._write`).

## Quality gate

Expand Down
101 changes: 95 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,40 @@ files that matter — ranked by meaning, not just string match.
It runs **entirely on your machine with no API key** (a local ONNX embedding model is the
default), keeps its index **up to date as you edit**, and is built to stay fast on **large
codebases**. Use it from the **CLI**, embed it as a **Python library**, self-host it as an
**HTTP service**, or browse with the **web UI**.
**HTTP service**, browse with the **web UI**, or plug it into an **AI coding agent over MCP**
so it searches a warm index instead of grepping.

> Built for the cases off-the-shelf IDE assistants don't cover well: a codebase that's too
> big, too private, or too custom — or a search/RAG capability you want to own and embed in
> your own tools.

## ⚡ Find the right code in one call — not a grep loop

Coding agents like Claude Code and Codex locate code by *running searches* — grep, glob, read,
repeat — which burns tokens and round-trips and reduces to literal keyword matching. CodeRAG
turns the workspace into a **warm, pre-indexed** engine: a single query returns the right
functions and files ranked by **meaning *and* keyword**, with exact `path:line` citations. The
embedding model loads once, so each query is one in-process lookup (FAISS + BM25 + fusion), not
a multi-round shell loop — and over MCP (`coderag mcp`, below) it becomes the agent's search tool.

**Proof from the eval harness** — this repo's 24 natural-language → file queries (90 files /
553 chunks), local `bge-small`, one warm query each (reproduce with `coderag eval --compare
--dataset coderag/eval/datasets/coderag_self.jsonl`):

| retrieval | MRR | R@1 | R@5 | Hit@10 |
| --- | :---: | :---: | :---: | :---: |
| BM25 — ranked keyword search (already stronger than raw grep) | 0.751 | 0.604 | 0.854 | 1.000 |
| dense — semantic only | 0.784 | 0.604 | 0.938 | 1.000 |
| **hybrid — CodeRAG's default** | **0.822** | **0.688** | **1.000** | **1.000** |

Hybrid puts a relevant file in the **top-5 for every query** and ranks it **#1 ≈69%** of the
time — beating the ranked-keyword search a grep-based agent leans on (raw grep is weaker still:
unranked literal match) by adding semantic understanding on top. To measure the *latency* and
*token-cost* gap against an actual grep loop on your own repo, run
[`scripts/bench_vs_grep.py`](scripts/bench_vs_grep.py). The fuller story — symbol-level
localization, the reranker (**+55% R@1** where there's headroom), multi-repo generalization,
and the honest caveats — is in [`docs/eval.md`](docs/eval.md).

---

## ✨ Highlights
Expand All @@ -35,10 +63,11 @@ codebases**. Use it from the **CLI**, embed it as a **Python library**, self-hos
- **Bring your own model platform.** Built for self-hosted and local models first (any OpenAI-compatible server — Ollama, vLLM, LM Studio, LocalAI), with first-class **OpenAI API** and **Anthropic API** support when you want it.
- **Symbol-aware chunking.** Indexes *functions, classes, and methods* (Python via `ast`; JS/TS/Go/Rust/Java via tree-sitter), not crude fixed-size blocks — so results point at real code units with `file:line` citations.
- **Hybrid retrieval, with optional reranking.** Dense vector search **+** BM25 keyword search, fused with Reciprocal Rank Fusion — great at both "what does this *mean*" and exact-identifier lookups. Add an optional local **cross-encoder reranker** (two-stage retrieve-then-rerank, `CODERAG_RERANK=1`, no API key) to sharpen the top results.
- **Drop-in for AI coding agents.** An **MCP server** (`coderag mcp`) lets Claude Code, Codex, and Cursor search a warm, pre-indexed workspace instead of slow grep/glob/read loops — ranked `path:line` results from a single call, with the index kept live as you edit. Works on a plain file directory too, not just code.
- **Measured, not guessed.** A built-in **evaluation harness** (`coderag eval`) scores retrieval quality — recall@k, MRR, nDCG@k at file *or* symbol level — and can mine a benchmark straight from your git history. Every default (1:1 hybrid, reranker opt-in, adaptive fusion off) is the choice the harness validated, including across an external repo.
- **Incremental & live.** Content-hashed indexing only re-embeds files that changed; a debounced watcher keeps the index current as you code. No duplicate or stale vectors.
- **Built to scale.** Exact `Flat` search for small repos, automatic switch to approximate `IVF` past a threshold so it stays fast at 100k+ chunks.
- **Four surfaces, one engine.** CLI · Python library · HTTP/REST · web UI — all thin wrappers over the same `CodeRAG` object.
- **Five surfaces, one engine.** CLI · Python library · HTTP/REST · web UI · MCP server — all thin wrappers over the same `CodeRAG` object.

## 🚀 Quick start

Expand All @@ -47,6 +76,7 @@ pip install -e . # core engine (local embeddings included)
# optional extras:
pip install -e ".[server]" # HTTP/REST API
pip install -e ".[ui]" # built-in web UI (FastAPI + Jinja + Pygments)
pip install -e ".[mcp]" # MCP server for AI coding agents (Claude Code, Codex, Cursor)
pip install -e ".[openai]" # OpenAI (or self-hosted OpenAI-compatible) embeddings / answers
pip install -e ".[anthropic]" # Anthropic (Claude) LLM answers
pip install -e ".[all]" # everything above
Expand All @@ -69,7 +99,7 @@ coderag search "where are duplicate vectors removed on file change" --watched-di
By default the index lives in `./.coderag/`. Set `CODERAG_WATCHED_DIR` / `CODERAG_STORE_DIR`
(or copy `example.env` to `.env`) to avoid repeating flags.

## 🧑‍💻 The four surfaces
## 🧑‍💻 The surfaces

### CLI

Expand All @@ -79,6 +109,7 @@ coderag search "QUERY" [-k 8] # hybrid search; add --json or --answer
coderag watch # index, then keep it live as files change
coderag serve --port 8000 # run the HTTP API (needs [server])
coderag ui # launch the web UI (needs [ui])
coderag mcp # MCP server for AI agents (needs [mcp]); --all-text for any dir
coderag status # index stats (files, chunks, model, index type)
coderag eval --dataset d.jsonl --compare # retrieval quality: dense vs BM25 vs hybrid
```
Expand Down Expand Up @@ -131,6 +162,57 @@ browser**, index status, a one-click **Reindex**, and an optional streamed LLM a
enhanced — every page works with JavaScript disabled, and there's no CDN/runtime network
dependency, so it stays local-first.

### MCP — let an AI coding agent search instead of grepping (`coderag mcp`)

Tools like Claude Code and Codex locate code with iterative `grep`/`glob`/read loops. CodeRAG
exposes the same workspace as a **Model Context Protocol** server, so an agent gets fast,
ranked `path:line` results from a single call against a **warm, pre-indexed** workspace — the
embedding model loads once and every query is then one in-process lookup (FAISS + BM25 +
fusion), not a multi-round shell search.

```bash
pip install -e ".[mcp]"
coderag mcp # index the current dir, keep it live, serve over stdio
coderag mcp --all-text # index ALL text files (docs/notes/config), not just code
```

It auto-indexes the working directory on startup (in the **background**, so it's responsive
immediately) and keeps the index live with the watcher — zero manual steps. Tools exposed:
**`search_code`** (hybrid search, compact snippets + `path:line`), **`get_file`** (read a
precise range of an indexed file), **`index_status`** (coverage/freshness), and **`reindex`**.

Wire it into an agent (the server defaults to the directory it's launched in):

```bash
# Claude Code
claude mcp add coderag -- coderag mcp
```

```jsonc
// Cursor: .cursor/mcp.json — or Claude Code: .mcp.json (at the repo root)
{ "mcpServers": { "coderag": { "command": "coderag", "args": ["mcp"] } } }
```

```toml
# Codex: ~/.codex/config.toml
[mcp_servers.coderag]
command = "coderag"
args = ["mcp"]
```

> If `coderag` isn't on the launcher's PATH, use an absolute path (or `python -m coderag.surfaces.cli mcp`).
> To index a directory other than where the client launches, add `"--watched-dir", "/abs/path"` to `args`.
> Fast by default (local `bge-small`, no reranker); set `CODERAG_RERANK=1` to trade ~30 ms/query for sharper top results.

**Why bother? Measure it.** [`scripts/bench_vs_grep.py`](scripts/bench_vs_grep.py) scores
indexed search against a raw grep baseline on the same eval dataset — accuracy
(recall@k / nDCG@k / MRR via the eval harness), latency per query, and approximate context
tokens (compact chunks vs reading whole files):

```bash
python scripts/bench_vs_grep.py --watched-dir . --dataset coderag/eval/datasets/coderag_self.jsonl
```

## 🐳 Docker (beta)

Prebuilt **multi-arch** images (`linux/amd64` + `linux/arm64`) are published to GHCR on
Expand Down Expand Up @@ -228,18 +310,25 @@ Everything is configurable via `CODERAG_*` environment variables or a `.env` fil
| `CODERAG_ANTHROPIC_MODEL` | `claude-opus-4-8` | Anthropic chat model for answers |
| `CODERAG_API_KEY` | – | If set, the HTTP API **requires** it (`Authorization: Bearer <key>` or `X-API-Key`). Set whenever the server is reachable beyond localhost. |
| `CODERAG_CORS_ORIGINS` | – | Comma-separated CORS allowlist for the HTTP API (never `*`). Empty ⇒ no cross-origin browser access. |
| `CODERAG_WORKERS` | `4` | Worker threads for chunking + embedding during indexing (`1` = serial). |
| `CODERAG_INDEX_ALL_TEXT` | `false` | Index any UTF-8 text file (docs/config/extensionless), not just code — turns a plain directory into a searchable workspace. Binary files are always skipped. |
| `CODERAG_MCP_AUTO_INDEX` | `true` | MCP server indexes the watched dir on startup (in the background). |
| `CODERAG_MCP_WATCH` | `true` | MCP server keeps the index live via the filesystem watcher. |
| `CODERAG_MCP_SNIPPET_LINES` | `12` | Lines of a chunk returned in a `search_code` snippet by default. |

## 🧩 Supported languages

Symbol-aware (function/class/method level): **Python, JavaScript, TypeScript/TSX, Go, Rust,
Java**. Many other languages and docs (C/C++, Ruby, PHP, Markdown, YAML, …) are indexed with
a line-window fallback, so they remain searchable.
Java**. Many other languages and docs (C/C++, Ruby, PHP, Markdown, YAML, HTML/CSS, …) are
indexed with a line-window fallback, so they remain searchable. Set `CODERAG_INDEX_ALL_TEXT=1`
(or `coderag mcp --all-text`) to index **any** UTF-8 text file — including extensionless ones
like `Dockerfile` — so a plain document/notes directory becomes searchable too, not just code.

## 🛠️ Development

```bash
python -m venv venv && source venv/bin/activate
pip install -e ".[dev,server,openai]"
pip install -e ".[dev,server,ui,mcp,openai]"

pytest -m "not integration" # fast, offline (uses a deterministic fake embedder)
pytest -m integration # exercises the real local model (downloads once)
Expand Down
17 changes: 12 additions & 5 deletions coderag/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from __future__ import annotations

import logging
import threading
from pathlib import Path
from typing import TYPE_CHECKING, List, Optional, Union

Expand Down Expand Up @@ -38,6 +39,10 @@ def __init__(self, config: Optional[Config] = None) -> None:
# Set when the store's embedding model/dim changed and the FAISS cache must
# be rebuilt from scratch (consumed when the vector index is first opened).
self._rebuild_required: bool = False
# Serializes all indexing/deletion so concurrent writers (the CLI, the HTTP
# surface, the MCP server's background index, and the live watcher) can't
# interleave a file's delete-before-add sequence. Reads (search) are unaffected.
self._index_lock = threading.Lock()

# --- lazily constructed collaborators ---

Expand Down Expand Up @@ -116,7 +121,8 @@ def index(
force a clean rebuild.
"""
target = Path(path).expanduser() if path else self.config.watched_dir
return self.indexer.index(target, full=full)
with self._index_lock:
return self.indexer.index(target, full=full)

def search(self, query: str, top_k: Optional[int] = None) -> List[SearchHit]:
"""Hybrid (dense + lexical) search over the indexed codebase."""
Expand Down Expand Up @@ -161,10 +167,11 @@ def delete_path(self, path: Union[str, Path]) -> int:
rel = Path(path).resolve().relative_to(root).as_posix()
except ValueError:
return 0
removed = self.store.delete_file(rel)
if removed:
self.vectors.remove(removed)
self.vectors.save()
with self._index_lock:
removed = self.store.delete_file(rel)
if removed:
self.vectors.remove(removed)
self.vectors.save()
return len(removed)

def status(self) -> dict:
Expand Down
43 changes: 40 additions & 3 deletions coderag/chunking/languages.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,12 +49,49 @@
".json": "json",
".cfg": "ini",
".ini": "ini",
# Common markup/web/config text — searchable in most repos, line-window chunked.
".xml": "xml",
".html": "html",
".htm": "html",
".css": "css",
".scss": "scss",
".less": "less",
".vue": "vue",
".svelte": "svelte",
".properties": "properties",
".gradle": "gradle",
}

# Well-known text files that have no (or an unconventional) extension. Matched on the
# lowercased file *name* when the extension lookup misses.
FILENAME_TO_LANGUAGE = {
"dockerfile": "dockerfile",
"makefile": "make",
"license": "text",
"notice": "text",
"readme": "text",
"codeowners": "text",
".env": "text",
".gitignore": "text",
".dockerignore": "text",
}


def detect_language(path: str | Path, *, all_text: bool = False) -> Optional[str]:
"""Return the language for ``path``, or ``None`` if it should not be indexed.

def detect_language(path: str | Path) -> Optional[str]:
"""Return the language for ``path``, or ``None`` if it should not be indexed."""
return EXTENSION_TO_LANGUAGE.get(Path(path).suffix.lower())
With ``all_text=True`` any unrecognized file is treated as plain ``"text"`` so a whole
directory (docs, notes, config) becomes searchable, not just code. Binary files are
still rejected later by the indexer's NUL-byte sniff, so this stays safe.
"""
p = Path(path)
lang = EXTENSION_TO_LANGUAGE.get(p.suffix.lower())
if lang:
return lang
lang = FILENAME_TO_LANGUAGE.get(p.name.lower())
if lang:
return lang
return "text" if all_text else None


def extensions_for(languages: Iterable[str]) -> List[str]:
Expand Down
24 changes: 24 additions & 0 deletions coderag/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,12 @@ class Config:
# --- What to index ---
languages: Tuple[str, ...] = DEFAULT_LANGUAGES
ignore_globs: Tuple[str, ...] = DEFAULT_IGNORE_GLOBS
# Index any UTF-8-decodable file as plain text, even with an unknown/absent extension
# (Dockerfile, Makefile, LICENSE, .log, ...). Off by default so code repos aren't
# polluted; turn on (CODERAG_INDEX_ALL_TEXT / `coderag mcp --all-text`) to make
# CodeRAG a general document/file-directory search engine. Binary files are still
# skipped (NUL-byte sniff in the indexer).
index_all_text: bool = False
max_file_bytes: int = 1_000_000 # skip files larger than this
max_chunk_lines: int = 200 # split oversized symbols into windows above this
window_lines: int = 60 # fallback line-window size
Expand Down Expand Up @@ -182,6 +188,18 @@ class Config:
# wildcard lets any website the user visits exfiltrate them via the browser.
cors_origins: Tuple[str, ...] = ()

# --- MCP server surface (optional [mcp] surface) ---
# `coderag mcp` runs a persistent, warm process so the embedding model loads once and
# every query is fast — the win over an agent's cold, repeated grep/read loop. By
# default it indexes the watched dir on startup (in the background, so the server is
# responsive immediately) and keeps it live via the filesystem watcher, so an agent
# gets fresh results with zero manual steps.
mcp_auto_index: bool = True
mcp_watch: bool = True
# Lines of a chunk returned in a search_code snippet by default (the agent can request
# the full text, or fetch a precise range via get_file) — keeps responses token-cheap.
mcp_snippet_lines: int = 12

# --- Demo mode (public, untrusted UI) ---
# When on, the Streamlit UI shows a notice, hides the Reindex button, and limits
# LLM answers per browser session. The per-session limit is soft (session-state
Expand Down Expand Up @@ -252,6 +270,12 @@ def from_env(cls, **overrides: object) -> "Config":
),
api_key=os.getenv("CODERAG_API_KEY"),
cors_origins=_env_tuple("CODERAG_CORS_ORIGINS", cls.cors_origins),
index_all_text=_env_bool("CODERAG_INDEX_ALL_TEXT", cls.index_all_text),
mcp_auto_index=_env_bool("CODERAG_MCP_AUTO_INDEX", cls.mcp_auto_index),
mcp_watch=_env_bool("CODERAG_MCP_WATCH", cls.mcp_watch),
mcp_snippet_lines=_env_int(
"CODERAG_MCP_SNIPPET_LINES", cls.mcp_snippet_lines
),
demo_mode=_env_bool("CODERAG_DEMO_MODE", cls.demo_mode),
demo_max_answers=_env_int("CODERAG_DEMO_MAX_ANSWERS", cls.demo_max_answers),
demo_cooldown_seconds=_env_int(
Expand Down
Loading
Loading