diff --git a/.github/workflows/ci-tests.yml b/.github/workflows/ci-tests.yml index 97bca1d..ddd65ba 100644 --- a/.github/workflows/ci-tests.yml +++ b/.github/workflows/ci-tests.yml @@ -43,7 +43,7 @@ jobs: enable-cache: true - name: Install - run: uv pip install --system -e ".[dev,server,ui,openai]" + run: uv pip install --system -e ".[dev,server,ui,openai,mcp]" - name: Lint (ruff) run: ruff check . diff --git a/AGENTS.md b/AGENTS.md index e5c5fef..bf45334 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -8,7 +8,7 @@ - `coderag/store/`: `sqlite_store.py` (source of truth + FTS5) and `vector_index.py` (FAISS Flat/IVF cache). - `coderag/retrieval/`: Hybrid dense + BM25 search fused with RRF. - `coderag/indexer.py`, `coderag/watch.py`: Incremental indexing and the debounced watcher. -- `coderag/surfaces/`: `cli.py`, `http_api.py` (FastAPI), `webui.py` — thin adapters over the facade. +- `coderag/surfaces/`: `cli.py`, `http_api.py` (FastAPI), `webui.py`, `mcp_server.py` (MCP, for AI agents) — thin adapters over the facade. - `tests/`: pytest suite (offline by default via the `fake` provider; real model behind `-m integration`). - `example.env` → copy to `.env`; CI lives in `.github/`. diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md index 99cb751..3be3ecc 100644 --- a/DEVELOPMENT.md +++ b/DEVELOPMENT.md @@ -30,7 +30,7 @@ coderag/ │ ├── sqlite_store.py # files/chunks/vectors + FTS5 lexical search │ └── vector_index.py # FaissVectorIndex: Flat (exact) / IVF (scale) ├── retrieval/ # Hybrid search: dense + BM25, fused with RRF -└── surfaces/ # cli.py · http_api.py (FastAPI) · webui.py +└── surfaces/ # cli.py · http_api.py (FastAPI) · webui.py · mcp_server.py (MCP) ``` ### Design invariants (don't break these) @@ -41,9 +41,14 @@ coderag/ - **`chunks.id` is the FAISS id and is `AUTOINCREMENT`** — ids are never reused, which keeps a stale cache from resurrecting deleted content. - **Delete-before-add.** A changed file's old chunks are removed from both SQLite and FAISS - before new ones are added (`Indexer._index_file`). This is the bug the old `monitor.py` had. + before new ones are added (`Indexer._write`). This is the bug the old `monitor.py` had. - **The embedding dimension comes from the provider**, never a hard-coded constant. A model change is detected via `meta.embed_dim` and triggers a clean rebuild. +- **Writes serialize; reads don't block.** All indexing/deletion goes through one lock on the + `CodeRAG` facade (`_index_lock`), and `FaissVectorIndex` guards its own add/remove/search/ + rebuild — so the MCP server's background index and live watcher run safely alongside + concurrent agent searches. Indexing may parallelize chunk+embed across `index_workers` + threads, but the SQLite/FAISS writes stay single-writer (`Indexer._write`). ## Quality gate diff --git a/README.md b/README.md index 8056174..ec8e815 100644 --- a/README.md +++ b/README.md @@ -21,12 +21,40 @@ files that matter — ranked by meaning, not just string match. It runs **entirely on your machine with no API key** (a local ONNX embedding model is the default), keeps its index **up to date as you edit**, and is built to stay fast on **large codebases**. Use it from the **CLI**, embed it as a **Python library**, self-host it as an -**HTTP service**, or browse with the **web UI**. +**HTTP service**, browse with the **web UI**, or plug it into an **AI coding agent over MCP** +so it searches a warm index instead of grepping. > Built for the cases off-the-shelf IDE assistants don't cover well: a codebase that's too > big, too private, or too custom — or a search/RAG capability you want to own and embed in > your own tools. +## ⚡ Find the right code in one call — not a grep loop + +Coding agents like Claude Code and Codex locate code by *running searches* — grep, glob, read, +repeat — which burns tokens and round-trips and reduces to literal keyword matching. CodeRAG +turns the workspace into a **warm, pre-indexed** engine: a single query returns the right +functions and files ranked by **meaning *and* keyword**, with exact `path:line` citations. The +embedding model loads once, so each query is one in-process lookup (FAISS + BM25 + fusion), not +a multi-round shell loop — and over MCP (`coderag mcp`, below) it becomes the agent's search tool. + +**Proof from the eval harness** — this repo's 24 natural-language → file queries (90 files / +553 chunks), local `bge-small`, one warm query each (reproduce with `coderag eval --compare +--dataset coderag/eval/datasets/coderag_self.jsonl`): + +| retrieval | MRR | R@1 | R@5 | Hit@10 | +| --- | :---: | :---: | :---: | :---: | +| BM25 — ranked keyword search (already stronger than raw grep) | 0.751 | 0.604 | 0.854 | 1.000 | +| dense — semantic only | 0.784 | 0.604 | 0.938 | 1.000 | +| **hybrid — CodeRAG's default** | **0.822** | **0.688** | **1.000** | **1.000** | + +Hybrid puts a relevant file in the **top-5 for every query** and ranks it **#1 ≈69%** of the +time — beating the ranked-keyword search a grep-based agent leans on (raw grep is weaker still: +unranked literal match) by adding semantic understanding on top. To measure the *latency* and +*token-cost* gap against an actual grep loop on your own repo, run +[`scripts/bench_vs_grep.py`](scripts/bench_vs_grep.py). The fuller story — symbol-level +localization, the reranker (**+55% R@1** where there's headroom), multi-repo generalization, +and the honest caveats — is in [`docs/eval.md`](docs/eval.md). + --- ## ✨ Highlights @@ -35,10 +63,11 @@ codebases**. Use it from the **CLI**, embed it as a **Python library**, self-hos - **Bring your own model platform.** Built for self-hosted and local models first (any OpenAI-compatible server — Ollama, vLLM, LM Studio, LocalAI), with first-class **OpenAI API** and **Anthropic API** support when you want it. - **Symbol-aware chunking.** Indexes *functions, classes, and methods* (Python via `ast`; JS/TS/Go/Rust/Java via tree-sitter), not crude fixed-size blocks — so results point at real code units with `file:line` citations. - **Hybrid retrieval, with optional reranking.** Dense vector search **+** BM25 keyword search, fused with Reciprocal Rank Fusion — great at both "what does this *mean*" and exact-identifier lookups. Add an optional local **cross-encoder reranker** (two-stage retrieve-then-rerank, `CODERAG_RERANK=1`, no API key) to sharpen the top results. +- **Drop-in for AI coding agents.** An **MCP server** (`coderag mcp`) lets Claude Code, Codex, and Cursor search a warm, pre-indexed workspace instead of slow grep/glob/read loops — ranked `path:line` results from a single call, with the index kept live as you edit. Works on a plain file directory too, not just code. - **Measured, not guessed.** A built-in **evaluation harness** (`coderag eval`) scores retrieval quality — recall@k, MRR, nDCG@k at file *or* symbol level — and can mine a benchmark straight from your git history. Every default (1:1 hybrid, reranker opt-in, adaptive fusion off) is the choice the harness validated, including across an external repo. - **Incremental & live.** Content-hashed indexing only re-embeds files that changed; a debounced watcher keeps the index current as you code. No duplicate or stale vectors. - **Built to scale.** Exact `Flat` search for small repos, automatic switch to approximate `IVF` past a threshold so it stays fast at 100k+ chunks. -- **Four surfaces, one engine.** CLI · Python library · HTTP/REST · web UI — all thin wrappers over the same `CodeRAG` object. +- **Five surfaces, one engine.** CLI · Python library · HTTP/REST · web UI · MCP server — all thin wrappers over the same `CodeRAG` object. ## 🚀 Quick start @@ -47,6 +76,7 @@ pip install -e . # core engine (local embeddings included) # optional extras: pip install -e ".[server]" # HTTP/REST API pip install -e ".[ui]" # built-in web UI (FastAPI + Jinja + Pygments) +pip install -e ".[mcp]" # MCP server for AI coding agents (Claude Code, Codex, Cursor) pip install -e ".[openai]" # OpenAI (or self-hosted OpenAI-compatible) embeddings / answers pip install -e ".[anthropic]" # Anthropic (Claude) LLM answers pip install -e ".[all]" # everything above @@ -69,7 +99,7 @@ coderag search "where are duplicate vectors removed on file change" --watched-di By default the index lives in `./.coderag/`. Set `CODERAG_WATCHED_DIR` / `CODERAG_STORE_DIR` (or copy `example.env` to `.env`) to avoid repeating flags. -## 🧑‍💻 The four surfaces +## 🧑‍💻 The surfaces ### CLI @@ -79,6 +109,7 @@ coderag search "QUERY" [-k 8] # hybrid search; add --json or --answer coderag watch # index, then keep it live as files change coderag serve --port 8000 # run the HTTP API (needs [server]) coderag ui # launch the web UI (needs [ui]) +coderag mcp # MCP server for AI agents (needs [mcp]); --all-text for any dir coderag status # index stats (files, chunks, model, index type) coderag eval --dataset d.jsonl --compare # retrieval quality: dense vs BM25 vs hybrid ``` @@ -131,6 +162,57 @@ browser**, index status, a one-click **Reindex**, and an optional streamed LLM a enhanced — every page works with JavaScript disabled, and there's no CDN/runtime network dependency, so it stays local-first. +### MCP — let an AI coding agent search instead of grepping (`coderag mcp`) + +Tools like Claude Code and Codex locate code with iterative `grep`/`glob`/read loops. CodeRAG +exposes the same workspace as a **Model Context Protocol** server, so an agent gets fast, +ranked `path:line` results from a single call against a **warm, pre-indexed** workspace — the +embedding model loads once and every query is then one in-process lookup (FAISS + BM25 + +fusion), not a multi-round shell search. + +```bash +pip install -e ".[mcp]" +coderag mcp # index the current dir, keep it live, serve over stdio +coderag mcp --all-text # index ALL text files (docs/notes/config), not just code +``` + +It auto-indexes the working directory on startup (in the **background**, so it's responsive +immediately) and keeps the index live with the watcher — zero manual steps. Tools exposed: +**`search_code`** (hybrid search, compact snippets + `path:line`), **`get_file`** (read a +precise range of an indexed file), **`index_status`** (coverage/freshness), and **`reindex`**. + +Wire it into an agent (the server defaults to the directory it's launched in): + +```bash +# Claude Code +claude mcp add coderag -- coderag mcp +``` + +```jsonc +// Cursor: .cursor/mcp.json — or Claude Code: .mcp.json (at the repo root) +{ "mcpServers": { "coderag": { "command": "coderag", "args": ["mcp"] } } } +``` + +```toml +# Codex: ~/.codex/config.toml +[mcp_servers.coderag] +command = "coderag" +args = ["mcp"] +``` + +> If `coderag` isn't on the launcher's PATH, use an absolute path (or `python -m coderag.surfaces.cli mcp`). +> To index a directory other than where the client launches, add `"--watched-dir", "/abs/path"` to `args`. +> Fast by default (local `bge-small`, no reranker); set `CODERAG_RERANK=1` to trade ~30 ms/query for sharper top results. + +**Why bother? Measure it.** [`scripts/bench_vs_grep.py`](scripts/bench_vs_grep.py) scores +indexed search against a raw grep baseline on the same eval dataset — accuracy +(recall@k / nDCG@k / MRR via the eval harness), latency per query, and approximate context +tokens (compact chunks vs reading whole files): + +```bash +python scripts/bench_vs_grep.py --watched-dir . --dataset coderag/eval/datasets/coderag_self.jsonl +``` + ## 🐳 Docker (beta) Prebuilt **multi-arch** images (`linux/amd64` + `linux/arm64`) are published to GHCR on @@ -228,18 +310,25 @@ Everything is configurable via `CODERAG_*` environment variables or a `.env` fil | `CODERAG_ANTHROPIC_MODEL` | `claude-opus-4-8` | Anthropic chat model for answers | | `CODERAG_API_KEY` | – | If set, the HTTP API **requires** it (`Authorization: Bearer ` or `X-API-Key`). Set whenever the server is reachable beyond localhost. | | `CODERAG_CORS_ORIGINS` | – | Comma-separated CORS allowlist for the HTTP API (never `*`). Empty ⇒ no cross-origin browser access. | +| `CODERAG_WORKERS` | `4` | Worker threads for chunking + embedding during indexing (`1` = serial). | +| `CODERAG_INDEX_ALL_TEXT` | `false` | Index any UTF-8 text file (docs/config/extensionless), not just code — turns a plain directory into a searchable workspace. Binary files are always skipped. | +| `CODERAG_MCP_AUTO_INDEX` | `true` | MCP server indexes the watched dir on startup (in the background). | +| `CODERAG_MCP_WATCH` | `true` | MCP server keeps the index live via the filesystem watcher. | +| `CODERAG_MCP_SNIPPET_LINES` | `12` | Lines of a chunk returned in a `search_code` snippet by default. | ## 🧩 Supported languages Symbol-aware (function/class/method level): **Python, JavaScript, TypeScript/TSX, Go, Rust, -Java**. Many other languages and docs (C/C++, Ruby, PHP, Markdown, YAML, …) are indexed with -a line-window fallback, so they remain searchable. +Java**. Many other languages and docs (C/C++, Ruby, PHP, Markdown, YAML, HTML/CSS, …) are +indexed with a line-window fallback, so they remain searchable. Set `CODERAG_INDEX_ALL_TEXT=1` +(or `coderag mcp --all-text`) to index **any** UTF-8 text file — including extensionless ones +like `Dockerfile` — so a plain document/notes directory becomes searchable too, not just code. ## 🛠️ Development ```bash python -m venv venv && source venv/bin/activate -pip install -e ".[dev,server,openai]" +pip install -e ".[dev,server,ui,mcp,openai]" pytest -m "not integration" # fast, offline (uses a deterministic fake embedder) pytest -m integration # exercises the real local model (downloads once) diff --git a/coderag/api.py b/coderag/api.py index a144ff8..497993d 100644 --- a/coderag/api.py +++ b/coderag/api.py @@ -8,6 +8,7 @@ from __future__ import annotations import logging +import threading from pathlib import Path from typing import TYPE_CHECKING, List, Optional, Union @@ -38,6 +39,10 @@ def __init__(self, config: Optional[Config] = None) -> None: # Set when the store's embedding model/dim changed and the FAISS cache must # be rebuilt from scratch (consumed when the vector index is first opened). self._rebuild_required: bool = False + # Serializes all indexing/deletion so concurrent writers (the CLI, the HTTP + # surface, the MCP server's background index, and the live watcher) can't + # interleave a file's delete-before-add sequence. Reads (search) are unaffected. + self._index_lock = threading.Lock() # --- lazily constructed collaborators --- @@ -116,7 +121,8 @@ def index( force a clean rebuild. """ target = Path(path).expanduser() if path else self.config.watched_dir - return self.indexer.index(target, full=full) + with self._index_lock: + return self.indexer.index(target, full=full) def search(self, query: str, top_k: Optional[int] = None) -> List[SearchHit]: """Hybrid (dense + lexical) search over the indexed codebase.""" @@ -161,10 +167,11 @@ def delete_path(self, path: Union[str, Path]) -> int: rel = Path(path).resolve().relative_to(root).as_posix() except ValueError: return 0 - removed = self.store.delete_file(rel) - if removed: - self.vectors.remove(removed) - self.vectors.save() + with self._index_lock: + removed = self.store.delete_file(rel) + if removed: + self.vectors.remove(removed) + self.vectors.save() return len(removed) def status(self) -> dict: diff --git a/coderag/chunking/languages.py b/coderag/chunking/languages.py index c2cc5a4..84e0e8f 100644 --- a/coderag/chunking/languages.py +++ b/coderag/chunking/languages.py @@ -49,12 +49,49 @@ ".json": "json", ".cfg": "ini", ".ini": "ini", + # Common markup/web/config text — searchable in most repos, line-window chunked. + ".xml": "xml", + ".html": "html", + ".htm": "html", + ".css": "css", + ".scss": "scss", + ".less": "less", + ".vue": "vue", + ".svelte": "svelte", + ".properties": "properties", + ".gradle": "gradle", } +# Well-known text files that have no (or an unconventional) extension. Matched on the +# lowercased file *name* when the extension lookup misses. +FILENAME_TO_LANGUAGE = { + "dockerfile": "dockerfile", + "makefile": "make", + "license": "text", + "notice": "text", + "readme": "text", + "codeowners": "text", + ".env": "text", + ".gitignore": "text", + ".dockerignore": "text", +} + + +def detect_language(path: str | Path, *, all_text: bool = False) -> Optional[str]: + """Return the language for ``path``, or ``None`` if it should not be indexed. -def detect_language(path: str | Path) -> Optional[str]: - """Return the language for ``path``, or ``None`` if it should not be indexed.""" - return EXTENSION_TO_LANGUAGE.get(Path(path).suffix.lower()) + With ``all_text=True`` any unrecognized file is treated as plain ``"text"`` so a whole + directory (docs, notes, config) becomes searchable, not just code. Binary files are + still rejected later by the indexer's NUL-byte sniff, so this stays safe. + """ + p = Path(path) + lang = EXTENSION_TO_LANGUAGE.get(p.suffix.lower()) + if lang: + return lang + lang = FILENAME_TO_LANGUAGE.get(p.name.lower()) + if lang: + return lang + return "text" if all_text else None def extensions_for(languages: Iterable[str]) -> List[str]: diff --git a/coderag/config.py b/coderag/config.py index 4c76423..54b9f29 100644 --- a/coderag/config.py +++ b/coderag/config.py @@ -117,6 +117,12 @@ class Config: # --- What to index --- languages: Tuple[str, ...] = DEFAULT_LANGUAGES ignore_globs: Tuple[str, ...] = DEFAULT_IGNORE_GLOBS + # Index any UTF-8-decodable file as plain text, even with an unknown/absent extension + # (Dockerfile, Makefile, LICENSE, .log, ...). Off by default so code repos aren't + # polluted; turn on (CODERAG_INDEX_ALL_TEXT / `coderag mcp --all-text`) to make + # CodeRAG a general document/file-directory search engine. Binary files are still + # skipped (NUL-byte sniff in the indexer). + index_all_text: bool = False max_file_bytes: int = 1_000_000 # skip files larger than this max_chunk_lines: int = 200 # split oversized symbols into windows above this window_lines: int = 60 # fallback line-window size @@ -182,6 +188,18 @@ class Config: # wildcard lets any website the user visits exfiltrate them via the browser. cors_origins: Tuple[str, ...] = () + # --- MCP server surface (optional [mcp] surface) --- + # `coderag mcp` runs a persistent, warm process so the embedding model loads once and + # every query is fast — the win over an agent's cold, repeated grep/read loop. By + # default it indexes the watched dir on startup (in the background, so the server is + # responsive immediately) and keeps it live via the filesystem watcher, so an agent + # gets fresh results with zero manual steps. + mcp_auto_index: bool = True + mcp_watch: bool = True + # Lines of a chunk returned in a search_code snippet by default (the agent can request + # the full text, or fetch a precise range via get_file) — keeps responses token-cheap. + mcp_snippet_lines: int = 12 + # --- Demo mode (public, untrusted UI) --- # When on, the Streamlit UI shows a notice, hides the Reindex button, and limits # LLM answers per browser session. The per-session limit is soft (session-state @@ -252,6 +270,12 @@ def from_env(cls, **overrides: object) -> "Config": ), api_key=os.getenv("CODERAG_API_KEY"), cors_origins=_env_tuple("CODERAG_CORS_ORIGINS", cls.cors_origins), + index_all_text=_env_bool("CODERAG_INDEX_ALL_TEXT", cls.index_all_text), + mcp_auto_index=_env_bool("CODERAG_MCP_AUTO_INDEX", cls.mcp_auto_index), + mcp_watch=_env_bool("CODERAG_MCP_WATCH", cls.mcp_watch), + mcp_snippet_lines=_env_int( + "CODERAG_MCP_SNIPPET_LINES", cls.mcp_snippet_lines + ), demo_mode=_env_bool("CODERAG_DEMO_MODE", cls.demo_mode), demo_max_answers=_env_int("CODERAG_DEMO_MAX_ANSWERS", cls.demo_max_answers), demo_cooldown_seconds=_env_int( diff --git a/coderag/indexer.py b/coderag/indexer.py index e3d4fd3..0970e40 100644 --- a/coderag/indexer.py +++ b/coderag/indexer.py @@ -24,7 +24,7 @@ from coderag.embeddings import EmbeddingProvider from coderag.store.sqlite_store import SQLiteStore from coderag.store.vector_index import FaissVectorIndex -from coderag.types import IndexStats +from coderag.types import Chunk, IndexStats logger = logging.getLogger(__name__) @@ -84,17 +84,11 @@ def index( else: work.append(item) - # 2. (Re)index changed files: remove old chunks, embed, add new ones. - iterator: Iterator[_Work] = iter(work) - if progress and work: - try: - from tqdm import tqdm - - iterator = tqdm(work, desc="Indexing", unit="file") - except Exception: # pragma: no cover - pass - for item in iterator: - added, removed = self._index_file(item) + # 2. (Re)index changed files. Chunking + embedding (the CPU/network cost) may run + # in parallel across files (config.index_workers); the SQLite + FAISS writes + # stay on this single thread to preserve the delete-before-add invariant and + # the single-connection store. + for added, removed in self._embed_and_write(work, progress=progress): stats.chunks_added += added stats.chunks_removed += removed stats.files_indexed += 1 @@ -131,6 +125,8 @@ def _maybe_work(self, abs_path: Path, rel: str, language: str) -> Optional[_Work return None if len(data) > self.config.max_file_bytes or not data.strip(): return None + if b"\x00" in data[:8192]: + return None # binary file (NUL byte in the head) — never index as text content_hash = hashlib.sha256(data).hexdigest() existing = self.store.get_file(rel) if existing is not None and existing["content_hash"] == content_hash: @@ -138,7 +134,66 @@ def _maybe_work(self, abs_path: Path, rel: str, language: str) -> Optional[_Work text = data.decode("utf-8", errors="replace") return _Work(rel, language, text, content_hash, abs_path.stat().st_mtime) - def _index_file(self, item: _Work) -> Tuple[int, int]: + def _embed_and_write( + self, work: List[_Work], *, progress: bool + ) -> Iterator[Tuple[int, int]]: + """Chunk+embed each file (optionally across worker threads) and apply the writes. + + Embedding is the expensive, parallelizable step and touches no shared mutable + state, so it runs in a thread pool when ``index_workers > 1``. The store/FAISS + writes are drained here on the single calling thread, so the no-duplicate + (delete-before-add) invariant and the single-writer store are preserved. + """ + if not work: + return + workers = max(1, self.config.index_workers) + bar = self._progress_bar(len(work), progress) + try: + if workers > 1 and len(work) > 1: + from concurrent.futures import ThreadPoolExecutor, as_completed + + with ThreadPoolExecutor(max_workers=workers) as pool: + futures = {pool.submit(self._prepare, item): item for item in work} + for fut in as_completed(futures): + chunks, vectors = fut.result() + yield self._write(futures[fut], chunks, vectors) + if bar is not None: + bar.update(1) + else: + for item in work: + chunks, vectors = self._prepare(item) + yield self._write(item, chunks, vectors) + if bar is not None: + bar.update(1) + finally: + if bar is not None: + bar.close() + + @staticmethod + def _progress_bar(total: int, progress: bool): # type: ignore[no-untyped-def] + if not progress: + return None + try: + from tqdm import tqdm + + return tqdm(total=total, desc="Indexing", unit="file") + except Exception: # pragma: no cover + return None + + def _prepare(self, item: _Work) -> Tuple[List[Chunk], Optional[np.ndarray]]: + """Chunk and embed a file. Pure with respect to the store/FAISS, so it is safe to + run in a worker thread; the resulting writes are applied by :meth:`_write`.""" + chunks = chunk_file(item.text, item.language, self.config) + if not chunks: + return [], None + vectors = self.provider.embed_documents([c.text for c in chunks]) + return chunks, vectors + + def _write( + self, item: _Work, chunks: List[Chunk], vectors: Optional[np.ndarray] + ) -> Tuple[int, int]: + """Apply a prepared file: remove its old chunks (store + FAISS) before adding the + new ones. Must run single-threaded — it is the only writer.""" removed = 0 existing = self.store.get_file(item.rel) if existing is not None: @@ -150,11 +205,9 @@ def _index_file(self, item: _Work) -> Tuple[int, int]: item.rel, item.language, item.content_hash, item.mtime ) - chunks = chunk_file(item.text, item.language, self.config) - if not chunks: + if not chunks or vectors is None: return 0, removed - vectors = self.provider.embed_documents([c.text for c in chunks]) new_ids = self.store.add_chunks( file_id, chunks, vectors, self.provider.model_id ) @@ -164,7 +217,7 @@ def _index_file(self, item: _Work) -> Tuple[int, int]: def _walk(self, target: Path, root: Path) -> Iterator[Tuple[Path, str, str]]: if target.is_file(): rel = self._rel(target, root) - language = detect_language(target) + language = detect_language(target, all_text=self.config.index_all_text) if rel and language and not self._ignored(rel): yield target, rel, language return @@ -177,7 +230,7 @@ def _walk(self, target: Path, root: Path) -> Iterator[Tuple[Path, str, str]]: rel = self._rel(abs_path, root) if not rel or self._ignored(rel): continue - language = detect_language(name) + language = detect_language(name, all_text=self.config.index_all_text) if language: yield abs_path, rel, language diff --git a/coderag/store/vector_index.py b/coderag/store/vector_index.py index 5fa919e..5ae86a3 100644 --- a/coderag/store/vector_index.py +++ b/coderag/store/vector_index.py @@ -14,6 +14,7 @@ import logging import math +import threading from pathlib import Path from typing import TYPE_CHECKING, Iterable, Tuple @@ -49,6 +50,11 @@ def __init__(self, index: faiss.Index, kind: str, config: Config, dim: int) -> N self.kind = kind self.config = config self.dim = dim + # A FAISS index is not safe for a write (add/remove/rebuild) concurrent with a + # read (search). The MCP server is the first surface to run the watcher (which + # writes) alongside live agent queries (which read), so serialize index access on + # a reentrant lock. Reads are fast, so contention is negligible. + self._lock = threading.RLock() # --- construction / persistence --- @@ -76,14 +82,16 @@ def open(cls, config: Config, dim: int) -> "FaissVectorIndex": def save(self) -> None: path = self.config.faiss_path path.parent.mkdir(parents=True, exist_ok=True) - faiss.write_index(self._index, str(path)) - Path(str(path) + ".kind").write_text(self.kind) + with self._lock: + faiss.write_index(self._index, str(path)) + Path(str(path) + ".kind").write_text(self.kind) # --- properties --- @property def ntotal(self) -> int: - return int(self._index.ntotal) + with self._lock: + return int(self._index.ntotal) # --- mutations --- @@ -92,22 +100,25 @@ def add(self, ids: np.ndarray, vectors: np.ndarray) -> None: return vecs = _normalized(vectors) id_arr = np.ascontiguousarray(ids, dtype="int64") - self._index.add_with_ids(vecs, id_arr) + with self._lock: + self._index.add_with_ids(vecs, id_arr) def remove(self, ids: Iterable[int]) -> int: ids = list(ids) if not ids: return 0 selector = faiss.IDSelectorBatch(np.asarray(ids, dtype="int64")) - return int(self._index.remove_ids(selector)) + with self._lock: + return int(self._index.remove_ids(selector)) def search(self, query: np.ndarray, k: int) -> Tuple[np.ndarray, np.ndarray]: """Return ``(ids, scores)`` for the top-k, with FAISS ``-1`` padding stripped.""" - if self.ntotal == 0: - return np.empty(0, dtype="int64"), np.empty(0, dtype="float32") - q = _normalized(np.asarray(query, dtype="float32").reshape(1, -1)) - k = min(k, self.ntotal) - scores, ids = self._index.search(q, k) + with self._lock: + if self.ntotal == 0: + return np.empty(0, dtype="int64"), np.empty(0, dtype="float32") + q = _normalized(np.asarray(query, dtype="float32").reshape(1, -1)) + k = min(k, self.ntotal) + scores, ids = self._index.search(q, k) ids_row, scores_row = ids[0], scores[0] mask = ids_row != -1 return ids_row[mask].astype("int64"), scores_row[mask].astype("float32") @@ -136,44 +147,50 @@ def _build_ivf(self, ids: np.ndarray, vecs: np.ndarray) -> faiss.Index: return index def rebuild_from_store(self, store: "SQLiteStore") -> None: - """Discard the current index and rebuild it from the SQLite vectors.""" - n = store.total_chunks() - kind = self._choose_kind(n) - if n == 0: - self._index = self._empty_flat(self.dim) - self.kind = "flat" - self.save() - return + """Discard the current index and rebuild it from the SQLite vectors. - if kind == "ivf": - # IVF needs all training vectors up front. - all_ids, all_vecs = [], [] - for ids, vecs in store.iter_vectors(): - all_ids.append(ids) - all_vecs.append(_normalized(vecs)) - ids = np.concatenate(all_ids) - vecs = np.vstack(all_vecs) - try: - self._index = self._build_ivf(ids, vecs) - self.kind = "ivf" - except Exception as exc: - # Degenerate corpora (too few or many duplicate vectors) can make IVF - # training fail; fall back to exact flat rather than aborting indexing. - logger.warning( - "IVF training failed (%s); falling back to flat index.", exc - ) + Holds the index lock for the whole swap so a concurrent search never observes a + half-built index. This is rare (model change, or the one-time flat->ivf upgrade), + so briefly stalling reads is an acceptable price for correctness. + """ + with self._lock: + n = store.total_chunks() + kind = self._choose_kind(n) + if n == 0: + self._index = self._empty_flat(self.dim) + self.kind = "flat" + self.save() + return + + if kind == "ivf": + # IVF needs all training vectors up front. + all_ids, all_vecs = [], [] + for ids, vecs in store.iter_vectors(): + all_ids.append(ids) + all_vecs.append(_normalized(vecs)) + ids = np.concatenate(all_ids) + vecs = np.vstack(all_vecs) + try: + self._index = self._build_ivf(ids, vecs) + self.kind = "ivf" + except Exception as exc: + # Degenerate corpora (too few or many duplicate vectors) can make IVF + # training fail; fall back to exact flat rather than aborting indexing. + logger.warning( + "IVF training failed (%s); falling back to flat index.", exc + ) + index = self._empty_flat(self.dim) + index.add_with_ids(vecs, np.ascontiguousarray(ids)) + self._index = index + self.kind = "flat" + else: index = self._empty_flat(self.dim) - index.add_with_ids(vecs, np.ascontiguousarray(ids)) + for ids, vecs in store.iter_vectors(): + index.add_with_ids(_normalized(vecs), np.ascontiguousarray(ids)) self._index = index self.kind = "flat" - else: - index = self._empty_flat(self.dim) - for ids, vecs in store.iter_vectors(): - index.add_with_ids(_normalized(vecs), np.ascontiguousarray(ids)) - self._index = index - self.kind = "flat" - logger.info("Built flat index: %d vectors", n) - self.save() + logger.info("Built flat index: %d vectors", n) + self.save() def ensure_consistent(self, store: "SQLiteStore") -> None: """Rebuild from SQLite if the cached index disagrees with the store. diff --git a/coderag/surfaces/cli.py b/coderag/surfaces/cli.py index aa36730..7c5456b 100644 --- a/coderag/surfaces/cli.py +++ b/coderag/surfaces/cli.py @@ -193,6 +193,27 @@ def cmd_serve(args: argparse.Namespace) -> int: return 0 +def cmd_mcp(args: argparse.Namespace) -> int: + try: + from coderag.surfaces.mcp_server import run_mcp + except ImportError: + print( + "The MCP server needs extra deps. Install with: pip install 'coderag[mcp]'" + ) + return 1 + cfg = _build_config(args) + if args.all_text: + cfg = cfg.with_overrides(index_all_text=True) + cr = CodeRAG(cfg) + run_mcp( + cr, + transport=args.transport, + auto_index=not args.no_index, + watch=not args.no_watch, + ) + return 0 + + def cmd_ui(args: argparse.Namespace) -> int: try: from coderag.surfaces.webui import run_ui @@ -326,6 +347,36 @@ def build_parser() -> argparse.ArgumentParser: _add_common(p_serve) p_serve.set_defaults(func=cmd_serve) + p_mcp = sub.add_parser( + "mcp", + help="Run the MCP server so AI agents (Claude Code/Codex/Cursor) can search " + "this workspace instead of grepping.", + ) + p_mcp.add_argument( + "--transport", + choices=("stdio", "sse", "streamable-http"), + default="stdio", + help="MCP transport (default stdio — how editors/agents launch servers).", + ) + p_mcp.add_argument( + "--no-index", + action="store_true", + help="Don't index the workspace on startup; use the existing index as-is.", + ) + p_mcp.add_argument( + "--no-watch", + action="store_true", + help="Don't keep the index live with the filesystem watcher.", + ) + p_mcp.add_argument( + "--all-text", + action="store_true", + help="Index any text file, not just code (docs/notes/config) — for a plain " + "file directory, not only a code repo.", + ) + _add_common(p_mcp) + p_mcp.set_defaults(func=cmd_mcp) + p_ui = sub.add_parser("ui", help="Launch the built-in web UI.") p_ui.add_argument( "--host", diff --git a/coderag/surfaces/mcp_server.py b/coderag/surfaces/mcp_server.py new file mode 100644 index 0000000..5ae10cd --- /dev/null +++ b/coderag/surfaces/mcp_server.py @@ -0,0 +1,272 @@ +"""MCP server surface (optional ``[mcp]`` extra). + +Exposes CodeRAG to AI coding agents (Claude Code, Codex, Cursor, …) over the Model +Context Protocol so they can query a **warm, pre-indexed workspace** instead of running +slow, repeated grep/glob/read loops. The server is a persistent process: the embedding +model loads once at startup and every query is then a single fast in-process call +(FAISS ANN + BM25 + fusion), so retrieval is cheaper and faster than an agent's +multi-round shell search — the whole point of this surface. + +Design: like the other surfaces (``cli``/``http_api``/``webui``), this is a thin adapter +over the :class:`coderag.api.CodeRAG` facade. Heavy imports (the ``mcp`` SDK) live inside +the functions so importing this module stays cheap and the ``[mcp]`` extra is only needed +to actually run it. The four tools route entirely through existing facade methods. + +Note: this module intentionally does NOT use ``from __future__ import annotations`` — the +MCP SDK introspects the tools' real type hints to generate their input/output schemas. +""" + +import logging +import threading +from typing import TYPE_CHECKING, List, Literal, Optional + +if TYPE_CHECKING: + from mcp.server.fastmcp import FastMCP + + from coderag.api import CodeRAG + from coderag.types import SearchHit + +logger = logging.getLogger(__name__) + +_INSTRUCTIONS = ( + "CodeRAG indexes this workspace for fast semantic + keyword search. Prefer the " + "search_code tool over grep/glob/read loops to find code or text by meaning or by " + "identifier — it returns ranked results with exact path:line locations in one call. " + "Then use get_file to read a precise range. Call index_status to check freshness." +) + + +class _State: + """Mutable server state shared between the tools and the background threads.""" + + def __init__(self) -> None: + self.indexing = False # True while the initial/manual index runs + self.stop = threading.Event() # set on shutdown to stop the watcher thread + + +def _truncate(text: str, max_lines: int) -> "tuple[str, bool]": + lines = text.splitlines() + if max_lines <= 0 or len(lines) <= max_lines: + return text, False + return "\n".join(lines[:max_lines]) + "\n…", True + + +def _format_hit(hit: "SearchHit", snippet_lines: int, full_text: bool) -> dict: + """Compact, token-cheap projection of a SearchHit for an agent. + + Collapses the line range into a ``path:start-end`` location and truncates the snippet + by default — the agent reads the location and calls ``get_file`` only for the chunk it + actually wants, which is what makes this cheaper than dumping whole files. + """ + if full_text: + snippet, truncated = hit.text, False + else: + snippet, truncated = _truncate(hit.text, snippet_lines) + return { + "location": f"{hit.path}:{hit.start_line}-{hit.end_line}", + "symbol": hit.symbol, + "kind": hit.kind, + "language": hit.language, + "score": round(hit.score, 4), + "similarity": round(hit.similarity, 4), + "snippet": snippet, + "truncated": truncated, + } + + +def _filter_hits( + hits: "List[SearchHit]", + *, + language: Optional[str], + path_prefix: Optional[str], + kind: Optional[str], +) -> "List[SearchHit]": + """Best-effort post-filter (the searcher itself has no filters).""" + out = hits + if language: + lang = language.lower() + out = [h for h in out if h.language.lower() == lang] + if kind: + want = kind.lower() + out = [h for h in out if h.kind.lower() == want] + if path_prefix: + out = [h for h in out if h.path.startswith(path_prefix)] + return out + + +def _status_word(state: _State) -> str: + return "in_progress" if state.indexing else "ready" + + +def build_mcp(cr: "CodeRAG", *, state: Optional[_State] = None) -> "FastMCP": + """Build the FastMCP server with CodeRAG's tools wired to the facade. + + Pure construction (no indexing, no transport), so tests can drive the tools in-memory. + """ + from mcp.server.fastmcp import FastMCP + + state = state or _State() + snippet_lines = cr.config.mcp_snippet_lines + mcp = FastMCP("coderag", instructions=_INSTRUCTIONS) + + @mcp.tool() + def search_code( + query: str, + top_k: int = 8, + language: Optional[str] = None, + path_prefix: Optional[str] = None, + kind: Optional[str] = None, + full_text: bool = False, + ) -> dict: + """Search the indexed workspace by meaning AND keyword (hybrid retrieval). + + Use this INSTEAD of grep/glob/read loops to locate code or text: one fast call + returns the most relevant chunks with exact ``path:start-end`` locations. Works for + conceptual questions ("where is retry/backoff handled?") and exact identifiers + alike. Snippets are truncated by default to stay token-cheap — pass + ``full_text=true`` for the whole chunk, or call ``get_file`` for a precise range. + + Args: + query: Natural-language question, or a code snippet/identifier to find. + top_k: Maximum number of results to return (default 8). + language: Restrict to one language tag (e.g. "python", "typescript"). + path_prefix: Restrict to paths starting with this prefix (e.g. "src/"). + kind: Restrict to a chunk kind ("function", "class", "method", "window"). + full_text: Return each chunk's full text instead of a truncated snippet. + """ + if language or path_prefix or kind: + # The searcher can't filter, so pull a deeper pool and filter post-hoc. + pool = max(top_k * 5, cr.config.fetch_k) + hits = _filter_hits( + cr.search(query, top_k=pool), + language=language, + path_prefix=path_prefix, + kind=kind, + )[:top_k] + else: + hits = cr.search(query, top_k=top_k) + return { + "query": query, + "count": len(hits), + "indexing": _status_word(state), + "results": [_format_hit(h, snippet_lines, full_text) for h in hits], + } + + @mcp.tool() + def get_file( + path: str, + start_line: Optional[int] = None, + end_line: Optional[int] = None, + ) -> dict: + """Return the exact contents of an INDEXED file, optionally a 1-based line range. + + Pair with search_code: take a result's path and line range to read precise context. + Only files that are in the index can be read (so this can't fetch arbitrary files + like .env). Returns ``{"error": ...}`` if the path isn't indexed or escapes the + workspace root, rather than failing the call. + """ + try: + content = cr.get_file(path, start_line, end_line) + except (ValueError, FileNotFoundError) as exc: + return {"error": str(exc), "path": path} + return { + "path": path, + "start_line": start_line, + "end_line": end_line, + "content": content, + } + + @mcp.tool() + def index_status() -> dict: + """Report index coverage, freshness, and the active retrieval configuration. + + Includes total_files / total_chunks, the embedding model, whether the reranker is + enabled, and ``"indexing": "ready" | "in_progress"`` so you can tell whether the + initial background index has finished. If results look thin, the index may still be + warming up — check here. + """ + status = cr.status() + status["indexing"] = _status_word(state) + return status + + @mcp.tool() + def reindex(path: Optional[str] = None, full: bool = False) -> dict: + """Re-index the workspace now (incremental by default). + + Rarely needed — the watcher keeps the index live automatically — but useful right + after a large checkout or branch switch. Pass ``full=true`` for a clean rebuild. + Returns the index stats, or ``{"error": ...}`` if an index run is already going. + """ + if state.indexing: + return {"error": "An index operation is already in progress"} + state.indexing = True + try: + stats = cr.index(path, full=full) + finally: + state.indexing = False + return stats.as_dict() + + return mcp + + +def _warm_up(cr: "CodeRAG") -> None: + """Load the engine + embedding model once at startup, not on the first query.""" + try: + cr.status() # builds provider/store/vectors + cr.provider.embed_query("warm up") # loads the model and JITs the query path + except Exception: # pragma: no cover - warm-up is best-effort + logger.exception("MCP warm-up failed (continuing).") + + +def run_mcp( + cr: "CodeRAG", + *, + transport: Literal["stdio", "sse", "streamable-http"] = "stdio", + auto_index: Optional[bool] = None, + watch: Optional[bool] = None, +) -> None: + """Run the MCP server: warm up, (background) index, watch, then serve. + + ``transport`` defaults to ``stdio`` — how Claude Code / Codex / Cursor launch servers. + ``auto_index`` / ``watch`` default to the config (``mcp_auto_index`` / ``mcp_watch``). + """ + from coderag.watch import watch as watch_loop + + auto_index = cr.config.mcp_auto_index if auto_index is None else auto_index + do_watch = cr.config.mcp_watch if watch is None else watch + + state = _State() + mcp = build_mcp(cr, state=state) + + _warm_up(cr) + + if auto_index: + # Index on a background thread so stdio is responsive immediately; search_code + # works against whatever is already indexed while this runs. + state.indexing = True + + def _initial_index() -> None: + try: + cr.index() + except Exception: # pragma: no cover - defensive + logger.exception("Initial MCP index failed.") + finally: + state.indexing = False + + threading.Thread( + target=_initial_index, name="coderag-mcp-index", daemon=True + ).start() + + if do_watch: + threading.Thread( + target=watch_loop, + args=(cr,), + kwargs={"stop_event": state.stop}, + name="coderag-mcp-watch", + daemon=True, + ).start() + + try: + mcp.run(transport=transport) + finally: + state.stop.set() diff --git a/coderag/watch.py b/coderag/watch.py index 931e5fd..ee744ba 100644 --- a/coderag/watch.py +++ b/coderag/watch.py @@ -11,7 +11,7 @@ import threading import time from pathlib import Path -from typing import TYPE_CHECKING, Set +from typing import TYPE_CHECKING, Optional, Set from watchdog.events import FileSystemEvent, FileSystemEventHandler from watchdog.observers import Observer @@ -25,12 +25,15 @@ class _Handler(FileSystemEventHandler): - def __init__(self, pending: Set[str], lock: threading.Lock) -> None: + def __init__( + self, pending: Set[str], lock: threading.Lock, all_text: bool = False + ) -> None: self._pending = pending self._lock = lock + self._all_text = all_text def _note(self, path: str) -> None: - if path and detect_language(path): + if path and detect_language(path, all_text=self._all_text): with self._lock: self._pending.add(path) @@ -52,19 +55,27 @@ def on_moved(self, event: FileSystemEvent) -> None: self._note(str(getattr(event, "dest_path", ""))) -def watch(cr: "CodeRAG", debounce: float = 0.5) -> None: - """Block, keeping ``cr``'s index in sync with its watched directory until Ctrl-C.""" +def watch( + cr: "CodeRAG", + debounce: float = 0.5, + stop_event: Optional[threading.Event] = None, +) -> None: + """Keep ``cr``'s index in sync with its watched directory. + + Blocks until Ctrl-C, or until ``stop_event`` is set — which lets the watcher run on a + background thread (e.g. inside the MCP server) and be shut down cleanly. + """ root = cr.config.watched_dir pending: Set[str] = set() lock = threading.Lock() - handler = _Handler(pending, lock) + handler = _Handler(pending, lock, all_text=cr.config.index_all_text) observer = Observer() observer.schedule(handler, str(root), recursive=True) observer.start() logger.info("Watching %s for changes (Ctrl-C to stop)...", root) try: - while True: + while stop_event is None or not stop_event.is_set(): time.sleep(debounce) with lock: batch = set(pending) diff --git a/example.env b/example.env index 8f829b1..2accbfa 100644 --- a/example.env +++ b/example.env @@ -24,6 +24,26 @@ CODERAG_WATCHED_DIR=/path/to/your/codebase # --- Retrieval --- # CODERAG_TOP_K=8 +# --- Indexing throughput --- +# Number of worker threads for chunking + embedding during indexing. >1 parallelizes +# the embed step (a big win for the OpenAI/remote providers; for the local fastembed +# default ONNX already uses multiple cores per call, so the extra lever there is the +# batch size below). Set to 1 to force fully serial indexing. +# CODERAG_WORKERS=4 +# CODERAG_EMBED_BATCH=64 + +# --- MCP server surface (`coderag mcp`, install: pip install 'coderag[mcp]') --- +# Lets AI coding agents (Claude Code, Codex, Cursor) query this workspace instead of +# grepping. By default it indexes the watched dir on startup (in the background) and +# keeps it live via the watcher. +# CODERAG_MCP_AUTO_INDEX=true +# CODERAG_MCP_WATCH=true +# Lines of a chunk returned in a search_code snippet by default (full text on request). +# CODERAG_MCP_SNIPPET_LINES=12 +# Index any UTF-8 text file, not just code (docs/notes/config, extensionless files) so a +# plain file directory becomes searchable. Binary files are always skipped. +# CODERAG_INDEX_ALL_TEXT=false + # --- Optional: AI model platforms (only needed for `--provider openai` or LLM answers) --- # CodeRAG runs fully local by default. Configure one of the platforms below only if you # want OpenAI embeddings, an LLM-generated answer (`coderag search ... --answer`), or a diff --git a/pyproject.toml b/pyproject.toml index fc5268b..ccad06a 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -39,6 +39,11 @@ ui = [ "jinja2>=3.1.6,<4", "pygments>=2.20.0,<3", ] +# MCP server surface: lets AI coding agents (Claude Code, Codex, Cursor) use CodeRAG's +# index as their retrieval tool instead of slow grep/glob/read loops. +mcp = [ + "mcp>=1.9.0,<2", +] openai = [ "openai>=2.41.1,<3", ] @@ -50,6 +55,7 @@ all = [ "uvicorn[standard]>=0.49.0,<1", "jinja2>=3.1.6,<4", "pygments>=2.20.0,<3", + "mcp>=1.9.0,<2", "openai>=2.41.1,<3", "anthropic>=0.109.2,<1", ] diff --git a/scripts/bench_vs_grep.py b/scripts/bench_vs_grep.py new file mode 100644 index 0000000..20c64c0 --- /dev/null +++ b/scripts/bench_vs_grep.py @@ -0,0 +1,276 @@ +#!/usr/bin/env python +"""Benchmark CodeRAG's indexed search against a raw grep baseline. + +This makes the headline claim measurable: a warm, pre-indexed workspace answers a query +faster (one in-process call vs many ripgrep invocations) and more accurately on conceptual +/ natural-language queries than the agentic grep loop that tools like Claude Code and Codex +fall back to. It reuses the eval harness so the accuracy numbers are directly comparable to +``coderag eval``. + + python scripts/bench_vs_grep.py \ + --watched-dir . \ + --dataset coderag/eval/datasets/coderag_self.jsonl + +What it reports, both for CodeRAG (hybrid retrieval) and for grep: + * accuracy — recall@k / nDCG@k / MRR at the file level (via coderag.eval.evaluate) + * latency — mean / p50 / p95 wall-clock per query + * context — approximate tokens needed to surface the top-k context (CodeRAG returns + compact chunks; the grep baseline must read whole matched files) + +The grep baseline models the agent's behaviour: extract salient terms from the query, run +ripgrep for each, rank files by match frequency — i.e. the floor that semantic search is +meant to beat. As the project's own strategy notes, grep wins on exact identifiers and +persistent edit tasks; CodeRAG's edge is conceptual queries on larger repos plus BM25 for +identifiers, with no code leaving the machine. +""" + +from __future__ import annotations + +import argparse +import os +import re +import subprocess # nosec B404 — benchmarking against the ripgrep CLI is the whole point +import time +from collections import Counter +from pathlib import Path +from statistics import mean +from typing import Callable, List, Sequence + +from coderag.api import CodeRAG +from coderag.config import Config +from coderag.eval import EvalCase, evaluate, load_dataset +from coderag.eval.harness import EvalResult, format_table +from coderag.types import SearchHit + +# Tiny stopword set so grep searches for content terms, not glue words. +_STOP = { + "the", + "and", + "for", + "with", + "that", + "this", + "from", + "into", + "are", + "was", + "use", + "add", + "fix", + "when", + "where", + "what", + "how", + "does", + "should", + "make", + "now", +} +_TOKEN = re.compile(r"[A-Za-z_][A-Za-z0-9_]{2,}") + + +def _query_terms(query: str, limit: int = 8) -> List[str]: + """Salient search terms from a natural-language query (what an agent would grep for).""" + seen: List[str] = [] + for tok in _TOKEN.findall(query): + low = tok.lower() + if low in _STOP or low in {t.lower() for t in seen}: + continue + seen.append(tok) + if len(seen) >= limit: + break + return seen + + +def make_grep_search(root: Path) -> Callable[[str, int], List[SearchHit]]: + """A grep-backed retriever with the harness's ``(query, k) -> hits`` signature.""" + + def search(query: str, k: int) -> List[SearchHit]: + terms = _query_terms(query) + if not terms: + return [] + counts: Counter = Counter() + for term in terms: + try: + proc = subprocess.run( # nosec B603,B607 — fixed argv, no shell + [ + "rg", + "--count-matches", + "--no-messages", + "-i", + "-e", + term, + str(root), + ], + capture_output=True, + text=True, + timeout=30, + ) + except (FileNotFoundError, subprocess.TimeoutExpired): + continue + for line in proc.stdout.splitlines(): + path, _, num = line.rpartition(":") + if not path: + continue + try: + counts[path] += int(num) + except ValueError: + # Not a "path:count" line (e.g. a path containing ':', or rg's + # summary output) — skip it rather than fail the whole query. + continue + hits: List[SearchHit] = [] + for abs_path, score in counts.most_common(k): + rel = os.path.relpath(abs_path, root) + hits.append( + SearchHit( + chunk_id=0, + path=Path(rel).as_posix(), + symbol=None, + kind="window", + language="", + start_line=1, + end_line=1, + text="", + score=float(score), + similarity=0.0, + ) + ) + return hits + + return search + + +def _timed( + fn: Callable[[str, int], List[SearchHit]], sink: List[float] +) -> Callable[[str, int], List[SearchHit]]: + def wrapped(query: str, k: int) -> List[SearchHit]: + start = time.perf_counter() + try: + return fn(query, k) + finally: + sink.append(time.perf_counter() - start) + + return wrapped + + +def _percentile(values: Sequence[float], pct: float) -> float: + if not values: + return 0.0 + ordered = sorted(values) + idx = min(len(ordered) - 1, int(round((pct / 100.0) * (len(ordered) - 1)))) + return ordered[idx] + + +def _fmt_ms(values: Sequence[float]) -> str: + if not values: + return "n/a" + return ( + f"mean {mean(values) * 1000:7.1f} " + f"p50 {_percentile(values, 50) * 1000:7.1f} " + f"p95 {_percentile(values, 95) * 1000:7.1f} (ms)" + ) + + +def _file_chars(path: Path) -> int: + try: + return len(path.read_text(encoding="utf-8", errors="replace")) + except OSError: + return 0 + + +def _context_tokens( + cr: CodeRAG, + grep_search: Callable[[str, int], List[SearchHit]], + cases: Sequence[EvalCase], + root: Path, + top_k: int, +) -> tuple[int, int]: + """Approximate tokens (~chars/4) to surface top-k context for each retriever. + + CodeRAG returns the matched chunks; the grep baseline must read whole matched files — + which is the token-cost argument in favour of indexed retrieval. + """ + cr_tokens = grep_tokens = 0 + for case in cases: + cr_tokens += sum(len(h.text) for h in cr.search(case.query, top_k)) // 4 + grep_tokens += ( + sum(_file_chars(root / h.path) for h in grep_search(case.query, top_k)) // 4 + ) + return cr_tokens, grep_tokens + + +def main() -> int: + ap = argparse.ArgumentParser(description=__doc__) + ap.add_argument("--watched-dir", default=".", help="Repo/directory to search.") + ap.add_argument( + "--dataset", + default="coderag/eval/datasets/coderag_self.jsonl", + help="JSONL query -> relevant_files dataset (file level).", + ) + ap.add_argument( + "--store-dir", default=None, help="Index location (default ./.coderag)." + ) + ap.add_argument("--model", default="BAAI/bge-small-en-v1.5") + ap.add_argument("--ks", default="1,5,10") + ap.add_argument( + "--top-k", type=int, default=10, help="k for latency/token sampling." + ) + ap.add_argument("--no-index", action="store_true", help="Reuse the existing index.") + args = ap.parse_args() + + root = Path(args.watched_dir).expanduser().resolve() + ks = tuple(int(k) for k in args.ks.split(",")) + cases = load_dataset(args.dataset) + if not cases: + raise SystemExit(f"No eval cases in {args.dataset}.") + + cfg = Config.from_env( + provider="fastembed", + model=args.model, + watched_dir=root, + store_dir=Path(args.store_dir).expanduser() + if args.store_dir + else root / ".coderag", + ) + cr = CodeRAG(cfg) + if not args.no_index: + stats = cr.index() + print(f"Indexed {stats.total_files} files / {stats.total_chunks} chunks.\n") + + grep_search = make_grep_search(root) + + cr_times: List[float] = [] + grep_times: List[float] = [] + results: List[EvalResult] = [ + evaluate( + _timed(cr.search, cr_times), + cases, + label="coderag (hybrid)", + ks=ks, + level="file", + ), + evaluate( + _timed(grep_search, grep_times), cases, label="grep", ks=ks, level="file" + ), + ] + + cr_tok, grep_tok = _context_tokens(cr, grep_search, cases, root, args.top_k) + + print(f"Accuracy ({len(cases)} cases, file level)\n") + print(format_table(results)) + print("\nLatency per query") + print(f" coderag (1 warm call) : {_fmt_ms(cr_times)}") + print( + f" grep ({len(_query_terms(cases[0].query)) or 'n'} rg calls/query) : {_fmt_ms(grep_times)}" + ) + print(f"\nApprox context tokens for top-{args.top_k} (sum over cases)") + print(f" coderag (compact chunks): {cr_tok:>9,}") + print(f" grep (read whole files) : {grep_tok:>9,}") + if cr_tok: + print(f" -> grep needs ~{grep_tok / max(cr_tok, 1):.1f}x the context tokens") + cr.close() + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/tests/test_mcp.py b/tests/test_mcp.py new file mode 100644 index 0000000..daace4a --- /dev/null +++ b/tests/test_mcp.py @@ -0,0 +1,264 @@ +"""Tests for the MCP server surface (all offline, with the fake provider). + +Drives the FastMCP tools in-memory via ``call_tool`` (no subprocess), mirroring the +HTTP-surface tests. Also covers the two things the MCP server newly stresses: parallel +indexing correctness and search staying safe while the index is being written. +""" + +from __future__ import annotations + +import asyncio +import json +import re +import threading + +import pytest + +pytest.importorskip("mcp") # skip the whole module if the [mcp] extra isn't installed + +from coderag.api import CodeRAG # noqa: E402 +from coderag.config import Config # noqa: E402 +from coderag.surfaces.mcp_server import _State, _warm_up, build_mcp # noqa: E402 +from tests.conftest import write # noqa: E402 + +DEMO = { + "auth.py": ( + "def authenticate(token):\n" + " '''verify a token with retry/backoff'''\n" + " return token == 'ok'\n" + ), + "math.ts": "export function add(a: number, b: number) {\n return a + b;\n}\n", +} + + +def _make(tmp_path, files, **cfg): + """Build an indexed CodeRAG + MCP server over ``files`` with the fake provider.""" + repo = tmp_path / "repo" + store = tmp_path / "store" + for name, body in files.items(): + write(repo / name, body) + cr = CodeRAG(Config(provider="fake", watched_dir=repo, store_dir=store, **cfg)) + cr.index() + state = _State() + return cr, build_mcp(cr, state=state), state, repo + + +def _call(mcp, name, args): + """Invoke a tool and parse its JSON text content into a dict.""" + res = asyncio.run(mcp.call_tool(name, args)) + content = res[0] if isinstance(res, tuple) else res + return json.loads(content[0].text) + + +# --- tool surface --- + + +def test_tools_are_registered(tmp_path): + cr, mcp, _, _ = _make(tmp_path, DEMO) + names = {t.name for t in asyncio.run(mcp.list_tools())} + assert names == {"search_code", "get_file", "index_status", "reindex"} + cr.close() + + +def test_search_code_returns_compact_locations(tmp_path): + cr, mcp, _, _ = _make(tmp_path, DEMO) + r = _call(mcp, "search_code", {"query": "authenticate token", "top_k": 5}) + assert r["count"] >= 1 + assert r["indexing"] == "ready" + hit = r["results"][0] + # Compact shape: path:start-end location, a snippet, and no heavy full-text field. + assert re.match(r".+:\d+-\d+$", hit["location"]) + assert "snippet" in hit and "text" not in hit + assert {"symbol", "kind", "language", "score", "similarity"} <= hit.keys() + cr.close() + + +def test_snippet_truncated_unless_full_text(tmp_path): + body = ( + "def big_function():\n" + + "".join(f" step_{i} = {i}\n" for i in range(40)) + + " return 'x39 done'\n" + ) + cr, mcp, _, _ = _make(tmp_path, {"big.py": body}) + q = {"query": "big_function step_39 x39", "top_k": 5} + hit = next( + h for h in _call(mcp, "search_code", q)["results"] if "big.py" in h["location"] + ) + assert hit["truncated"] is True and "…" in hit["snippet"] + + full = next( + h + for h in _call(mcp, "search_code", {**q, "full_text": True})["results"] + if "big.py" in h["location"] + ) + assert full["truncated"] is False and "step_39" in full["snippet"] + cr.close() + + +def test_search_code_filters(tmp_path): + cr, mcp, _, _ = _make(tmp_path, DEMO) + + r = _call( + mcp, "search_code", {"query": "add", "top_k": 10, "language": "typescript"} + ) + assert r["results"] and all(h["language"] == "typescript" for h in r["results"]) + + r = _call( + mcp, "search_code", {"query": "function", "top_k": 10, "path_prefix": "math"} + ) + assert all(h["location"].startswith("math.ts") for h in r["results"]) + + r = _call( + mcp, "search_code", {"query": "anything", "top_k": 10, "language": "rust"} + ) + assert r["results"] == [] # no rust files indexed + cr.close() + + +def test_get_file_range_and_structured_errors(tmp_path): + cr, mcp, _, _ = _make(tmp_path, DEMO) + + r = _call(mcp, "get_file", {"path": "auth.py", "start_line": 1, "end_line": 1}) + assert r["content"] == "def authenticate(token):" + + # Errors are returned as content, not raised — so the agent gets a usable message. + assert "error" in _call(mcp, "get_file", {"path": "../../etc/passwd"}) + assert "error" in _call(mcp, "get_file", {"path": "not_indexed.py"}) + cr.close() + + +def test_index_status_reports_totals_and_flag(tmp_path): + cr, mcp, state, _ = _make(tmp_path, DEMO) + r = _call(mcp, "index_status", {}) + assert r["total_files"] == 2 + assert r["total_chunks"] == cr.vectors.ntotal + assert r["indexing"] == "ready" + + state.indexing = True + assert _call(mcp, "index_status", {})["indexing"] == "in_progress" + cr.close() + + +def test_reindex_picks_up_new_file_and_guards_concurrency(tmp_path): + cr, mcp, state, repo = _make(tmp_path, DEMO) + write(repo / "extra.py", "def extra():\n return 1\n") + r = _call(mcp, "reindex", {}) + assert r["total_files"] == 3 + assert cr.store.total_chunks() == cr.vectors.ntotal + + state.indexing = True # a run already in progress -> guarded + assert "error" in _call(mcp, "reindex", {}) + cr.close() + + +def test_warm_up_is_safe(tmp_path): + cr, _, _, _ = _make(tmp_path, DEMO) + _warm_up(cr) # must not raise with any provider + cr.close() + + +# --- all-text (general file-directory) indexing --- + + +def test_all_text_indexes_text_and_skips_binary(tmp_path): + files = { + "notes.log": "deployment runbook: restart the scheduler service\n", + "Dockerfile": "FROM python:3.11\nRUN pip install coderag\n", + "data.bin": "head\x00\x01\x02tail binary blob\n", # NUL byte -> binary + } + + # Default (code-oriented): unknown .log is skipped; Dockerfile is a known text name. + cr, _, _, _ = _make(tmp_path / "a", files) + paths = set(cr.store.all_file_paths()) + assert "notes.log" not in paths + assert "Dockerfile" in paths + cr.close() + + # all_text: arbitrary text becomes searchable; binary is still rejected. + cr, _, _, _ = _make(tmp_path / "b", files, index_all_text=True) + paths = set(cr.store.all_file_paths()) + assert "notes.log" in paths + assert "data.bin" not in paths + cr.close() + + +# --- parallel indexing correctness & concurrency safety --- + + +def test_parallel_indexing_matches_serial(tmp_path): + files = { + f"m{i}.py": ( + f"def f{i}(x):\n return x + {i}\n\n" + f"class C{i}:\n def m(self):\n return {i}\n" + ) + for i in range(8) + } + + def build(workers, sub): + cr = CodeRAG( + Config( + provider="fake", + watched_dir=tmp_path / "repo", + store_dir=tmp_path / sub, + index_workers=workers, + ) + ) + # write the same repo once (shared watched_dir) + for name, body in files.items(): + write(tmp_path / "repo" / name, body) + stats = cr.index() + out = ( + stats.total_chunks, + cr.store.total_chunks(), + cr.vectors.ntotal, + sorted(cr.store.all_file_paths()), + ) + cr.close() + return out + + serial = build(1, "store_serial") + parallel = build(4, "store_parallel") + assert serial[1] == parallel[1] > 0 # identical chunk count + assert serial[1] == serial[2] and parallel[1] == parallel[2] # store == FAISS + assert serial[3] == parallel[3] # identical file set + + +def test_search_is_safe_during_concurrent_indexing(tmp_path): + repo = tmp_path / "repo" + for i in range(25): + write(repo / f"f{i}.py", "def g():\n return 'token retry backoff'\n") + cr = CodeRAG( + Config(provider="fake", watched_dir=repo, store_dir=tmp_path / "store") + ) + cr.index() + + errors: list = [] + stop = threading.Event() + + def hammer_search(): + try: + while not stop.is_set(): + cr.search("token retry backoff", top_k=5) + except Exception as exc: # pragma: no cover - failure path + errors.append(exc) + + t = threading.Thread(target=hammer_search) + t.start() + try: + # Re-index (FAISS add/remove) while searches (FAISS reads) run concurrently. + for _ in range(3): + for i in range(25, 45): + write(repo / f"f{i}.py", "def g():\n return 'more tokens here'\n") + cr.index() + for i in range(25, 45): + (repo / f"f{i}.py").unlink() + cr.index() + finally: + stop.set() + t.join(timeout=5) + + assert not errors, errors + assert ( + cr.store.total_chunks() == cr.vectors.ntotal + ) # invariant holds after the race + cr.close()