FalkorDB · DvirDukhan · Mar 14, 2026 · Apr 11, 2026 · Apr 12, 2026 · May 7, 2026
diff --git a/.env.template b/.env.template
@@ -27,3 +27,33 @@ GEMINI_API_KEY=<YOUR_GEMINI_API_KEY>
 # Optional Uvicorn bind settings used by start.sh / make run-*
 HOST=0.0.0.0
 PORT=5000
+
+# ---------------------------------------------------------------------------
+# Benchmark runner (bench/runners/mini_runner.py) credentials.
+# Picked up automatically from .env at repo root.
+#
+# Pick ONE of these provider configs based on your model choice:
+#
+# 1) Anthropic API (direct):
+#       ANTHROPIC_API_KEY=sk-ant-...
+#       # ANTHROPIC_API_BASE is unset → uses api.anthropic.com
+#
+# 2) Azure AI Foundry's Anthropic-compatible passthrough
+#    (path /anthropic/v1/messages, x-api-key auth):
+#       ANTHROPIC_API_KEY=<your-azure-key>
+#       ANTHROPIC_API_BASE=https://<resource>.services.ai.azure.com/anthropic
+#    Then: --model anthropic/claude-sonnet-4-5
+#
+# 3) GitHub Models (free tier, 8K-16K context cap on personal plans):
+#       GITHUB_API_KEY=$(gh auth token)
+#       GITHUB_API_BASE=https://models.github.ai/inference
+#    Then: --model github/openai/gpt-4o-mini
+#
+# 4) Azure OpenAI (chat completions API, not Anthropic):
+#       AZURE_API_KEY=<your-azure-openai-key>
+#       AZURE_API_BASE=https://<resource>.openai.azure.com
+#       AZURE_API_VERSION=2024-10-21
+#    Then: --model azure/<your-deployment-name>
+# ---------------------------------------------------------------------------
+# ANTHROPIC_API_KEY=
+# ANTHROPIC_API_BASE=
diff --git a/.github/workflows/mcp-tests.yml b/.github/workflows/mcp-tests.yml
@@ -0,0 +1,76 @@
+name: MCP tests
+
+on:
+  push:
+    branches: ["main", "staging", "mcp/**"]
+    paths:
+      - "api/mcp/**"
+      - "tests/mcp/**"
+      - "api/llm.py"
+      - "api/graph.py"
+      - "pyproject.toml"
+      - "uv.lock"
+      - ".github/workflows/mcp-tests.yml"
+  pull_request:
+    paths:
+      - "api/mcp/**"
+      - "tests/mcp/**"
+      - "api/llm.py"
+      - "api/graph.py"
+      - "pyproject.toml"
+      - "uv.lock"
+      - ".github/workflows/mcp-tests.yml"
+  workflow_dispatch:
+
+permissions:
+  contents: read
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
+  cancel-in-progress: true
+
+jobs:
+  mcp-tests:
+    runs-on: ubuntu-latest
+
+    services:
+      falkordb:
+        image: falkordb/falkordb:latest
+        ports:
+          - 6379:6379
+        options: >-
+          --health-cmd "redis-cli ping"
+          --health-interval 5s
+          --health-timeout 3s
+          --health-retries 12
+
+    env:
+      FALKORDB_HOST: localhost
+      FALKORDB_PORT: "6379"
+
+    steps:
+      - name: Checkout
+        uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6
+
+      - name: Setup Python
+        uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6
+        with:
+          python-version: "3.12"
+
+      - name: Install uv
+        uses: astral-sh/setup-uv@cec208311dfd045dd5311c1add060b2062131d57 # v8.0.0
+        with:
+          version: "latest"
+          enable-cache: true
+          cache-dependency-glob: "uv.lock"
+
+      - name: Install backend dependencies
+        run: uv sync --all-extras
+
+      - name: Verify FalkorDB reachable
+        run: |
+          sudo apt-get update -qq && sudo apt-get install -y redis-tools
+          redis-cli -h localhost -p 6379 ping
+
+      - name: Run MCP test suite
+        run: uv run pytest tests/mcp/ -v
diff --git a/.gitignore b/.gitignore
@@ -58,3 +58,8 @@ htmlcov/
 pytest_cache/
 *.log
 repositories/
+
+# bench: SWE-bench harness output (regrade reports + per-instance logs)
+bench/cache/verify/
+logs/run_evaluation/
+code-graph-bench.*.json
diff --git a/AGENTS.md b/AGENTS.md
@@ -154,3 +154,26 @@ cgraph info [--repo <name>]              # Repo stats + metadata
 ```
 
 `--repo` defaults to the current directory name. Claude Code skill in `skills/code-graph/`.
+
+## MCP server (for agents)
+
+`cgraph-mcp` exposes the code graph over MCP stdio. Eight tools:
+`index_repo`, `search_code`, `get_callers`, `get_callees`,
+`get_dependencies`, `impact_analysis`, `find_path`, `ask`.
+
+Drop the canonical agent guidance into any repo:
+
+```bash
+cgraph init-agent             # writes CLAUDE.md + .cursorrules
+cgraph init-agent --force     # overwrite existing files
+```
+
+See `api/mcp/templates/claude_mcp_section.md` for the full tool table
+and rules of thumb (start with `search_code`; prefer structural tools
+over `ask`; run `impact_analysis` before refactoring).
+
+Environment:
+
+- `CODE_GRAPH_AUTO_INDEX=true` — auto-index CWD on MCP startup.
+- `CGRAPH_MODE=mcp` — run `cgraph-mcp` instead of the FastAPI web
+  server when using the Docker image.
diff --git a/CONTEXT.md b/CONTEXT.md
@@ -0,0 +1,136 @@
+# Benchmark glossary (CONTEXT.md)
+
+Scope: glossary for the **benchmark workstream** (`bench/` and related
+changes). Not a project-wide glossary for code-graph.
+
+## Terms
+
+### Agent
+The autonomous loop that reads a task, calls tools, edits code, and
+submits a result. We adopt **mini-swe-agent** (SWE-agent project's
+recommended minimal harness) as the agent. The original SWE-agent
+is **not** used: upstream now points users at mini- instead, and its
+bash-only tool surface is a much smaller, more transparent
+integration. The agent loop is fixed across all configs.
+
+### Config
+One of `baseline`, `lsp`, `code-graph`. A config is **fully defined by
+its `system_preamble.md` plus the `PATH` it exposes to the agent's
+bash**. Same model, same scaffolding template, same step/cost limits
+across all three. (mini-swe-agent has no per-config `tools.yaml`
+because bash is the only tool; the per-config `tools.yaml` files in
+the repo are kept as design documentation.)
+
+### baseline (config)
+mini-swe-agent's stock bash environment — `cat`, `grep`, `find`,
+`sed`, `git`, the agent's own implicit submit protocol. **Not
+"zero tools"** — an LLM with no filesystem access is not a useful
+comparison.
+
+### lsp (config)
+`baseline` + an `lsp` command on PATH that wraps multilspy/jedi
+(`goto-definition`, `find-references`, `hover`, `document-symbols`),
+each shaped by the LSP response shim (see below). The plan originally
+specified pyright + `workspace_symbols`; we run **jedi-language-server**
+(what the pinned multilspy fork ships) and drop `workspace_symbols`
+(the fork doesn't implement `request_workspace_symbol`). The shim
+normalizes responses so jedi-vs-pyright does not affect the validity
+comparison; agent falls back to bash+grep for workspace-wide symbol
+search.
+
+### code-graph (config)
+`baseline` + a `cg` command on PATH that talks to the code-graph
+HTTP service: `graph-entities`, `get-neighbors`, `find-paths`,
+`auto-complete`, `find-symbol`, plus `note-edit`. The GraphRAG `chat`
+endpoint is **excluded** to avoid nested-agent token double-counting.
+
+### Accuracy
+The SWE-bench end-to-end metric: did the agent's patch pass the repo's
+test suite? Only accuracy number reported. We considered an intrinsic
+retrieval diagnostic and dropped it.
+
+### Token cost
+LLM input + LLM output tokens summed across one agent session for one
+task. Always reported as median, p90, and **Δ vs baseline**.
+
+### Indexing cost
+Wall-clock seconds and any LLM tokens spent to build the FalkorDB graph
+for a `<repo>@<commit>` pair. Reported **separately** and **never
+combined** with per-task token cost. The writeup states the amortization
+break-even task count, no fake math.
+
+### Task
+One instance from SWE-bench Verified — `(repo, base-commit, issue,
+gold-patch, tests)`.
+
+### Run
+One execution of (config × task). We report **pass@1 at temperature 0**.
+Failed runs are re-tried 2× more to filter stochastic failures; the
+re-tries never change a pass into a fail.
+
+### Indexed pair
+A `<repo>@<commit>` for which a FalkorDB graph has been built. Cache
+key. No incremental indexing across commits.
+
+### Tool service architecture
+mini-swe-agent runs each step as `subprocess.run` in a configured cwd
+(the prepared repo working tree). **Tools live on the host** (local
+process model): multilspy/jedi runs in-process via the `lsp` CLI
+wrapper; code-graph is reached via an HTTP client (`cg` CLI wrapper)
+to the FastAPI + FalkorDB service. The runner sets `PATH` so the
+agent sees `bench/cli/` only for configs that include those tools
+(baseline gets the unmodified host `PATH`). code-graph's graph is
+built once per `<repo>@<commit>` and would otherwise go stale on agent
+edits, so the code-graph bundle includes a `cg note-edit PATH` tool
+that triggers a **single-file incremental re-index** of the touched
+file. This keeps fairness with the live-by-default LSP.
+
+### LSP response shim
+Raw LSP responses are too verbose for a fair token-cost comparison
+(`find_references` can return hundreds of locations; `hover` can return
+multi-paragraph markdown). The `lsp` config wraps every pyright tool in
+a thin adapter (`bench/tools/lsp/adapter.py`) that:
+
+- Caps result lists at **50** entries. Further results behind a `page`
+  arg on the same tool.
+- Strips `hover` markdown to the **first signature line + first
+  docstring sentence**. Full hover available via an opt-in
+  `hover_full`.
+- Returns locations as `{path, line, col}`, not the raw LSP `Range`.
+
+The shim is identical across all LSP runs. We do **not** run a
+raw-LSP comparison.
+
+### Preambles
+Each config gets a single-paragraph **symmetric preamble** introducing
+its toolkit. The preambles are committed to
+`bench/tools/<config>/system_preamble.md` and reviewed as artifacts.
+Before headline runs we sanity-check phrasing sensitivity by re-running
+one config with 2-3 alternative phrasings; if pass rate moves by >5%,
+the preambles are dominating signal and we revisit.
+
+### Rollout
+Three-stage:
+
+1. **Smoke** — 3 hand-picked tasks × 3 configs × 1 run = 9 sessions.
+   Verify harness, token accounting, indexing path, tool plumbing.
+2. **Calibration** — 10 random Verified tasks × 3 configs × 1 run = 30
+   sessions. Verify preamble phrasing sensitivity, shim behavior.
+3. **Headline** — remaining 40 of the 50-task sample × 3 configs ×
+   pass@1 with retry-2x-on-fail.
+
+### Dataset
+**50-task random sample from SWE-bench Verified** (500-task split).
+Random seed committed to `bench/configs/default.yaml`. If the headline
+Δ between configs is <10 percentage points, we expand to 150 tasks
+before publishing — the 50-task sample's confidence interval is roughly
+±7 pp.
+
+## Conventions
+
+- `bench/` is the top-level directory for the workstream.
+- Results are JSONL, one row per `(task_id, config, run_idx)`, with
+  token counts pulled from the mini-swe-agent trajectory JSON
+  (`agent.serialize()` — `messages[*].extra.response.usage`).
+- The opencode track and RepoBench track are **not** part of this
+  workstream (dropped during the grill).
diff --git a/Dockerfile b/Dockerfile
@@ -21,7 +21,9 @@ COPY --from=node-base /usr/local/bin/node /usr/local/bin/node
 COPY --from=node-base /usr/local/lib/node_modules /usr/local/lib/node_modules
 
 # Install netcat for wait loop in start.sh and system build tools
-RUN apt-get update && apt-get install -y --no-install-recommends \
+RUN apt-get update \
+    && apt-get install -y -f \
+    && apt-get install -y --no-install-recommends \
     netcat-openbsd \
     git \
     build-essential \

diff --git a/README.md b/README.md
@@ -222,6 +222,48 @@ npx skills add FalkorDB/code-graph
 
 Then ask Claude things like *"what functions call analyze_sources?"* or *"find the dependency chain between parse_config and send_request"* — it will handle the indexing and querying automatically.
 
+### MCP server (`cgraph-mcp`)
+
+For agents that speak the [Model Context Protocol](https://modelcontextprotocol.io)
+(Claude Code, Cursor, Cline, …), code-graph ships a stdio MCP server
+that exposes the knowledge graph as 8 first-class tools: `index_repo`,
+`search_code`, `get_callers`, `get_callees`, `get_dependencies`,
+`impact_analysis`, `find_path`, and `ask` (NL→Cypher via GraphRAG).
+
+Quickstart — Claude Code:
+
+```bash
+# 1. Install (in any venv with the cgraph package on PATH)
+pip install code-graph         # or: uv pip install code-graph
+
+# 2. Register with Claude Code
+claude mcp add-json code-graph '{
+  "command": "cgraph-mcp",
+  "env": {
+    "FALKORDB_HOST": "localhost",
+    "FALKORDB_PORT": "6379",
+    "CODE_GRAPH_AUTO_INDEX": "true"
+  }
+}'
+
+# 3. Drop agent guidance into your repo
+cd /path/to/your/repo
+cgraph init-agent              # writes CLAUDE.md and .cursorrules
+```
+
+Quickstart — Docker Compose:
+
+```bash
+docker compose up -d falkordb                       # start the DB
+docker compose --profile mcp run --rm -i code-graph-mcp   # attach via stdio
+```
+
+The MCP server auto-bootstraps FalkorDB if it's missing on localhost
+(via `cgraph ensure-db`). When `CODE_GRAPH_AUTO_INDEX=true` is set,
+the current working directory is indexed automatically on start.
+
+**Transport:** Phase 1 is stdio only. HTTP/SSE is deferred.
+
 ## Running with Docker
 
 ### Using Docker Compose
@@ -232,18 +274,34 @@ docker compose up --build
 
 This starts FalkorDB and the CodeGraph app together. The checked-in compose file sets `CODE_GRAPH_PUBLIC=1` for the app service.
 
+To run the **MCP stdio server** instead of the web app from the same
+image, set `CGRAPH_MODE=mcp` and use the `mcp` profile:
+
+```bash
+docker compose --profile mcp run --rm -i code-graph-mcp
+```
+
 ### Using Docker directly
 
 ```bash
 docker build -t code-graph .
 
+# Web mode (default)
 docker run -p 5000:5000 \
   -e FALKORDB_HOST=host.docker.internal \
   -e FALKORDB_PORT=6379 \
   -e MODEL_NAME=gemini/gemini-flash-lite-latest \
   -e GEMINI_API_KEY=<YOUR_GEMINI_API_KEY> \
   -e SECRET_TOKEN=<YOUR_SECRET_TOKEN> \
   code-graph
+
+# MCP stdio mode (same image)
+docker run --rm -i \
+  -e CGRAPH_MODE=mcp \
+  -e FALKORDB_HOST=host.docker.internal \
+  -e FALKORDB_PORT=6379 \
+  -e MODEL_NAME=gemini/gemini-flash-lite-latest \
+  code-graph
 ```
 
 ## Creating a Code Graph