Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
4e3c11c
Merge pull request #618 from FalkorDB/staging
gkorland Mar 14, 2026
92d6805
feat(mcp): scaffold api/mcp module with FastMCP server and cgraph-mcp…
DvirDukhan Apr 11, 2026
408efde
fix(mcp): address Copilot + CodeRabbit review comments on T1 PR
DvirDukhan Apr 12, 2026
6dafb58
chore(mcp): tighten T1 scaffold per review
galshubeli May 7, 2026
0c7e3db
chore(mcp): regenerate uv.lock for anyio test extra
galshubeli May 7, 2026
4a05363
Revert "chore(mcp): regenerate uv.lock for anyio test extra"
galshubeli May 10, 2026
1838b36
fix(docker): repair held-back deps before installing build tools
galshubeli May 10, 2026
a6cec64
fix(e2e): seed from installed graphrag-sdk 0.8.2 instead of cloning HEAD
galshubeli May 10, 2026
b70fd97
fix(e2e): copy SDK out of site-packages and synthesize missing nodes
galshubeli May 10, 2026
db71080
fix(e2e): pass repo URL and synthesize test_* search terms
galshubeli May 10, 2026
9f00b8b
Add benchmark workstream scaffold + CONTEXT.md
Copilot May 26, 2026
1e0048f
Consolidate grill decisions (Q1-Q11)
Copilot May 26, 2026
84c56ac
Upgrade graphrag-sdk 0.8 -> 1.1.1
DvirDukhan May 26, 2026
aed0c56
Add benchmark metrics + report aggregation modules
DvirDukhan May 26, 2026
c52f6e6
Add code-graph HTTP adapter for the benchmark agent
DvirDukhan May 26, 2026
20ba1ed
Add indexing-cache registry for benchmark runner
DvirDukhan May 26, 2026
453dcfa
Add LSP adapter (multilspy + jedi) with end-to-end shim tests
DvirDukhan May 26, 2026
c6ef736
Add mini-swe-agent benchmark runner with dry-run mode
DvirDukhan May 26, 2026
57a406d
Add --real-run mode with synthetic smoke task and outcome verification
DvirDukhan May 26, 2026
020cf64
Add SWE-bench Verified dataset loader and --swe-bench runner mode
DvirDukhan May 26, 2026
7088e23
Add report CLI and SWE-bench Docker verify adapter
DvirDukhan May 26, 2026
dcf4ac0
Load .env in mini_runner; document Anthropic / Azure providers
DvirDukhan May 27, 2026
a6b1b48
bench: wire cg/lsp shims, pre-index code-graph track, sharpen preambles
DvirDukhan May 27, 2026
03c7a73
bench: add --limit flag for quick single-instance runs
DvirDukhan May 27, 2026
13ac345
bench: force tool usage in per-config instance template
DvirDukhan May 27, 2026
b0c9ce9
bench: fix find_symbol exact-match against nested properties.name
DvirDukhan May 27, 2026
921dccd
feat(graph): per-branch graph identity (T17 #651)
DvirDukhan May 27, 2026
bba43e0
ci(mcp): add MCP-tests workflow with FalkorDB service (T2 #649)
DvirDukhan May 27, 2026
a598a74
test(mcp): sample-project fixture + assertion contract (T3 #650)
DvirDukhan May 27, 2026
18d3cc7
feat(mcp): index_repo tool (T4 #652)
DvirDukhan May 27, 2026
58e35b3
feat(mcp): query tools — get_callers/callees/deps, find_path, search_…
DvirDukhan May 27, 2026
c23a206
feat(mcp): impact_analysis tool — variable-depth Cypher (T6 #654)
DvirDukhan May 27, 2026
a3b3206
feat(mcp): GraphRAG ask tool — init module + prompt seam + tool (T9/T…
DvirDukhan May 27, 2026
5e376e6
feat(mcp): auto-init — ensure FalkorDB + opt-in auto-index (T12 #660)
DvirDukhan May 27, 2026
2a71e7f
MCP-T13 + T14: cgraph init-agent + Docker mode switch
DvirDukhan May 27, 2026
a18854b
MCP smoke harness + template fixes from end-to-end run
DvirDukhan May 27, 2026
60e2bd1
bench: add MCP-transport sibling of the code_graph track
DvirDukhan May 27, 2026
f17d437
bench: wire code_graph_mcp into mini_runner dispatch
DvirDukhan May 27, 2026
b14432b
bench: fail loudly on indexing errors + bump analyze_folder timeout
DvirDukhan May 27, 2026
532d849
fix(analyzer): resolve LSP CALLS edges on repos without a venv
May 27, 2026
476bc73
bench: add resume support + ignore sympy rubi rules
May 27, 2026
d23ef79
fix(analyzer): defensive skip when second_pass references untracked file
DvirDukhan May 28, 2026
4841701
refactor(analyzers): extract TreeSitterAnalyzer base class (T15 #663)
DvirDukhan May 28, 2026
3e8935f
feat(analyzers): tree-sitter Python symbol resolver (T18 #689)
DvirDukhan May 28, 2026
612b04f
perf(analyzers): memoise compiled tree-sitter queries
DvirDukhan May 28, 2026
ec7fac6
bench: add start-api.sh helper enabling tree-sitter fast resolver
DvirDukhan May 28, 2026
f9e8156
Merge dvirdukhan/mcp-smoke-combined into bench-combined for calibration
DvirDukhan May 28, 2026
b264700
Merge bench harness with T18 + query-cache stack
DvirDukhan May 28, 2026
5e6c63c
fix(mcp): lazy-import KnowledgeGraph so server starts on graphrag 1.x
DvirDukhan May 28, 2026
38d2411
fix(bench): silence cgraph-mcp stderr so DEBUG logs don't bloat agent…
DvirDukhan May 28, 2026
bbb5d95
fix(bench): bump default cgraph-mcp timeout 60s → 300s for large repos
DvirDukhan May 28, 2026
aa850d6
feat(bench): tool-availability precheck + per-trajectory tool-usage rate
DvirDukhan May 28, 2026
4a6956e
fix(bench): defensive stdin redirect + anti-fallback preamble rules
DvirDukhan May 28, 2026
4daad7e
fix(bench): grade via official swebench Docker harness; deprecate bro…
DvirDukhan May 28, 2026
bfdf60d
chore(bench): gitignore swebench harness output (regrade reports + logs)
DvirDukhan May 28, 2026
7ab59f4
bench: add fallback_rate metric (passive grep/find tracking)
DvirDukhan May 28, 2026
e5f5631
bench: add median wall-clock column to report
DvirDukhan May 28, 2026
8dd3055
bench: capture one-time indexing wall-clock per task
DvirDukhan May 28, 2026
e82c05c
bench: robust indexing precheck (GRAPH.LIST + bounded timeout)
DvirDukhan May 28, 2026
6508e3e
fix(bench): add /api/_health probe + harness sanity check for tree-si…
DvirDukhan May 28, 2026
4c46736
fix(bench): MCP adapter defaults CODE_GRAPH_PY_RESOLVER=tree_sitter +…
DvirDukhan May 28, 2026
dc8534e
perf(bench): compact cg/cg-mcp output + trim system preambles
DvirDukhan May 28, 2026
805d0ad
fix(bench): restore submission sentinel in cg/cg-mcp preambles
DvirDukhan May 28, 2026
4758ea1
fix(bench): SIGKILL whole pgid on timeout to stop orphan leak
DvirDukhan May 29, 2026
3125946
fix(bench/mcp): collect ALL TextContent chunks, prefer structuredContent
DvirDukhan May 29, 2026
403f958
perf(bench/mcp): cap impact_analysis at --limit 50, strip worktree pa…
DvirDukhan May 29, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions .env.template
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,33 @@ GEMINI_API_KEY=<YOUR_GEMINI_API_KEY>
# Optional Uvicorn bind settings used by start.sh / make run-*
HOST=0.0.0.0
PORT=5000

# ---------------------------------------------------------------------------
# Benchmark runner (bench/runners/mini_runner.py) credentials.
# Picked up automatically from .env at repo root.
#
# Pick ONE of these provider configs based on your model choice:
#
# 1) Anthropic API (direct):
# ANTHROPIC_API_KEY=sk-ant-...
# # ANTHROPIC_API_BASE is unset → uses api.anthropic.com
#
# 2) Azure AI Foundry's Anthropic-compatible passthrough
# (path /anthropic/v1/messages, x-api-key auth):
# ANTHROPIC_API_KEY=<your-azure-key>
# ANTHROPIC_API_BASE=https://<resource>.services.ai.azure.com/anthropic
# Then: --model anthropic/claude-sonnet-4-5
#
# 3) GitHub Models (free tier, 8K-16K context cap on personal plans):
# GITHUB_API_KEY=$(gh auth token)
# GITHUB_API_BASE=https://models.github.ai/inference
# Then: --model github/openai/gpt-4o-mini
#
# 4) Azure OpenAI (chat completions API, not Anthropic):
# AZURE_API_KEY=<your-azure-openai-key>
# AZURE_API_BASE=https://<resource>.openai.azure.com
# AZURE_API_VERSION=2024-10-21
# Then: --model azure/<your-deployment-name>
# ---------------------------------------------------------------------------
# ANTHROPIC_API_KEY=
# ANTHROPIC_API_BASE=
76 changes: 76 additions & 0 deletions .github/workflows/mcp-tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
name: MCP tests

on:
push:
branches: ["main", "staging", "mcp/**"]
paths:
- "api/mcp/**"
- "tests/mcp/**"
- "api/llm.py"
- "api/graph.py"
- "pyproject.toml"
- "uv.lock"
- ".github/workflows/mcp-tests.yml"
pull_request:
paths:
- "api/mcp/**"
- "tests/mcp/**"
- "api/llm.py"
- "api/graph.py"
- "pyproject.toml"
- "uv.lock"
- ".github/workflows/mcp-tests.yml"
workflow_dispatch:

permissions:
contents: read

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

jobs:
mcp-tests:
runs-on: ubuntu-latest

services:
falkordb:
image: falkordb/falkordb:latest
ports:
- 6379:6379
options: >-
--health-cmd "redis-cli ping"
--health-interval 5s
--health-timeout 3s
--health-retries 12

env:
FALKORDB_HOST: localhost
FALKORDB_PORT: "6379"

steps:
- name: Checkout
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6

- name: Setup Python
uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6
with:
python-version: "3.12"

- name: Install uv
uses: astral-sh/setup-uv@cec208311dfd045dd5311c1add060b2062131d57 # v8.0.0
with:
version: "latest"
enable-cache: true
cache-dependency-glob: "uv.lock"

- name: Install backend dependencies
run: uv sync --all-extras

- name: Verify FalkorDB reachable
run: |
sudo apt-get update -qq && sudo apt-get install -y redis-tools
redis-cli -h localhost -p 6379 ping

- name: Run MCP test suite
run: uv run pytest tests/mcp/ -v
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -58,3 +58,8 @@ htmlcov/
pytest_cache/
*.log
repositories/

# bench: SWE-bench harness output (regrade reports + per-instance logs)
bench/cache/verify/
logs/run_evaluation/
code-graph-bench.*.json
23 changes: 23 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,3 +154,26 @@ cgraph info [--repo <name>] # Repo stats + metadata
```

`--repo` defaults to the current directory name. Claude Code skill in `skills/code-graph/`.

## MCP server (for agents)

`cgraph-mcp` exposes the code graph over MCP stdio. Eight tools:
`index_repo`, `search_code`, `get_callers`, `get_callees`,
`get_dependencies`, `impact_analysis`, `find_path`, `ask`.

Drop the canonical agent guidance into any repo:

```bash
cgraph init-agent # writes CLAUDE.md + .cursorrules
cgraph init-agent --force # overwrite existing files
```

See `api/mcp/templates/claude_mcp_section.md` for the full tool table
and rules of thumb (start with `search_code`; prefer structural tools
over `ask`; run `impact_analysis` before refactoring).

Environment:

- `CODE_GRAPH_AUTO_INDEX=true` — auto-index CWD on MCP startup.
- `CGRAPH_MODE=mcp` — run `cgraph-mcp` instead of the FastAPI web
server when using the Docker image.
136 changes: 136 additions & 0 deletions CONTEXT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# Benchmark glossary (CONTEXT.md)

Scope: glossary for the **benchmark workstream** (`bench/` and related
changes). Not a project-wide glossary for code-graph.

## Terms

### Agent
The autonomous loop that reads a task, calls tools, edits code, and
submits a result. We adopt **mini-swe-agent** (SWE-agent project's
recommended minimal harness) as the agent. The original SWE-agent
is **not** used: upstream now points users at mini- instead, and its
bash-only tool surface is a much smaller, more transparent
integration. The agent loop is fixed across all configs.

### Config
One of `baseline`, `lsp`, `code-graph`. A config is **fully defined by
its `system_preamble.md` plus the `PATH` it exposes to the agent's
bash**. Same model, same scaffolding template, same step/cost limits
across all three. (mini-swe-agent has no per-config `tools.yaml`
because bash is the only tool; the per-config `tools.yaml` files in
the repo are kept as design documentation.)

### baseline (config)
mini-swe-agent's stock bash environment — `cat`, `grep`, `find`,
`sed`, `git`, the agent's own implicit submit protocol. **Not
"zero tools"** — an LLM with no filesystem access is not a useful
comparison.

### lsp (config)
`baseline` + an `lsp` command on PATH that wraps multilspy/jedi
(`goto-definition`, `find-references`, `hover`, `document-symbols`),
each shaped by the LSP response shim (see below). The plan originally
specified pyright + `workspace_symbols`; we run **jedi-language-server**
(what the pinned multilspy fork ships) and drop `workspace_symbols`
(the fork doesn't implement `request_workspace_symbol`). The shim
normalizes responses so jedi-vs-pyright does not affect the validity
comparison; agent falls back to bash+grep for workspace-wide symbol
search.

### code-graph (config)
`baseline` + a `cg` command on PATH that talks to the code-graph
HTTP service: `graph-entities`, `get-neighbors`, `find-paths`,
`auto-complete`, `find-symbol`, plus `note-edit`. The GraphRAG `chat`
endpoint is **excluded** to avoid nested-agent token double-counting.

### Accuracy
The SWE-bench end-to-end metric: did the agent's patch pass the repo's
test suite? Only accuracy number reported. We considered an intrinsic
retrieval diagnostic and dropped it.

### Token cost
LLM input + LLM output tokens summed across one agent session for one
task. Always reported as median, p90, and **Δ vs baseline**.

### Indexing cost
Wall-clock seconds and any LLM tokens spent to build the FalkorDB graph
for a `<repo>@<commit>` pair. Reported **separately** and **never
combined** with per-task token cost. The writeup states the amortization
break-even task count, no fake math.

### Task
One instance from SWE-bench Verified — `(repo, base-commit, issue,
gold-patch, tests)`.

### Run
One execution of (config × task). We report **pass@1 at temperature 0**.
Failed runs are re-tried 2× more to filter stochastic failures; the
re-tries never change a pass into a fail.

### Indexed pair
A `<repo>@<commit>` for which a FalkorDB graph has been built. Cache
key. No incremental indexing across commits.

### Tool service architecture
mini-swe-agent runs each step as `subprocess.run` in a configured cwd
(the prepared repo working tree). **Tools live on the host** (local
process model): multilspy/jedi runs in-process via the `lsp` CLI
wrapper; code-graph is reached via an HTTP client (`cg` CLI wrapper)
to the FastAPI + FalkorDB service. The runner sets `PATH` so the
agent sees `bench/cli/` only for configs that include those tools
(baseline gets the unmodified host `PATH`). code-graph's graph is
built once per `<repo>@<commit>` and would otherwise go stale on agent
edits, so the code-graph bundle includes a `cg note-edit PATH` tool
that triggers a **single-file incremental re-index** of the touched
file. This keeps fairness with the live-by-default LSP.

### LSP response shim
Raw LSP responses are too verbose for a fair token-cost comparison
(`find_references` can return hundreds of locations; `hover` can return
multi-paragraph markdown). The `lsp` config wraps every pyright tool in
a thin adapter (`bench/tools/lsp/adapter.py`) that:

- Caps result lists at **50** entries. Further results behind a `page`
arg on the same tool.
- Strips `hover` markdown to the **first signature line + first
docstring sentence**. Full hover available via an opt-in
`hover_full`.
- Returns locations as `{path, line, col}`, not the raw LSP `Range`.

The shim is identical across all LSP runs. We do **not** run a
raw-LSP comparison.

### Preambles
Each config gets a single-paragraph **symmetric preamble** introducing
its toolkit. The preambles are committed to
`bench/tools/<config>/system_preamble.md` and reviewed as artifacts.
Before headline runs we sanity-check phrasing sensitivity by re-running
one config with 2-3 alternative phrasings; if pass rate moves by >5%,
the preambles are dominating signal and we revisit.

### Rollout
Three-stage:

1. **Smoke** — 3 hand-picked tasks × 3 configs × 1 run = 9 sessions.
Verify harness, token accounting, indexing path, tool plumbing.
2. **Calibration** — 10 random Verified tasks × 3 configs × 1 run = 30
sessions. Verify preamble phrasing sensitivity, shim behavior.
3. **Headline** — remaining 40 of the 50-task sample × 3 configs ×
pass@1 with retry-2x-on-fail.

### Dataset
**50-task random sample from SWE-bench Verified** (500-task split).
Random seed committed to `bench/configs/default.yaml`. If the headline
Δ between configs is <10 percentage points, we expand to 150 tasks
before publishing — the 50-task sample's confidence interval is roughly
±7 pp.

## Conventions

- `bench/` is the top-level directory for the workstream.
- Results are JSONL, one row per `(task_id, config, run_idx)`, with
token counts pulled from the mini-swe-agent trajectory JSON
(`agent.serialize()` — `messages[*].extra.response.usage`).
- The opencode track and RepoBench track are **not** part of this
workstream (dropped during the grill).
4 changes: 3 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,9 @@ COPY --from=node-base /usr/local/bin/node /usr/local/bin/node
COPY --from=node-base /usr/local/lib/node_modules /usr/local/lib/node_modules

# Install netcat for wait loop in start.sh and system build tools
RUN apt-get update && apt-get install -y --no-install-recommends \
RUN apt-get update \
&& apt-get install -y -f \
&& apt-get install -y --no-install-recommends \
netcat-openbsd \
git \
build-essential \
Expand Down
58 changes: 58 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -222,6 +222,48 @@ npx skills add FalkorDB/code-graph

Then ask Claude things like *"what functions call analyze_sources?"* or *"find the dependency chain between parse_config and send_request"* — it will handle the indexing and querying automatically.

### MCP server (`cgraph-mcp`)

For agents that speak the [Model Context Protocol](https://modelcontextprotocol.io)
(Claude Code, Cursor, Cline, …), code-graph ships a stdio MCP server
that exposes the knowledge graph as 8 first-class tools: `index_repo`,
`search_code`, `get_callers`, `get_callees`, `get_dependencies`,
`impact_analysis`, `find_path`, and `ask` (NL→Cypher via GraphRAG).

Quickstart — Claude Code:

```bash
# 1. Install (in any venv with the cgraph package on PATH)
pip install code-graph # or: uv pip install code-graph

# 2. Register with Claude Code
claude mcp add-json code-graph '{
"command": "cgraph-mcp",
"env": {
"FALKORDB_HOST": "localhost",
"FALKORDB_PORT": "6379",
"CODE_GRAPH_AUTO_INDEX": "true"
}
}'

# 3. Drop agent guidance into your repo
cd /path/to/your/repo
cgraph init-agent # writes CLAUDE.md and .cursorrules
```

Quickstart — Docker Compose:

```bash
docker compose up -d falkordb # start the DB
docker compose --profile mcp run --rm -i code-graph-mcp # attach via stdio
```

The MCP server auto-bootstraps FalkorDB if it's missing on localhost
(via `cgraph ensure-db`). When `CODE_GRAPH_AUTO_INDEX=true` is set,
the current working directory is indexed automatically on start.

**Transport:** Phase 1 is stdio only. HTTP/SSE is deferred.

## Running with Docker

### Using Docker Compose
Expand All @@ -232,18 +274,34 @@ docker compose up --build

This starts FalkorDB and the CodeGraph app together. The checked-in compose file sets `CODE_GRAPH_PUBLIC=1` for the app service.

To run the **MCP stdio server** instead of the web app from the same
image, set `CGRAPH_MODE=mcp` and use the `mcp` profile:

```bash
docker compose --profile mcp run --rm -i code-graph-mcp
```

### Using Docker directly

```bash
docker build -t code-graph .

# Web mode (default)
docker run -p 5000:5000 \
-e FALKORDB_HOST=host.docker.internal \
-e FALKORDB_PORT=6379 \
-e MODEL_NAME=gemini/gemini-flash-lite-latest \
-e GEMINI_API_KEY=<YOUR_GEMINI_API_KEY> \
-e SECRET_TOKEN=<YOUR_SECRET_TOKEN> \
code-graph

# MCP stdio mode (same image)
docker run --rm -i \
-e CGRAPH_MODE=mcp \
-e FALKORDB_HOST=host.docker.internal \
-e FALKORDB_PORT=6379 \
-e MODEL_NAME=gemini/gemini-flash-lite-latest \
code-graph
```

## Creating a Code Graph
Expand Down
Loading