bench: 3-config SWE-bench harness (baseline/lsp/code_graph_mcp) by DvirDukhan · Pull Request #693 · FalkorDB/code-graph

DvirDukhan · 2026-05-28T09:33:12Z

Prerequisites (merge order)

None — branches directly off staging; can merge independently.

Summary

End-to-end benchmark harness for evaluating code-graph against baseline and LSP on SWE-bench Verified. Three configurations:

baseline — bash only
lsp — bash + multilspy/jedi
code_graph_mcp — bash + cg-mcp JSON-RPC stdio CLI against cgraph-mcp

Note: an earlier HTTP arm (code_graph — cg HTTP CLI against the FastAPI service) was dropped from this PR. The benchmark now evaluates code-graph only over the cgraph-mcp stdio transport, which is the real-world agent integration (Claude Code, Cursor). The HTTP↔MCP parity PRs (#695, #696) were closed alongside this. The HTTP arm's measured numbers are retained in the results table below for historical comparison.

Includes resume support, per-instance timeouts, tree-sitter fast resolver (T15 + T18), MCP auto-init (T12-T14), a tool-usage rate metric to detect silent fallback to bash, and the official swebench.harness.run_evaluation Docker-backed verifier with retroactive regrade CLI.

Verified results (Sonnet 4.5, n=10, step-75, official SWE-bench Docker harness)

config	resolved	resolve rate	median tokens	Δ vs baseline	tool-usage
baseline	9/10	90%	1,137,823	—	—
lsp	10/10	100%	885,624	−22.2%	27%
code_graph (removed HTTP arm)	9/10	90%	881,397	−22.5%	12%
code_graph_mcp	9/10	90%	790,482	−30.5%	10%

All resolves checked via the official harness (per-instance Docker images, real FAIL_TO_PASS + PASS_TO_PASS selection). Sympy-19040 is the only universally-hard task; only lsp solves it. The code_graph row reflects the now-removed HTTP arm and is kept only for reference.

Token efficiency at a glance

code_graph_mcp saves 30.5% median tokens vs baseline while matching baseline accuracy
All tool tracks beat baseline tokens by 22-30%
Resolve rates are within 1 task of each other across configs

Engineering hardening shipped in this branch

38d2411 silence cgraph-mcp stderr (was bloating agent context 9×)
bbb5d95 bump default cgraph-mcp timeout 60s → 300s for sympy/django
aa850d6 tool-availability precheck + tool-usage rate metric (caught the silent-fallback regression that almost shipped)
4a6956e defensive stdin redirect on cg-mcp/lsp shims + anti-fallback preamble rules
4daad7e rewrite verifier to use the official swebench Docker harness (the previous one ran modern pytest 8 against legacy worktrees and graded every trajectory failed)
bfdf60d gitignore harness output
drop HTTP code_graph arm; standardize on baseline/lsp/code_graph_mcp

Out of scope for this PR

Headline n=40 with Opus 4.5 — verifier is unblocked, just needs compute budget
pyright LSP adapter (currently jedi via multilspy) for the production-realistic LSP track

Draft for review of the harness mechanics + early numbers. Not for merge until headline n=40 lands.

Summary by CodeRabbit

Release Notes

New Features
- Added health-check endpoint for server status and runtime configuration monitoring.
Improvements
- Refactored Q&A system for code graphs to use direct query execution for improved performance.
Dependencies
- Updated graphrag-sdk from 0.8.x to 1.1.x.

staging-->main

Scaffold for the code-graph vs LSP vs baseline benchmark. No runners yet — just the directory layout, locked-in tool bundles per config, default run config, and the glossary in CONTEXT.md. Both originally-planned pre-reqs (graphrag-sdk 0.8 -> 1.1.1 upgrade, MCP-T15 tree-sitter base class refactor) are deferred as non-blockers for this workstream; rationale in the session plan. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Updates from the round-2 grill: - Outcome accuracy only; drop intrinsic suite (Q1) - code-graph tools = primitives only; no GraphRAG chat (Q2) - Tools in-container; single-file re-index on edit via note_edit (Q3) - Token cost and indexing cost reported separately, never combined (Q4) - LSP responses shimmed (cap 50, trim hover); spec in shim.yaml (Q5) - Pass@1 + retry failures 2x (Q6) - Symmetric one-paragraph preambles per config (Q7) - Drop RepoBench (Q8) - Drop opencode qualitative track (Q9) - Three-stage rollout: smoke / calibration / headline (Q10) - 50-task random sample from SWE-bench Verified, seed committed (Q11) graphrag-sdk upgrade kept in scope per explicit user override. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The v1 SDK is a ground-up rewrite around document ingestion: the v0 KnowledgeGraph class (which we wrapped around an already-populated FalkorDB graph for /api/chat text-to-Cypher) is gone, and the new GraphRAG facade expects to own the graph via its ingestion pipeline with embeddings. There is no public primitive for 'wrap an existing graph and chat over it'. code-graph builds graphs through dedicated language analyzers, not ingestion, so we now keep the text-to-Cypher pipeline in-house in api/llm.py: generate Cypher from question + ontology, execute via the existing FalkorDB async client, synthesize an answer. We still use graphrag-sdk's LiteLLM provider as a thin LiteLLM wrapper to keep retry logic. Ontology is now a plain string in the prompt instead of the old Ontology/Entity/Relation object tree (which is also gone in v1). The /api/chat endpoint surface (ask(repo_name, question) -> str) is unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

bench/metrics/ parses SWE-agent trajectory JSON into per-task TaskMetrics rows: input/output tokens, tool-call counts (with per-tool breakdown), patch, outcome. Defensive about trajectory-shape drift between SWE-agent versions (history vs trajectory vs steps; openai-style tool_calls vs SWE-agent action.command). bench/report/ aggregates those rows into a per-config table with median + p90 tokens and Δ-vs-baseline. The summary picks the best run per task (resolved > failed) so retries don't double-count. 10 unit tests cover token extraction, both tool-call shapes, the retry-merge rule, and the markdown delta column. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

bench/agents/code_graph_adapter.py exposes the seven tools the code-graph SWE-agent config gets: - graph_entities, get_neighbors, find_paths, auto_complete: thin wrappers over the existing FastAPI surface. - find_symbol: exact-name lookup, built client-side on top of auto_complete so we don't grow the server surface. - note_edit: incremental re-index hook the agent must call after every write_file/edit. Currently routes through analyze_folder on the dirname; degrades gracefully if the call fails. Crucially, GraphRAG is NOT exposed (Q2 grill decision: nested-agent double-counting). Both class-style (CodeGraphClient context manager) and function-style (graph_entities(...) etc.) are provided — the function form is what SWE-agent's tool registry needs. 9 unit tests using httpx.MockTransport cover all seven methods, the bearer-token auth header, 4xx propagation, and note_edit's non-fatal failure path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

bench/runners/index_cache.py tracks which <repo>@<commit> pairs code-graph has already analyzed, so re-running the benchmark doesn't pay the indexing cost twice. Backed by a single JSON file under bench/cache/. Atomic via tmp-file replace. This module doesn't run analysis itself — that's done via code-graph's existing /api/analyze_folder endpoint. This is just the bookkeeping the runner consults before deciding to re-index. 6 unit tests cover record/lookup, cross-instance persistence, forget, and overwrite semantics. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

bench/agents/lsp_adapter.py wraps multilspy's SyncLanguageServer behind the same response shim spec'd in bench/tools/lsp/shim.yaml: cap results at 50, trim hover to 1 signature line + 1 docstring sentence, locations as {path, line, col}. Tools exposed: goto_definition, find_references, hover, document_symbols Notes on the LSP backend choice: - The plan originally specified pyright; multilspy >= 0.0.15 is required for that, but the pinned multilspy fork (AviAvni/multilspy@python-init-params, used by api/analyzers) is older. Using jedi-language-server matches the rest of the repo and avoids a divergent dep tree. Shim normalizes responses so jedi-vs-pyright doesn't affect the validity comparison. - workspace_symbols is dropped: the multilspy fork doesn't implement request_workspace_symbol. Agent falls back to bash+grep, which is the realistic LSP-world fallback too. - MultilspyConfig must be built via from_dict for this fork (constructor doesn't set all fields JediServer expects). Register pytest 'slow' marker in pyproject.toml; the 3 jedi roundtrip tests are slow but currently complete in <4s on a warm cache. Run them with -m slow or default; skip with -m 'not slow'. CONTEXT.md and bench/tools/lsp/tools.yaml updated to match. 10 tests pass: 7 shim units + 3 real jedi roundtrips (goto_definition, hover, document_symbols). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Pivots the harness from SWE-agent to mini-swe-agent — upstream now recommends mini-, and its bash-only tool surface is a simpler integration: each config is a PATH prefix plus a system_preamble.md, not a per-config tools.yaml. What this adds: - bench/runners/mini_runner.py — wraps DefaultAgent + LocalEnvironment, per-config env wiring (PATH for lsp/code_graph, baseline untouched), trajectory + diff capture, JSONL append via bench.metrics. Includes a stub LLM model that exercises the entire loop without any network calls so the harness is testable today. - bench/cli/cg.py, bench/cli/lsp.py — bash-callable CLIs wrapping the existing CodeGraphClient and LSP adapter. These are what the agent invokes via bash. - bench/tools/{baseline,lsp,code_graph}/system_preamble.md — symmetric one-page preambles per the locked-in grill decision. - bench/metrics — extended to also parse mini-swe-agent trajectory shape (messages[*].extra.response.usage and extra.actions[*].command). Buckets bash commands by first token; the COMPLETE_TASK submit protocol is bucketed as 'submit'. - tests/test_bench_runner.py — 10 tests, all run offline (no LLM): smoke, env wiring, persistence, CLI argparse smoke. - CONTEXT.md + plan.md — reflect mini-swe-agent + jedi pivots. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds --real-run as a mutually exclusive sibling of --dry-run. Real-run prepares a fresh repo per config (no cross-contamination), runs the agent against a synthetic buggy math_utils.py + pytest, then runs pytest to set metrics.outcome to resolved/failed. JSONL append in run_batch can now be deferred via defer_jsonl=True so the smoke loop can write the row once outcome is known. Validated end-to-end against GitHub Models (gpt-4o-mini) using GITHUB_API_KEY=$(gh auth token). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Loads princeton-nlp/SWE-bench_Verified via 'datasets', samples deterministically by seed (20260526) into smoke/calibration/headline stages (3/10/37), and prepares per-instance worktrees by cloning the upstream repo, checking out base_commit, and applying test_patch so FAIL_TO_PASS tests are present. Adds 'datasets' to the bench optional dep group. Adds 'swe_bench' mode to mini_runner alongside dry_run / real_run (mutually exclusive). Verification uses pytest with the FAIL_TO_PASS + PASS_TO_PASS test ids from the dataset row -- best effort because the official harness needs per-repo conda envs, which we don't build yet. 6 new unit tests cover the non-network parts of the loader (field parsing, sampling determinism, n override, pool clamping, path hygiene, task mapping). Worktree prep was validated end-to-end against pytest-dev/pytest-6202. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

bench/report/__main__.py: `uv run python -m bench.report` renders results.jsonl as a per-config summary table with token-delta vs baseline. Validated against the existing real-run smoke results. bench/runners/swebench_verify.py: exports per-config predictions JSONL files in the SWE-bench harness format, optionally invokes `python -m swebench.harness.run_evaluation` (Docker-based), then parses the resulting report.json and patches outcomes back into results.jsonl. 4 new unit tests cover the non-Docker parts. Adds `swebench>=4.0` to the bench optional dep group. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

mini_runner.main() now calls dotenv.load_dotenv(.env) at the repo root if present, so users don't have to export ANTHROPIC_API_KEY / ANTHROPIC_API_BASE / GITHUB_API_KEY by hand each shell session. .env.template gains a documented block for the four supported provider configs we've actually tested or have credentials for: direct Anthropic, Azure AI Foundry's Anthropic-passthrough endpoint (/anthropic/v1/messages, x-api-key), GitHub Models, and Azure OpenAI. Most relevant for our setup: Azure AI Foundry → litellm's anthropic/ provider with a custom ANTHROPIC_API_BASE. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Smoke run showed the agent invoked cg exactly once and lsp zero times across all three SWE-bench instances — because the bash shims didn't exist (the agent's `which cg` returned 'cg not found'). The differential between configs was therefore noise. Fixes: - Add executable bash shims bench/cli/{cg,lsp} that exec "$BENCH_PYTHON" -m bench.cli.{cg,lsp}. Runner exports BENCH_PYTHON = sys.executable so the venv (with httpx/multilspy) is used. - Export REPO_NAME for the code_graph config (worktree dirname). The preamble references it; nothing was setting it. - _ensure_indexed(): POST /api/analyze_folder for each code_graph worktree before running the task, so cg find-symbol returns real results. Skips re-indexing via /api/list_repos precheck. - Rewrite system preambles to instruct "use cg/lsp BEFORE grep" with an explicit typical-loop, not just a list of subcommands. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Smoke #2 confirmed that even with cg/lsp shims on PATH, indexed repos, REPO_NAME set, and explicit "use cg/lsp first" framing in the system preamble, Claude Opus 4.5 ignored the differentiating tools and fell straight back to grep/sed/cat. The 3-way comparison was real but uninformative: tool choice was identical across configs. This commit adds two new instance templates (INSTANCE_TEMPLATE_LSP and INSTANCE_TEMPLATE_CODE_GRAPH) that embed a 'Required workflow.' block directly in the task description — the first thing the model sees each turn. Selection via load_instance_template(config); baseline keeps the original template. Smoke #3 result: lsp track now invokes 'lsp' 3x, code_graph track invokes 'cg' 5x (including cg auto-complete returning the exact buggy function with line numbers + docstring). The structured-navigation tools are finally exercised, so token deltas measured against baseline are now meaningful signal rather than noise. n=1 finding: both lsp (+128%) and code_graph (+85%) use MORE tokens than baseline on this instance. Bigger preambles + verbose JSON tool replies + occasional retries (cg find-symbol exact-match bug) outweigh any savings. Headline run should scale n or pivot to a function-calling harness. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Smoke #3 revealed cg find-symbol --name <exact> returned [] for symbols the graph clearly contained (cg auto-complete --prefix found the same symbol with full file:line+docstring). Root cause: the filter compared item['name'] to the requested name, but the /api/auto_complete payload nests the symbol name under item['properties']['name'] (FalkorDB node properties), so the top-level lookup always returned None and nothing matched. Fix: prefer item['properties']['name'], fall back to item['name'] for flatter shapes the unit tests pass in. Added a regression test that uses the real payload structure. Verified end-to-end against the live FastAPI service: cg find-symbol --repo pytest-dev__pytest-6202__code_graph \ --name getmodpath # -> [{id:2714, labels:[Function], properties:{name,path,doc,...}}] This was the bug that made the smoke #3 code_graph agent burn 3 of 5 cg calls retrying exact-name lookups before falling back to auto-complete. With this fix, an agent doing the natural workflow (find-symbol -> get-neighbors -> note-edit) should land far fewer wasted calls. Also: norecursedirs in [tool.pytest.ini_options] to keep pytest from walking into per-instance bench worktrees that ship their own pytest sources (was breaking host pytest's AST rewriter on import). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds the only MCP tool with no pre-existing AsyncGraphQuery backing, inlining a [:CALLS*1..depth] traversal with DISTINCT for cycle safety. api/mcp/tools/structural.py - impact_analysis(symbol_id, project, branch, direction='IN', depth=3) - direction='IN' returns transitive upstream callers (what breaks if you change this symbol); direction='OUT' returns transitive callees. - depth is clamped to [1, IMPACT_MAX_DEPTH=10]; values above 10 are silently clamped, not rejected, so agents can ask for "very deep" without worrying about the cap. - _clamp_depth helper accepts int / stringified int, rejects bool / None / non-numeric strings. - Direction must be 'IN' or 'OUT'; raises before issuing the query. tests/mcp/test_impact_analysis.py (9 tests) - Unit: _clamp_depth normalization + garbage rejection; direction validation; tool registered on the MCP app. - Integration vs the sample-project fixture: upstream(db) and downstream(entrypoint) match the expected.yaml contract; depth=1 from db excludes the 3-hop entrypoint. - Cycle safety: throwaway A-B + C-A graph; DISTINCT ensures no duplicates and the query terminates. - JSON serialisability. tests/mcp/fixtures/expected.yaml - New impact section with upstream_includes_any_of for db and downstream_includes_any_of for entrypoint. Out of scope (per ticket): cross-rel traversal (CALLS only for v1), ranking, cross-branch / cross-repo impact. Closes #654. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…10/T11) Bundles three tightly-coupled tickets: T9 builds the per-(project,branch) KnowledgeGraph cache, T10 adds the prompt-override seam, T11 wires both together into the `ask` MCP tool that gives agents natural-language access to the graph. T9 (#657) — api/mcp/graphrag_init.py - get_or_create_kg(project, branch) — process-wide cache keyed by (project, branch). Identity-stable: same key returns the same KG. - reset_cache() for tests. - Reuses the hand-coded ontology from api/llm.define_ontology (200+ lines of File/Class/Function descriptions the LLM relies on for Cypher quality). Do NOT replace with auto-extraction. - Graph name uses the T17 convention `code:{project}:{branch}` so it matches what index_repo writes. T9 — api/llm.py rename - _define_ontology → define_ontology (drop underscore so it's importable). Internal callers updated. No other call sites in the repo. T10 (#658) — api/mcp/code_prompts.py - Thin re-export of api.prompts (CYPHER_GEN_SYSTEM/PROMPT, GRAPH_QA_SYSTEM/PROMPT). The value is the seam: when the MCP ask tool needs agent-flavoured prompts (vs human-chat framing), the divergence happens here without touching api/prompts.py. T11 (#659) — api/mcp/tools/ask.py - ask(question, project, branch=None) MCP tool. - Uses get_or_create_kg + chat_session().send_message() in an executor so the MCP event loop stays responsive. - Returns the design-doc-mandated {answer, cypher_query, context_nodes} shape. cypher_query is the transparency requirement so agents can verify the executed query and learn the schema. - _normalize_response tolerates the graphrag-sdk response shape variance ({response/answer, cypher/query, context/results}). - Errors are surfaced as a structured {error: ...} payload, never as a transport exception — the agent always sees a valid tool result. Tests (14 new, all pass with mocked LiteModel — no network in CI): - tests/mcp/test_code_prompts.py (3): re-exports match originals, __all__ shape, snapshot hash stability. - tests/mcp/test_graphrag_init.py (5): per-branch graph name, cache identity, distinct keys yield distinct instances, ontology reuse, define_ontology is public. - tests/mcp/test_ask.py (6): tool registered, normalised payload, alternate response keys, plain-string response, errors surfaced as payload, JSON serialisable. Full MCP suite still green (48 passed in 27.5s). Out of scope per tickets: real-LLM E2E (Phase 1.5 with API-key secrets), streaming, multi-turn memory, prompt iteration. Closes #657, #658, #659. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Zero-config startup so a fresh user doesn't need to run `cgraph ensure-db` and `index_repo` manually. api/mcp/auto_init.py - ensure_falkordb(): on server boot, ping FalkorDB; if unreachable on a localhost host, shell out to `cgraph ensure-db` (reuses the existing CLI's Docker bootstrap rather than duplicating it). Subprocess (not in-process call) so the CLI's stdout JSON doesn't pollute the MCP server's stdio transport. Never raises — server start continues even on bootstrap failure so individual tools can surface their own errors. - maybe_auto_index(cwd=None, project=None, branch=None): opt-in via CODE_GRAPH_AUTO_INDEX env var (off by default — indexing a large repo can take minutes and surprising the user on first call is bad UX). Detects current branch via `git rev-parse`, falls back to `_default`. Per-(project, branch) idempotency via a module-level set; second call for the same key is a no-op. - _truthy helper accepts 1/true/yes/on (case insensitive). api/mcp/server.py - main() now runs ensure_falkordb() and maybe_auto_index() before app.run(). Module-level import behaviour unchanged (tests that `import api.mcp.server` don't trigger any I/O). tests/mcp/test_auto_init.py (9 tests) - ensure_falkordb: no-op when reachable, runs cgraph when not, skips Docker for remote hosts, handles missing CLI binary. - maybe_auto_index: skipped when env unset, indexes when opt-in, idempotent across calls for same key, distinct branches each get one auto-index, _truthy semantics. All mocks — no Docker, no real FalkorDB writes — so the tests run in <2s without external dependencies. Out of scope per ticket: watch mode / re-indexing on FS change, auto-pulling Docker image (cgraph ensure-db handles that), cross- session state. Closes #660. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

T13 — agent guidance bundle: - Add `api/mcp/templates/` shipped as package data (claude_mcp_section.md, cursorrules.template) with the canonical 8-tool MCP guidance. - Add `cgraph init-agent [--force]` Typer command that drops CLAUDE.md and .cursorrules into CWD. - Update AGENTS.md with the MCP tool table and env-var reference. T14 — packaging / image dual-mode: - `start.sh` dispatches on `CGRAPH_MODE` env var (web|mcp). Default (web) preserves the existing FastAPI behaviour; mcp execs cgraph-mcp. - `docker-compose.yml` gains an opt-in `code-graph-mcp` service under the `mcp` profile for stdio attach. - README quickstart section for both `claude mcp add-json` registration and Docker compose profile usage. Tests: 4 new (CliRunner against tmp_path); MCP suite green at 61 passed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Added scripts/mcp_smoke.py — drives cgraph-mcp over real stdio (mcp SDK ClientSession + StdioServerParameters) and exercises tool listing, index_repo, search_code, get_callers, impact_analysis. Findings folded back into claude_mcp_section.md: - index_repo takes path_or_url (not path) and derives project name from the folder/URL — agents must read it back from the response. - Collection-returning tools land their array in structuredContent.result, not under a {results: [...]} wrapper. Smoke result on api/ subgraph: 8 tools listed, 6324 nodes / 6228 edges indexed, all calls returned expected payloads. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds a second bench track that exercises code-graph through the exact transport real-world agents use (Claude Code, Cursor, …) — JSON-RPC over stdio to a spawned `cgraph-mcp` server — instead of HTTP to the FastAPI service. Files: - bench/agents/code_graph_mcp_adapter.py — sync Python adapter that spawns cgraph-mcp per call via the official MCP Python SDK. Knows the 8-tool MCP surface (search_code, get_callers, get_callees, get_dependencies, impact_analysis, find_path, index_repo, ask). - bench/cli/cg-mcp + cg_mcp.py — bash-callable CLI shim mirroring the existing `cg` shim. mini-swe-agent only does bash, so each "tool" is one CLI invocation. - bench/tools/code_graph_mcp/{tools.yaml,system_preamble.md} — agent config for the MCP track. Mirrors code_graph; same Q2 decision to exclude `ask` (no nested LLM in the benchmarked tool set). - tests/bench/test_cg_mcp_adapter.py — 5 unit + 1 e2e test (FalkorDB-gated AND MCP-server-gated so it skips cleanly until the MCP stack lands on staging). Heavy e2e validated against the api/ subgraph (~6.3k nodes) over real stdio: search_code -> get_callers -> impact_analysis returned expected payloads. Depends on the MCP stack (PRs #675–#683) for cgraph-mcp itself. Lands cleanly once that stack merges. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The adapter and shim from the previous commit were inert from the runner's perspective — VALID_CONFIGS only knew baseline/lsp/code_graph. This commit makes `--config code_graph_mcp` a first-class track. Changes in bench/runners/mini_runner.py: - VALID_CONFIGS gains "code_graph_mcp" (passes argparse + help string). - New INSTANCE_TEMPLATE_CODE_GRAPH_MCP: mirrors the HTTP code_graph template but tells the agent to call `cg-mcp` with $PROJECT_NAME + $BRANCH, and to use impact_analysis before non-trivial edits. - load_instance_template dispatches the new template. - config_env("code_graph_mcp", ...) prepends venv bin to PATH (so cgraph-mcp is callable from the agent's bash), passes FALKORDB_* through to the spawned MCP server, and exports PROJECT_NAME + BRANCH which the preamble references. - New _ensure_indexed_mcp() mirrors _ensure_indexed but goes through the bench MCP adapter instead of HTTP. Skip-if-present probe hits FalkorDB's GRAPH.LIST directly (one trip, no MCP spawn). - Per-instance loop now dispatches to _ensure_indexed_mcp for the new config. Smoke-verified that: - VALID_CONFIGS == ('baseline','lsp','code_graph','code_graph_mcp') - load_instance_template('code_graph_mcp') contains 'cg-mcp' - config_env populates PROJECT_NAME/BRANCH/FALKORDB_HOST Unit tests for the adapter still pass (5 passed, 1 skipped — heavy e2e double-gated on FalkorDB + api.mcp.server availability). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The mini_runner previously printed a [index] WARN line on analyze_folder errors and continued. This meant SWE-bench instances whose path falls outside ALLOWED_ANALYSIS_DIR (e.g. when the API server is started from a sibling worktree) would silently run the agent against a missing code-graph project. The agent's first cg call returns 400 'Missing project ...', the agent falls back to grep/sed, and we get a token count that looks bad for the code_graph track but actually reflects 'tool unavailable'. Two changes: * analyze_folder errors and httpx exceptions now raise RuntimeError with the offending path. This stops the run and surfaces the ALLOWED_ANALYSIS_DIR misconfiguration immediately. * analyze_folder timeout bumped 600s -> 1800s. The 600s default was tight for sympy (~5 MB of Python, ~5000 functions) and caused a timeout during indexing. This was discovered while running the first real 3-way SWE-bench smoke. With the fix and a corrected ALLOWED_ANALYSIS_DIR, the code_graph track produces sensible numbers (-11% input vs baseline across the smoke sample vs the prior bogus +4.7%). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The Python analyzer hardcoded `environment_path={path}/venv` when starting jedi-language-server via multilspy. When the repo had no venv (the common case for cloned codebases like sphinx, sympy, anything from SWE-bench), jedi raised `InvalidPythonEnvironment` on every `request_definition()` call. analyzer.resolve() then swallowed the exception silently and the indexer produced a graph with DEFINES edges only — zero CALLS, zero EXTENDS. Benchmark validation showed sphinx (5K functions) and sympy (41K functions) had no resolved cross-references at all. Fix: - source_analyzer.py: prefer {repo}/venv, then {repo}/.venv, then fall back to the host interpreter's environment (sys.executable's prefix) so jedi always has a valid Python to introspect. - analyzer.py: log resolve() failures at WARN with file/line context instead of swallowing them silently, so the next regression is loud. Verified: re-indexed sphinx-doc/sphinx-9230 with the fix: DEFINES: 5640, CALLS: 4931, EXTENDS: 484 (was DEFINES-only). Fixes #685. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Two production-quality fixes from the calibration run that crashed at 14/30 trajectories: 1. Resume support: skip (instance, cfg) pairs whose trajectory file already exists. Lets us recover from crashes/kills without re-running completed work (avoids ~$3 of wasted compute on this run). 2. Ignore pathological files at index time: sympy/integrals/rubi/rules contains auto-generated 3000-line files with hundreds of unresolvable symbols per line. jedi spends hours and never makes progress. Adding it to the default ignore list unblocks sympy-19040 (and other sympy instances) without affecting graph quality. Also expanded default ignore set: __pycache__, build, dist, .tox, .eggs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

In source_analyzer.second_pass, the list of files we iterate can include paths that first_pass did not add to self.files (e.g. parse errors, LSP-induced timeouts, or rare edge cases where a candidate file is present in the input list but never makes it into the files map). Previously this raised KeyError and aborted the entire index. Hit on sympy/polys/distributedmodules.py during bench calibration of sympy-12481. Skip with a WARN log instead so a single bad file no longer takes down the whole index. Also bump mini_runner httpx timeout 1800s -> 7200s; observed sympy-12481 index taking >30 min in the field, which previously left the API server indexing successfully but the runner gave up early. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace jedi-based resolution with a pure tree-sitter static resolver behind CODE_GRAPH_PY_RESOLVER=tree_sitter. Default remains jedi for backwards compatibility. Benchmark on pytest-dev/pytest-6202 (204 files): - jedi: 247.1s wall, CALLS=1976, EXTENDS=71 - tree-sitter: 6.9s wall, CALLS=4833, EXTENDS=83 ~36x speedup, broader call recall (jedi returns None ~80% of the time). Mechanism: - TreeSitterPythonResolver builds a project-wide symbol table (top-level funcs/classes/assigns, class methods, import maps) keyed by id(files) for lazy construction. - Resolution: head lookup (local module -> import map -> cross-project bare-name fallback) + tail walk through attributes and class methods. - Handles relative imports, aliased imports, import-of-package, Optional[T]/generic_type subscript unwrapping. - AbstractAnalyzer.needs_lsp() hook + PythonAnalyzer override let source_analyzer skip LSP startup and venv setup entirely when the static resolver is active. This is where the wall-time win actually lives (jedi warm-up was ~240s of the 247s baseline). Closes #689. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

AbstractAnalyzer._captures was recompiling its query string on every call. cProfile on pytest-dev/pytest-6202 (204 files) showed tree_sitter.Language.query consuming 3.03s of the 6.36s first_pass — ~48% of analyzer time spent rebuilding queries that never change. Cache them on the analyzer instance, keyed by pattern string. Also switches from the deprecated language.query() to the Query(language, pattern) constructor. Wall-time on pytest-6202 (CODE_GRAPH_PY_RESOLVER=tree_sitter): before: 6.9s after: 3.7s Benefits every tree-sitter analyzer (Python, JavaScript, Kotlin), not just the new static resolver. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…tter resolver When the API server is launched without CODE_GRAPH_PY_RESOLVER=tree_sitter the PythonAnalyzer silently falls back to the jedi/multilspy path. On real-world repos (sphinx-doc/sphinx-8035, sympy, …) that path calls `python3 -m venv venv && pip install poetry && poetry install` per repo then runs jedi over the full transitive dep tree; we observed it wedge the server at 100% CPU + 3.5 GB RSS for 3+ hours with no progress. bench/scripts/start-api.sh already exports CODE_GRAPH_PY_RESOLVER, but a human-launched `uvicorn api.index:app …` won't pick it up and the bench silently degrades to the slow path. This commit makes the failure mode loud: 1. `GET /api/_health` returns {status, py_resolver, falkordb_host, falkordb_port, public}. Cheap (no DB call), unauth'd. 2. `_ensure_indexed` in the mini_runner calls /api/_health before any indexing and raises a clear RuntimeError when py_resolver != 'tree_sitter', pointing the operator at bench/scripts/start-api.sh. Verified: sphinx-doc__sphinx-8035 indexes in ~68s end-to-end with the new server (vs hours unbounded before). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… raises timeout The MCP adapter spawns a fresh cgraph-mcp stdio server per call. When the caller shell did not export CODE_GRAPH_PY_RESOLVER, the spawned server fell back to the legacy jedi/multilspy resolver, which runs 'python -m venv && pip install poetry && poetry install' per repo and then analyzes the full transitive dep tree. On full SWE-bench worktrees this wedges for >15 min — we observed it timing out indexing sympy__sympy-20154 and sympy__sympy-19040 during a fresh Opus calibration run. Mirror the start-api.sh policy: default CODE_GRAPH_PY_RESOLVER to tree_sitter in _env_for_mcp() so the MCP track is symmetric with HTTP regardless of caller env. Also bump the per-call timeout default 300s -> 900s in both the adapter (CGRAPH_MCP_TIMEOUT_SEC) and the cg-mcp CLI for headroom on cold MCP spawns over big repos. Validated: sympy-20154 (591 .py files, ~49k nodes, ~344k edges) indexes end-to-end via MCP in 220 s with the new default, vs >900 s timeout before. HTTP path on the same repo: 95 s; ~2.3x slower over the stdio spawn is expected and well within the new timeout. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Opus n=10 calibration showed the code_graph track spending +14.5% input tokens vs baseline, and code_graph_mcp +19.5% — driven by three things that compound over a 70-80-turn trajectory: 1. /api/get_neighbors returns a verbose vars(node)/vars(edge) dump that includes empty 'properties: {}' and empty 'alias: ""' on every edge, plus per-node 'doc' blocks. Every byte we hand back is re-fed to the LLM on every subsequent turn, so a single 20 KB neighbors call ends up billed ~50x. 2. JSON was pretty-printed (indent=2). Whitespace is free for humans, not for token counts. 3. System preambles were 2.8-3.6 KB of duplicated mini-swe-agent submission boilerplate + repeated rules-of-thumb. This is a bench-layer-only change (the React frontend and the core API contract are untouched): - bench/cli/cg.py: - _compact_neighbors strips empty properties/alias and projects nodes to {id, label, name, file, line}. - _compact_symbols same for find-symbol/auto-complete. - New --limit flag on get-neighbors (default 50; 0 = unlimited). - JSON now emits with separators=(',', ':'). - bench/cli/cg_mcp.py: same compact JSON formatting. - system_preamble.md (code_graph, code_graph_mcp): rewritten as ~1.1 KB instead of 2.9-3.6 KB, keeping the workflow and sub-command listing but dropping mini-swe-agent's own submission boilerplate. Validated against the live indexed graphs: - pytest-6202 cold neighbor call: 400 -> 148 bytes (63%) - sympy-19040 hot neighbor (id=23432, 2039 outgoing edges): raw: 747 KB -> compacted unlim: 252 KB (-66%) -> compacted lim=50: 6.2 KB (-99.2%) - sympy-19040 auto-complete prefix='solve': 11.4 KB -> 0.6 KB (-94.6%) Expected effect on the next Opus run: cg* input-token cost drops below baseline rather than above it, while the agent still gets the same structural information. All 24 per-branch-graph tests still pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The compaction pass in dc8534e accidentally dropped the COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT submission instruction (which baseline + lsp still had). Without it Opus has no way to signal 'done' and just loops, re-emitting the final diff every turn. Visible in opus-smoke n=1 (pytest-6202) after dc8534e: config msgs bash_outs input_tok vs baseline baseline 60 28 243k — lsp 52 24 209k -14% code_graph 235 82 1,043k +329% <-- looped code_graph_mcp hung > 16min on Azure round-trip Restore the autonomous-agent framing intro + submission section in both preambles. Keep the trimmed workflow / sub-command list. Also add an explicit anti-loop rule ('do not call the same cg query twice for the same symbol'). Net: ~30% smaller than original (was 60% with the broken trim), and now has the must-have completion contract. Compaction of tool *output* (in cg.py / cg_mcp.py) is unchanged and still valid; this only touches the preambles. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

LocalEnvironment.execute uses subprocess.run(shell=True, timeout=N). On timeout Python kills only the immediate shell PID — grandchildren (typically the python -c snippets the agent uses to reproduce bugs) get reparented to init and run forever. The Opus n=10 run leaked 4 such pythons from sympy-19040, each at ~100% CPU for 3-4 hours after the trajectory ended (~13 CPU-hours). Confirmed locally with a reproducer that left PPID=1 python alive under upstream LocalEnvironment. SafeLocalEnvironment spawns each command via Popen(start_new_session=True) so it gets its own process group, then on timeout os.killpg(pgid, SIGKILL) takes the whole subtree down. Output / returncode shape is otherwise identical to upstream so trajectories remain comparable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

+            # Best-effort: still try to kill anything we spawned.
+            try:
+                self._kill_process_group(proc.pid)
+            except Exception:


+            if sig is signal.SIGTERM:
+                try:
+                    os.waitpid(pid, os.WNOHANG)
+                except ChildProcessError:


+from __future__ import annotations
+
+import os
+import platform


+from typing import Any
+
+from minisweagent.environments.local import LocalEnvironment, LocalEnvironmentConfig
+from minisweagent.exceptions import Submitted


FastMCP serializes list[dict] tool returns as N separate TextContent chunks (one per item) AND echoes the full list in structuredContent['result']. Our previous _extract returned only the FIRST text chunk, which meant every list-returning tool (search_code, get_callers, get_callees, get_dependencies, impact_analysis, find_path) silently truncated to its first element throughout the entire benchmark. Caught on the n=10 Opus run, sympy-19040 trajectory: agent ran search_code(prefix='factor') hoping to discover dmp_ext_factor, dmp_sqf_part, factor_list, etc. — got back only 'factor_nc' (alphabetically first), retried with --limit 30 (same single result), gave up on the graph entirely and burned 50 python/pytest turns flailing in bash. cg_mcp ran 95 turns vs cg's 73 on the same task, 2.89M vs 1.39M input tokens. Fix: prefer structuredContent (always carries the full payload), unwrapping the spec's {'result': ...} envelope. Fall back to concatenating + parsing all text chunks so older FastMCP versions still work. Two new regression tests pin the multi-chunk shape; all 8 existing tests still pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…th prefix Iter1 (chunk-fix, 3125946) un-broke list returns from the MCP layer. That correctness fix exposed two unbounded payload sources in cg-mcp output that re-feed every LLM turn: 1. impact_analysis has no --limit. On a large graph (sympy: 142k edges) a depth=3 traversal routinely returns 500+ nodes. 2. Every node's 'file' is an absolute worktree path (/Users/.../worktrees/<project>/sympy/printing/latex.py, ~130 chars). The ~100-char prefix is identical for every entry and contributes nothing actionable. Real impact, observed on sympy__sympy-12481 iter1: - single 'impact_analysis --depth 3' returned 82,041 bytes. - 36 turns × ~82KB re-feed → cost $2.80 (iter0) → $7.28 (iter1). - Accuracy still 10/10, so this is pure context bloat. Fix at the CLI layer (bench/cli/cg_mcp.py) so the MCP server tools stay untouched and other clients (Claude, Cursor) keep their full payloads: - Add '--limit N' (default 50) to impact_analysis subcommand. - _compact_entry strips '/<project>/' prefix from 'file', drops empty values, preserves all other fields. - _compact_list applies entry compaction + truncates to limit on every list-returning subcommand (search_code, get_callers, get_callees, get_dependencies, impact_analysis). - find_path: single entry, just strip path. Adds 10 regression tests pinning the new contract. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Conflict resolution (8 files): - api/llm.py -> OURS (bench): keep the graphrag-sdk 1.x text-to-Cypher rewrite. Required for consistency with the merged pyproject pin (graphrag-sdk>=1.1.1); staging's llm.py still imports KnowledgeGraph, which 1.x removed, so taking staging would ImportError at load. - api/analyzers/source_analyzer.py, analyzer.py, python/ts_resolver.py, api/mcp/tools/structural.py -> STAGING: canonical/superset versions (parallel index workers + missing-file guard, refined logging, per-match capture alignment + threading lock, impact_analysis limit). - tests/analyzers/test_ts_python_resolver.py, tests/mcp/fixtures/expected.yaml, tests/mcp/test_impact_analysis.py -> STAGING: refined fixtures and added regression tests. Align MCP surface to staging (drop the deliberately-removed `ask` tool): - Revert api/mcp/tools/__init__.py to structural-only registration. - Delete api/mcp/tools/ask.py, api/mcp/graphrag_init.py, api/mcp/code_prompts.py and their tests. The ask tool was dropped from staging via #702 and is broken under graphrag-sdk 1.x (needs the 0.8 KnowledgeGraph API and api.llm.define_ontology, both gone in the bench rewrite). Tests: tests/mcp + tests/analyzers (79) and bench/swebench suites pass (bench_runner needs the mini-swe-agent extra, installed in CI). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Adds an end-to-end SWE-bench Verified benchmarking harness (baseline/LSP/HTTP code-graph/MCP code-graph) and hardens the MCP server + Docker packaging so the same image can run either the web API or the stdio MCP server.

Changes:

Introduces bench/ runner/metrics/report tooling + extensive unit tests for adapters, shims, and SWE-bench verification/export.
Adds MCP quality-of-life features: auto-start FalkorDB (localhost-only), optional auto-index on startup, and cgraph init-agent to drop bundled agent guidance templates.
Updates packaging/docs/compose/startup to support CGRAPH_MODE=mcp and bumps graphrag-sdk to 1.x with a rewritten in-house text→Cypher pipeline.

Reviewed changes

Copilot reviewed 58 out of 64 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
tests/test_swebench_verify.py	Unit tests for SWE-bench predictions export + report outcome patching + docker-missing behavior.
tests/test_swe_bench_loader.py	Unit tests for SWE-bench dataset loader utilities (parsing, deterministic sampling, worktree paths).
tests/test_bench_runner.py	Dry-run harness tests for runner loop, env wiring, persistence, and CLI shim argparse smoke checks.
tests/test_bench_metrics.py	Unit tests for trajectory parsing (tokens/tools/patch) and report aggregation rendering.
tests/test_bench_lsp_adapter.py	Unit + slow E2E tests for LSP adapter behavior and shim outputs.
tests/test_bench_index_cache.py	Tests for JSON-on-disk index cache registry behavior.
tests/test_bench_code_graph_adapter.py	HTTP adapter tests via `httpx.MockTransport` (payload shaping, auth header, errors).
tests/mcp/test_init_agent.py	Tests for `cgraph init-agent` template install + overwrite behavior.
tests/mcp/test_auto_init.py	Tests for MCP auto-init helpers (ensure-db + auto-index idempotence).
tests/bench/test_cg_mcp_cli_compaction.py	Regression tests ensuring cg-mcp CLI caps/compacts large list payloads and strips worktree prefixes.
tests/bench/test_cg_mcp_adapter.py	Tests for MCP adapter extraction logic + cg-mcp CLI parsing + optional heavy E2E check.
tests/bench/init.py	Test package marker.
start.sh	Adds `CGRAPH_MODE` dispatcher to run web server vs MCP server from same image; sends startup logs to stderr.
scripts/mcp_smoke.py	End-to-end MCP smoke script: list tools, index repo, exercise key tools with compact output checking.
README.md	Documents MCP server usage, Claude Code registration, and Docker Compose/Docker invocation for MCP mode.
pyproject.toml	Bumps `graphrag-sdk` to 1.x, adds bench optional deps, pytest markers/norecursedirs, and package-data for MCP templates.
docker-compose.yml	Adds `code-graph-mcp` stdio service under an `mcp` profile, using `CGRAPH_MODE=mcp`.
CONTEXT.md	Glossary/design constraints for benchmark workstream and measurement conventions.
bench/tools/lsp/tools.yaml	LSP tool bundle definition + notes about jedi/multilspy limitations.
bench/tools/lsp/system_preamble.md	System preamble instructing LSP-first navigation and explicit fallback behavior.
bench/tools/lsp/shim.yaml	Shim spec for capping locations and trimming hover payloads.
bench/tools/code_graph/tools.yaml	HTTP code-graph tool bundle definition + explicit exclusion of GraphRAG chat.
bench/tools/code_graph/system_preamble.md	System preamble instructing cg-first navigation and note-edit workflow.
bench/tools/code_graph_mcp/tools.yaml	MCP tool bundle definition for cg-mcp transport track (excluding `ask`).
bench/tools/code_graph_mcp/system_preamble.md	System preamble instructing cg-mcp-first navigation and usage rules.
bench/tools/baseline/tools.yaml	Baseline SWE-agent tool bundle definition (bash/edit/read/write/submit).
bench/tools/baseline/system_preamble.md	Baseline system preamble with submission sentinel protocol.
bench/scripts/start-api.sh	Helper to start API with tree-sitter resolver and public mode for bench runs.
bench/runners/swebench_verify.py	Converts results.jsonl → predictions and runs official Docker SWE-bench harness + patches outcomes.
bench/runners/safe_local_env.py	LocalEnvironment variant that kills full process group on timeouts to prevent orphans.
bench/runners/mini_runner.py	Main mini-swe-agent benchmark runner, config env wiring, indexing prechecks, tool-usage instrumentation, CLI.
bench/runners/index_cache.py	JSON-file-backed registry for indexed `<repo>@<commit>` pairs.
bench/runners/init.py	Package marker.
bench/report/main.py	CLI to render results.jsonl → markdown report.
bench/report/init.py	Aggregation logic producing per-config summaries and markdown rendering.
bench/README.md	Benchmark workstream README (currently outdated vs implemented runner).
bench/metrics/init.py	Pure trajectory parsing utilities and TaskMetrics schema + JSONL helpers.
bench/datasets/swe_bench.py	SWE-bench Verified loader/worktree prep + official Docker harness verification wrapper.
bench/datasets/init.py	Package marker.
bench/configs/default.yaml	Benchmark config seed/sample sizing and rollout plan configuration artifact.
bench/cli/regrade.py	Retroactive regrading of trajectories via official SWE-bench Docker harness.
bench/cli/lsp.py	Python module implementing the `lsp` CLI wrapper (argparse).
bench/cli/lsp	Bash entrypoint for `lsp` CLI using BENCH_PYTHON.
bench/cli/cg.py	Python module implementing the `cg` HTTP CLI wrapper + payload compaction.
bench/cli/cg-mcp	Bash entrypoint for `cg-mcp` CLI using BENCH_PYTHON.
bench/cli/cg_mcp.py	Python module implementing MCP CLI wrapper + list compaction/path stripping + timeouts.
bench/cli/cg	Bash entrypoint for `cg` CLI (stdin redirected defensively).
bench/cli/init.py	Package marker.
bench/agents/lsp_adapter.py	LSP client + shim implementation used by `lsp` CLI and tests.
bench/agents/code_graph_mcp_adapter.py	MCP stdio adapter spawning `cgraph-mcp` per call; extracts structured results.
bench/agents/code_graph_adapter.py	HTTP adapter client + helper tools (`find_symbol`, `note_edit`) and tests.
bench/agents/init.py	Package marker.
bench/.gitignore	Ignores bench outputs (cache/jsonl/pycache).
api/mcp/templates/cursorrules.template	Cursor guidance template for MCP tool usage.
api/mcp/templates/claude_mcp_section.md	Claude/Cursor guidance section describing MCP tools and response shapes.
api/mcp/server.py	MCP server entrypoint now runs auto-init helpers before starting stdio transport.
api/mcp/auto_init.py	New: ensure FalkorDB (localhost) and optional auto-indexing of CWD per branch.
api/llm.py	Rewrites text→Cypher chat to avoid removed graphrag-sdk 0.8 KnowledgeGraph API; uses LiteLLM + FalkorDB query + synthesis.
api/index.py	Adds `/api/_health` endpoint for bench fail-fast resolver/DB config detection.
api/cli.py	Adds `cgraph init-agent` command and template bundling path constant.
AGENTS.md	Documents MCP server usage and `init-agent`.
.gitignore	Ignores SWE-bench verifier logs and bench verification output dirs.
.env.template	Adds benchmark runner credential documentation and examples.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

 dependencies = [
-    "graphrag-sdk>=0.8.1,<0.9.0",
+    "graphrag-sdk>=1.1.1,<2.0.0",
    "tree-sitter>=0.25.2,<0.26.0",
    "validators>=0.35.0,<0.36.0",
    "falkordb>=1.1.3,<2.0.0",


+async def _run_cypher(repo_name: str, cypher: str) -> list[list[Any]]:
+    if not cypher:
+        return []
+    db = _falkordb()
+    graph = db.select_graph(repo_name)
+    try:


+async def ask(repo_name: str, question: str) -> str:
+    """Answer a natural-language question against the code graph for repo_name."""
+    llm = _build_llm()
+
+    cypher_resp = await llm.ainvoke_messages(
+        [
+            ChatMessage(role="system", content=CYPHER_GEN_SYSTEM.format(ontology=_ONTOLOGY_TEXT)),
+            ChatMessage(role="user", content=CYPHER_GEN_PROMPT.format(question=question)),
+        ]
+    )
+    cypher = _extract_cypher(cypher_resp.content)
+    logger.debug("Generated Cypher: %s", cypher)
+
+    context = await _run_cypher(repo_name, cypher)
+


+    # Silence the server's stderr (analyzer + MCP DEBUG logs). When the
+    # bash shim is invoked by the agent, the server's stderr is merged
+    # into the agent's tool-output buffer, inflating context by ~1.8kB
+    # per call. The agent only needs the JSON-RPC result on stdout.
+    devnull = open(os.devnull, "w")
+    async with stdio_client(params, errlog=devnull) as (read, write):


+import os
+import platform
+import signal
+import subprocess
+from typing import Any


+## Status
+
+**Scaffold only.** Directory layout and contracts exist; runners do not.
+Next steps are tracked in the session's todo list.
+


+When you believe the task is complete, finish your turn with a final
+message that contains a unified diff of your changes inside a fenced
+``` block, then exit. Do not commit; the harness reads the diff via
+`git diff`.
+"""


+When you believe the task is complete, finish your turn with a final
+message that contains a unified diff of your changes inside a fenced
+``` block, then exit. Do not commit; the harness reads the diff via
+`git diff`.
+"""


+When you believe the task is complete, finish your turn with a final
+message that contains a unified diff of your changes inside a fenced
+``` block, then exit. Do not commit; the harness reads the diff via
+`git diff`.
+"""


+When you believe the task is complete, finish your turn with a final
+message that contains a unified diff of your changes inside a fenced
+``` block, then exit. Do not commit; the harness reads the diff via
+`git diff`.
+"""


coderabbitai

Actionable comments posted: 12

♻️ Duplicate comments (3)

bench/agents/code_graph_mcp_adapter.py (1)

120-120: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

File handle is not closed.

The devnull file opened here is never closed, causing a resource leak. This issue has been flagged multiple times in previous reviews.

🔧 Proposed fix using a context manager

-    devnull = open(os.devnull, "w")
-    async with stdio_client(params, errlog=devnull) as (read, write):
-        async with ClientSession(read, write) as session:
-            await asyncio.wait_for(session.initialize(), timeout=timeout)
-            result = await asyncio.wait_for(
-                session.call_tool(name, arguments), timeout=timeout
-            )
-            payload = _extract(result)
-            if getattr(result, "isError", False):
-                return {"error": payload}
-            return payload
+    with open(os.devnull, "w") as devnull:
+        async with stdio_client(params, errlog=devnull) as (read, write):
+            async with ClientSession(read, write) as session:
+                await asyncio.wait_for(session.initialize(), timeout=timeout)
+                result = await asyncio.wait_for(
+                    session.call_tool(name, arguments), timeout=timeout
+                )
+                payload = _extract(result)
+                if getattr(result, "isError", False):
+                    return {"error": payload}
+                return payload

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@bench/agents/code_graph_mcp_adapter.py` at line 120, The code opens
os.devnull into the variable devnull without closing it (devnull =
open(os.devnull, "w")), causing a resource leak; change the usage to either (1)
wrap the open(os.devnull, "w") in a with context so the file is auto-closed
around its use, or (2) replace usages with subprocess.DEVNULL where applicable,
or (3) ensure devnull.close() is called after the last use — locate the devnull
variable in this module (devnull) and update its creation/usage accordingly to
guarantee the file handle is closed.

bench/report/__main__.py (1)

18-18: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Unused import: aggregate_to_markdown.

The aggregate_to_markdown function is imported but not used. The script directly calls load_jsonl, summarize, and render_markdown instead.

🧹 Proposed fix

-from bench.report import aggregate_to_markdown, load_jsonl, render_markdown, summarize
+from bench.report import load_jsonl, render_markdown, summarize

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@bench/report/__main__.py` at line 18, Remove the unused import
aggregate_to_markdown from the import list in the module where load_jsonl,
summarize, and render_markdown are used; update the import statement that
currently reads "from bench.report import aggregate_to_markdown, load_jsonl,
render_markdown, summarize" to only import the functions actually used
(load_jsonl, summarize, render_markdown) so there are no unused references to
aggregate_to_markdown.

bench/datasets/swe_bench.py (1)

34-34: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Unused import: os.

The os module is imported but never used in this file.

🧹 Proposed fix

 import json
-import os
 import random

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@bench/datasets/swe_bench.py` at line 34, Remove the unused import of the os
module from the top of bench/datasets/swe_bench.py: delete the line "import os"
so there are no unused imports; ensure no other code in this file references os
(e.g., any later functions or top-level code) before removing.

🧹 Nitpick comments (15)

bench/agents/code_graph_mcp_adapter.py (1)

66-67: 💤 Low value

Simplify setdefault calls.

Since env = dict(os.environ) at line 65 already copied all environment variables, the os.environ.get(...) calls in the setdefault arguments are redundant. If the key exists in os.environ, it's already in env; if not, the default should be used directly.

♻️ Proposed simplification

-    env.setdefault("FALKORDB_HOST", os.environ.get("FALKORDB_HOST", "127.0.0.1"))
-    env.setdefault("FALKORDB_PORT", os.environ.get("FALKORDB_PORT", "6379"))
+    env.setdefault("FALKORDB_HOST", "127.0.0.1")
+    env.setdefault("FALKORDB_PORT", "6379")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@bench/agents/code_graph_mcp_adapter.py` around lines 66 - 67, The setdefault
calls on the local dict env are using redundant os.environ.get calls; change the
two lines that set "FALKORDB_HOST" and "FALKORDB_PORT" to call
env.setdefault("FALKORDB_HOST", "127.0.0.1") and env.setdefault("FALKORDB_PORT",
"6379") respectively so defaults are applied directly (locate the env dict
created earlier and the two setdefault calls in code_graph_mcp_adapter.py).

bench/report/__init__.py (1)

35-37: 💤 Low value

Misleading comment about property replacement.

The comment states "not perfectly accurate, replaced below" but the total_tokens property is never replaced or overridden. If this is a known approximation, clarify the comment; otherwise, remove the inaccurate note.

📝 Suggested clarification

     `@property`
     def total_tokens(self) -> int:
-        return self.median_tokens * self.n_tasks  # not perfectly accurate, replaced below
+        return self.median_tokens * self.n_tasks  # approximation: assumes median represents all tasks

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@bench/report/__init__.py` around lines 35 - 37, The inline comment on the
total_tokens property is misleading because nothing replaces or overrides it
later; update the comment or remove it: in the total_tokens property (method
total_tokens) either delete the "not perfectly accurate, replaced below" note or
replace it with a clear statement that this is an approximation computed as
median_tokens * n_tasks and, if applicable, reference how median_tokens and
n_tasks are derived (median_tokens, n_tasks) so future readers know this is an
estimated total rather than an exact value.

tests/test_swebench_verify.py (2)

74-74: ⚡ Quick win

Ambiguous variable name l.

The single-letter variable l (lowercase L) is easily confused with 1 (digit one) or I (uppercase i). Use a more descriptive name like line or row_json.

📝 Proposed fix

-    rows = [json.loads(l) for l in results.read_text().splitlines()]
+    rows = [json.loads(line) for line in results.read_text().splitlines()]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_swebench_verify.py` at line 74, Rename the ambiguous single-letter
variable `l` in the list comprehension inside tests/test_swebench_verify.py (the
expression creating `rows = [json.loads(l) for l in
results.read_text().splitlines()]`) to a clearer name such as `line` or
`row_json`; update the comprehension to use that name and replace any other uses
of the same variable in the surrounding scope so readability is improved and `l`
isn’t confused with similar characters.

38-38: ⚡ Quick win

Ambiguous variable name l.

The single-letter variable l (lowercase L) is easily confused with 1 (digit one) or I (uppercase i). Use a more descriptive name like line or row_json.

📝 Proposed fix

-    baseline = [json.loads(l) for l in paths["baseline"].read_text().splitlines()]
+    baseline = [json.loads(line) for line in paths["baseline"].read_text().splitlines()]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_swebench_verify.py` at line 38, The list comprehension in the
assignment to baseline uses an ambiguous single-letter variable `l`; change it
to a descriptive name (e.g., `line` or `row_json`) so the expression becomes:
baseline = [json.loads(<new_name>) for <new_name> in
paths["baseline"].read_text().splitlines()] and update any matching usage in
that expression to avoid confusion between `l`, `1`, and `I`.

tests/test_bench_metrics.py (1)

1-166: ⚖️ Poor tradeoff

Consider unittest.TestCase style for consistency.

Based on learnings, the repository's backend test suite under tests/ uses unittest.TestCase style. While these pytest-style tests work well, aligning with the existing convention would improve consistency. Since this would require refactoring multiple test files, it's better suited for a coordinated effort across the test suite.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_bench_metrics.py` around lines 1 - 166, Reviewer asks for
unittest.TestCase style consistency, but this file (tests/test_bench_metrics.py)
intentionally uses pytest-style tests like test_extract_token_usage_sums_history
and test_task_metrics_from_trajectory_round_trip; add a one-line comment at the
top of the file stating that these tests intentionally use pytest style and that
a repository-wide refactor to unittest.TestCase will be done separately, so we
keep the current tests as-is for now to avoid a piecemeal conversion.
Source: Learnings

scripts/mcp_smoke.py (1)

25-36: ⚡ Quick win

Narrow the exception handler to json.JSONDecodeError.

Line 31 catches bare Exception, which is overly broad and could hide unexpected errors (e.g., AttributeError if chunk is malformed). Since the only expected failure here is JSON decode failure, catch json.JSONDecodeError specifically.

🎯 Proposed fix

 def _pretty(result):
     """Pull the first text payload out of a CallToolResult."""
     for chunk in result.content:
         if hasattr(chunk, "text"):
             try:
                 return json.loads(chunk.text)
-            except Exception:
+            except json.JSONDecodeError:
                 return chunk.text
     # No text chunks — show structured content if MCP put the payload there.
     if hasattr(result, "structuredContent") and result.structuredContent is not None:
         return result.structuredContent
     return None

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/mcp_smoke.py` around lines 25 - 36, The _pretty function is catching
a bare Exception when attempting json.loads(chunk.text); narrow that to
json.JSONDecodeError so unrelated errors (e.g., AttributeError) surface; update
the try/except in _pretty to except json.JSONDecodeError and keep the existing
fallback behavior of returning chunk.text on decode failure (ensure json is
imported where _pretty is defined).

api/mcp/auto_init.py (1)

36-36: 💤 Low value

Consider renaming _AUTO_INDEXED to follow mutable naming convention.

The module-level _AUTO_INDEXED set uses ALL_CAPS naming, which Python conventions typically reserve for immutable constants. Since this is a mutable session cache, consider renaming it to _auto_indexed (lowercase with underscore prefix) to signal its mutability.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/mcp/auto_init.py` at line 36, Rename the module-level mutable cache
symbol from _AUTO_INDEXED to _auto_indexed to reflect mutability; update the
declaration (keep the type annotation set[tuple[str, str]] and the leading
underscore) and replace every reference to _AUTO_INDEXED in this module
(functions, reads/writes) with _auto_indexed so the code remains consistent and
follows Python naming conventions.

api/mcp/server.py (1)

33-34: ⚡ Quick win

Consider logging the auto-init results for better diagnostics.

Both ensure_falkordb() and maybe_auto_index() return status dicts with useful diagnostic information (host, port, action, errors, indexed project/branch), but the results are currently discarded. Logging these results would improve observability and help diagnose bootstrap issues without changing behavior.

📊 Suggested enhancement

 from .auto_init import ensure_falkordb, maybe_auto_index

-    ensure_falkordb()
-    maybe_auto_index()
+    import logging
+    logger = logging.getLogger(__name__)
+    
+    db_status = ensure_falkordb()
+    logger.info("FalkorDB bootstrap: %s", db_status)
+    
+    index_status = maybe_auto_index()
+    logger.info("Auto-index: %s", index_status)
+    
     app.run(transport="stdio")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/mcp/server.py` around lines 33 - 34, Call and capture the returned status
dicts from ensure_falkordb() and maybe_auto_index() (e.g., falkordb_status =
ensure_falkordb(), auto_index_status = maybe_auto_index()) and emit concise
structured logs including those dicts (use existing logger or
logging.info/debug) so host/port/action/errors/indexed project/branch are
recorded; do this without altering the functions' behavior or return values and
ensure the log messages provide clear context like "ensure_falkordb result" and
"maybe_auto_index result".

api/cli.py (1)

434-437: ⚡ Quick win

Validate template file existence before reading.

The function attempts to read template files without verifying they exist. If the package is incorrectly installed or files are missing, the error will occur during iteration rather than being caught upfront. Consider adding an early check.

🔍 Proposed validation check

     targets = {
         "CLAUDE.md": _TEMPLATES_DIR / "claude_mcp_section.md",
         ".cursorrules": _TEMPLATES_DIR / "cursorrules.template",
     }
+    
+    # Validate template files exist
+    missing = [str(tpl) for tpl in targets.values() if not tpl.exists()]
+    if missing:
+        _json_error(f"Template files not found (package may be incorrectly installed): {', '.join(missing)}")
 
     cwd = Path.cwd()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/cli.py` around lines 434 - 437, The targets mapping in api/cli.py can
point to missing template files causing runtime errors when they are read;
before iterating/reading the templates referenced by targets (and using
_TEMPLATES_DIR), validate each Path (the values in targets) exists
(Path.exists()/is_file()) and raise a clear exception or log a descriptive error
listing missing templates (include the template key names like "CLAUDE.md" and
".cursorrules") so installation/missing-file problems are caught early rather
than during file I/O.

bench/tools/code_graph/system_preamble.md (1)

44-46: 💤 Low value

Consider adding a language specifier to the fenced code block.

The fenced code block could specify text or bash as the language identifier to satisfy the markdown linter and improve rendering consistency.
📝 Suggested fix
-```
+```text
 COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @bench/tools/code_graph/system_preamble.md around lines 44 - 46, Update the
fenced code block containing the line "COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT" to
include a language specifier (e.g., text or bash) after the opening triple
backticks so the markdown linter passes and rendering is consistent; locate the
triple-backtick fenced block around that string and change the opening fence
from totext (or ```bash) accordingly.
</details>



_Source: Linters/SAST tools_

</blockquote></details>
<details>
<summary>bench/tools/code_graph_mcp/system_preamble.md (1)</summary><blockquote>

`46-48`: _💤 Low value_

**Consider adding a language specifier to the fenced code block.**

The fenced code block could specify `text` or `bash` as the language identifier to satisfy the markdown linter and improve rendering consistency.





<details>
<summary>📝 Suggested fix</summary>

```diff
-```
+```text
 COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT
 ```
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@bench/tools/code_graph_mcp/system_preamble.md` around lines 46 - 48, The
fenced code block containing the token COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT
should include a language specifier (e.g., text or bash) to satisfy the markdown
linter and ensure consistent rendering; update the triple-backtick fence that
wraps COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT to use a language identifier like
"text" (or "bash") so the linter and renderers recognize the block type.
```

</details>



_Source: Linters/SAST tools_

</blockquote></details>
<details>
<summary>bench/tools/baseline/system_preamble.md (1)</summary><blockquote>

`14-16`: _💤 Low value_

**Consider adding a language specifier to the fenced code block.**

The fenced code block could specify `text` or `bash` as the language identifier to satisfy the markdown linter and improve rendering consistency.





<details>
<summary>📝 Suggested fix</summary>

```diff
-```
+```text
 COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT
 ```
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @bench/tools/baseline/system_preamble.md around lines 14 - 16, The fenced
code block containing COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT should include a
language identifier (e.g., text or bash) to satisfy the markdown linter and
ensure consistent rendering; update the fenced block from totext (or
Source: Linters/SAST tools

start.sh (1)

29-39: 💤 Low value

Consider logging unknown CGRAPH_MODE values.

The web|*) pattern silently defaults unknown mode values to web behavior, which could hide typos. While defaulting to web is safe, logging a warning for unrecognized values would improve debuggability.

💡 Optional enhancement

 case "${CGRAPH_MODE}" in
   mcp)
     exec cgraph-mcp
     ;;
-  web|*)
+  web)
     exec uvicorn api.index:app --host "${HOST:-0.0.0.0}" --port "${PORT:-5000}" ${APP_RELOAD:+--reload}
     ;;
+  *)
+    echo "WARNING: Unknown CGRAPH_MODE='${CGRAPH_MODE}', defaulting to web" >&2
+    exec uvicorn api.index:app --host "${HOST:-0.0.0.0}" --port "${PORT:-5000}" ${APP_RELOAD:+--reload}
+    ;;
 esac

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@start.sh` around lines 29 - 39, The case fallback silently treats any value
other than "mcp" as "web", so add a warning log when CGRAPH_MODE is set to an
unrecognized value before falling through to the web branch: detect if
CGRAPH_MODE is neither "mcp" nor "web" and emit a warning (e.g., echo to stderr)
mentioning the actual CGRAPH_MODE, then continue to execute the existing web
branch (the exec uvicorn api.index:app ... line) so behavior is unchanged; refer
to the existing case on CGRAPH_MODE and the mcp and web|* branches to locate
where to add this check.

bench/runners/swebench_verify.py (2)

43-45: ⚡ Quick win

Use a clearer variable name instead of l.

The variable name l in the list comprehension is ambiguous (easily confused with 1 or I). Use line instead.

♻️ Clearer variable name

 def _load_jsonl(path: Path) -> list[dict[str, Any]]:
-    return [json.loads(l) for l in path.read_text().splitlines() if l.strip()]
+    return [json.loads(line) for line in path.read_text().splitlines() if line.strip()]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@bench/runners/swebench_verify.py` around lines 43 - 45, In _load_jsonl,
replace the ambiguous list-comprehension variable name `l` with a clearer name
like `line`; update the comprehension in function _load_jsonl to use `line` in
both the json.loads(...) call and the if check so it reads more clearly (e.g.,
json.loads(line) for line in path.read_text().splitlines() if line.strip()).

114-141: ⚡ Quick win

Split multiple statements onto separate lines.

Lines 136 and 138 have multiple statements separated by semicolons. Split them onto separate lines for better readability and to align with Python style conventions.

♻️ Split statements

         if r["task_id"] in resolved:
-            r["outcome"] = "resolved"; n += 1
+            r["outcome"] = "resolved"
+            n += 1
         elif r["task_id"] in unresolved:
-            r["outcome"] = "failed"; n += 1
+            r["outcome"] = "failed"
+            n += 1

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@bench/runners/swebench_verify.py` around lines 114 - 141, The function
patch_outcomes_from_report contains multiple statements on single lines in the
branches that assign r["outcome"] and increment n; split those into separate
statements for clarity: inside patch_outcomes_from_report, replace the
single-line branches that currently do r["outcome"] = "resolved"; n += 1 and
r["outcome"] = "failed"; n += 1 with two lines each (one assigning r["outcome"]
and a separate line incrementing n) so both the resolved and unresolved branches
are expanded into separate statements.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@api/cli.py`:
- Around line 422-456: Wrap the body of init_agent in a try/except that catches
Exception and calls _json_error(str(e)) on failure to match other CLI commands;
specifically, surround the logic that builds targets, checks for existing files,
writes templates (dest.write_text and template.read_text), and returns _json_out
with a try block, and in the except block call _json_error(str(e)) so missing
template files or I/O errors are reported consistently. Ensure you still respect
the existing force check and that written remains returned on success.

In `@api/index.py`:
- Around line 116-133: Add the public_or_auth decorator to the _health endpoint:
annotate the async function _health with `@public_or_auth` so the route follows
the read-endpoint guideline (stays public when CODE_GRAPH_PUBLIC=1, otherwise
requires auth); also ensure public_or_auth is imported into api/index.py if it
isn't already so the decorator resolves correctly.

In `@api/llm.py`:
- Around line 87-92: The graph lookup in _run_cypher currently uses only
repo_name which ignores branch/worktree and can return wrong state; modify
_run_cypher (and the related callers, including ask()) to accept and pass a
graph key that includes branch/worktree context (e.g., repo_name +
worktree/branch id or a resolved graph_name), use that key when calling
_falkordb().select_graph(...) instead of raw repo_name, and update any other
functions in the same region (the logic around lines 100-113) to thread the
resolver/graph_name through so all graph selections are per-branch rather than
global to the repo. Ensure function signatures for _run_cypher and ask reflect
the new parameter and update all call sites accordingly.
- Around line 95-97: The _run_cypher() function currently swallows all
exceptions and returns an empty list which makes failures indistinguishable from
"no results" and lets ask() continue to synthesize answers; change _run_cypher()
to raise a specific exception (e.g., CypherExecutionError or a custom subclass)
when any query/DB error occurs (while still logging with logger.exception), and
update ask() to catch that specific exception and short-circuit (propagate the
error or return an explicit failure response) instead of proceeding to the
answer-generation logic; reference _run_cypher and ask so you update both places
consistently.
- Around line 87-97: The _run_cypher function currently executes model-provided
Cypher verbatim; add a strict read-only guard that rejects any Cypher containing
mutation keywords (e.g., CREATE, MERGE, SET, DELETE, REMOVE, DROP, CALL, LOAD
CSV, CREATE INDEX, DROP INDEX, etc.) using a case-insensitive regex or token
check before calling graph.query, return/log an error for disallowed queries,
and/or use a FalkorDB read-only credential/connection from _falkordb() to ensure
no writes are possible; apply the identical validation/read-only-credential
change to the other Cypher execution site that also calls graph.query so all
model-driven executions are protected.
- Around line 73-75: The current api/llm.py must be changed to use the GraphRAG
abstraction for chat (re-import/instantiate GraphRAG from the existing
graphrag_init module instead of directly wiring LiteLLM in _build_llm), update
ask(repo_name: str, question: str, branch: str) to accept a branch argument and
ensure callers (e.g., /api/chat) pass data.branch through, and fix _run_cypher
to (1) select the per-branch graph (use db.select_graph(repo_name, branch) or
equivalent graph identity used elsewhere), (2) enforce read-only Cypher by
validating/rewriting the LLM output before executing (reject/abort if DDL/DML
keywords detected), (3) avoid swallowing DB exceptions—catch and rethrow or
return an explicit error so failures aren’t masked as empty context, and (4)
call graph.query only after these checks; reference functions/classes:
_build_llm, ask, _run_cypher, GraphRAG, graphrag_init, graph.query,
db.select_graph.

In `@bench/datasets/swe_bench.py`:
- Around line 198-200: Replace the bare re-raise of a RuntimeError with an
exception-chaining raise so the original subprocess exception is preserved:
modify the raise in the block that currently does raise RuntimeError(f"failed to
apply test_patch for {inst.instance_id}: {exc.stderr}") to use "raise ... from
exc" so the original exception (exc) is chained to the new RuntimeError; keep
the same message and reference to inst.instance_id and exc.stderr.

In `@bench/README.md`:
- Around line 16-17: Update the README status line that currently reads
"Scaffold only. Directory layout and contracts exist; runners do not." to
accurately reflect that benchmark runners, adapters, and tooling are now
included in this PR; replace that sentence with concise, current status text
(e.g., "Benchmarks include runners, adapters, and tooling — see the runners/ and
adapters/ directories for usage and examples") so users aren’t misled, and
ensure any follow-up "todo" pointers in the README reference remaining tasks
only.
- Around line 11-12: The README references a user-local path
'`~/.copilot/session-state/<id>/plan.md`' which is not accessible to
collaborators/CI; update the line to point to a committed, repo-tracked file
(e.g., replace with a path under the repository like a docs or bench subpath
such as a committed "session-plan.md") or remove the reference entirely, and
ensure the updated README points to the new committed filename so others and CI
can access the full design doc.

In `@bench/tools/code_graph/tools.yaml`:
- Line 28: The tools.yaml currently hardcodes service_url to
http://host.docker.internal:5000 which is not portable; update
bench/tools/code_graph/tools.yaml to read the base URL from the CODEGRAPH_URL
environment variable (falling back to the existing literal if unset) so the
code-graph tool uses the same endpoint as the benchmark runner, and add a short
note in .env.template or README that CODEGRAPH_URL must be set on platforms
where host.docker.internal is unavailable (keep SECRET_TOKEN usage unchanged to
match bench/agents/code_graph_adapter.py).

In `@bench/tools/lsp/system_preamble.md`:
- Around line 57-59: The fenced code block containing
COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT is missing a language specifier; update
the block delimiter from ``` to ```text (or ```bash if preferred) so the code
fence becomes ```text COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT ``` to satisfy the
MD040 lint rule and explicitly mark the example output language.

In `@README.md`:
- Line 237: The README install snippet uses the wrong PyPI package name; replace
the occurrence of "pip install code-graph" with "pip install
falkordb-code-graph" (or make it consistent with the earlier "pipx/pip install
falkordb-code-graph" guidance) so the install command matches the package name
declared in pyproject.toml and the existing "uv pip install -e ." instructions.

---

Duplicate comments:
In `@bench/agents/code_graph_mcp_adapter.py`:
- Line 120: The code opens os.devnull into the variable devnull without closing
it (devnull = open(os.devnull, "w")), causing a resource leak; change the usage
to either (1) wrap the open(os.devnull, "w") in a with context so the file is
auto-closed around its use, or (2) replace usages with subprocess.DEVNULL where
applicable, or (3) ensure devnull.close() is called after the last use — locate
the devnull variable in this module (devnull) and update its creation/usage
accordingly to guarantee the file handle is closed.

In `@bench/datasets/swe_bench.py`:
- Line 34: Remove the unused import of the os module from the top of
bench/datasets/swe_bench.py: delete the line "import os" so there are no unused
imports; ensure no other code in this file references os (e.g., any later
functions or top-level code) before removing.

In `@bench/report/__main__.py`:
- Line 18: Remove the unused import aggregate_to_markdown from the import list
in the module where load_jsonl, summarize, and render_markdown are used; update
the import statement that currently reads "from bench.report import
aggregate_to_markdown, load_jsonl, render_markdown, summarize" to only import
the functions actually used (load_jsonl, summarize, render_markdown) so there
are no unused references to aggregate_to_markdown.

---

Nitpick comments:
In `@api/cli.py`:
- Around line 434-437: The targets mapping in api/cli.py can point to missing
template files causing runtime errors when they are read; before
iterating/reading the templates referenced by targets (and using
_TEMPLATES_DIR), validate each Path (the values in targets) exists
(Path.exists()/is_file()) and raise a clear exception or log a descriptive error
listing missing templates (include the template key names like "CLAUDE.md" and
".cursorrules") so installation/missing-file problems are caught early rather
than during file I/O.

In `@api/mcp/auto_init.py`:
- Line 36: Rename the module-level mutable cache symbol from _AUTO_INDEXED to
_auto_indexed to reflect mutability; update the declaration (keep the type
annotation set[tuple[str, str]] and the leading underscore) and replace every
reference to _AUTO_INDEXED in this module (functions, reads/writes) with
_auto_indexed so the code remains consistent and follows Python naming
conventions.

In `@api/mcp/server.py`:
- Around line 33-34: Call and capture the returned status dicts from
ensure_falkordb() and maybe_auto_index() (e.g., falkordb_status =
ensure_falkordb(), auto_index_status = maybe_auto_index()) and emit concise
structured logs including those dicts (use existing logger or
logging.info/debug) so host/port/action/errors/indexed project/branch are
recorded; do this without altering the functions' behavior or return values and
ensure the log messages provide clear context like "ensure_falkordb result" and
"maybe_auto_index result".

In `@bench/agents/code_graph_mcp_adapter.py`:
- Around line 66-67: The setdefault calls on the local dict env are using
redundant os.environ.get calls; change the two lines that set "FALKORDB_HOST"
and "FALKORDB_PORT" to call env.setdefault("FALKORDB_HOST", "127.0.0.1") and
env.setdefault("FALKORDB_PORT", "6379") respectively so defaults are applied
directly (locate the env dict created earlier and the two setdefault calls in
code_graph_mcp_adapter.py).

In `@bench/report/__init__.py`:
- Around line 35-37: The inline comment on the total_tokens property is
misleading because nothing replaces or overrides it later; update the comment or
remove it: in the total_tokens property (method total_tokens) either delete the
"not perfectly accurate, replaced below" note or replace it with a clear
statement that this is an approximation computed as median_tokens * n_tasks and,
if applicable, reference how median_tokens and n_tasks are derived
(median_tokens, n_tasks) so future readers know this is an estimated total
rather than an exact value.

In `@bench/runners/swebench_verify.py`:
- Around line 43-45: In _load_jsonl, replace the ambiguous list-comprehension
variable name `l` with a clearer name like `line`; update the comprehension in
function _load_jsonl to use `line` in both the json.loads(...) call and the if
check so it reads more clearly (e.g., json.loads(line) for line in
path.read_text().splitlines() if line.strip()).
- Around line 114-141: The function patch_outcomes_from_report contains multiple
statements on single lines in the branches that assign r["outcome"] and
increment n; split those into separate statements for clarity: inside
patch_outcomes_from_report, replace the single-line branches that currently do
r["outcome"] = "resolved"; n += 1 and r["outcome"] = "failed"; n += 1 with two
lines each (one assigning r["outcome"] and a separate line incrementing n) so
both the resolved and unresolved branches are expanded into separate statements.

In `@bench/tools/baseline/system_preamble.md`:
- Around line 14-16: The fenced code block containing
COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT should include a language identifier
(e.g., text or bash) to satisfy the markdown linter and ensure consistent
rendering; update the fenced block from ``` to ```text (or ```bash) in the
system_preamble.md file so the block is explicitly typed.

In `@bench/tools/code_graph_mcp/system_preamble.md`:
- Around line 46-48: The fenced code block containing the token
COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT should include a language specifier (e.g.,
text or bash) to satisfy the markdown linter and ensure consistent rendering;
update the triple-backtick fence that wraps
COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT to use a language identifier like "text"
(or "bash") so the linter and renderers recognize the block type.

In `@bench/tools/code_graph/system_preamble.md`:
- Around line 44-46: Update the fenced code block containing the line
"COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT" to include a language specifier (e.g.,
text or bash) after the opening triple backticks so the markdown linter passes
and rendering is consistent; locate the triple-backtick fenced block around that
string and change the opening fence from ``` to ```text (or ```bash)
accordingly.

In `@scripts/mcp_smoke.py`:
- Around line 25-36: The _pretty function is catching a bare Exception when
attempting json.loads(chunk.text); narrow that to json.JSONDecodeError so
unrelated errors (e.g., AttributeError) surface; update the try/except in
_pretty to except json.JSONDecodeError and keep the existing fallback behavior
of returning chunk.text on decode failure (ensure json is imported where _pretty
is defined).

In `@start.sh`:
- Around line 29-39: The case fallback silently treats any value other than
"mcp" as "web", so add a warning log when CGRAPH_MODE is set to an unrecognized
value before falling through to the web branch: detect if CGRAPH_MODE is neither
"mcp" nor "web" and emit a warning (e.g., echo to stderr) mentioning the actual
CGRAPH_MODE, then continue to execute the existing web branch (the exec uvicorn
api.index:app ... line) so behavior is unchanged; refer to the existing case on
CGRAPH_MODE and the mcp and web|* branches to locate where to add this check.

In `@tests/test_bench_metrics.py`:
- Around line 1-166: Reviewer asks for unittest.TestCase style consistency, but
this file (tests/test_bench_metrics.py) intentionally uses pytest-style tests
like test_extract_token_usage_sums_history and
test_task_metrics_from_trajectory_round_trip; add a one-line comment at the top
of the file stating that these tests intentionally use pytest style and that a
repository-wide refactor to unittest.TestCase will be done separately, so we
keep the current tests as-is for now to avoid a piecemeal conversion.

In `@tests/test_swebench_verify.py`:
- Line 74: Rename the ambiguous single-letter variable `l` in the list
comprehension inside tests/test_swebench_verify.py (the expression creating
`rows = [json.loads(l) for l in results.read_text().splitlines()]`) to a clearer
name such as `line` or `row_json`; update the comprehension to use that name and
replace any other uses of the same variable in the surrounding scope so
readability is improved and `l` isn’t confused with similar characters.
- Line 38: The list comprehension in the assignment to baseline uses an
ambiguous single-letter variable `l`; change it to a descriptive name (e.g.,
`line` or `row_json`) so the expression becomes: baseline =
[json.loads(<new_name>) for <new_name> in
paths["baseline"].read_text().splitlines()] and update any matching usage in
that expression to avoid confusion between `l`, `1`, and `I`.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b754c159-82f0-46b1-99e3-4b2371598c56

📥 Commits

Reviewing files that changed from the base of the PR and between 126c672 and 5db867a.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (63)

.env.template
.gitignore
AGENTS.md
CONTEXT.md
README.md
api/cli.py
api/index.py
api/llm.py
api/mcp/auto_init.py
api/mcp/server.py
api/mcp/templates/claude_mcp_section.md
api/mcp/templates/cursorrules.template
bench/.gitignore
bench/README.md
bench/agents/__init__.py
bench/agents/code_graph_adapter.py
bench/agents/code_graph_mcp_adapter.py
bench/agents/lsp_adapter.py
bench/cli/__init__.py
bench/cli/cg
bench/cli/cg-mcp
bench/cli/cg.py
bench/cli/cg_mcp.py
bench/cli/lsp
bench/cli/lsp.py
bench/cli/regrade.py
bench/configs/default.yaml
bench/datasets/__init__.py
bench/datasets/swe_bench.py
bench/metrics/__init__.py
bench/report/__init__.py
bench/report/__main__.py
bench/runners/__init__.py
bench/runners/index_cache.py
bench/runners/mini_runner.py
bench/runners/safe_local_env.py
bench/runners/swebench_verify.py
bench/scripts/start-api.sh
bench/tools/baseline/system_preamble.md
bench/tools/baseline/tools.yaml
bench/tools/code_graph/system_preamble.md
bench/tools/code_graph/tools.yaml
bench/tools/code_graph_mcp/system_preamble.md
bench/tools/code_graph_mcp/tools.yaml
bench/tools/lsp/shim.yaml
bench/tools/lsp/system_preamble.md
bench/tools/lsp/tools.yaml
docker-compose.yml
pyproject.toml
scripts/mcp_smoke.py
start.sh
tests/bench/__init__.py
tests/bench/test_cg_mcp_adapter.py
tests/bench/test_cg_mcp_cli_compaction.py
tests/mcp/test_auto_init.py
tests/mcp/test_init_agent.py
tests/test_bench_code_graph_adapter.py
tests/test_bench_index_cache.py
tests/test_bench_lsp_adapter.py
tests/test_bench_metrics.py
tests/test_bench_runner.py
tests/test_swe_bench_loader.py
tests/test_swebench_verify.py

coderabbitai · 2026-06-09T08:43:09Z

+@app.command("init-agent")
+def init_agent(
+    force: bool = typer.Option(
+        False, "--force", "-f", help="Overwrite existing CLAUDE.md / .cursorrules."
+    ),
+) -> None:
+    """Drop AI-agent guidance files (CLAUDE.md, .cursorrules) into CWD.
+
+    Copies the canonical code-graph MCP guidance bundled with this
+    package so any repo can announce the tools to Cursor and Claude
+    Code with one command.
+    """
+    targets = {
+        "CLAUDE.md": _TEMPLATES_DIR / "claude_mcp_section.md",
+        ".cursorrules": _TEMPLATES_DIR / "cursorrules.template",
+    }
+
+    cwd = Path.cwd()
+    if not force:
+        existing = [name for name in targets if (cwd / name).exists()]
+        if existing:
+            _json_error(
+                f"Refusing to overwrite existing files: {', '.join(existing)}. "
+                "Re-run with --force to clobber."
+            )
+
+    written: List[str] = []
+    for name, template in targets.items():
+        dest = cwd / name
+        dest.write_text(template.read_text(encoding="utf-8"), encoding="utf-8")
+        written.append(str(dest))
+        _stderr(f"Wrote {dest}")
+
+    _json_out({"status": "ok", "written": written, "force": force})
+


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add try/except error handling for consistency.

All other CLI commands in this file wrap their operations in try/except blocks and call _json_error(str(e)) on failure (see index, index_repo, list_repos, migrate, etc.). The init_agent command should follow the same pattern to ensure consistent error reporting when template files are missing or file I/O fails.

🛡️ Proposed fix to add error handling

`@app.command`("init-agent") def init_agent( force: bool = typer.Option( False, "--force", "-f", help="Overwrite existing CLAUDE.md / .cursorrules." ), ) -> None: """Drop AI-agent guidance files (CLAUDE.md, .cursorrules) into CWD. Copies the canonical code-graph MCP guidance bundled with this package so any repo can announce the tools to Cursor and Claude Code with one command. """ + try: - targets = { - "CLAUDE.md": _TEMPLATES_DIR / "claude_mcp_section.md", - ".cursorrules": _TEMPLATES_DIR / "cursorrules.template", - } - - cwd = Path.cwd() - if not force: - existing = [name for name in targets if (cwd / name).exists()] - if existing: - _json_error( - f"Refusing to overwrite existing files: {', '.join(existing)}. " - "Re-run with --force to clobber." - ) - - written: List[str] = [] - for name, template in targets.items(): - dest = cwd / name - dest.write_text(template.read_text(encoding="utf-8"), encoding="utf-8") - written.append(str(dest)) - _stderr(f"Wrote {dest}") - - _json_out({"status": "ok", "written": written, "force": force}) + targets = { + "CLAUDE.md": _TEMPLATES_DIR / "claude_mcp_section.md", + ".cursorrules": _TEMPLATES_DIR / "cursorrules.template", + } + + cwd = Path.cwd() + if not force: + existing = [name for name in targets if (cwd / name).exists()] + if existing: + _json_error( + f"Refusing to overwrite existing files: {', '.join(existing)}. " + "Re-run with --force to clobber." + ) + + written: List[str] = [] + for name, template in targets.items(): + dest = cwd / name + dest.write_text(template.read_text(encoding="utf-8"), encoding="utf-8") + written.append(str(dest)) + _stderr(f"Wrote {dest}") + + _json_out({"status": "ok", "written": written, "force": force}) + except Exception as e: + _json_error(str(e))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@app.command("init-agent")

def init_agent(

force: bool = typer.Option(

False, "--force", "-f", help="Overwrite existing CLAUDE.md / .cursorrules."

),

) -> None:

"""Drop AI-agent guidance files (CLAUDE.md, .cursorrules) into CWD.

Copies the canonical code-graph MCP guidance bundled with this

package so any repo can announce the tools to Cursor and Claude

Code with one command.

"""

targets = {

"CLAUDE.md": _TEMPLATES_DIR / "claude_mcp_section.md",

".cursorrules": _TEMPLATES_DIR / "cursorrules.template",

}

cwd = Path.cwd()

if not force:

existing = [name for name in targets if (cwd / name).exists()]

if existing:

_json_error(

f"Refusing to overwrite existing files: {', '.join(existing)}. "

"Re-run with --force to clobber."

)

written: List[str] = []

for name, template in targets.items():

dest = cwd / name

dest.write_text(template.read_text(encoding="utf-8"), encoding="utf-8")

written.append(str(dest))

_stderr(f"Wrote {dest}")

_json_out({"status": "ok", "written": written, "force": force})

`@app.command`("init-agent")

def init_agent(

force: bool = typer.Option(

False, "--force", "-f", help="Overwrite existing CLAUDE.md / .cursorrules."

),

) -> None:

"""Drop AI-agent guidance files (CLAUDE.md, .cursorrules) into CWD.

Copies the canonical code-graph MCP guidance bundled with this

package so any repo can announce the tools to Cursor and Claude

Code with one command.

"""

try:

targets = {

"CLAUDE.md": _TEMPLATES_DIR / "claude_mcp_section.md",

".cursorrules": _TEMPLATES_DIR / "cursorrules.template",

}

cwd = Path.cwd()

if not force:

existing = [name for name in targets if (cwd / name).exists()]

if existing:

_json_error(

f"Refusing to overwrite existing files: {', '.join(existing)}. "

"Re-run with --force to clobber."

)

written: List[str] = []

for name, template in targets.items():

dest = cwd / name

dest.write_text(template.read_text(encoding="utf-8"), encoding="utf-8")

written.append(str(dest))

_stderr(f"Wrote {dest}")

_json_out({"status": "ok", "written": written, "force": force})

except Exception as e:

_json_error(str(e))

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api/cli.py` around lines 422 - 456, Wrap the body of init_agent in a try/except that catches Exception and calls _json_error(str(e)) on failure to match other CLI commands; specifically, surround the logic that builds targets, checks for existing files, writes templates (dest.write_text and template.read_text), and returns _json_out with a try block, and in the except block call _json_error(str(e)) so missing template files or I/O errors are reported consistently. Ensure you still respect the existing force check and that written remains returned on success.

coderabbitai · 2026-06-09T08:43:10Z

+@app.get('/api/_health')
+async def _health():
+    """Diagnostic endpoint: reports the running server's resolver + DB config.
+
+    Used by the bench harness to fail-fast when the API server was started
+    without ``CODE_GRAPH_PY_RESOLVER=tree_sitter`` — without that flag the
+    Python indexer falls back to jedi/multilspy, which on real-world repos
+    (sphinx, sympy, …) spawns a per-repo venv + ``pip install poetry`` and
+    can wedge for hours at 100% CPU. Cheap (no DB call); safe to ship.
+    """
+    return {
+        "status": "ok",
+        "py_resolver": os.environ.get("CODE_GRAPH_PY_RESOLVER", "jedi"),
+        "falkordb_host": os.environ.get("FALKORDB_HOST", "localhost"),
+        "falkordb_port": os.environ.get("FALKORDB_PORT", "6379"),
+        "public": os.environ.get("CODE_GRAPH_PUBLIC", "0"),
+    }
+


🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add public_or_auth decorator for guideline consistency.

As per coding guidelines, read endpoints in api/index.py should use the public_or_auth decorator. While health checks are often unauthenticated by design, adding this decorator would maintain consistency without breaking functionality—when CODE_GRAPH_PUBLIC=1, the endpoint remains accessible without auth; otherwise it requires a token.

📋 Proposed fix to add decorator

`@app.get`('/api/_health') -async def _health(): +async def _health(_=Depends(public_or_auth)): """Diagnostic endpoint: reports the running server's resolver + DB config.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@app.get('/api/_health')

async def _health():

"""Diagnostic endpoint: reports the running server's resolver + DB config.

Used by the bench harness to fail-fast when the API server was started

without ``CODE_GRAPH_PY_RESOLVER=tree_sitter`` — without that flag the

Python indexer falls back to jedi/multilspy, which on real-world repos

(sphinx, sympy, …) spawns a per-repo venv + ``pip install poetry`` and

can wedge for hours at 100% CPU. Cheap (no DB call); safe to ship.

"""

return {

"status": "ok",

"py_resolver": os.environ.get("CODE_GRAPH_PY_RESOLVER", "jedi"),

"falkordb_host": os.environ.get("FALKORDB_HOST", "localhost"),

"falkordb_port": os.environ.get("FALKORDB_PORT", "6379"),

"public": os.environ.get("CODE_GRAPH_PUBLIC", "0"),

}

`@app.get`('/api/_health')

async def _health(_=Depends(public_or_auth)):

"""Diagnostic endpoint: reports the running server's resolver + DB config.

Used by the bench harness to fail-fast when the API server was started

without ``CODE_GRAPH_PY_RESOLVER=tree_sitter`` — without that flag the

Python indexer falls back to jedi/multilspy, which on real-world repos

(sphinx, sympy, …) spawns a per-repo venv + ``pip install poetry`` and

can wedge for hours at 100% CPU. Cheap (no DB call); safe to ship.

"""

return {

"status": "ok",

"py_resolver": os.environ.get("CODE_GRAPH_PY_RESOLVER", "jedi"),

"falkordb_host": os.environ.get("FALKORDB_HOST", "localhost"),

"falkordb_port": os.environ.get("FALKORDB_PORT", "6379"),

"public": os.environ.get("CODE_GRAPH_PUBLIC", "0"),

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api/index.py` around lines 116 - 133, Add the public_or_auth decorator to the _health endpoint: annotate the async function _health with `@public_or_auth` so the route follows the read-endpoint guideline (stays public when CODE_GRAPH_PUBLIC=1, otherwise requires auth); also ensure public_or_auth is imported into api/index.py if it isn't already so the decorator resolves correctly.

Source: Coding guidelines

coderabbitai · 2026-06-09T08:43:10Z

+def _build_llm() -> LiteLLM:
+    model_name = os.getenv("MODEL_NAME", "gemini/gemini-flash-lite-latest")
+    return LiteLLM(model_name, temperature=0.0)


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Show file size/line count wc -l api/llm.py # Print the whole file if it's small; otherwise key ranges around the reported line numbers. if [ "$(wc -l < api/llm.py)" -le 260 ]; then cat -n api/llm.py else # Head / around the middle / tail echo "---- head ----" sed -n '1,220p' api/llm.py | nl -ba echo "---- around 60-160 ----" sed -n '55,170p' api/llm.py | nl -ba echo "---- around 90-150 (per original mention) ----" sed -n '85,150p' api/llm.py | nl -ba echo "---- tail ----" tail -n 120 api/llm.py | nl -ba fi # Search for GraphRAG-related symbols/usages anywhere in api/ echo "---- rg GraphRAG in api/ ----" rg -n "GraphRAG|graphrag|graph rag|RAG" api || true # Identify what reads/writes are performed: look for query execution and exception handling echo "---- rg for graph.query / query(cypher) / cypher execution ----" rg -n "graph\.query|query\(|_run_cypher|select_graph|cypher" api/llm.py || true

Repository: FalkorDB/code-graph

Length of output: 6489

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "---- rg for ask( ... ) usage ----" rg -n "\bask\(" . --glob '!**/node_modules/**' || true echo "---- rg for select_graph( ----" rg -n "select_graph\(" . || true echo "---- rg for repo_name naming / branch / worktree ----" rg -n "repo_name|worktree|branch|stack|pr[-_ ]?number|pull[_-]?request|workspace" api . || true echo "---- List files under api/ that might route LLM calls ----" ls -la api || true

Repository: FalkorDB/code-graph

Length of output: 50375

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "---- rg GraphRAG( across repo ----" rg -n "GraphRAG\(" . || true echo "---- rg graphrag_init / GraphRAG facade ----" rg -n "graphrag_init|KnowledgeGraph|chat_session\(" api docs || true echo "---- inspect api/index.py around the ask call ----" sed -n '220,280p' api/index.py | nl -ba echo "---- inspect api/index.py imports / how ask is imported ----" sed -n '1,80p' api/index.py | nl -ba echo "---- inspect tests/index.py around ask call ----" sed -n '140,210p' tests/index.py | nl -ba echo "---- inspect bench/agents/code_graph_mcp_adapter.py ask wrapper ----" sed -n '180,260p' bench/agents/code_graph_mcp_adapter.py | nl -ba

Repository: FalkorDB/code-graph

Length of output: 2047

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "---- cat api/index.py (imports + ask call context) ----" sed -n '1,120p' api/index.py | cat -n echo "---- around ask call ----" sed -n '220,270p' api/index.py | cat -n echo "---- cat tests/index.py around ask call ----" sed -n '140,210p' tests/index.py | cat -n echo "---- rg for KnowledgeGraph / graphrag_init implementation ----" rg -n "KnowledgeGraph|graphrag_init|define_ontology|chat_session|kg\.ask\(" api docs || true echo "---- list api/mcp directory ----" ls -la api/mcp | head echo "---- if present, show api/mcp/graphrag_init.py ----" if [ -f api/mcp/graphrag_init.py ]; then wc -l api/mcp/graphrag_init.py sed -n '1,220p' api/mcp/graphrag_init.py | cat -n else echo "api/mcp/graphrag_init.py not found" fi

Repository: FalkorDB/code-graph

Length of output: 13811

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "---- inspect bench/agents/code_graph_mcp_adapter.py ask wrapper ----" sed -n '150,270p' bench/agents/code_graph_mcp_adapter.py | cat -n echo "---- inspect api/prompts.py cypher system/prompt ----" ls -la api | rg "prompts" sed -n '1,220p' api/prompts.py | cat -n echo "---- rg for CYPHER_GEN_SYSTEM / GRAPH_QA_SYSTEM usage ----" rg -n "CYPHER_GEN_SYSTEM|CYPHER_GEN_PROMPT|GRAPH_QA_SYSTEM|GRAPH_QA_PROMPT" api || true

Repository: FalkorDB/code-graph

Length of output: 10294

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "---- rg for '/api/chat' usage in tests/bench ----" rg -n "\"/api/chat\"|/api/chat" tests bench e2e scripts api || true echo "---- rg for ChatRequest model construction ----" rg -n "ChatRequest\(" . || true echo "---- rg for data\.branch in call to api.llm.ask ----" rg -n "ask\(data\.repo, data\.msg, branch=" -S . || true rg -n "from api\.llm import ask" -S . || true

Repository: FalkorDB/code-graph

Length of output: 580

Fix GraphRAG contract + branch handling + Cypher safety in api/llm.py

api/llm.py no longer uses GraphRAG for chat (it only wires LiteLLM with hand-written prompts, and there is no api/mcp/graphrag_init.py), violating the “LLM chat functionality must be implemented via GraphRAG in api/llm.py” rule.

Additional issues in the same flow (also affects 104-127):

/api/chat passes branch=data.branch into api.llm.ask(...), but ask() is defined as ask(repo_name: str, question: str), so this will raise TypeError at runtime.

_run_cypher() executes raw LLM-generated Cypher via graph.query(cypher) with no read-only/write-blocking guard; prompts also do not restrict to read-only Cypher.

_run_cypher() swallows all FalkorDB/query exceptions and returns [], masking real DB/query failures as “no context”.

Graph selection is effectively “by repo_name only” (db.select_graph(repo_name)), so it is not scoped to the per-branch graph identity used elsewhere.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api/llm.py` around lines 73 - 75, The current api/llm.py must be changed to use the GraphRAG abstraction for chat (re-import/instantiate GraphRAG from the existing graphrag_init module instead of directly wiring LiteLLM in _build_llm), update ask(repo_name: str, question: str, branch: str) to accept a branch argument and ensure callers (e.g., /api/chat) pass data.branch through, and fix _run_cypher to (1) select the per-branch graph (use db.select_graph(repo_name, branch) or equivalent graph identity used elsewhere), (2) enforce read-only Cypher by validating/rewriting the LLM output before executing (reject/abort if DDL/DML keywords detected), (3) avoid swallowing DB exceptions—catch and rethrow or return an explicit error so failures aren’t masked as empty context, and (4) call graph.query only after these checks; reference functions/classes: _build_llm, ask, _run_cypher, GraphRAG, graphrag_init, graph.query, db.select_graph.

Source: Coding guidelines

coderabbitai · 2026-06-09T08:43:10Z

+async def _run_cypher(repo_name: str, cypher: str) -> list[list[Any]]:
+    if not cypher:
+        return []
+    db = _falkordb()
+    graph = db.select_graph(repo_name)
+    try:
+        result = await graph.query(cypher)
+        return list(result.result_set or [])
+    except Exception:
+        logger.exception("Cypher execution failed: %s", cypher)
+        return []


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Block non-read Cypher before executing it.

The model output is executed verbatim. A prompt-injected question can steer the LLM into emitting CREATE/MERGE/SET/DELETE, and this path will run it against the live repo graph. Enforce a read-only allowlist here and/or use FalkorDB credentials that cannot mutate data.

🔒 Suggested guard

+_WRITE_CYPHER_RE = re.compile( + r"\b(CREATE|MERGE|DELETE|DETACH|SET|REMOVE|DROP|FOREACH)\b", + re.IGNORECASE, +) + + +def _assert_read_only_cypher(cypher: str) -> None: + if _WRITE_CYPHER_RE.search(cypher): + raise ValueError("Generated Cypher must be read-only") + + async def _run_cypher(repo_name: str, cypher: str) -> list[list[Any]]: if not cypher: return [] + _assert_read_only_cypher(cypher) db = _falkordb() graph = db.select_graph(repo_name)

Also applies to: 104-113

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api/llm.py` around lines 87 - 97, The _run_cypher function currently executes model-provided Cypher verbatim; add a strict read-only guard that rejects any Cypher containing mutation keywords (e.g., CREATE, MERGE, SET, DELETE, REMOVE, DROP, CALL, LOAD CSV, CREATE INDEX, DROP INDEX, etc.) using a case-insensitive regex or token check before calling graph.query, return/log an error for disallowed queries, and/or use a FalkorDB read-only credential/connection from _falkordb() to ensure no writes are possible; apply the identical validation/read-only-credential change to the other Cypher execution site that also calls graph.query so all model-driven executions are protected.

coderabbitai · 2026-06-09T08:43:10Z

+`~/.copilot/session-state/<id>/plan.md` for the full design doc and the
+deferred pre-requisites.


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Replace the local session-state path with a repo-tracked reference.

~/.copilot/session-state/<id>/plan.md is user-local and not accessible to collaborators/CI docs readers; this should point to a committed file or be removed.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@bench/README.md` around lines 11 - 12, The README references a user-local path '`~/.copilot/session-state/<id>/plan.md`' which is not accessible to collaborators/CI; update the line to point to a committed, repo-tracked file (e.g., replace with a path under the repository like a docs or bench subpath such as a committed "session-plan.md") or remove the reference entirely, and ensure the updated README points to the new committed filename so others and CI can access the full design doc.

coderabbitai · 2026-06-09T08:43:10Z

+**Scaffold only.** Directory layout and contracts exist; runners do not.
+Next steps are tracked in the session's todo list.


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update status text to match shipped benchmark components.

“Scaffold only… runners do not” appears outdated against this PR’s benchmark runner/adapter/tooling additions, and may mislead users about current capabilities.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@bench/README.md` around lines 16 - 17, Update the README status line that currently reads "Scaffold only. Directory layout and contracts exist; runners do not." to accurately reflect that benchmark runners, adapters, and tooling are now included in this PR; replace that sentence with concise, current status text (e.g., "Benchmarks include runners, adapters, and tooling — see the runners/ and adapters/ directories for usage and examples") so users aren’t misled, and ensure any follow-up "todo" pointers in the README reference remaining tasks only.

coderabbitai · 2026-06-09T08:43:10Z

+                          # re-index. Fairness vs live-LSP.
+
+backend:
+  service_url: http://host.docker.internal:5000


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Description: Check for Docker host configuration docs and alternatives rg -n 'host\.docker\.internal|extra_hosts|DOCKER_HOST' docker-compose.yml README.md bench/README.md

Repository: FalkorDB/code-graph

Length of output: 175

🏁 Script executed:

#!/bin/bash set -euo pipefail # 1) Find all occurrences of host.docker.internal and common Linux alternatives rg -n --hidden --no-ignore-vcs 'host\.docker\.internal|extra_hosts|host-gateway|DOCKER_HOST|FALKORDB_HOST' . # 2) Find env template documentation rg -n --hidden --no-ignore-vcs '\.env\.template|env\.template|SECRET_TOKEN' . # 3) Locate docker-compose files and inspect for extra_hosts/host-gateway if present fd -t f 'docker-compose*.yml' . || true fd -t f 'compose*.yml' . || true

Repository: FalkorDB/code-graph

Length of output: 9057

🏁 Script executed:

#!/bin/bash set -euo pipefail # Show relevant sections around host.docker.internal in tools config and README sed -n '1,120p' bench/tools/code_graph/tools.yaml echo '---' sed -n '260,340p' README.md echo '---' nl -ba .env.template | sed -n '1,80p' echo '---' # Show docker-compose for any host-gateway/extra_hosts alternatives nl -ba docker-compose.yml | sed -n '1,120p' echo '---' # Show any bench scripts/docs mentioning host.docker.internal or Linux alternatives rg -n 'host\.docker\.internal' -S . || true

Repository: FalkorDB/code-graph

Length of output: 3819

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== bench/tools/code_graph/tools.yaml (around backend) ===" sed -n '1,120p' bench/tools/code_graph/tools.yaml echo "=== README.md (around Docker run env vars) ===" sed -n '260,360p' README.md echo "=== .env.template ===" sed -n '1,120p' .env.template echo "=== docker-compose.yml ===" sed -n '1,140p' docker-compose.yml echo "=== All host.docker.internal occurrences ===" rg -n 'host\.docker\.internal' -S . echo "=== Any extra_hosts/host-gateway alternatives ===" rg -n 'extra_hosts|host-gateway' -S .

Repository: FalkorDB/code-graph

Length of output: 8258

🏁 Script executed:

#!/bin/bash set -euo pipefail # Find swe_bench runner and inspect docker setup fd -t f 'swe_bench*.py' bench || true rg -n 'swe_bench\.py|network host|--network host|host-gateway|extra_hosts' -S bench || true # Inspect code_graph_adapter and swe_bench docker setup if found FILE1="$(fd -t f 'swe_bench*.py' bench | head -n 1 || true)" echo "SWE file: $FILE1" if [ -n "${FILE1:-}" ]; then sed -n '1,220p' "$FILE1" fi FILE2="$(rg -l 'host\.docker\.internal:5000' bench/agents/code_graph_adapter.py || true)" echo "Adapter file: $FILE2" if [ -n "${FILE2:-}" ]; then sed -n '1,140p' "$FILE2" echo '---' sed -n '140,260p' "$FILE2" fi

Repository: FalkorDB/code-graph

Length of output: 14563

🏁 Script executed:

#!/bin/bash set -euo pipefail # Locate swe_bench docker setup and check for host-network usage / host-gateway alternatives fd -t f 'swe_bench*.py' bench || true rg -n 'swe_bench\.py|network host|--network host|host-gateway|extra_hosts' -S bench || true # Inspect bench/runners/swe_bench.py (or similarly named) if present FILE1="$(fd -t f 'swe_bench*.py' bench | head -n 1 || true)" echo "SWE file: $FILE1" if [ -n "${FILE1:-}" ]; then sed -n '1,260p' "$FILE1" fi # Inspect how the adapter uses host.docker.internal FILE2="$(rg -l 'host\.docker\.internal:5000' bench/agents/code_graph_adapter.py || true)" echo "Adapter file: $FILE2" if [ -n "${FILE2:-}" ]; then sed -n '1,220p' "$FILE2" fi

Repository: FalkorDB/code-graph

Length of output: 15988

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== bench/runners tree (files) ===" find bench/runners -maxdepth 2 -type f -print || true echo "=== Find any swe_bench references under bench/runners ===" rg -n 'swe[_-]?bench|SWE_BENCH|Verified' bench/runners || true echo "=== Search for docker compose/run flags under bench/runners ===" rg -n --hidden --no-ignore-vcs 'docker (run|compose)|--network|network host|host-gateway|extra_hosts' bench/runners || true echo "=== Search whole repo for '--network host' occurrences ===" rg -n --hidden --no-ignore-vcs '--network host|network host' . || true

Repository: FalkorDB/code-graph

Length of output: 2926

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== rg CODEGRAPH_URL in bench/ ===" rg -n --hidden --no-ignore-vcs 'CODEGRAPH_URL' bench || true echo "=== Inspect bench/runners/swebench_verify.py ===" sed -n '1,240p' bench/runners/swebench_verify.py echo "=== Inspect bench/runners/mini_runner.py (swe-bench parts + docker run) ===" # Show relevant CLI arg section and swe-bench execution path sed -n '820,1040p' bench/runners/mini_runner.py # Search for docker run usage in mini_runner.py echo "=== rg docker in mini_runner.py ===" rg -n 'docker (run|compose)|--network|extra_hosts|host-gateway' bench/runners/mini_runner.py || true

Repository: FalkorDB/code-graph

Length of output: 17173

🏁 Script executed:

#!/bin/bash set -euo pipefail sed -n '200,340p' bench/runners/mini_runner.py

Repository: FalkorDB/code-graph

Length of output: 7011

🏁 Script executed:

#!/bin/bash set -euo pipefail sed -n '1,120p' bench/cli/cg.py echo '---' rg -n 'CODEGRAPH_URL' README.md bench/cli bench/tools bench/agents || true

Repository: FalkorDB/code-graph

Length of output: 4555

Fix host.docker.internal portability by making the code-graph base URL configurable

SECRET_TOKEN is already documented in .env.template, and bench/tools/code_graph/tools.yaml/bench/agents/code_graph_adapter.py use that same env var name.

bench/tools/code_graph/tools.yaml hardcodes backend.service_url: http://host.docker.internal:5000, but the benchmark runner overrides CODEGRAPH_URL to http://127.0.0.1:5000 for the code_graph config, so the hostname may not affect normal runs.

Update bench/tools/code_graph/tools.yaml to derive from CODEGRAPH_URL (or clearly document that users must set CODEGRAPH_URL for non-runner usage, especially on Linux where host.docker.internal may not resolve).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@bench/tools/code_graph/tools.yaml` at line 28, The tools.yaml currently hardcodes service_url to http://host.docker.internal:5000 which is not portable; update bench/tools/code_graph/tools.yaml to read the base URL from the CODEGRAPH_URL environment variable (falling back to the existing literal if unset) so the code-graph tool uses the same endpoint as the benchmark runner, and add a short note in .env.template or README that CODEGRAPH_URL must be set on platforms where host.docker.internal is unavailable (keep SECRET_TOKEN usage unchanged to match bench/agents/code_graph_adapter.py).

coderabbitai · 2026-06-09T08:43:10Z

+```
+COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT
+```


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language specifier to fenced code block.

The fenced code block is missing a language identifier, which violates MD040 linting rules. Since this shows example output format, specify text or bash.

📝 Proposed fix

-``` +```text COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT

</details>  <details> <summary>📝 Committable suggestion</summary> > ‼️ **IMPORTANT** > Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements. ```suggestion

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 57-57: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@bench/tools/lsp/system_preamble.md` around lines 57 - 59, The fenced code block containing COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT is missing a language specifier; update the block delimiter from ``` to ```text (or ```bash if preferred) so the code fence becomes ```text COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT ``` to satisfy the MD040 lint rule and explicitly mark the example output language.

Source: Linters/SAST tools

…ombined # Conflicts: # AGENTS.md # README.md # api/llm.py # api/mcp/auto_init.py # api/mcp/server.py # api/mcp/templates/claude_mcp_section.md # api/mcp/templates/cursorrules.template # scripts/mcp_smoke.py # tests/mcp/test_auto_init.py # tests/mcp/test_init_agent.py # uv.lock

+
+from minisweagent.environments.local import LocalEnvironment, LocalEnvironmentConfig
+from minisweagent.exceptions import Submitted
+from minisweagent.utils.serialize import recursive_merge


+from __future__ import annotations
+
+import json
+import os


The MCP server consolidated its per-edge tools: get_callers/get_callees/ get_dependencies are now a single get_neighbors(relation, direction), search_code takes `query` (hybrid) instead of `prefix`, find_symbol and get_file_neighbors are new, and ask is no longer exposed. The benchmark MCP arm still targeted the removed names, so the code_graph_mcp track would have called tools that don't exist after merging staging. Update the adapter, cg-mcp shim, tools.yaml, system_preamble, and the mini_runner inline preamble to the current surface: - search_code: --prefix -> --query - get_callers/callees/dependencies -> get_neighbors (--relation, --direction IN|OUT|BOTH); callers = IN+CALLS, callees = OUT+CALLS, deps = OUT + CALLS,IMPORTS,DEFINES - add find_symbol (--name, --file?) and get_file_neighbors (--file) - drop ask (intentionally excluded from the benchmark surface anyway) - get_file_neighbors output: compact the nested neighbors list (strip worktree path prefix, cap to --limit) like the flat list tools Tests updated accordingly (get_callers -> get_neighbors compaction test, search_code e2e uses --query). 17 passed, 1 skipped (FalkorDB e2e). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

bench/agents/code_graph_mcp_adapter.py (1)
136-138: ⚠️ Potential issue | 🟠 Major

Major: Timeout override in cg_mcp.py doesn’t reach call_tool() defaulting (Python binds defaults at definition time)
In bench/agents/code_graph_mcp_adapter.py (call_tool(..., timeout: float = DEFAULT_TIMEOUT_SEC)), DEFAULT_TIMEOUT_SEC is captured when the function is defined; in bench/cli/cg_mcp.py setting cgm.DEFAULT_TIMEOUT_SEC = timeout won’t affect wrapper calls that omit timeout (e.g., index_repo/search_code/...).
Suggested fix
-def call_tool(name: str, arguments: dict[str, Any], *, timeout: float = DEFAULT_TIMEOUT_SEC) -> Any:
+def call_tool(name: str, arguments: dict[str, Any], *, timeout: float | None = None) -> Any:
     """Sync entry point for the bash shim. One spawn per call."""
+    if timeout is None:
+        timeout = DEFAULT_TIMEOUT_SEC
     return asyncio.run(_call_tool_async(name, arguments, timeout))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@bench/agents/code_graph_mcp_adapter.py` around lines 136 - 138, The call_tool
function currently binds DEFAULT_TIMEOUT_SEC at definition time so runtime
overrides to DEFAULT_TIMEOUT_SEC (from cg_mcp.py) won't take effect; change the
signature of call_tool(name: str, arguments: dict[str, Any], *, timeout:
Optional[float] = None) -> Any and inside the function set timeout =
DEFAULT_TIMEOUT_SEC if timeout is None before calling _call_tool_async (and
apply the same pattern for the underlying async wrapper _call_tool_async if it
also uses a default), ensuring runtime updates to DEFAULT_TIMEOUT_SEC are
honored.

♻️ Duplicate comments (1)

bench/agents/code_graph_mcp_adapter.py (1)

123-129: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Close devnull with a context manager.

On Line 123, open(os.devnull, "w") is not guaranteed to close if initialize() / call_tool() times out or raises, which can leak FDs across repeated CLI invocations.

Suggested fix

-    devnull = open(os.devnull, "w")
-    async with stdio_client(params, errlog=devnull) as (read, write):
-        async with ClientSession(read, write) as session:
-            await asyncio.wait_for(session.initialize(), timeout=timeout)
-            result = await asyncio.wait_for(
-                session.call_tool(name, arguments), timeout=timeout
-            )
-            payload = _extract(result)
-            if getattr(result, "isError", False):
-                return {"error": payload}
-            return payload
+    with open(os.devnull, "w") as devnull:
+        async with stdio_client(params, errlog=devnull) as (read, write):
+            async with ClientSession(read, write) as session:
+                await asyncio.wait_for(session.initialize(), timeout=timeout)
+                result = await asyncio.wait_for(
+                    session.call_tool(name, arguments), timeout=timeout
+                )
+                payload = _extract(result)
+                if getattr(result, "isError", False):
+                    return {"error": payload}
+                return payload

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@bench/agents/code_graph_mcp_adapter.py` around lines 123 - 129, The devnull
file descriptor opened by open(os.devnull, "w") can leak if session.initialize()
or session.call_tool() raises or times out; wrap that open in a context manager
so it is always closed (e.g., use with open(os.devnull, "w") as devnull: around
the async with stdio_client(...) block) so the devnull variable is closed even
on exceptions; ensure the indented block still creates the stdio_client(...) and
enters ClientSession(read, write) and calls session.initialize() /
session.call_tool() as before.

🧹 Nitpick comments (1)

tests/bench/test_cg_mcp_cli_compaction.py (1)
45-169: ⚡ Quick win

Align new backend tests with the suite’s unittest convention.

These new tests are added as pytest-style top-level functions under tests/; please convert them to a unittest.TestCase class to stay consistent with the repository’s backend test style and avoid piecemeal framework drift.

Based on learnings, in this repository’s backend test suite under tests/, new tests should inherit from unittest.TestCase, and pytest migration should be coordinated separately.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/bench/test_cg_mcp_cli_compaction.py` around lines 45 - 169, Convert
these top-level pytest-style functions into a unittest.TestCase subclass (e.g.,
class TestCgMcpCliCompaction(unittest.TestCase)) and move each test_* function
into instance methods (def
test_strip_worktree_prefix_makes_path_repo_relative(self): etc.). Replace bare
assert statements with unittest assertions (self.assertEqual, self.assertTrue,
self.assertNotIn, self.assertIsNone, etc.), keep helpers like _run_cli and
_entry either as module-level helpers or as `@staticmethods` on the TestCase, and
retain patching via unittest.mock.patch (as decorators or context managers)
inside the test methods; also ensure unittest is imported at the top. Ensure
method names and referenced symbols (cg_mcp._strip_worktree_prefix,
cg_mcp._compact_entry, cg_mcp._compact_list, _run_cli, cg_mcp.main,
cg_mcp.cgm.impact_analysis, cg_mcp.cgm.get_neighbors) remain unchanged so tests
locate the same code.
Source: Learnings

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@bench/agents/code_graph_mcp_adapter.py`:
- Around line 136-138: The call_tool function currently binds
DEFAULT_TIMEOUT_SEC at definition time so runtime overrides to
DEFAULT_TIMEOUT_SEC (from cg_mcp.py) won't take effect; change the signature of
call_tool(name: str, arguments: dict[str, Any], *, timeout: Optional[float] =
None) -> Any and inside the function set timeout = DEFAULT_TIMEOUT_SEC if
timeout is None before calling _call_tool_async (and apply the same pattern for
the underlying async wrapper _call_tool_async if it also uses a default),
ensuring runtime updates to DEFAULT_TIMEOUT_SEC are honored.

---

Duplicate comments:
In `@bench/agents/code_graph_mcp_adapter.py`:
- Around line 123-129: The devnull file descriptor opened by open(os.devnull,
"w") can leak if session.initialize() or session.call_tool() raises or times
out; wrap that open in a context manager so it is always closed (e.g., use with
open(os.devnull, "w") as devnull: around the async with stdio_client(...) block)
so the devnull variable is closed even on exceptions; ensure the indented block
still creates the stdio_client(...) and enters ClientSession(read, write) and
calls session.initialize() / session.call_tool() as before.

---

Nitpick comments:
In `@tests/bench/test_cg_mcp_cli_compaction.py`:
- Around line 45-169: Convert these top-level pytest-style functions into a
unittest.TestCase subclass (e.g., class
TestCgMcpCliCompaction(unittest.TestCase)) and move each test_* function into
instance methods (def test_strip_worktree_prefix_makes_path_repo_relative(self):
etc.). Replace bare assert statements with unittest assertions
(self.assertEqual, self.assertTrue, self.assertNotIn, self.assertIsNone, etc.),
keep helpers like _run_cli and _entry either as module-level helpers or as
`@staticmethods` on the TestCase, and retain patching via unittest.mock.patch (as
decorators or context managers) inside the test methods; also ensure unittest is
imported at the top. Ensure method names and referenced symbols
(cg_mcp._strip_worktree_prefix, cg_mcp._compact_entry, cg_mcp._compact_list,
_run_cli, cg_mcp.main, cg_mcp.cgm.impact_analysis, cg_mcp.cgm.get_neighbors)
remain unchanged so tests locate the same code.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9d0afa5e-25be-4c40-920f-bdaed8daebff

📥 Commits

Reviewing files that changed from the base of the PR and between d7cf6e0 and 0fe92ea.

📒 Files selected for processing (7)

bench/agents/code_graph_mcp_adapter.py
bench/cli/cg_mcp.py
bench/runners/mini_runner.py
bench/tools/code_graph_mcp/system_preamble.md
bench/tools/code_graph_mcp/tools.yaml
tests/bench/test_cg_mcp_adapter.py
tests/bench/test_cg_mcp_cli_compaction.py

✅ Files skipped from review due to trivial changes (1)

bench/tools/code_graph_mcp/system_preamble.md

🚧 Files skipped from review as they are similar to previous changes (3)

bench/tools/code_graph_mcp/tools.yaml
tests/bench/test_cg_mcp_adapter.py
bench/runners/mini_runner.py

galshubeli · 2026-06-09T11:03:41Z

-    KnowledgeGraph,
-    KnowledgeGraphModelConfig
-)
+This module previously relied on `graphrag-sdk` 0.8.x's `KnowledgeGraph` class,


Why we need this on the code file?

galshubeli · 2026-06-09T11:04:40Z


-    return ontology
+def _build_llm() -> LiteLLM:
+    model_name = os.getenv("MODEL_NAME", "gemini/gemini-flash-lite-latest")


Default should be model that will work.

…ph_mcp The benchmark now compares code-graph only over the cgraph-mcp stdio transport (the real agent integration), so the HTTP `code_graph` arm is redundant. Remove it and its plumbing: - delete bench/tools/code_graph/, bench/agents/code_graph_adapter.py, bench/cli/cg(.py), bench/scripts/start-api.sh, and the HTTP adapter test - mini_runner: drop "code_graph" from VALID_CONFIGS, remove its instance template, config_env branch, _ensure_indexed (HTTP), and tool prechecks - update README, metrics comment, and runner tests to the MCP surface Configs are now baseline / lsp / code_graph_mcp. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

+    # bash shim is invoked by the agent, the server's stderr is merged
+    # into the agent's tool-output buffer, inflating context by ~1.8kB
+    # per call. The agent only needs the JSON-RPC result on stdout.
+    devnull = open(os.devnull, "w")


+        env_path = REPO_ROOT / ".env"
+        if env_path.exists():
+            load_dotenv(env_path)
+    except ImportError:


coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

bench/runners/mini_runner.py (2)
258-267: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not continue after MCP indexing fails.

The comment at Lines 798-803 says the graph must exist before the task runs, but _ensure_indexed_mcp() currently turns both Redis/bootstrap failures and index_repo() errors into warnings and proceeds anyway. That produces code_graph_mcp trajectories against an empty/missing graph and records them as tool-track runs instead of surfacing an infra failure.

Also applies to: 798-803
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@bench/runners/mini_runner.py` around lines 258 - 267, The
_ensure_indexed_mcp() flow must not swallow MCP indexing failures; when calling
cgm.index_repo (in _ensure_indexed_mcp) treat any payload that is a dict with an
"error" key or any exception as a hard failure: log an error and raise an
exception (e.g., RuntimeError or re-raise the caught exception) instead of
printing a warning and continuing. Update the index_repo result handling to
check isinstance(payload, dict) and payload.get("error") -> raise
RuntimeError(f"index_repo failed: {payload['error']}"), and in the except
Exception as exc block re-raise exc (or wrap and raise) so the caller surfaces
an infra failure rather than proceeding to create code_graph_mcp trajectories
against a missing graph.
231-254: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The MCP cache key can reuse a stale graph on reruns.

This fast-path only keys on repo_path.name and branch. For resumed/retried swe-bench runs, the worktree is recreated at the same <instance>__<cfg> path, so GRAPH.LIST can hit an index from an earlier dirty/crashed attempt and skip rebuilding against the fresh checkout. That makes the agent query relationships for a different repo state than the files on disk.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@bench/runners/mini_runner.py` around lines 231 - 254, The fast-path key for
MCP indexing in _ensure_indexed_mcp uses only repo_path.name and branch which
can match stale indexes; change the expected_graph to include a unique
fingerprint of the current checkout (e.g. the current commit SHA or a repo
checksum) so the skip-fast-path only applies when the on-disk state matches the
index. Concretely: in _ensure_indexed_mcp compute a fingerprint (try running git
rev-parse --verify HEAD in repo_path and fall back to a deterministic hash of
the tree contents if not a git repo), append that fingerprint to expected_graph
(e.g. f"code:{repo_name}:{branch}:{commit_sha}"), and use that new
expected_graph when checking r.execute_command("GRAPH.LIST"); keep existing
Redis calls and fallbacks intact.

🧹 Nitpick comments (1)

tests/test_bench_runner.py (1)
84-89: ⚡ Quick win

Assert BRANCH in the MCP env test too.

code_graph_mcp command construction depends on both PROJECT_NAME and BRANCH, but this test only guards the former. A regression dropping BRANCH would still pass here and break every cg-mcp ... --branch "$BRANCH" invocation.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_bench_runner.py` around lines 84 - 89, Update the test
test_config_env_code_graph_mcp_prepends_cli_dir_and_sets_project to also assert
that the BRANCH key exists in the environment dict returned by
mini_runner.config_env("code_graph_mcp", tmp_path); specifically, after calling
env = mini_runner.config_env("code_graph_mcp", tmp_path) add an assertion that
"BRANCH" in env (similar to the existing PROJECT_NAME and FALKORDB_* checks) so
the test will fail if mini_runner.config_env or the code_graph_mcp invocation no
longer supplies BRANCH.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@bench/runners/mini_runner.py`:
- Line 53: The VALID_CONFIGS tuple now includes "code_graph_mcp" but the
bootstrapping only calls _ensure_indexed_mcp() inside the args.swe_bench branch,
so running with --config code_graph_mcp can skip indexing; update the
startup/bootstrap logic to call _ensure_indexed_mcp() whenever args.config ==
"code_graph_mcp" (or when the chosen config requires MCP), not only when
args.swe_bench is true, by adding a conditional that checks args.config (and/or
the config selection path used for synthetic smoke runs) and invokes
_ensure_indexed_mcp() before running; reference VALID_CONFIGS, args.config,
args.swe_bench and _ensure_indexed_mcp() to locate where to add this call so
both real-run and synthetic paths ensure the graph is indexed.

---

Outside diff comments:
In `@bench/runners/mini_runner.py`:
- Around line 258-267: The _ensure_indexed_mcp() flow must not swallow MCP
indexing failures; when calling cgm.index_repo (in _ensure_indexed_mcp) treat
any payload that is a dict with an "error" key or any exception as a hard
failure: log an error and raise an exception (e.g., RuntimeError or re-raise the
caught exception) instead of printing a warning and continuing. Update the
index_repo result handling to check isinstance(payload, dict) and
payload.get("error") -> raise RuntimeError(f"index_repo failed:
{payload['error']}"), and in the except Exception as exc block re-raise exc (or
wrap and raise) so the caller surfaces an infra failure rather than proceeding
to create code_graph_mcp trajectories against a missing graph.
- Around line 231-254: The fast-path key for MCP indexing in _ensure_indexed_mcp
uses only repo_path.name and branch which can match stale indexes; change the
expected_graph to include a unique fingerprint of the current checkout (e.g. the
current commit SHA or a repo checksum) so the skip-fast-path only applies when
the on-disk state matches the index. Concretely: in _ensure_indexed_mcp compute
a fingerprint (try running git rev-parse --verify HEAD in repo_path and fall
back to a deterministic hash of the tree contents if not a git repo), append
that fingerprint to expected_graph (e.g.
f"code:{repo_name}:{branch}:{commit_sha}"), and use that new expected_graph when
checking r.execute_command("GRAPH.LIST"); keep existing Redis calls and
fallbacks intact.

---

Nitpick comments:
In `@tests/test_bench_runner.py`:
- Around line 84-89: Update the test
test_config_env_code_graph_mcp_prepends_cli_dir_and_sets_project to also assert
that the BRANCH key exists in the environment dict returned by
mini_runner.config_env("code_graph_mcp", tmp_path); specifically, after calling
env = mini_runner.config_env("code_graph_mcp", tmp_path) add an assertion that
"BRANCH" in env (similar to the existing PROJECT_NAME and FALKORDB_* checks) so
the test will fail if mini_runner.config_env or the code_graph_mcp invocation no
longer supplies BRANCH.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0621c535-b9de-4f81-9a60-651a130f4f67

📥 Commits

Reviewing files that changed from the base of the PR and between 0fe92ea and 096ac95.

📒 Files selected for processing (5)

bench/README.md
bench/agents/code_graph_mcp_adapter.py
bench/metrics/__init__.py
bench/runners/mini_runner.py
tests/test_bench_runner.py

✅ Files skipped from review due to trivial changes (1)

bench/README.md

🚧 Files skipped from review as they are similar to previous changes (2)

bench/metrics/init.py
bench/agents/code_graph_mcp_adapter.py

coderabbitai · 2026-06-09T11:43:58Z

+DEFAULT_CACHE_DIR = BENCH_DIR / "cache"
+DEFAULT_RESULTS = DEFAULT_CACHE_DIR / "results.jsonl"
+
+VALID_CONFIGS = ("baseline", "lsp", "code_graph_mcp")


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

code_graph_mcp is selectable outside the only indexing path.

Now that code_graph_mcp is one of the standard configs, --real-run --config code_graph_mcp still uses the MCP prompt/env but never calls _ensure_indexed_mcp(). The only bootstrap hook is inside the args.swe_bench branch, so the synthetic smoke path can exercise an unindexed graph and produce misleading results.

Suggested fix

for cfg in configs: repo_path = Path(td) / f"repo-{cfg}" task = task_fn(repo_path) + if cfg == "code_graph_mcp" and not dry_run: + _ensure_indexed_mcp(repo_path) cfg_rows = run_batch( [task], [cfg],

Also applies to: 798-803

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@bench/runners/mini_runner.py` at line 53, The VALID_CONFIGS tuple now includes "code_graph_mcp" but the bootstrapping only calls _ensure_indexed_mcp() inside the args.swe_bench branch, so running with --config code_graph_mcp can skip indexing; update the startup/bootstrap logic to call _ensure_indexed_mcp() whenever args.config == "code_graph_mcp" (or when the chosen config requires MCP), not only when args.swe_bench is true, by adding a conditional that checks args.config (and/or the config selection path used for synthetic smoke runs) and invokes _ensure_indexed_mcp() before running; reference VALID_CONFIGS, args.config, args.swe_bench and _ensure_indexed_mcp() to locate where to add this call so both real-run and synthetic paths ensure the graph is indexed.

gkorland and others added 30 commits March 14, 2026 19:55

Merge pull request #618 from FalkorDB/staging

4e3c11c

staging-->main

bench: add --limit flag for quick single-instance runs

03c7a73

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-code-quality Bot found potential problems May 28, 2026

View reviewed changes

Comment thread bench/runners/mini_runner.py Fixed

github-code-quality Bot found potential problems May 28, 2026

View reviewed changes

Comment thread bench/agents/code_graph_mcp_adapter.py Fixed

DvirDukhan and others added 3 commits May 29, 2026 00:44

github-code-quality Bot found potential problems May 29, 2026

View reviewed changes

Comment thread bench/agents/code_graph_mcp_adapter.py Fixed

This was referenced May 29, 2026

feat(api): /api/v2/* MCP-parity endpoints #695

Closed

bench(parity): cg HTTP and cg-mcp share the same 8-verb surface #696

Closed

DvirDukhan force-pushed the dvirdukhan/bench-combined branch from 1d5eba8 to 5db867a Compare June 8, 2026 22:34

galshubeli marked this pull request as ready for review June 9, 2026 08:28

DvirDukhan requested a review from Copilot June 9, 2026 08:28

Copilot started reviewing on behalf of DvirDukhan June 9, 2026 08:28 View session

Copilot AI reviewed Jun 9, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

github-code-quality Bot found potential problems Jun 9, 2026

View reviewed changes

Comment thread bench/agents/code_graph_mcp_adapter.py Fixed

Comment thread bench/runners/mini_runner.py Fixed

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

galshubeli reviewed Jun 9, 2026

View reviewed changes

github-code-quality Bot found potential problems Jun 9, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

DvirDukhan changed the title ~~bench: 4-config SWE-bench harness (baseline/lsp/code_graph/code_graph_mcp)~~ bench: 3-config SWE-bench harness (baseline/lsp/code_graph_mcp) Jun 9, 2026

		`~/.copilot/session-state/<id>/plan.md` for the full design doc and the
		deferred pre-requisites.

		Scaffold only. Directory layout and contracts exist; runners do not.
		Next steps are tracked in the session's todo list.

Uh oh!

Conversation

DvirDukhan commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Prerequisites (merge order)

Summary

Verified results (Sonnet 4.5, n=10, step-75, official SWE-bench Docker harness)

Token efficiency at a glance

Engineering hardening shipped in this branch

Out of scope for this PR

Summary by CodeRabbit

Release Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

galshubeli Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

galshubeli Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

galshubeli Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

DvirDukhan commented May 28, 2026 •

edited

Loading