bench: 4-config SWE-bench harness (baseline/lsp/code_graph/code_graph_mcp)#693
Draft
DvirDukhan wants to merge 63 commits into
Draft
bench: 4-config SWE-bench harness (baseline/lsp/code_graph/code_graph_mcp)#693DvirDukhan wants to merge 63 commits into
DvirDukhan wants to merge 63 commits into
Conversation
staging-->main
… entry point Add the bare MCP server module (api/mcp/) using the official FastMCP SDK, wire the cgraph-mcp console script in pyproject.toml, and include a protocol smoke test that spawns the server over stdio and verifies list_tools returns an empty tool set. Also copies the MCP design docs into docs/. Closes #648 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix stale entry point references in design doc: api.mcp.server:app → :main - Remove contradicting decisions about tree-sitter/incremental indexing scope - Add language tags to fenced code blocks (MD040) - Add anyio.fail_after timeout to stdio smoke test to prevent CI hangs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- server: pass transport="stdio" explicitly to guard against future FastMCP default changes - test: drop STDIO_TIMEOUT to 10s (a stuck handshake should fail fast) - test: pin anyio backend to asyncio via fixture so transitive trio installs cannot silently double-run the test - pyproject: add anyio to test extras since the smoke test imports it directly (was previously available only via mcp's transitives) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without this, uv sync detects pyproject/lockfile drift on CI and silently re-resolves the entire dep tree to newer versions (uvicorn 0.41.0 → 0.46.0 was observed), which broke the e2e playwright suite. Lock now matches pyproject so installs are reproducible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This reverts commit 0c7e3db.
The falkordb/falkordb:latest base image is now Debian Trixie-based and arrives with apt in a state where the t64 ABI deps that git and build-essential require (libcurl3t64-gnutls, libtinfo6, libc6-dev, etc.) are held back. apt itself recommends `apt --fix-broken install`. Running `apt-get install -y -f` between update and the real install clears the broken state so the install can proceed. Verified locally against the exact base image digest CI uses (sha256:aaf67c724bba36b9fb8d43a2671fd57e89c536b971d72b692a63a168c8053ff4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GraphRAG-SDK released v1.0 (April 16) and force-pushed history during the release, dropping the pre-v1.0 API surface that the e2e tests were built against. Cloning HEAD now produces a graph without the merge_with/combine/import_data/add_node/add_edge/ask Function nodes the tests interact with. Switch to analyzing the installed graphrag-sdk package (pinned to 0.8.2 via uv.lock — immutable on PyPI). flask clone stays for autocomplete variety on set/lo/as substrings. ensure_calls_edges keeps acting as a safety net for the two required CALLS edges. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-ups to address the remaining 7 of 31 e2e failures: 1. Copy installed graphrag-sdk to a tempdir before analyzing. When the source path lives under .venv/lib/.../site-packages/, LSP treats it as an installed library and stops resolving call sites between functions (analyzer produced 0 CALLS edges vs 392 on the April 12 baseline). Copying to /tmp lets LSP treat it as a project and restores organic call-graph extraction. 2. Synthesize missing Function nodes in ensure_calls_edges. import_data has no `def` in any graphrag-sdk version (was a phantom from LSP resolution into a transitive dep). MERGE both source and dest Function nodes with minimal properties so the e2e path tests can find them. Adds the Searchable label so autocomplete works. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the last 3 of the original 31 e2e failures. 1. Pass url= to Project() so save_repo_info populates Redis. The /api/repo_info endpoint returns 400 if repo_info is None, which broke canvas:167 with TypeError on response.info.node_count. 2. Synthesize test_<module> Function nodes for the search-bar tests. testData.ts parametrizes over searchInput "test", but graphrag-sdk 0.8.2 has zero functions whose names contain "test", so the auto-scroll dropdown isn't scrollable and the auto-complete count is 0. 12 synthesized names give the dropdown enough to scroll. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Scaffold for the code-graph vs LSP vs baseline benchmark. No runners yet — just the directory layout, locked-in tool bundles per config, default run config, and the glossary in CONTEXT.md. Both originally-planned pre-reqs (graphrag-sdk 0.8 -> 1.1.1 upgrade, MCP-T15 tree-sitter base class refactor) are deferred as non-blockers for this workstream; rationale in the session plan. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Updates from the round-2 grill: - Outcome accuracy only; drop intrinsic suite (Q1) - code-graph tools = primitives only; no GraphRAG chat (Q2) - Tools in-container; single-file re-index on edit via note_edit (Q3) - Token cost and indexing cost reported separately, never combined (Q4) - LSP responses shimmed (cap 50, trim hover); spec in shim.yaml (Q5) - Pass@1 + retry failures 2x (Q6) - Symmetric one-paragraph preambles per config (Q7) - Drop RepoBench (Q8) - Drop opencode qualitative track (Q9) - Three-stage rollout: smoke / calibration / headline (Q10) - 50-task random sample from SWE-bench Verified, seed committed (Q11) graphrag-sdk upgrade kept in scope per explicit user override. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The v1 SDK is a ground-up rewrite around document ingestion: the v0 KnowledgeGraph class (which we wrapped around an already-populated FalkorDB graph for /api/chat text-to-Cypher) is gone, and the new GraphRAG facade expects to own the graph via its ingestion pipeline with embeddings. There is no public primitive for 'wrap an existing graph and chat over it'. code-graph builds graphs through dedicated language analyzers, not ingestion, so we now keep the text-to-Cypher pipeline in-house in api/llm.py: generate Cypher from question + ontology, execute via the existing FalkorDB async client, synthesize an answer. We still use graphrag-sdk's LiteLLM provider as a thin LiteLLM wrapper to keep retry logic. Ontology is now a plain string in the prompt instead of the old Ontology/Entity/Relation object tree (which is also gone in v1). The /api/chat endpoint surface (ask(repo_name, question) -> str) is unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bench/metrics/ parses SWE-agent trajectory JSON into per-task TaskMetrics rows: input/output tokens, tool-call counts (with per-tool breakdown), patch, outcome. Defensive about trajectory-shape drift between SWE-agent versions (history vs trajectory vs steps; openai-style tool_calls vs SWE-agent action.command). bench/report/ aggregates those rows into a per-config table with median + p90 tokens and Δ-vs-baseline. The summary picks the best run per task (resolved > failed) so retries don't double-count. 10 unit tests cover token extraction, both tool-call shapes, the retry-merge rule, and the markdown delta column. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bench/agents/code_graph_adapter.py exposes the seven tools the code-graph SWE-agent config gets: - graph_entities, get_neighbors, find_paths, auto_complete: thin wrappers over the existing FastAPI surface. - find_symbol: exact-name lookup, built client-side on top of auto_complete so we don't grow the server surface. - note_edit: incremental re-index hook the agent must call after every write_file/edit. Currently routes through analyze_folder on the dirname; degrades gracefully if the call fails. Crucially, GraphRAG is NOT exposed (Q2 grill decision: nested-agent double-counting). Both class-style (CodeGraphClient context manager) and function-style (graph_entities(...) etc.) are provided — the function form is what SWE-agent's tool registry needs. 9 unit tests using httpx.MockTransport cover all seven methods, the bearer-token auth header, 4xx propagation, and note_edit's non-fatal failure path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bench/runners/index_cache.py tracks which <repo>@<commit> pairs code-graph has already analyzed, so re-running the benchmark doesn't pay the indexing cost twice. Backed by a single JSON file under bench/cache/. Atomic via tmp-file replace. This module doesn't run analysis itself — that's done via code-graph's existing /api/analyze_folder endpoint. This is just the bookkeeping the runner consults before deciding to re-index. 6 unit tests cover record/lookup, cross-instance persistence, forget, and overwrite semantics. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bench/agents/lsp_adapter.py wraps multilspy's SyncLanguageServer
behind the same response shim spec'd in bench/tools/lsp/shim.yaml:
cap results at 50, trim hover to 1 signature line + 1 docstring
sentence, locations as {path, line, col}. Tools exposed:
goto_definition, find_references, hover, document_symbols
Notes on the LSP backend choice:
- The plan originally specified pyright; multilspy >= 0.0.15 is
required for that, but the pinned multilspy fork
(AviAvni/multilspy@python-init-params, used by api/analyzers)
is older. Using jedi-language-server matches the rest of the
repo and avoids a divergent dep tree. Shim normalizes responses
so jedi-vs-pyright doesn't affect the validity comparison.
- workspace_symbols is dropped: the multilspy fork doesn't
implement request_workspace_symbol. Agent falls back to
bash+grep, which is the realistic LSP-world fallback too.
- MultilspyConfig must be built via from_dict for this fork
(constructor doesn't set all fields JediServer expects).
Register pytest 'slow' marker in pyproject.toml; the 3 jedi
roundtrip tests are slow but currently complete in <4s on a warm
cache. Run them with -m slow or default; skip with -m 'not slow'.
CONTEXT.md and bench/tools/lsp/tools.yaml updated to match.
10 tests pass: 7 shim units + 3 real jedi roundtrips
(goto_definition, hover, document_symbols).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Pivots the harness from SWE-agent to mini-swe-agent — upstream now
recommends mini-, and its bash-only tool surface is a simpler
integration: each config is a PATH prefix plus a system_preamble.md,
not a per-config tools.yaml.
What this adds:
- bench/runners/mini_runner.py — wraps DefaultAgent + LocalEnvironment,
per-config env wiring (PATH for lsp/code_graph, baseline untouched),
trajectory + diff capture, JSONL append via bench.metrics.
Includes a stub LLM model that exercises the entire loop without
any network calls so the harness is testable today.
- bench/cli/cg.py, bench/cli/lsp.py — bash-callable CLIs wrapping the
existing CodeGraphClient and LSP adapter. These are what the agent
invokes via bash.
- bench/tools/{baseline,lsp,code_graph}/system_preamble.md — symmetric
one-page preambles per the locked-in grill decision.
- bench/metrics — extended to also parse mini-swe-agent trajectory
shape (messages[*].extra.response.usage and extra.actions[*].command).
Buckets bash commands by first token; the COMPLETE_TASK submit
protocol is bucketed as 'submit'.
- tests/test_bench_runner.py — 10 tests, all run offline (no LLM):
smoke, env wiring, persistence, CLI argparse smoke.
- CONTEXT.md + plan.md — reflect mini-swe-agent + jedi pivots.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds --real-run as a mutually exclusive sibling of --dry-run. Real-run prepares a fresh repo per config (no cross-contamination), runs the agent against a synthetic buggy math_utils.py + pytest, then runs pytest to set metrics.outcome to resolved/failed. JSONL append in run_batch can now be deferred via defer_jsonl=True so the smoke loop can write the row once outcome is known. Validated end-to-end against GitHub Models (gpt-4o-mini) using GITHUB_API_KEY=$(gh auth token). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Loads princeton-nlp/SWE-bench_Verified via 'datasets', samples deterministically by seed (20260526) into smoke/calibration/headline stages (3/10/37), and prepares per-instance worktrees by cloning the upstream repo, checking out base_commit, and applying test_patch so FAIL_TO_PASS tests are present. Adds 'datasets' to the bench optional dep group. Adds 'swe_bench' mode to mini_runner alongside dry_run / real_run (mutually exclusive). Verification uses pytest with the FAIL_TO_PASS + PASS_TO_PASS test ids from the dataset row -- best effort because the official harness needs per-repo conda envs, which we don't build yet. 6 new unit tests cover the non-network parts of the loader (field parsing, sampling determinism, n override, pool clamping, path hygiene, task mapping). Worktree prep was validated end-to-end against pytest-dev/pytest-6202. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bench/report/__main__.py: `uv run python -m bench.report` renders results.jsonl as a per-config summary table with token-delta vs baseline. Validated against the existing real-run smoke results. bench/runners/swebench_verify.py: exports per-config predictions JSONL files in the SWE-bench harness format, optionally invokes `python -m swebench.harness.run_evaluation` (Docker-based), then parses the resulting report.json and patches outcomes back into results.jsonl. 4 new unit tests cover the non-Docker parts. Adds `swebench>=4.0` to the bench optional dep group. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
mini_runner.main() now calls dotenv.load_dotenv(.env) at the repo root if present, so users don't have to export ANTHROPIC_API_KEY / ANTHROPIC_API_BASE / GITHUB_API_KEY by hand each shell session. .env.template gains a documented block for the four supported provider configs we've actually tested or have credentials for: direct Anthropic, Azure AI Foundry's Anthropic-passthrough endpoint (/anthropic/v1/messages, x-api-key), GitHub Models, and Azure OpenAI. Most relevant for our setup: Azure AI Foundry → litellm's anthropic/ provider with a custom ANTHROPIC_API_BASE. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Smoke run showed the agent invoked cg exactly once and lsp zero times
across all three SWE-bench instances — because the bash shims didn't
exist (the agent's `which cg` returned 'cg not found'). The differential
between configs was therefore noise.
Fixes:
- Add executable bash shims bench/cli/{cg,lsp} that exec
"$BENCH_PYTHON" -m bench.cli.{cg,lsp}. Runner exports BENCH_PYTHON =
sys.executable so the venv (with httpx/multilspy) is used.
- Export REPO_NAME for the code_graph config (worktree dirname). The
preamble references it; nothing was setting it.
- _ensure_indexed(): POST /api/analyze_folder for each code_graph
worktree before running the task, so cg find-symbol returns real
results. Skips re-indexing via /api/list_repos precheck.
- Rewrite system preambles to instruct "use cg/lsp BEFORE grep" with
an explicit typical-loop, not just a list of subcommands.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Smoke #2 confirmed that even with cg/lsp shims on PATH, indexed repos, REPO_NAME set, and explicit "use cg/lsp first" framing in the system preamble, Claude Opus 4.5 ignored the differentiating tools and fell straight back to grep/sed/cat. The 3-way comparison was real but uninformative: tool choice was identical across configs. This commit adds two new instance templates (INSTANCE_TEMPLATE_LSP and INSTANCE_TEMPLATE_CODE_GRAPH) that embed a 'Required workflow.' block directly in the task description — the first thing the model sees each turn. Selection via load_instance_template(config); baseline keeps the original template. Smoke #3 result: lsp track now invokes 'lsp' 3x, code_graph track invokes 'cg' 5x (including cg auto-complete returning the exact buggy function with line numbers + docstring). The structured-navigation tools are finally exercised, so token deltas measured against baseline are now meaningful signal rather than noise. n=1 finding: both lsp (+128%) and code_graph (+85%) use MORE tokens than baseline on this instance. Bigger preambles + verbose JSON tool replies + occasional retries (cg find-symbol exact-match bug) outweigh any savings. Headline run should scale n or pivot to a function-calling harness. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Smoke #3 revealed cg find-symbol --name <exact> returned [] for symbols the graph clearly contained (cg auto-complete --prefix found the same symbol with full file:line+docstring). Root cause: the filter compared item['name'] to the requested name, but the /api/auto_complete payload nests the symbol name under item['properties']['name'] (FalkorDB node properties), so the top-level lookup always returned None and nothing matched. Fix: prefer item['properties']['name'], fall back to item['name'] for flatter shapes the unit tests pass in. Added a regression test that uses the real payload structure. Verified end-to-end against the live FastAPI service: cg find-symbol --repo pytest-dev__pytest-6202__code_graph \ --name getmodpath # -> [{id:2714, labels:[Function], properties:{name,path,doc,...}}] This was the bug that made the smoke #3 code_graph agent burn 3 of 5 cg calls retrying exact-name lookups before falling back to auto-complete. With this fix, an agent doing the natural workflow (find-symbol -> get-neighbors -> note-edit) should land far fewer wasted calls. Also: norecursedirs in [tool.pytest.ini_options] to keep pytest from walking into per-instance bench worktrees that ship their own pytest sources (was breaking host pytest's AST rewriter on import). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Refactor FalkorDB graph naming so each (project, branch) pair gets
its own graph: 'code:{project}:{branch}'. This lets concurrent agents
working on different branches of the same repo index in parallel
without overwriting each other.
Changes:
- api/graph.py: add DEFAULT_BRANCH, compose_graph_name(),
parse_graph_name(); Graph and AsyncGraphQuery constructors now
accept (name, branch=None); Graph.from_raw_name() classmethod for
internal callers that need to bypass composition (e.g. clone());
get_repos()/async_get_repos() now return {project, branch, graph}
dicts.
- api/info.py: branch-aware Redis hash keys
('{repo}:{branch}_info'); reads fall back to legacy '{repo}_info'
for un-migrated graphs.
- api/git_utils: GitRepoName() and switch_commit() thread branch
through; LegacyGitRepoName() retained for the migration helper.
- api/project.py: detect_branch() via 'git rev-parse --abbrev-ref
HEAD'; Project.__init__ / from_git_repository /
from_local_repository accept branch.
- api/index.py: all Pydantic request models gain
'branch: Optional[str]'; endpoints thread it into
AsyncGraphQuery + info functions; responses include 'branch'.
- api/cli.py: --branch flag on index / index-repo / search /
neighbors / paths / info; new 'cgraph migrate' command.
- api/migrations/per_branch.py (NEW): idempotent migration that
renames legacy '<project>' graphs to 'code:<project>:_default',
'{<project>}_info' Redis keys to '{<project>}:_default_info',
and '{<project>}_git' graphs to '{<project>}:_default_git'.
Supports --dry-run.
Tests:
- tests/test_per_branch_graphs.py (NEW): 24 unit tests covering
compose/parse helpers, Graph constructor branch awareness,
AsyncGraphQuery, info-key shape, GitRepoName shape, and migration
idempotency (with mocked FalkorDB).
- tests/test_async_graph.py, tests/test_cli.py,
tests/endpoints/test_list_repos.py: updated assertions for the
new dict return shape from get_repos / async_get_repos.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
New `.github/workflows/mcp-tests.yml` runs `pytest tests/mcp/` against a real FalkorDB service container on port 6379. Triggers only on PRs that touch MCP-relevant paths so the unrelated parts of the repo don't pay the cost. - FalkorDB service with redis-cli ping healthcheck. - uv cache keyed on uv.lock for fast incremental runs. - Sets `FALKORDB_HOST` / `FALKORDB_PORT` env so api/graph.py picks up the service host. - Path filter covers api/mcp/, tests/mcp/, api/llm.py, api/graph.py, pyproject.toml, uv.lock, and the workflow file itself. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
New `tests/mcp/fixtures/`:
- `sample_project/python/` — canonical call graph
`entrypoint -> service -> {UserRepo,OrderRepo}.repo -> db`
plus a small class hierarchy (BaseRepo <- UserRepo, OrderRepo)
and inter-file imports so IMPORTS edges exist.
- `expected.yaml` — single source of truth for every per-tool
ticket's integration assertions: minimum per-label counts, named
callers / callees, known paths, prefix-search hits.
New `tests/mcp/conftest.py`:
- `expected_contract` (pure-Python, always available) loads the
YAML once per session.
- `indexed_fixture` (session-scoped) indexes the fixture into a
unique `code:sample_project:test-<uuid>` graph so parallel CI
shards don't contend. Self-skips when FalkorDB is unreachable.
Uses `SourceAnalyzer.analyze_local_folder` directly so the
fixture doesn't need to be a git repository.
New `tests/mcp/test_fixture_contract.py` — regression-tests the
fixture itself: contract shape, on-disk files, and that the
integration fixture indexes cleanly and meets the minimum count
contract.
Multilingual coverage (Java + C#) was dropped from the spec: both
multilspy analyzers demand a Maven / .NET project layout at the
indexed root, which would force this fixture into an awkward shape.
Deferred to a follow-up ticket (likely T16 which adds languages).
All 4 contract tests pass against FalkorDB on 6390.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
First real MCP tool. Wraps the existing Project / SourceAnalyzer
pipeline so AI agents can call `index_repo(path_or_url, branch)` over
stdio to populate code-graph for a repo.
- `api/mcp/tools/structural.py` (NEW) — registers `index_repo` on
the shared FastMCP app. Accepts local paths or git URLs;
auto-detects branch from local git checkouts via T17's
`detect_branch`; honors `ALLOWED_ANALYSIS_DIR` for sandboxing.
Non-git folders are handled by driving SourceAnalyzer directly
(Project requires a git repo).
- `api/mcp/tools/__init__.py` (NEW) — package marker; importing it
registers every tool module's `@app.tool()` decorators.
- `api/mcp/server.py` — imports tools at module load so both direct
`from api.mcp.server import app` and `cgraph-mcp` stdio entry
point see the same tool list.
- `tests/mcp/test_index_repo.py` (NEW) — 5 tests: local-path happy
path, missing-path error, ALLOWED_ANALYSIS_DIR sandboxing,
in-process app registration, JSON serialisability.
- `tests/mcp/test_scaffold.py` — replaced the "zero tools"
assertion with a presence check for `index_repo` so it stays
stable as T5-T8 / T11 add more tools.
Return shape:
{project_name, branch, graph_name, num_nodes, num_edges,
languages_detected, mode}
`incremental` parameter is accepted now and forwarded once T18
lands; the current full-reindex path ignores it and always returns
`mode="full"`.
All 8 tests pass against FalkorDB on 6390.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
In source_analyzer.second_pass, the list of files we iterate can include paths that first_pass did not add to self.files (e.g. parse errors, LSP-induced timeouts, or rare edge cases where a candidate file is present in the input list but never makes it into the files map). Previously this raised KeyError and aborted the entire index. Hit on sympy/polys/distributedmodules.py during bench calibration of sympy-12481. Skip with a WARN log instead so a single bad file no longer takes down the whole index. Also bump mini_runner httpx timeout 1800s -> 7200s; observed sympy-12481 index taking >30 min in the field, which previously left the API server indexing successfully but the runner gave up early. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace jedi-based resolution with a pure tree-sitter static resolver
behind CODE_GRAPH_PY_RESOLVER=tree_sitter. Default remains jedi for
backwards compatibility.
Benchmark on pytest-dev/pytest-6202 (204 files):
- jedi: 247.1s wall, CALLS=1976, EXTENDS=71
- tree-sitter: 6.9s wall, CALLS=4833, EXTENDS=83
~36x speedup, broader call recall (jedi returns None ~80% of the time).
Mechanism:
- TreeSitterPythonResolver builds a project-wide symbol table
(top-level funcs/classes/assigns, class methods, import maps)
keyed by id(files) for lazy construction.
- Resolution: head lookup (local module -> import map ->
cross-project bare-name fallback) + tail walk through attributes
and class methods.
- Handles relative imports, aliased imports, import-of-package,
Optional[T]/generic_type subscript unwrapping.
- AbstractAnalyzer.needs_lsp() hook + PythonAnalyzer override let
source_analyzer skip LSP startup and venv setup entirely when
the static resolver is active. This is where the wall-time win
actually lives (jedi warm-up was ~240s of the 247s baseline).
Closes #689.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
AbstractAnalyzer._captures was recompiling its query string on every call. cProfile on pytest-dev/pytest-6202 (204 files) showed tree_sitter.Language.query consuming 3.03s of the 6.36s first_pass — ~48% of analyzer time spent rebuilding queries that never change. Cache them on the analyzer instance, keyed by pattern string. Also switches from the deprecated language.query() to the Query(language, pattern) constructor. Wall-time on pytest-6202 (CODE_GRAPH_PY_RESOLVER=tree_sitter): before: 6.9s after: 3.7s Benefits every tree-sitter analyzer (Python, JavaScript, Kotlin), not just the new static resolver. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After T18 (#691) + query-cache (#692), code_graph indexing on pytest-6202 drops from 247s to 3.7s — but only if the API server is launched with CODE_GRAPH_PY_RESOLVER=tree_sitter. This helper bakes in that env plus the public/permissive flags the bench harness expects, so calibration runs hit the fast path without manual setup. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Resolve conflicts: - source_analyzer: keep needs_lsp() gate from query-cache, keep venv fallback + first_pass-skipped-file defense from bench-mcp-track - analyzer.resolve: keep verbose error logging from bench-mcp-track - llm.py / uv.lock: take bench-mcp-track (graphrag 1.x rewrite)
After merging the bench harness (graphrag-sdk 1.1.1) with the MCP suite (written against 0.8 KnowledgeGraph), the server failed at import. Move the SDK import inside get_or_create_kg so only the 'ask' tool trips the incompatibility — structural tools used by the bench harness (index_repo, search_code, get_callers, ...) work either way.
… context Each cg-mcp bash invocation spawns a fresh cgraph-mcp server, whose DEBUG logs (analyzer init + MCP server.py registration + per-request dispatch) were being merged into the agent's tool-output buffer at ~1.8 kB per call. Across a 50-call trajectory that's ~90 kB of useless log noise replayed each turn, blowing token counts up to ~9x what the HTTP code_graph track produces. Route the spawned server's stderr to /dev/null via stdio_client's errlog kwarg. Verified end-to-end: pytest-6202 code_graph_mcp trajectory dropped from $6+ to $2.48. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
4 of 10 calibration instances (sympy/django) hit TimeoutError during indexing at the 60s default. The sympy graphs alone have 24k+ nodes and 145k+ edges, which legitimately exceeds 60s. 300s matches the HTTP code_graph adapter's behaviour for large repos and removes the indexing-timeout failure mode without slowing happy-path calls. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two safeguards against the 'silent fallback to bash' failure mode that
made our Sonnet calibration headline numbers untrustworthy:
1. verify_tool_available(): before launching the agent in any tool
config (lsp / code_graph / code_graph_mcp), exec the tool's --help
in the same env the agent will see. If it fails (missing PATH,
Python startup crash, etc.) the run aborts with outcome=
'tool_unavailable' instead of silently producing a bash-only
trajectory that we'd later attribute to the tool.
2. compute_tool_usage(): for every trajectory, count how many bash
commands actually invoked the configured tool (cg / cg-mcp / lsp).
Surfaced as tool_usage_rate on TaskMetrics and as a new column in
report.md. Sonnet calibration backfill revealed:
code_graph median rate 12% (8 of 10 ⚠️ )
code_graph_mcp median rate 10% (10 of 10 ⚠️ )
lsp median rate 27% (7 of 10 ⚠️ )
So the agent abandoned the tool after a few attempts and ran
80-90% of bash commands as plain grep/sed/cat — meaning the
'-30.5% MCP vs baseline' headline is mostly preamble effect,
not tool effect. Reframes the experiment substantially.
3. Backfilled tool_usage_rate on all 40 existing Sonnet trajectories
in mcp-t17/bench/cache/results.jsonl so future report renders
show the column.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two fixes addressing the 10-27% tool-usage rate observed in the Sonnet calibration: 1. cg / cg-mcp / lsp shims: redirect stdin from /dev/null on exec. mini-swe-agent's LocalEnvironment runs commands via subprocess.run(shell=True) without specifying stdin. When the runner is nohup-detached or run in a context with a closed FD 0, Python crashes at interpreter startup with init_sys_streams: Bad file descriptor before our argparse code runs. The Opus probe on pytest-6202 showed the first cg call crashing this way, after which the agent wrapped subsequent calls in '|| echo failed' and ran the rest of the trajectory on plain bash. Defense-in-depth only; harmless when FD 0 is already valid. 2. code_graph / code_graph_mcp / lsp preambles: add explicit rules forbidding silent fallback to grep/find. The agent must state tool failure before using a textual search alternative. This gives us a chance to (a) actually diagnose tool failures from trajectories instead of silently scoring bash trajectories as tool wins, and (b) raise tool-usage rates closer to a regime where the tool can plausibly affect outcomes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| import argparse | ||
| from pathlib import Path | ||
|
|
||
| from bench.report import aggregate_to_markdown, load_jsonl, render_markdown, summarize |
| # Register tools on import so both direct ``import api.mcp.server`` and the | ||
| # stdio entry point see the same tool list. Imported below ``app`` because | ||
| # the tool modules need a reference to it. | ||
| from . import tools # noqa: F401, E402 |
|
|
||
| from __future__ import annotations | ||
|
|
||
| from pathlib import Path |
|
|
||
| from __future__ import annotations | ||
|
|
||
| from pathlib import Path |
|
|
||
| from pathlib import Path | ||
|
|
||
| import pytest |
…ken pytest verifier The old verify_instance ran modern pytest 8 from the bench-combined venv against legacy SWE-bench worktrees. Old codebases like pytest-6202 use config keys (rsyncdirs) removed in modern pytest, producing `INTERNALERROR: Unknown config option` at collection time — 0 tests collected, returncode!=0, every trajectory graded 'failed' regardless of patch correctness. The 1-task Opus probe proved this: 3 of 4 configs produced the exact gold patch, all 4 graded 'failed' for the same config error. Replacement: verify_with_swebench_harness(inst, patch, ...) writes a predictions.jsonl in the format the official harness expects and calls swebench.harness.run_evaluation.main, which spins up per-instance Docker images with the correct Python + dependency set and runs the real FAIL_TO_PASS + PASS_TO_PASS selection. The agent's patch comes from trajectory.info.submission (already populated by the runner). When Docker is absent the result is reported as outcome= 'verifier_unavailable' rather than silently graded 'failed' — strictly more honest, and lets the report distinguish 'agent failed' from 'we don't know'. The old verify_instance is kept as a deprecation shim so any leftover caller fails loud. Also adds: - mini_runner --skip-verify / --verify-timeout flags - bench.cli.regrade for retroactively grading existing trajectories without re-running the agent (saves the tokens spent on the 40 Sonnet calibration once Docker is wired up) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment on lines
+27
to
+32
| from bench.datasets.swe_bench import ( | ||
| SweBenchInstance, | ||
| load_instances, | ||
| verify_with_swebench_harness, | ||
| _docker_available, | ||
| ) |
| from __future__ import annotations | ||
|
|
||
| import json | ||
| import os |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Track the share of bash commands on each track that are plain text search (grep/rg/find/ack/ag) rather than the configured tool. Surfaced alongside tool_usage_rate so we can distinguish 'tool answered the question and the rest is normal bash' from 'agent silently abandoned the tool and reverted to grep'. Sonnet 4.5 n=10 headline now shows: - baseline 25% fallback - code_graph 10% (cut 60%) - code_graph_mcp 8% (cut 68%) - lsp 4% (cut 84%) Backfilled all 40 Sonnet trajectories and the 15 Opus trajectories currently on disk; harness writes the metric forward. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| ) | ||
| if result.returncode == 0 and result.stdout.strip(): | ||
| return result.stdout.strip() | ||
| except FileNotFoundError: |
Surfaces median wall-clock seconds per task per config plus delta vs baseline, alongside tokens. wall_clock_sec was already captured in TaskMetrics — just plumbed into report aggregation/rendering. Sonnet 4.5 n=10: - baseline 336s — - code_graph 269s -20.1% - code_graph_mcp 273s -18.7% - lsp 290s -13.7% Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
_ensure_indexed and _ensure_indexed_mcp now return elapsed seconds (0.0 on cache hit). Runner stashes the value on the metrics row as index_sec; report renders median per config. This separates 'how long does indexing the repo take' (one-time setup cost) from 'how long does the agent take to solve the task' (the existing median wall column). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The HTTP /api/list_repos response shape changed from [name, ...] to
[{project, branch, graph}, ...], so the old 'repo_name in repositories'
membership check silently returned False — every cg-track run
re-issued analyze_folder even when the graph existed. With a 7200s
timeout this masked server hangs for over an hour at a time.
New precheck:
- queries FalkorDB GRAPH.LIST directly (matches the MCP track path)
- matches both bare-name (legacy) and code:<name>:<branch> forms
- bounded read timeout at 1800s (was 7200s); surfaces server hangs
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tter resolver
When the API server is launched without CODE_GRAPH_PY_RESOLVER=tree_sitter
the PythonAnalyzer silently falls back to the jedi/multilspy path. On
real-world repos (sphinx-doc/sphinx-8035, sympy, …) that path calls
`python3 -m venv venv && pip install poetry && poetry install` per repo
then runs jedi over the full transitive dep tree; we observed it wedge
the server at 100% CPU + 3.5 GB RSS for 3+ hours with no progress.
bench/scripts/start-api.sh already exports CODE_GRAPH_PY_RESOLVER, but a
human-launched `uvicorn api.index:app …` won't pick it up and the bench
silently degrades to the slow path.
This commit makes the failure mode loud:
1. `GET /api/_health` returns {status, py_resolver, falkordb_host,
falkordb_port, public}. Cheap (no DB call), unauth'd.
2. `_ensure_indexed` in the mini_runner calls /api/_health before any
indexing and raises a clear RuntimeError when py_resolver !=
'tree_sitter', pointing the operator at bench/scripts/start-api.sh.
Verified: sphinx-doc__sphinx-8035 indexes in ~68s end-to-end with the
new server (vs hours unbounded before).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| env_path = REPO_ROOT / ".env" | ||
| if env_path.exists(): | ||
| load_dotenv(env_path) | ||
| except ImportError: |
… raises timeout The MCP adapter spawns a fresh cgraph-mcp stdio server per call. When the caller shell did not export CODE_GRAPH_PY_RESOLVER, the spawned server fell back to the legacy jedi/multilspy resolver, which runs 'python -m venv && pip install poetry && poetry install' per repo and then analyzes the full transitive dep tree. On full SWE-bench worktrees this wedges for >15 min — we observed it timing out indexing sympy__sympy-20154 and sympy__sympy-19040 during a fresh Opus calibration run. Mirror the start-api.sh policy: default CODE_GRAPH_PY_RESOLVER to tree_sitter in _env_for_mcp() so the MCP track is symmetric with HTTP regardless of caller env. Also bump the per-call timeout default 300s -> 900s in both the adapter (CGRAPH_MCP_TIMEOUT_SEC) and the cg-mcp CLI for headroom on cold MCP spawns over big repos. Validated: sympy-20154 (591 .py files, ~49k nodes, ~344k edges) indexes end-to-end via MCP in 220 s with the new default, vs >900 s timeout before. HTTP path on the same repo: 95 s; ~2.3x slower over the stdio spawn is expected and well within the new timeout. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| # bash shim is invoked by the agent, the server's stderr is merged | ||
| # into the agent's tool-output buffer, inflating context by ~1.8kB | ||
| # per call. The agent only needs the JSON-RPC result on stdout. | ||
| devnull = open(os.devnull, "w") |
Opus n=10 calibration showed the code_graph track spending +14.5% input
tokens vs baseline, and code_graph_mcp +19.5% — driven by three things
that compound over a 70-80-turn trajectory:
1. /api/get_neighbors returns a verbose vars(node)/vars(edge) dump that
includes empty 'properties: {}' and empty 'alias: ""' on every edge,
plus per-node 'doc' blocks. Every byte we hand back is re-fed to the
LLM on every subsequent turn, so a single 20 KB neighbors call ends
up billed ~50x.
2. JSON was pretty-printed (indent=2). Whitespace is free for humans,
not for token counts.
3. System preambles were 2.8-3.6 KB of duplicated mini-swe-agent
submission boilerplate + repeated rules-of-thumb.
This is a bench-layer-only change (the React frontend and the core API
contract are untouched):
- bench/cli/cg.py:
- _compact_neighbors strips empty properties/alias and projects nodes
to {id, label, name, file, line}.
- _compact_symbols same for find-symbol/auto-complete.
- New --limit flag on get-neighbors (default 50; 0 = unlimited).
- JSON now emits with separators=(',', ':').
- bench/cli/cg_mcp.py: same compact JSON formatting.
- system_preamble.md (code_graph, code_graph_mcp): rewritten as ~1.1 KB
instead of 2.9-3.6 KB, keeping the workflow and sub-command listing
but dropping mini-swe-agent's own submission boilerplate.
Validated against the live indexed graphs:
- pytest-6202 cold neighbor call: 400 -> 148 bytes (63%)
- sympy-19040 hot neighbor (id=23432, 2039 outgoing edges):
raw: 747 KB -> compacted unlim: 252 KB (-66%)
-> compacted lim=50: 6.2 KB (-99.2%)
- sympy-19040 auto-complete prefix='solve': 11.4 KB -> 0.6 KB (-94.6%)
Expected effect on the next Opus run: cg* input-token cost drops below
baseline rather than above it, while the agent still gets the same
structural information. All 24 per-branch-graph tests still pass.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The compaction pass in dc8534e accidentally dropped the COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT submission instruction (which baseline + lsp still had). Without it Opus has no way to signal 'done' and just loops, re-emitting the final diff every turn. Visible in opus-smoke n=1 (pytest-6202) after dc8534e: config msgs bash_outs input_tok vs baseline baseline 60 28 243k — lsp 52 24 209k -14% code_graph 235 82 1,043k +329% <-- looped code_graph_mcp hung > 16min on Azure round-trip Restore the autonomous-agent framing intro + submission section in both preambles. Keep the trimmed workflow / sub-command list. Also add an explicit anti-loop rule ('do not call the same cg query twice for the same symbol'). Net: ~30% smaller than original (was 60% with the broken trim), and now has the must-have completion contract. Compaction of tool *output* (in cg.py / cg_mcp.py) is unchanged and still valid; this only touches the preambles. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
End-to-end benchmark harness for evaluating code-graph against baseline and LSP on SWE-bench Verified. Four configurations:
cgHTTP CLI against the FastAPI servicecg-mcpJSON-RPC stdio CLI againstcgraph-mcpIncludes resume support, per-instance timeouts, tree-sitter fast resolver (T15 + T18), MCP auto-init (T12-T14), a tool-usage rate metric to detect silent fallback to bash, and the official
swebench.harness.run_evaluationDocker-backed verifier with retroactive regrade CLI.Verified results (Sonnet 4.5, n=10, step-75, official SWE-bench Docker harness)
All resolves checked via the official harness (per-instance Docker images, real FAIL_TO_PASS + PASS_TO_PASS selection). Sympy-19040 is the only universally-hard task; only lsp solves it.
Token efficiency at a glance
Engineering hardening shipped in this branch
38d2411silence cgraph-mcp stderr (was bloating agent context 9×)bbb5d95bump default cgraph-mcp timeout 60s → 300s for sympy/djangoaa850d6tool-availability precheck + tool-usage rate metric (caught the silent-fallback regression that almost shipped)4a6956edefensive stdin redirect on cg/cg-mcp/lsp shims + anti-fallback preamble rules4daad7erewrite verifier to use the official swebench Docker harness (the previous one ran modern pytest 8 against legacy worktrees and graded every trajectoryfailed)bfdf60dgitignore harness outputOut of scope for this PR
Draft for review of the harness mechanics + early numbers. Not for merge until headline n=40 lands.