Add mini-swe-agent harness via ACP shim by bingran-you · Pull Request #576 · benchflow-ai/benchflow

bingran-you · 2026-05-28T20:27:24Z

Summary

Integrates mini-swe-agent as a first-class benchflow agent (mini-swe, aliases mini / minisweagent / mini-swe-agent), following the same integration contract as the other supported agents.

mini-swe is a deliberately minimal, single-bash-tool harness for apples-to-apples model comparison. A new in-process ACP shim runs its DefaultAgent loop and loads its bundled mini.yaml verbatim (minus the interactive mode key), so the upstream guardrails are reproduced faithfully: single bash tool, shared system/instance templates, >10k output truncation, malformed-tool-call retry.

Files

src/benchflow/agents/mini_swe_acp_shim.py (new) — import-safe ACP shim (no top-level side effects; stdout isolation + minisweagent import happen in main()).
src/benchflow/agents/registry.py — one additive AgentConfig + aliases; installs into an isolated /opt/benchflow/mini-swe-venv.
tests/integration/configs/mini-swe.yaml + tests/integration/run.sh — agent integration matrix.
tests/conformance/run_conformance.py — mini-swe smoke model + env keys.
tests/test_mini_swe_routing.py, tests/test_mini_swe_submit.py — routing + submit-lifecycle tests.
src/benchflow/sandbox/docker.py — drive-by fix of a pre-existing ruff UP041 error (added in openhands install + docker concurrency: 4 fixes to make --concurrency 60 viable #575) that failed ruff check src tests for every PR off main. Behavior-preserving.

Provider wiring

The shim reads BENCHFLOW_PROVIDER_* directly (like openclaw/pi/opencode/harvey-lab) — no env.py changes; the usage proxy is honored via the injected litellm api_base, so token usage is captured the same way as other agents.

_litellm_prefix reconstructs the litellm provider prefix from BENCHFLOW_PROVIDER_PROTOCOL. mini-swe drives litellm.completion (chat-completions / anthropic-messages, not the OpenAI Responses API); openai-responses only comes from aws-bedrock, whose proxy also exposes an anthropic-messages surface, so Anthropic models route there. This makes Azure (openai-completions) and Bedrock (Claude via the proxy's /v1/messages) both work.

Review follow-ups (thermo-nuclear, two rounds)

Per-action ACP lifecycle: execute_actions drives the env loop itself — each action emits start→result around its own env.execute; the submit action is closed with the submission; actions that never run (anything after submit in a multi-tool-call turn) emit nothing rather than being falsely marked completed. Fixes both the original dangling-submit bug and the multi-action pollution case.
Infra-error classification: unexpected exceptions in session/prompt return a JSON-RPC error (not a successful end_turn), so auth/provider/protocol/runtime failures are classified as agent/infra errors instead of masquerading as task failures (matches openclaw). The agent's own task failures still return normally with an exit_status.
Import-safe shim: stdout redirect + minisweagent import live in main()/a factory, so the routing policy is importable/unit-testable without the sandbox runtime and importing the module never clobbers stdout.
Parity coverage: mini-swe added to the integration matrix (configs/ + run.sh) and the conformance smoke map.

Test plan

ruff check src tests, ruff format --check src tests, ty check — clean
Full unit suite: 2504 passed, 3 skipped, 0 failed
test_mini_swe_routing.py (Azure/Bedrock/anthropic/empty routing); test_mini_swe_submit.py (multi-action submit lifecycle — executed action → real output completed, submit → submission completed, post-submit action → not emitted)
Real e2e on SkillsBench weighted-gdp-calc, Azure azure-foundry-openai/gpt-5.5: healthy trajectory, 11 tool calls all completed (0 dangling), agent iterates on outputs, usage extracted (provider_response)
Real e2e on SkillsBench weighted-gdp-calc, Bedrock aws-bedrock/us.anthropic.claude-opus-4-7: healthy trajectory, 9 tool calls all completed (0 dangling), usage extracted
Real e2e on SkillsBench pdf-excel-diff: reward 1.0 on both Azure gpt-5.5 and Bedrock opus-4.7
Regression — existing claude-agent-acp through the same pipeline unaffected

Note: reward 0.0 on weighted-gdp-calc reflects model/task difficulty, not the integration — trajectory, ACP lifecycle, and token-usage extraction are all healthy.

Integrates SWE-agent's mini-swe-agent as a benchflow agent. A new in-process ACP shim runs mini-swe's DefaultAgent loop and loads its bundled mini.yaml verbatim (minus the interactive `mode` key), so the upstream guardrails are reproduced faithfully: single bash tool, shared system instructions, >10k output truncation, and malformed-tool-call retry. The shim reads BENCHFLOW_PROVIDER_* directly (like openclaw/pi/opencode), so no env.py wiring is needed; the usage proxy is honored automatically via the injected litellm api_base. Registered as `mini-swe` with aliases mini / minisweagent / mini-swe-agent.

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

litellm.completion (what mini-swe drives) speaks chat-completions and anthropic-messages but NOT the OpenAI Responses API. Replace the flat protocol->prefix dict with a policy helper: anthropic-messages -> anthropic, openai-completions -> openai, and openai-responses (only ever aws-bedrock, whose proxy also exposes /v1/messages) -> anthropic for Claude models. Verified end-to-end: Azure gpt-5.5 and Bedrock us.anthropic.claude-opus-4-7 both solve hello-world and the skillsbench pdf-excel-diff task (reward 1.0, token usage extracted via the usage proxy).

`except (TimeoutError, asyncio.TimeoutError)` in docker teardown (added in benchflow-ai#575) is redundant — asyncio.TimeoutError is an alias of builtin TimeoutError on Python 3.11+. ruff UP041 flags it, failing `ruff check src tests` for every PR off main. Collapse to `except TimeoutError`. Behavior-preserving.

Review follow-ups for the mini-swe ACP shim: - Fix dangling submit tool_call: the `echo COMPLETE_TASK...` command makes env.execute raise Submitted before the parent emits observations, leaving its ACP tool_call stuck in_progress. _ACPAgent.execute_actions now catches Submitted, emits a completed tool_call_update, then re-raises. Verified on Azure gpt-5.5 and Bedrock opus-4.7: every tool call ends `completed`. - Make the shim import-safe: stdout redirection and the (banner-printing) minisweagent import move into main()/a factory, so the pure routing policy is importable and unit-testable without the sandbox runtime. - Integration parity: add tests/integration/configs/mini-swe.yaml and register mini-swe in run.sh ALL_AGENTS, matching the other 8 supported agents. - Add tests/test_mini_swe_routing.py covering Azure/Bedrock/anthropic/empty protocol routing in _litellm_prefix.

…rmance map Second-round review follow-ups: - Model the ACP tool-call lifecycle per action. execute_actions now drives the env loop itself (mirroring DefaultAgent) instead of delegating then patching up after Submitted. Each action emits start→result around its own env.execute; the submit action is closed with the submission; actions that never run (e.g. anything after submit in a multi-tool-call turn) emit nothing instead of being falsely marked completed. tool_call start moves out of query() into the execution loop. - Return a JSON-RPC error (not a successful end_turn) for unexpected exceptions in session/prompt, so BenchFlow classifies auth/provider/protocol/ runtime failures as agent/infra errors rather than masking them as task failures (matches the openclaw shim; the agent's own task failures still return normally with an exit_status). - Add mini-swe to tests/conformance/run_conformance.py AGENT_MODELS + ENV_KEYS (gemini smoke model + keys) so the conformance run uses the right model and credential check instead of the unknown-agent fallback. - Add tests/test_mini_swe_submit.py (gated on minisweagent, like the docker smoke test) proving the multi-action submit lifecycle. Verified: full unit suite green; Azure gpt-5.5 and Bedrock opus-4.7 e2e on SkillsBench weighted-gdp-calc — every tool call completed, 0 dangling, agent iterates on outputs, usage extracted via provider_response.

bingran-you added 7 commits May 24, 2026 13:11

Merge branch 'benchflow-ai:main' into main

c4dffff

Merge branch 'benchflow-ai:main' into main

6621b9a

Merge branch 'benchflow-ai:main' into main

5a20884

Merge branch 'benchflow-ai:main' into main

13a5746

Merge branch 'benchflow-ai:main' into main

be68a0b

Merge branch 'benchflow-ai:main' into main

961085e

devin-ai-integration Bot reviewed May 28, 2026

View reviewed changes

bingran-you added 5 commits May 28, 2026 13:44

Apply ruff format to mini-swe shim

53a64d8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add mini-swe-agent harness via ACP shim#576

Add mini-swe-agent harness via ACP shim#576
bingran-you wants to merge 12 commits into
benchflow-ai:mainfrom
bingran-you:bry/blissful-allen-516e63

bingran-you commented May 28, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bingran-you commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files

Provider wiring

Review follow-ups (thermo-nuclear, two rounds)

Test plan

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bingran-you commented May 28, 2026 •

edited

Loading