Add mini-swe-agent harness via ACP shim#576
Open
bingran-you wants to merge 12 commits into
Open
Conversation
Integrates SWE-agent's mini-swe-agent as a benchflow agent. A new in-process ACP shim runs mini-swe's DefaultAgent loop and loads its bundled mini.yaml verbatim (minus the interactive `mode` key), so the upstream guardrails are reproduced faithfully: single bash tool, shared system instructions, >10k output truncation, and malformed-tool-call retry. The shim reads BENCHFLOW_PROVIDER_* directly (like openclaw/pi/opencode), so no env.py wiring is needed; the usage proxy is honored automatically via the injected litellm api_base. Registered as `mini-swe` with aliases mini / minisweagent / mini-swe-agent.
litellm.completion (what mini-swe drives) speaks chat-completions and anthropic-messages but NOT the OpenAI Responses API. Replace the flat protocol->prefix dict with a policy helper: anthropic-messages -> anthropic, openai-completions -> openai, and openai-responses (only ever aws-bedrock, whose proxy also exposes /v1/messages) -> anthropic for Claude models. Verified end-to-end: Azure gpt-5.5 and Bedrock us.anthropic.claude-opus-4-7 both solve hello-world and the skillsbench pdf-excel-diff task (reward 1.0, token usage extracted via the usage proxy).
`except (TimeoutError, asyncio.TimeoutError)` in docker teardown (added in benchflow-ai#575) is redundant — asyncio.TimeoutError is an alias of builtin TimeoutError on Python 3.11+. ruff UP041 flags it, failing `ruff check src tests` for every PR off main. Collapse to `except TimeoutError`. Behavior-preserving.
Review follow-ups for the mini-swe ACP shim: - Fix dangling submit tool_call: the `echo COMPLETE_TASK...` command makes env.execute raise Submitted before the parent emits observations, leaving its ACP tool_call stuck in_progress. _ACPAgent.execute_actions now catches Submitted, emits a completed tool_call_update, then re-raises. Verified on Azure gpt-5.5 and Bedrock opus-4.7: every tool call ends `completed`. - Make the shim import-safe: stdout redirection and the (banner-printing) minisweagent import move into main()/a factory, so the pure routing policy is importable and unit-testable without the sandbox runtime. - Integration parity: add tests/integration/configs/mini-swe.yaml and register mini-swe in run.sh ALL_AGENTS, matching the other 8 supported agents. - Add tests/test_mini_swe_routing.py covering Azure/Bedrock/anthropic/empty protocol routing in _litellm_prefix.
…rmance map Second-round review follow-ups: - Model the ACP tool-call lifecycle per action. execute_actions now drives the env loop itself (mirroring DefaultAgent) instead of delegating then patching up after Submitted. Each action emits start→result around its own env.execute; the submit action is closed with the submission; actions that never run (e.g. anything after submit in a multi-tool-call turn) emit nothing instead of being falsely marked completed. tool_call start moves out of query() into the execution loop. - Return a JSON-RPC error (not a successful end_turn) for unexpected exceptions in session/prompt, so BenchFlow classifies auth/provider/protocol/ runtime failures as agent/infra errors rather than masking them as task failures (matches the openclaw shim; the agent's own task failures still return normally with an exit_status). - Add mini-swe to tests/conformance/run_conformance.py AGENT_MODELS + ENV_KEYS (gemini smoke model + keys) so the conformance run uses the right model and credential check instead of the unknown-agent fallback. - Add tests/test_mini_swe_submit.py (gated on minisweagent, like the docker smoke test) proving the multi-action submit lifecycle. Verified: full unit suite green; Azure gpt-5.5 and Bedrock opus-4.7 e2e on SkillsBench weighted-gdp-calc — every tool call completed, 0 dangling, agent iterates on outputs, usage extracted via provider_response.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Integrates mini-swe-agent as a first-class benchflow agent (
mini-swe, aliasesmini/minisweagent/mini-swe-agent), following the same integration contract as the other supported agents.mini-swe is a deliberately minimal, single-bash-tool harness for apples-to-apples model comparison. A new in-process ACP shim runs its
DefaultAgentloop and loads its bundledmini.yamlverbatim (minus the interactivemodekey), so the upstream guardrails are reproduced faithfully: singlebashtool, shared system/instance templates, >10k output truncation, malformed-tool-call retry.Files
src/benchflow/agents/mini_swe_acp_shim.py(new) — import-safe ACP shim (no top-level side effects; stdout isolation + minisweagent import happen inmain()).src/benchflow/agents/registry.py— one additiveAgentConfig+ aliases; installs into an isolated/opt/benchflow/mini-swe-venv.tests/integration/configs/mini-swe.yaml+tests/integration/run.sh— agent integration matrix.tests/conformance/run_conformance.py— mini-swe smoke model + env keys.tests/test_mini_swe_routing.py,tests/test_mini_swe_submit.py— routing + submit-lifecycle tests.src/benchflow/sandbox/docker.py— drive-by fix of a pre-existingruff UP041error (added in openhands install + docker concurrency: 4 fixes to make --concurrency 60 viable #575) that failedruff check src testsfor every PR off main. Behavior-preserving.Provider wiring
The shim reads
BENCHFLOW_PROVIDER_*directly (likeopenclaw/pi/opencode/harvey-lab) — noenv.pychanges; the usage proxy is honored via the injected litellmapi_base, so token usage is captured the same way as other agents._litellm_prefixreconstructs the litellm provider prefix fromBENCHFLOW_PROVIDER_PROTOCOL. mini-swe driveslitellm.completion(chat-completions / anthropic-messages, not the OpenAI Responses API);openai-responsesonly comes from aws-bedrock, whose proxy also exposes an anthropic-messages surface, so Anthropic models route there. This makes Azure (openai-completions) and Bedrock (Claude via the proxy's/v1/messages) both work.Review follow-ups (thermo-nuclear, two rounds)
execute_actionsdrives the env loop itself — each action emits start→result around its ownenv.execute; the submit action is closed with the submission; actions that never run (anything after submit in a multi-tool-call turn) emit nothing rather than being falsely marked completed. Fixes both the original dangling-submit bug and the multi-action pollution case.session/promptreturn a JSON-RPC error (not a successfulend_turn), so auth/provider/protocol/runtime failures are classified as agent/infra errors instead of masquerading as task failures (matches openclaw). The agent's own task failures still return normally with anexit_status.main()/a factory, so the routing policy is importable/unit-testable without the sandbox runtime and importing the module never clobbers stdout.configs/+run.sh) and the conformance smoke map.Test plan
ruff check src tests,ruff format --check src tests,ty check— cleantest_mini_swe_routing.py(Azure/Bedrock/anthropic/empty routing);test_mini_swe_submit.py(multi-action submit lifecycle — executed action → real output completed, submit → submission completed, post-submit action → not emitted)weighted-gdp-calc, Azureazure-foundry-openai/gpt-5.5: healthy trajectory, 11 tool calls allcompleted(0 dangling), agent iterates on outputs, usage extracted (provider_response)weighted-gdp-calc, Bedrockaws-bedrock/us.anthropic.claude-opus-4-7: healthy trajectory, 9 tool calls allcompleted(0 dangling), usage extractedpdf-excel-diff: reward 1.0 on both Azure gpt-5.5 and Bedrock opus-4.7claude-agent-acpthrough the same pipeline unaffectedNote:
reward 0.0onweighted-gdp-calcreflects model/task difficulty, not the integration — trajectory, ACP lifecycle, and token-usage extraction are all healthy.