Skip to content

Add mini-swe-agent harness via ACP shim#576

Open
bingran-you wants to merge 12 commits into
benchflow-ai:mainfrom
bingran-you:bry/blissful-allen-516e63
Open

Add mini-swe-agent harness via ACP shim#576
bingran-you wants to merge 12 commits into
benchflow-ai:mainfrom
bingran-you:bry/blissful-allen-516e63

Conversation

@bingran-you
Copy link
Copy Markdown
Collaborator

@bingran-you bingran-you commented May 28, 2026

Summary

Integrates mini-swe-agent as a first-class benchflow agent (mini-swe, aliases mini / minisweagent / mini-swe-agent), following the same integration contract as the other supported agents.

mini-swe is a deliberately minimal, single-bash-tool harness for apples-to-apples model comparison. A new in-process ACP shim runs its DefaultAgent loop and loads its bundled mini.yaml verbatim (minus the interactive mode key), so the upstream guardrails are reproduced faithfully: single bash tool, shared system/instance templates, >10k output truncation, malformed-tool-call retry.

Files

  • src/benchflow/agents/mini_swe_acp_shim.py (new) — import-safe ACP shim (no top-level side effects; stdout isolation + minisweagent import happen in main()).
  • src/benchflow/agents/registry.py — one additive AgentConfig + aliases; installs into an isolated /opt/benchflow/mini-swe-venv.
  • tests/integration/configs/mini-swe.yaml + tests/integration/run.sh — agent integration matrix.
  • tests/conformance/run_conformance.py — mini-swe smoke model + env keys.
  • tests/test_mini_swe_routing.py, tests/test_mini_swe_submit.py — routing + submit-lifecycle tests.
  • src/benchflow/sandbox/docker.py — drive-by fix of a pre-existing ruff UP041 error (added in openhands install + docker concurrency: 4 fixes to make --concurrency 60 viable #575) that failed ruff check src tests for every PR off main. Behavior-preserving.

Provider wiring

The shim reads BENCHFLOW_PROVIDER_* directly (like openclaw/pi/opencode/harvey-lab) — no env.py changes; the usage proxy is honored via the injected litellm api_base, so token usage is captured the same way as other agents.

_litellm_prefix reconstructs the litellm provider prefix from BENCHFLOW_PROVIDER_PROTOCOL. mini-swe drives litellm.completion (chat-completions / anthropic-messages, not the OpenAI Responses API); openai-responses only comes from aws-bedrock, whose proxy also exposes an anthropic-messages surface, so Anthropic models route there. This makes Azure (openai-completions) and Bedrock (Claude via the proxy's /v1/messages) both work.

Review follow-ups (thermo-nuclear, two rounds)

  • Per-action ACP lifecycle: execute_actions drives the env loop itself — each action emits start→result around its own env.execute; the submit action is closed with the submission; actions that never run (anything after submit in a multi-tool-call turn) emit nothing rather than being falsely marked completed. Fixes both the original dangling-submit bug and the multi-action pollution case.
  • Infra-error classification: unexpected exceptions in session/prompt return a JSON-RPC error (not a successful end_turn), so auth/provider/protocol/runtime failures are classified as agent/infra errors instead of masquerading as task failures (matches openclaw). The agent's own task failures still return normally with an exit_status.
  • Import-safe shim: stdout redirect + minisweagent import live in main()/a factory, so the routing policy is importable/unit-testable without the sandbox runtime and importing the module never clobbers stdout.
  • Parity coverage: mini-swe added to the integration matrix (configs/ + run.sh) and the conformance smoke map.

Test plan

  • ruff check src tests, ruff format --check src tests, ty check — clean
  • Full unit suite: 2504 passed, 3 skipped, 0 failed
  • test_mini_swe_routing.py (Azure/Bedrock/anthropic/empty routing); test_mini_swe_submit.py (multi-action submit lifecycle — executed action → real output completed, submit → submission completed, post-submit action → not emitted)
  • Real e2e on SkillsBench weighted-gdp-calc, Azure azure-foundry-openai/gpt-5.5: healthy trajectory, 11 tool calls all completed (0 dangling), agent iterates on outputs, usage extracted (provider_response)
  • Real e2e on SkillsBench weighted-gdp-calc, Bedrock aws-bedrock/us.anthropic.claude-opus-4-7: healthy trajectory, 9 tool calls all completed (0 dangling), usage extracted
  • Real e2e on SkillsBench pdf-excel-diff: reward 1.0 on both Azure gpt-5.5 and Bedrock opus-4.7
  • Regression — existing claude-agent-acp through the same pipeline unaffected

Note: reward 0.0 on weighted-gdp-calc reflects model/task difficulty, not the integration — trajectory, ACP lifecycle, and token-usage extraction are all healthy.

Integrates SWE-agent's mini-swe-agent as a benchflow agent. A new in-process
ACP shim runs mini-swe's DefaultAgent loop and loads its bundled mini.yaml
verbatim (minus the interactive `mode` key), so the upstream guardrails are
reproduced faithfully: single bash tool, shared system instructions, >10k
output truncation, and malformed-tool-call retry.

The shim reads BENCHFLOW_PROVIDER_* directly (like openclaw/pi/opencode), so
no env.py wiring is needed; the usage proxy is honored automatically via the
injected litellm api_base. Registered as `mini-swe` with aliases mini /
minisweagent / mini-swe-agent.
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

Open in Devin Review

litellm.completion (what mini-swe drives) speaks chat-completions and
anthropic-messages but NOT the OpenAI Responses API. Replace the flat
protocol->prefix dict with a policy helper: anthropic-messages -> anthropic,
openai-completions -> openai, and openai-responses (only ever aws-bedrock,
whose proxy also exposes /v1/messages) -> anthropic for Claude models.

Verified end-to-end: Azure gpt-5.5 and Bedrock us.anthropic.claude-opus-4-7
both solve hello-world and the skillsbench pdf-excel-diff task (reward 1.0,
token usage extracted via the usage proxy).
`except (TimeoutError, asyncio.TimeoutError)` in docker teardown (added in
benchflow-ai#575) is redundant — asyncio.TimeoutError is an alias of builtin TimeoutError
on Python 3.11+. ruff UP041 flags it, failing `ruff check src tests` for every
PR off main. Collapse to `except TimeoutError`. Behavior-preserving.
Review follow-ups for the mini-swe ACP shim:

- Fix dangling submit tool_call: the `echo COMPLETE_TASK...` command makes
  env.execute raise Submitted before the parent emits observations, leaving its
  ACP tool_call stuck in_progress. _ACPAgent.execute_actions now catches
  Submitted, emits a completed tool_call_update, then re-raises. Verified on
  Azure gpt-5.5 and Bedrock opus-4.7: every tool call ends `completed`.

- Make the shim import-safe: stdout redirection and the (banner-printing)
  minisweagent import move into main()/a factory, so the pure routing policy
  is importable and unit-testable without the sandbox runtime.

- Integration parity: add tests/integration/configs/mini-swe.yaml and register
  mini-swe in run.sh ALL_AGENTS, matching the other 8 supported agents.

- Add tests/test_mini_swe_routing.py covering Azure/Bedrock/anthropic/empty
  protocol routing in _litellm_prefix.
…rmance map

Second-round review follow-ups:

- Model the ACP tool-call lifecycle per action. execute_actions now drives the
  env loop itself (mirroring DefaultAgent) instead of delegating then patching
  up after Submitted. Each action emits start→result around its own
  env.execute; the submit action is closed with the submission; actions that
  never run (e.g. anything after submit in a multi-tool-call turn) emit nothing
  instead of being falsely marked completed. tool_call start moves out of
  query() into the execution loop.

- Return a JSON-RPC error (not a successful end_turn) for unexpected
  exceptions in session/prompt, so BenchFlow classifies auth/provider/protocol/
  runtime failures as agent/infra errors rather than masking them as task
  failures (matches the openclaw shim; the agent's own task failures still
  return normally with an exit_status).

- Add mini-swe to tests/conformance/run_conformance.py AGENT_MODELS + ENV_KEYS
  (gemini smoke model + keys) so the conformance run uses the right model and
  credential check instead of the unknown-agent fallback.

- Add tests/test_mini_swe_submit.py (gated on minisweagent, like the docker
  smoke test) proving the multi-action submit lifecycle.

Verified: full unit suite green; Azure gpt-5.5 and Bedrock opus-4.7 e2e on
SkillsBench weighted-gdp-calc — every tool call completed, 0 dangling, agent
iterates on outputs, usage extracted via provider_response.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant