feat(coding_agent_rl): add SWE-bench harness evaluation path by aoshen02 · Pull Request #2079 · THUDM/slime

aoshen02 · 2026-06-15T03:11:17Z

Summary

Add swebench_metadata as a third evaluation route in sandbox.evaluate(), alongside the existing swepro and eval_cmd paths. This allows coding_agent_rl to grade SWE-bench Verified instances directly using the official swebench harness.

Changes

docker/Dockerfile: add uni-agent (source install, --no-deps) and swebench Python packages
examples/coding_agent_rl/sandbox.py: add swebench_metadata param to evaluate(), add _run_swebench_eval() — reuses uni_agent.reward.swe_bench._make_eval_script_list for eval script generation, standard swebench grading API for result parsing
examples/coding_agent_rl/generate.py: pass swebench_metadata through _metadata() and the evaluate() call site

Evaluation priority

swepro           → SWEPro custom scripts (existing, unchanged)
swebench_metadata → SWE-bench official harness (new)
eval_cmd         → shell command fallback (existing, unchanged)

Context

Validated on 500-instance SWE-bench Verified eval with Qwen3.6-35B-A3B: uniagent mode 71.3% (355/498), matching the official 71.6%.

Test plan

`docker build` completes
`python -c "from uni_agent.reward.swe_bench import _make_eval_script_list; print('ok')"` inside container
Run a small SWE-bench eval with `swebench_metadata` in sample metadata

🤖 Generated with Claude Code

…louts Add uni-agent (source install, --no-deps) and swebench to the Docker image so coding_agent_rl can import uni_agent.reward.swe_bench for SWE-bench harness evaluation, and swebench for grading constants/parsers. Previously these were installed at container startup via run scripts, adding cold-start latency and making the image non-self-contained. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: aoshen <aoshen@inferact.ai>

Add `swebench_metadata` as a third evaluation route in `sandbox.evaluate()`, alongside the existing `swepro` and `eval_cmd` paths. This allows coding_agent_rl to grade SWE-bench Verified instances directly using the official swebench harness (constants, log parsers, grading). Eval script generation reuses `uni_agent.reward.swe_bench._make_eval_script_list` (installed via the Dockerfile change in this PR) so the eval logic stays in one place. Result parsing uses the standard swebench grading API. Changes: - sandbox.py: add `swebench_metadata` param to `evaluate()`, add `_run_swebench_eval()` between `_run_swepro` and `_run_eval_cmd` - generate.py: pass `swebench_metadata` through `_metadata()` and the `evaluate()` call site Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: aoshen <aoshen@inferact.ai>

Now that uni_agent.reward.swe_bench exposes make_eval_script() and parse_eval_output() as standalone functions, _run_swebench_eval is just 6 lines: build script, execute, parse. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: aoshen <aoshen@inferact.ai>

Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: aoshen <aoshen@inferact.ai>

aoshen02 and others added 2 commits June 15, 2026 03:10

aoshen02 changed the title ~~feat(docker): add uni-agent and swebench dependencies for agentic rollouts~~ feat(coding_agent_rl): add SWE-bench harness evaluation path Jun 15, 2026

aoshen02 and others added 2 commits June 15, 2026 03:44

fix: drop standalone swebench pip install, uni-agent brings it as dep

411e6a5

Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: aoshen <aoshen@inferact.ai>

This was referenced Jun 15, 2026

feat(coding_agent_rl): add SWE-bench harness evaluation + uniagent mode vllm-project/vime#250

Open

[reward] refactor: extract make_eval_script / parse_eval_output as public helpers verl-project/uni-agent#62

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(coding_agent_rl): add SWE-bench harness evaluation path#2079

feat(coding_agent_rl): add SWE-bench harness evaluation path#2079
aoshen02 wants to merge 4 commits into
THUDM:mainfrom
aoshen02:feat/agentic-rollout-deps

aoshen02 commented Jun 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aoshen02 commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Evaluation priority

Context

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aoshen02 commented Jun 15, 2026 •

edited

Loading