Skip to content

feat(coding_agent_rl): add SWE-bench harness evaluation path#2079

Open
aoshen02 wants to merge 4 commits into
THUDM:mainfrom
aoshen02:feat/agentic-rollout-deps
Open

feat(coding_agent_rl): add SWE-bench harness evaluation path#2079
aoshen02 wants to merge 4 commits into
THUDM:mainfrom
aoshen02:feat/agentic-rollout-deps

Conversation

@aoshen02

@aoshen02 aoshen02 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

Add swebench_metadata as a third evaluation route in sandbox.evaluate(), alongside the existing swepro and eval_cmd paths. This allows coding_agent_rl to grade SWE-bench Verified instances directly using the official swebench harness.

Changes

  • docker/Dockerfile: add uni-agent (source install, --no-deps) and swebench Python packages
  • examples/coding_agent_rl/sandbox.py: add swebench_metadata param to evaluate(), add _run_swebench_eval() — reuses uni_agent.reward.swe_bench._make_eval_script_list for eval script generation, standard swebench grading API for result parsing
  • examples/coding_agent_rl/generate.py: pass swebench_metadata through _metadata() and the evaluate() call site

Evaluation priority

swepro           → SWEPro custom scripts (existing, unchanged)
swebench_metadata → SWE-bench official harness (new)
eval_cmd         → shell command fallback (existing, unchanged)

Context

Validated on 500-instance SWE-bench Verified eval with Qwen3.6-35B-A3B: uniagent mode 71.3% (355/498), matching the official 71.6%.

Test plan

  • `docker build` completes
  • `python -c "from uni_agent.reward.swe_bench import _make_eval_script_list; print('ok')"` inside container
  • Run a small SWE-bench eval with `swebench_metadata` in sample metadata

🤖 Generated with Claude Code

aoshen02 and others added 2 commits June 15, 2026 03:10
…louts

Add uni-agent (source install, --no-deps) and swebench to the Docker
image so coding_agent_rl can import uni_agent.reward.swe_bench for
SWE-bench harness evaluation, and swebench for grading constants/parsers.

Previously these were installed at container startup via run scripts,
adding cold-start latency and making the image non-self-contained.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: aoshen <aoshen@inferact.ai>
Add `swebench_metadata` as a third evaluation route in `sandbox.evaluate()`,
alongside the existing `swepro` and `eval_cmd` paths. This allows
coding_agent_rl to grade SWE-bench Verified instances directly using
the official swebench harness (constants, log parsers, grading).

Eval script generation reuses `uni_agent.reward.swe_bench._make_eval_script_list`
(installed via the Dockerfile change in this PR) so the eval logic stays
in one place. Result parsing uses the standard swebench grading API.

Changes:
- sandbox.py: add `swebench_metadata` param to `evaluate()`, add
  `_run_swebench_eval()` between `_run_swepro` and `_run_eval_cmd`
- generate.py: pass `swebench_metadata` through `_metadata()` and
  the `evaluate()` call site

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: aoshen <aoshen@inferact.ai>
@aoshen02 aoshen02 changed the title feat(docker): add uni-agent and swebench dependencies for agentic rollouts feat(coding_agent_rl): add SWE-bench harness evaluation path Jun 15, 2026
aoshen02 and others added 2 commits June 15, 2026 03:44
Now that uni_agent.reward.swe_bench exposes make_eval_script() and
parse_eval_output() as standalone functions, _run_swebench_eval is
just 6 lines: build script, execute, parse.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: aoshen <aoshen@inferact.ai>
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: aoshen <aoshen@inferact.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant