feat(coding_agent_rl): add SWE-bench harness evaluation path#2079
Open
aoshen02 wants to merge 4 commits into
Open
feat(coding_agent_rl): add SWE-bench harness evaluation path#2079aoshen02 wants to merge 4 commits into
aoshen02 wants to merge 4 commits into
Conversation
…louts Add uni-agent (source install, --no-deps) and swebench to the Docker image so coding_agent_rl can import uni_agent.reward.swe_bench for SWE-bench harness evaluation, and swebench for grading constants/parsers. Previously these were installed at container startup via run scripts, adding cold-start latency and making the image non-self-contained. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: aoshen <aoshen@inferact.ai>
Add `swebench_metadata` as a third evaluation route in `sandbox.evaluate()`, alongside the existing `swepro` and `eval_cmd` paths. This allows coding_agent_rl to grade SWE-bench Verified instances directly using the official swebench harness (constants, log parsers, grading). Eval script generation reuses `uni_agent.reward.swe_bench._make_eval_script_list` (installed via the Dockerfile change in this PR) so the eval logic stays in one place. Result parsing uses the standard swebench grading API. Changes: - sandbox.py: add `swebench_metadata` param to `evaluate()`, add `_run_swebench_eval()` between `_run_swepro` and `_run_eval_cmd` - generate.py: pass `swebench_metadata` through `_metadata()` and the `evaluate()` call site Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: aoshen <aoshen@inferact.ai>
Now that uni_agent.reward.swe_bench exposes make_eval_script() and parse_eval_output() as standalone functions, _run_swebench_eval is just 6 lines: build script, execute, parse. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: aoshen <aoshen@inferact.ai>
Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: aoshen <aoshen@inferact.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add
swebench_metadataas a third evaluation route insandbox.evaluate(), alongside the existingsweproandeval_cmdpaths. This allowscoding_agent_rlto grade SWE-bench Verified instances directly using the official swebench harness.Changes
docker/Dockerfile: adduni-agent(source install,--no-deps) andswebenchPython packagesexamples/coding_agent_rl/sandbox.py: addswebench_metadataparam toevaluate(), add_run_swebench_eval()— reusesuni_agent.reward.swe_bench._make_eval_script_listfor eval script generation, standard swebench grading API for result parsingexamples/coding_agent_rl/generate.py: passswebench_metadatathrough_metadata()and theevaluate()call siteEvaluation priority
Context
Validated on 500-instance SWE-bench Verified eval with Qwen3.6-35B-A3B: uniagent mode 71.3% (355/498), matching the official 71.6%.
Test plan
🤖 Generated with Claude Code