feat: ProgramBench reverse-engineering environment#1351
Conversation
Eval run results (132/193 tasks, gpt-4.1-mini)Overall avg reward: 0.103
Reward distribution: 102 zeros, 17 in (0, 0.5), 7 in [0.5, 1), 6 perfect 1.0 61/193 tasks incomplete due to prime_sandboxes container lifetime limit (~12-17 min per container on box C++ 0% rate: needs investigationAll 10 C++ tasks scored 0. This may be a pipeline issue (g++ invocation, missing headers in gcc:13-bookworm, or test harness mismatch) or simply model failure. Worth a targeted smoke test. Fixes committed this session
Results file: |
Fixes from agent review of PR #1351: - vf.ensure_keys(["HF_TOKEN","OPENAI_API_KEY"]) — add missing key, use vf. prefix - _TEST_TIMEOUT per-language map; _run_tests now uses it (was fixed 300s) - Drop unused _compile_hint from info dict - Fill README.md placeholders Pre-commit hooks: - lint-agent: calls `claude --print` with .agents/skills/lint-agent/SKILL.md - oo-agent: calls `claude --print` with .agents/skills/oo-agent/SKILL.md (only on environments/, set SKIP_OO_CHECK=1 to bypass) Skill logic lives in markdown, not committed Python. Also splits _evaluate into _compile + _run_tests (lint-agent threshold fix). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
84d9efe to
e952874
Compare
Smoke Test Results (2026-05-13)Model: qwen3-30b-i (qwen/qwen3-30b-a3b-instruct-2507) ✅ working on pinference.ai Step 4: C pipeline smoke test ✅
Step 5: Rust pipeline smoke test ✅
Pipeline statusAll components validated:
Remaining blocker: Docker image rebuildThe deployed image lacks the mini-swe-agent skip guard sentinel file. Without rebuild, To unblock: run locally: Step 6: Full 200-task evalReady to run pending your sign-off. Recommend:
|
Verifiers.v1 environment where an RLM agent reconstructs source code from compiled binaries, scored by fraction of pytest tests passed. Key implementation decisions: - golang:1.22-bookworm / gcc:13-bookworm for standard langs; custom Rust image with pre-warmed cargo registry for Rust tasks - apt-get update + pip --break-system-packages required for pytest on Debian 12 (PEP 668 restriction) - Test archives hidden from agent at setup; uploaded at scoring time following the rlm_swe_v1 pattern - Scoring re-runs compile.sh with PATH prepended for toolchain binaries so agents that omit PATH export still get scored correctly - go_subset.jsonl contains 5 smoke-test tasks; full dataset on PrimeIntellect/programbench-processed (HF private) Smoke test: 17/36 tests passed (reward=0.472) with gpt-4.1-mini on antonmedv__fx.86d0d34, confirming end-to-end pipeline works. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…2a, fasttext, brotli, halite, blake3)
…ces fragile shell heredocs)
…ures Some go.mod files specify newer patch versions (e.g. ariga/atlas requires go1.24.11) which causes Go to try fetching the toolchain at build time, failing in the subprocess environment. GOTOOLCHAIN=local forces use of the installed Go version instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When `go build ./...` fails with no Go files at root (e.g. gdu, pixterm, hostctl where main is under cmd/), grep for `package main` declarations and build from the containing directory. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
test_branches stores filenames as e.g. '1b991a57d4e9.tar' but actual HF files are '1b991a57d4e9.tar.gz'. Appending .tar.gz directly produced a double-extension path that 404'd. Normalise the stem first. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Matches the version installed on the build node and satisfies go.mod constraints that require >= 1.26 (e.g. cheat/cheat). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
primeintellect/programbench-toolchain needs Docker Hub creds to push. Using rust:latest as fallback — the setup script installs pytest/tmux inside the container so the pipeline still works, just without pre-warmed cargo registry (slower for full evals). TODO: push custom image and revert to primeintellect/programbench-toolchain:latest Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add /usr/local/cargo/bin to PATH (rust:latest puts cargo there, not /root/.cargo/bin) - Update CARGO_HOME to /usr/local/cargo to match rust:latest layout - Fix Dockerfile warmup block: heredoc-in-RUN fails Docker parser; use COPY instead Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
30s is too short for long-running agentic sandbox tasks; the worker event loop blocks on concurrent prime_sandboxes API calls and misses heartbeats, causing spurious restarts and lost in-flight results. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sandbox ReadErrors and tunnel timeouts propagate as RuntimeError from the ZMQ client, crashing the whole eval run. Record 0-reward outputs for the affected group and continue so the eval can complete. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sandbox ReadErrors and tunnel timeouts propagate as RuntimeError from the ZMQ client, crashing the whole eval run. Skip the failed group and log the error; it will be retried on the next --resume invocation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…leakage
Switch from verifiers RLM to mini-SWE-agent harness, matching the ProgramBench
paper scaffold. Remove all leaking fields from the agent prompt: nm_output,
strings_output, objdump_head, compile_hint, and explicit language hint. Apply
chmod 111 (execute-only) to the reference binary so the agent can run but not
read/decompile it. Retained fields in info{} (prefixed _) for internal logging.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
mini-SWE-agent bash -lc doesn't inherit the Docker image's ENV PATH, so Go/Cargo weren't accessible. Move toolchain env vars from get_env_vars() into the per-task program.env dict, which gets merged into the command process environment before mini-swe-agent's run script executes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… PATH mini-swe-agent runs bash -lc which sources /etc/profile and may reset PATH, dropping Docker's ENV-injected toolchain dirs. Write a profile.d snippet in setup so the bash login shell always has Go/Cargo/Rust in PATH. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…i-swe-agent DEFAULT_CLI_SANDBOX has command_timeout=900s which kills mini-swe-agent before it can finish apt-get install + agent turns + compile (takes 15-17min). Set command_timeout = timeout_minutes * 60 per language in sandbox_config() so the full sandbox lifetime is available for the CLI command. Also revert temporary logger.warning diagnostics back to logger.info. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The execute_command API enforces a 900s max timeout. timeout_min*60 for Go (1200s) and Rust (2700s) exceeded this, causing HTTP 400 errors. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…meout hard kill The sandbox API caps command_timeout at 900s. Without AGENT_TIMEOUT_SECONDS, mini-swe-agent runs up to its 3600s default and gets killed at 900s with an error. Set AGENT_TIMEOUT_SECONDS = command_timeout - 60 so the agent exits cleanly ~60s before the hard wall, allowing the harness to collect artifacts normally. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
120s per-command timeout kills large Rust cargo build runs (exit 124). 600s matches the sandbox command_timeout safety margin and lets complex crates compile without hitting the timeout wall. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
10 min was too short: setup (apt-get + mini-swe-agent install) takes ~3-5 min, leaving only 5 min for the agent. Tasks ran for 12+ min and hit SandboxNotRunningError (pod killed at sandbox lifetime boundary). 20 min matches Go/C++ and leaves adequate headroom. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Use `is not None` for dataset_name/dataset_split (consistent with other optional params; `or` was semantically wrong for str | None) - Check exit codes on mkdir and profile.d write in setup (log warnings) - Extract _hf_download() helper to deduplicate two near-identical hf_hub_download try-except blocks in _upload_binary/_setup_tests - Fix state ordering in _setup_tests: set _pb_test_branch only after successful extraction (was set before upload/extract for hide=False) - Add error handling for upload/extract in hide_tests_from_agent=False path Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes from agent review of PR #1351: - vf.ensure_keys(["HF_TOKEN","OPENAI_API_KEY"]) — add missing key, use vf. prefix - _TEST_TIMEOUT per-language map; _run_tests now uses it (was fixed 300s) - Drop unused _compile_hint from info dict - Fill README.md placeholders Pre-commit hooks: - lint-agent: calls `claude --print` with .agents/skills/lint-agent/SKILL.md - oo-agent: calls `claude --print` with .agents/skills/oo-agent/SKILL.md (only on environments/, set SKIP_OO_CHECK=1 to bypass) Skill logic lives in markdown, not committed Python. Also splits _evaluate into _compile + _run_tests (lint-agent threshold fix). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Pin rust:latest → rust:1.92 for reproducibility (paper §4 constants) - Add strace, ltrace, and no-wrap prohibition to SYSTEM_PROMPT (paper §3) - Use max(test_branches) instead of [0] to pick most comprehensive branch - Emit almost (≥95% tests) and resolved (100% tests) as state signals (paper §4) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…est exposure Drop nm_output, strings_output, objdump_head from info entirely — these are reverse-engineering artifacts that have no place in the task dict. example_io (expected I/O pairs) was already excluded from _build_instruction; add a comment making this explicit. Add logger.warning when hide_tests_from_agent=False so anyone running eval with tests visible gets a loud reminder that this violates paper §3. The agent's full context is now: readme + docs (documentation) + binary path. No source code, no test content, no analysis artifacts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The verifiers runtime injects OPENAI_BASE_URL + OPENAI_API_KEY into the container pointing at its own proxy. OPENAI_API_KEY in the outer environment is not needed when running via prime eval run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
huggingface_hub's get_token() checks env var, then ~/.cache/huggingface/token,
then huggingface-cli login cache. Passing token=None to hf_hub_download lets
it use that same chain instead of only os.environ.get('HF_TOKEN').
On remote nodes: either export HF_TOKEN=... or run huggingface-cli login once.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
HF_TOKEN is required for private HuggingFace dataset + test archives. Usage: cp .env.example .env, fill in token, then source .env before eval. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Paper allows up to 1000 steps per rollout. mini-SWE-agent defaults to 0 (unlimited). Passing agent.step_limit=1000 via extra_config_specs caps steps at the paper spec. Wall clock (AGENT_TIMEOUT_SECONDS) remains the binding constraint at our resource level but step limit is now explicit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Each task has N oracle test suites (branches) in programbench/ProgramBench-Tests. The official eval scores against all of them. Previously we used one branch (selected arbitrarily by max() on hex hash), which under-counted tests. Now: download all branches concurrently at setup, extract each into its own subdirectory in TEST_DIR to avoid filename collisions, run pytest once across all of them. n_passed/n_total covers the full union of oracle tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@vf.metric(priority=-1) reads state["resolved"] set by _evaluate. Runs after solved() (priority=0) so the flag is already populated. Dense reward (solved) stays as RL training signal; resolved_binary is the 0/1 "% Resolved" figure the paper reports as primary metric. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- network_access=False in sandbox_config (paper §8.2: infra-level block, not just prompt-level) - System prompt: explicit ban on internet access, git clone, cargo install, go get, package-manager cache reads, and binary wrapping - _is_binary_wrap(): compare sha256 of submitted exe vs reference binary; flags eval_error=binary_wrap_detected and returns reward 0 Paper found 36% cheat rate (Sonnet 4.6) on internet-enabled runs, mostly via source-code lookup. These three changes close that attack surface. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ose config params - Switch all _LANGUAGE_IMAGES to primeintellect/programbench-toolchain:latest (unified image with Go+Rust+C/C+++pytest+tmux pre-installed) - Add tmux to Dockerfile (was missing; needed for TTY emulation in tests) - setup: check if pytest/tmux pre-installed and skip apt-get entirely, so network_access=False sandboxes work without hitting package mirrors - Expose cpu_cores, memory_gb, network_access, compile_timeout, test_timeout, sandbox_timeout_minutes as config params in ProgramBenchTasksetConfig, __init__, and load_environment - Log task_id, image, sandbox_id, and setup timing at INFO level Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…v var Team-scoped Prime registry paths must not appear in public code. Set PRIME_TOOLCHAIN_IMAGE in .env to use a private registry image; falls back to primeintellect/programbench-toolchain:latest on DockerHub. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
sha256sum timeout 10s → 30s to handle large binaries like gomplate (15-20MB). Go test timeout 120s → 300s for projects with slow integration test suites. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
C++: 120s → 300s (7-Zip and other heavy C++ suites timed out at 120s). C: 60s → 120s (conservative bump for consistency). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Paper's eval/run.sh has a 3600s timeout for the full evaluation step (container.py). Replace per-language invented timeouts with a flat 3600s for all languages to match. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…try logic - Dockerfile: Go 1.21.0 (paper spec); pre-install mini-swe-agent 2.2.8 to /opt/mini-swe-agent/prefix/ so the harness runs with network_access=False - mini_swe_agent.py: skip-if-installed guard before rm -rf prefix to avoid re-downloading when the binary is already present in the image - programbench.py: _SANDBOX_TIMEOUT_MIN all languages -> 360 min (6hr, paper §4); add test_retries config param (default 3) with best-of-N pytest retry loop; first retry uses --max-worker-restart=4 per paper §4 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Use pip3 install system-wide instead of --target to avoid pip resolver overhead and rely on prebuilt binary wheels from the package index. The prefix dir stub is still created so the harness skip-if-installed guard fires correctly at container startup. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All 13 previously-failed tasks now build successfully. Fixes applied: - gdal, proj: cmake with optional drivers/tests disabled - lnav: cmake fallback with bundled deps - doxygen: cmake Release build - caps-log: cmake with GCC 14 and C++23 standard - tree-sitter: cargo build with proper feature detection - lightningcss: cargo build --features cli - dog: OPENSSL_DIR/OPENSSL_LIB_DIR env vars for openssl-sys 0.9.61 + OpenSSL 3.x - eva: cargo build --release --features build-binary - samtools: inline htslib clone with --recurse-submodules - duc: ./configure --disable-cairo --disable-x11 - php: buildconf --force + --disable-all --enable-cli - pingu: GONOSUMCHECK + GOFLAGS=-mod=mod with go mod tidy Dataset PrimeIntellect/programbench-processed updated: 193/193 tasks have binaries. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Dockerfile: --default-toolchain 1.92.0 to match paper spec (constants.py). Previously used 'stable' which resolved to 1.95.0. Requires image rebuild (primeintellect/programbench-toolchain) once Docker Hub credentials available. all_tasks.jsonl: add cslarsen__jp2a and google__brotli (2 new tasks added to PrimeIntellect/programbench-processed after initial build run). Dataset now 195 tasks total, all 195 with binaries. HF dataset updated. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tasks - Add jgm__pandoc.4e37075 (Haskell, 229MB stripped) and stathissideris__ditaa.f2286c4 (Java wrapper, 12MB) to all_tasks.jsonl - Both use test_hf_repo=PrimeIntellect/programbench-processed; binaries and test archives uploaded to HF - Add GHC 9.6.7 + cabal 3.14.2.0 via GHCup to Dockerfile for Haskell compilation support - Extend language support in programbench.py for haskell and java (images, timeouts, disk sizes) - Add test_hf_repo per-task field to allow new tasks to use PrimeIntellect/programbench-processed/tests/ instead of the read-only programbench/ProgramBench-Tests repo Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bootstrap script was putting ghcup binary in XDG paths (GHCUP_USE_XDG_DIRS=0 is still truthy in shell) rather than ~/.ghcup/bin/. Switch to downloading the ghcup binary directly and placing it at the expected path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n not ~/.cabal/bin Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
HuggingFace datasets builder requires uniform schema. Old tasks were missing test_hf_repo; backfill with default "programbench/ProgramBench-Tests". Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e/test - Add filter_task_ids param to target specific tasks by ID (enables 5-task smoke test) - Switch _compile() and _run_tests() from sandbox.execute() to sandbox.run_background_job() to bypass the execute_command 900s hard kill limit at the API level - Remove 900s cap on command_timeout (compile/test use per-call timeouts; agent budget is the full sandbox timeout minus 60s headroom) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
execute_command has a hard 900s API limit — removing the cap broke the mini-swe-agent launch. compile/test now use run_background_job (no cap), so the only thing constrained at 900s is the agent command itself. Remove cap when PR #1364 lands. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rged Mini-swe-agent command now runs via run_background_job (no execute_command hard limit), so the full sandbox lifetime is available as the agent budget. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6735fe5 to
105503c
Compare
…ages PRIME_TEAM_ID env var (or sandbox_config["team_id"]) is now forwarded so team-scoped Docker images (e.g. team-*/programbench-toolchain:latest) can be pulled without IMAGE_PULL_FAILED. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
PrimeIntellect/programbench-processedtest_hf_repofield allows new tasks to use private HF repos for test archivesnetwork_access=FalseevalsTask breakdown
New in this PR (5 tasks added to reach 200/200)
jgm__pandoc.4e37075— Haskell, pandoc 3.9.0.2stathissideris__ditaa.f2286c4— Java/C wrapper embedding ditaa JARblake3-team__blake3.91f7308— Rustfacebookresearch__fasttext.1142dc4— C++halitechallenge__halite.822cfb6— C++Test plan
network_access=False(no mini-swe-agent timeout)🤖 Generated with Claude Code