Skip to content

feat: ProgramBench reverse-engineering environment#1351

Draft
sethkarten wants to merge 49 commits into
mainfrom
feat/programbench-env
Draft

feat: ProgramBench reverse-engineering environment#1351
sethkarten wants to merge 49 commits into
mainfrom
feat/programbench-env

Conversation

@sethkarten
Copy link
Copy Markdown

@sethkarten sethkarten commented May 12, 2026

Summary

  • Full ProgramBench reverse-engineering environment: agent gets execute-only binary, must rewrite source code to pass tests
  • 200/200 tasks complete with binaries + pytest test archives in PrimeIntellect/programbench-processed
  • Supports C, C++, Go, Rust, Haskell, Java tasks; Docker toolchain image includes all compilers
  • Per-task test_hf_repo field allows new tasks to use private HF repos for test archives
  • mini-swe-agent 2.2.8 pre-baked into toolchain image for network_access=False evals

Task breakdown

Language Count
C ~40
C++ ~35
Go ~35
Rust ~85
Haskell 1 (pandoc)
Java 1 (ditaa)

New in this PR (5 tasks added to reach 200/200)

  • jgm__pandoc.4e37075 — Haskell, pandoc 3.9.0.2
  • stathissideris__ditaa.f2286c4 — Java/C wrapper embedding ditaa JAR
  • blake3-team__blake3.91f7308 — Rust
  • facebookresearch__fasttext.1142dc4 — C++
  • halitechallenge__halite.822cfb6 — C++

Test plan

  • Docker image rebuild with GHC 9.6.7 + cabal completes successfully
  • Smoke test C task with network_access=False (no mini-swe-agent timeout)
  • Smoke test Rust task
  • Verify 200-task count in dataset

🤖 Generated with Claude Code

@sethkarten
Copy link
Copy Markdown
Author

Eval run results (132/193 tasks, gpt-4.1-mini)

Overall avg reward: 0.103

Language n Non-zero Avg reward Max
C 23 7 (30%) 0.131 0.937
C++ 10 0 (0%) 0.000 0.000
Go 34 11 (32%) 0.089 0.778
Rust 65 12 (18%) 0.117 1.000

Reward distribution: 102 zeros, 17 in (0, 0.5), 7 in [0.5, 1), 6 perfect 1.0

61/193 tasks incomplete due to prime_sandboxes container lifetime limit (~12-17 min per container on box cmp0f64qu0044alfqcjbax5ou). Tasks involving slow Rust/C++ compilation consistently exceed the container window. Fix requires longer-lived sandbox containers or running on a different host.

C++ 0% rate: needs investigation

All 10 C++ tasks scored 0. This may be a pipeline issue (g++ invocation, missing headers in gcc:13-bookworm, or test harness mismatch) or simply model failure. Worth a targeted smoke test.

Fixes committed this session

  • environments/programbench/programbench.py: Rust PATH fix (/usr/local/cargo/bin + CARGO_HOME=/usr/local/cargo)
  • verifiers/serve/server/env_router.py + env_server.py: worker heartbeat timeout 30s → 1800s
  • verifiers/envs/environment.py: catch per-group exceptions, skip+retry instead of crashing eval
  • environments/programbench/docker/: Dockerfile heredoc fix, warmup crate added

Results file: environments/programbench/outputs/evals/programbench--openai--gpt-4.1-mini/cce96e04/results.jsonl

sethkarten pushed a commit that referenced this pull request May 12, 2026
Fixes from agent review of PR #1351:
- vf.ensure_keys(["HF_TOKEN","OPENAI_API_KEY"]) — add missing key, use vf. prefix
- _TEST_TIMEOUT per-language map; _run_tests now uses it (was fixed 300s)
- Drop unused _compile_hint from info dict
- Fill README.md placeholders

Pre-commit hooks:
- lint-agent: calls `claude --print` with .agents/skills/lint-agent/SKILL.md
- oo-agent: calls `claude --print` with .agents/skills/oo-agent/SKILL.md
  (only on environments/, set SKIP_OO_CHECK=1 to bypass)
Skill logic lives in markdown, not committed Python.

Also splits _evaluate into _compile + _run_tests (lint-agent threshold fix).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@sethkarten sethkarten force-pushed the feat/programbench-env branch from 84d9efe to e952874 Compare May 12, 2026 22:51
@sethkarten
Copy link
Copy Markdown
Author

Smoke Test Results (2026-05-13)

Model: qwen3-30b-i (qwen/qwen3-30b-a3b-instruct-2507) ✅ working on pinference.ai

Step 4: C pipeline smoke test ✅

  • Task: cmatrix (C, filter_language=c, max_tasks=1, agent_step_limit=5, network_access=True)
  • Agent ran 5 steps, wrote ncurses C implementation, hit SandboxError on compile (expected — task-level, not pipeline)
  • Reward: 0.000 (expected for first reconstruction attempt)
  • Duration: ~12 min (includes ~1 min mini-swe-agent install with network_access=True)

Step 5: Rust pipeline smoke test ✅

  • Task: agourlay__zip-password-finder (Rust, filter_language=rust)
  • Agent ran, hit SandboxError (fast fail, 106s total)
  • Reward: 0.000
  • Pipeline confirmed working end-to-end for Rust toolchain image

Pipeline status

All components validated:

  • Sandbox creation ✅
  • Binary upload to /workspace/binary (chmod 111) ✅
  • mini-swe-agent installation (with network_access=True) ✅
  • LLM API routing via prime_tunnel ✅
  • Agent step execution ✅
  • Reward scoring ✅

Remaining blocker: Docker image rebuild

The deployed image lacks the mini-swe-agent skip guard sentinel file. Without rebuild, network_access=False evals time out on install. Workaround: network_access=True (used for smoke tests above).

To unblock: run locally: cd environments/programbench/docker && docker login && ./build.sh

Step 6: Full 200-task eval

Ready to run pending your sign-off. Recommend:

  • Model: qwen3-30b-i (confirmed working)
  • network_access=True (until Docker image rebuilt)
  • Full 195 tasks across C/C++/Go/Rust

Seth and others added 26 commits May 13, 2026 10:37
Verifiers.v1 environment where an RLM agent reconstructs source code
from compiled binaries, scored by fraction of pytest tests passed.

Key implementation decisions:
- golang:1.22-bookworm / gcc:13-bookworm for standard langs; custom
  Rust image with pre-warmed cargo registry for Rust tasks
- apt-get update + pip --break-system-packages required for pytest on
  Debian 12 (PEP 668 restriction)
- Test archives hidden from agent at setup; uploaded at scoring time
  following the rlm_swe_v1 pattern
- Scoring re-runs compile.sh with PATH prepended for toolchain binaries
  so agents that omit PATH export still get scored correctly
- go_subset.jsonl contains 5 smoke-test tasks; full dataset on
  PrimeIntellect/programbench-processed (HF private)

Smoke test: 17/36 tests passed (reward=0.472) with gpt-4.1-mini on
antonmedv__fx.86d0d34, confirming end-to-end pipeline works.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ures

Some go.mod files specify newer patch versions (e.g. ariga/atlas requires
go1.24.11) which causes Go to try fetching the toolchain at build time,
failing in the subprocess environment. GOTOOLCHAIN=local forces use of
the installed Go version instead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When `go build ./...` fails with no Go files at root (e.g. gdu, pixterm,
hostctl where main is under cmd/), grep for `package main` declarations
and build from the containing directory.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
test_branches stores filenames as e.g. '1b991a57d4e9.tar' but actual
HF files are '1b991a57d4e9.tar.gz'. Appending .tar.gz directly produced
a double-extension path that 404'd. Normalise the stem first.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Matches the version installed on the build node and satisfies go.mod
constraints that require >= 1.26 (e.g. cheat/cheat).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
primeintellect/programbench-toolchain needs Docker Hub creds to push.
Using rust:latest as fallback — the setup script installs pytest/tmux
inside the container so the pipeline still works, just without pre-warmed
cargo registry (slower for full evals).

TODO: push custom image and revert to primeintellect/programbench-toolchain:latest

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add /usr/local/cargo/bin to PATH (rust:latest puts cargo there, not /root/.cargo/bin)
- Update CARGO_HOME to /usr/local/cargo to match rust:latest layout
- Fix Dockerfile warmup block: heredoc-in-RUN fails Docker parser; use COPY instead

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
30s is too short for long-running agentic sandbox tasks; the worker event
loop blocks on concurrent prime_sandboxes API calls and misses heartbeats,
causing spurious restarts and lost in-flight results.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sandbox ReadErrors and tunnel timeouts propagate as RuntimeError from the
ZMQ client, crashing the whole eval run. Record 0-reward outputs for the
affected group and continue so the eval can complete.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sandbox ReadErrors and tunnel timeouts propagate as RuntimeError from the
ZMQ client, crashing the whole eval run. Skip the failed group and log
the error; it will be retried on the next --resume invocation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…leakage

Switch from verifiers RLM to mini-SWE-agent harness, matching the ProgramBench
paper scaffold. Remove all leaking fields from the agent prompt: nm_output,
strings_output, objdump_head, compile_hint, and explicit language hint. Apply
chmod 111 (execute-only) to the reference binary so the agent can run but not
read/decompile it. Retained fields in info{} (prefixed _) for internal logging.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
mini-SWE-agent bash -lc doesn't inherit the Docker image's ENV PATH, so
Go/Cargo weren't accessible. Move toolchain env vars from get_env_vars()
into the per-task program.env dict, which gets merged into the command
process environment before mini-swe-agent's run script executes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… PATH

mini-swe-agent runs bash -lc which sources /etc/profile and may reset PATH,
dropping Docker's ENV-injected toolchain dirs. Write a profile.d snippet in
setup so the bash login shell always has Go/Cargo/Rust in PATH.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…i-swe-agent

DEFAULT_CLI_SANDBOX has command_timeout=900s which kills mini-swe-agent
before it can finish apt-get install + agent turns + compile (takes 15-17min).
Set command_timeout = timeout_minutes * 60 per language in sandbox_config()
so the full sandbox lifetime is available for the CLI command.

Also revert temporary logger.warning diagnostics back to logger.info.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The execute_command API enforces a 900s max timeout. timeout_min*60 for Go
(1200s) and Rust (2700s) exceeded this, causing HTTP 400 errors.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…meout hard kill

The sandbox API caps command_timeout at 900s. Without AGENT_TIMEOUT_SECONDS,
mini-swe-agent runs up to its 3600s default and gets killed at 900s with an error.
Set AGENT_TIMEOUT_SECONDS = command_timeout - 60 so the agent exits cleanly
~60s before the hard wall, allowing the harness to collect artifacts normally.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
120s per-command timeout kills large Rust cargo build runs (exit 124).
600s matches the sandbox command_timeout safety margin and lets
complex crates compile without hitting the timeout wall.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
10 min was too short: setup (apt-get + mini-swe-agent install) takes ~3-5 min,
leaving only 5 min for the agent. Tasks ran for 12+ min and hit
SandboxNotRunningError (pod killed at sandbox lifetime boundary).
20 min matches Go/C++ and leaves adequate headroom.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Use `is not None` for dataset_name/dataset_split (consistent with
  other optional params; `or` was semantically wrong for str | None)
- Check exit codes on mkdir and profile.d write in setup (log warnings)
- Extract _hf_download() helper to deduplicate two near-identical
  hf_hub_download try-except blocks in _upload_binary/_setup_tests
- Fix state ordering in _setup_tests: set _pb_test_branch only after
  successful extraction (was set before upload/extract for hide=False)
- Add error handling for upload/extract in hide_tests_from_agent=False path

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes from agent review of PR #1351:
- vf.ensure_keys(["HF_TOKEN","OPENAI_API_KEY"]) — add missing key, use vf. prefix
- _TEST_TIMEOUT per-language map; _run_tests now uses it (was fixed 300s)
- Drop unused _compile_hint from info dict
- Fill README.md placeholders

Pre-commit hooks:
- lint-agent: calls `claude --print` with .agents/skills/lint-agent/SKILL.md
- oo-agent: calls `claude --print` with .agents/skills/oo-agent/SKILL.md
  (only on environments/, set SKIP_OO_CHECK=1 to bypass)
Skill logic lives in markdown, not committed Python.

Also splits _evaluate into _compile + _run_tests (lint-agent threshold fix).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Pin rust:latest → rust:1.92 for reproducibility (paper §4 constants)
- Add strace, ltrace, and no-wrap prohibition to SYSTEM_PROMPT (paper §3)
- Use max(test_branches) instead of [0] to pick most comprehensive branch
- Emit almost (≥95% tests) and resolved (100% tests) as state signals (paper §4)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…est exposure

Drop nm_output, strings_output, objdump_head from info entirely — these are
reverse-engineering artifacts that have no place in the task dict. example_io
(expected I/O pairs) was already excluded from _build_instruction; add a comment
making this explicit.

Add logger.warning when hide_tests_from_agent=False so anyone running eval with
tests visible gets a loud reminder that this violates paper §3.

The agent's full context is now: readme + docs (documentation) + binary path.
No source code, no test content, no analysis artifacts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The verifiers runtime injects OPENAI_BASE_URL + OPENAI_API_KEY into the
container pointing at its own proxy. OPENAI_API_KEY in the outer environment
is not needed when running via prime eval run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Seth and others added 22 commits May 13, 2026 10:37
huggingface_hub's get_token() checks env var, then ~/.cache/huggingface/token,
then huggingface-cli login cache. Passing token=None to hf_hub_download lets
it use that same chain instead of only os.environ.get('HF_TOKEN').

On remote nodes: either export HF_TOKEN=... or run huggingface-cli login once.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
HF_TOKEN is required for private HuggingFace dataset + test archives.
Usage: cp .env.example .env, fill in token, then source .env before eval.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Paper allows up to 1000 steps per rollout. mini-SWE-agent defaults to 0
(unlimited). Passing agent.step_limit=1000 via extra_config_specs caps
steps at the paper spec. Wall clock (AGENT_TIMEOUT_SECONDS) remains the
binding constraint at our resource level but step limit is now explicit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Each task has N oracle test suites (branches) in programbench/ProgramBench-Tests.
The official eval scores against all of them. Previously we used one branch
(selected arbitrarily by max() on hex hash), which under-counted tests.

Now: download all branches concurrently at setup, extract each into its own
subdirectory in TEST_DIR to avoid filename collisions, run pytest once across
all of them. n_passed/n_total covers the full union of oracle tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@vf.metric(priority=-1) reads state["resolved"] set by _evaluate.
Runs after solved() (priority=0) so the flag is already populated.
Dense reward (solved) stays as RL training signal; resolved_binary
is the 0/1 "% Resolved" figure the paper reports as primary metric.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- network_access=False in sandbox_config (paper §8.2: infra-level block,
  not just prompt-level)
- System prompt: explicit ban on internet access, git clone, cargo install,
  go get, package-manager cache reads, and binary wrapping
- _is_binary_wrap(): compare sha256 of submitted exe vs reference binary;
  flags eval_error=binary_wrap_detected and returns reward 0

Paper found 36% cheat rate (Sonnet 4.6) on internet-enabled runs, mostly
via source-code lookup. These three changes close that attack surface.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ose config params

- Switch all _LANGUAGE_IMAGES to primeintellect/programbench-toolchain:latest
  (unified image with Go+Rust+C/C+++pytest+tmux pre-installed)
- Add tmux to Dockerfile (was missing; needed for TTY emulation in tests)
- setup: check if pytest/tmux pre-installed and skip apt-get entirely,
  so network_access=False sandboxes work without hitting package mirrors
- Expose cpu_cores, memory_gb, network_access, compile_timeout, test_timeout,
  sandbox_timeout_minutes as config params in ProgramBenchTasksetConfig,
  __init__, and load_environment
- Log task_id, image, sandbox_id, and setup timing at INFO level

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…v var

Team-scoped Prime registry paths must not appear in public code.
Set PRIME_TOOLCHAIN_IMAGE in .env to use a private registry image;
falls back to primeintellect/programbench-toolchain:latest on DockerHub.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
sha256sum timeout 10s → 30s to handle large binaries like gomplate (15-20MB).
Go test timeout 120s → 300s for projects with slow integration test suites.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
C++: 120s → 300s (7-Zip and other heavy C++ suites timed out at 120s).
C: 60s → 120s (conservative bump for consistency).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Paper's eval/run.sh has a 3600s timeout for the full evaluation step
(container.py). Replace per-language invented timeouts with a flat 3600s
for all languages to match.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…try logic

- Dockerfile: Go 1.21.0 (paper spec); pre-install mini-swe-agent 2.2.8 to
  /opt/mini-swe-agent/prefix/ so the harness runs with network_access=False
- mini_swe_agent.py: skip-if-installed guard before rm -rf prefix to avoid
  re-downloading when the binary is already present in the image
- programbench.py: _SANDBOX_TIMEOUT_MIN all languages -> 360 min (6hr, paper §4);
  add test_retries config param (default 3) with best-of-N pytest retry loop;
  first retry uses --max-worker-restart=4 per paper §4

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Use pip3 install system-wide instead of --target to avoid pip resolver
overhead and rely on prebuilt binary wheels from the package index.
The prefix dir stub is still created so the harness skip-if-installed
guard fires correctly at container startup.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All 13 previously-failed tasks now build successfully. Fixes applied:
- gdal, proj: cmake with optional drivers/tests disabled
- lnav: cmake fallback with bundled deps
- doxygen: cmake Release build
- caps-log: cmake with GCC 14 and C++23 standard
- tree-sitter: cargo build with proper feature detection
- lightningcss: cargo build --features cli
- dog: OPENSSL_DIR/OPENSSL_LIB_DIR env vars for openssl-sys 0.9.61 + OpenSSL 3.x
- eva: cargo build --release --features build-binary
- samtools: inline htslib clone with --recurse-submodules
- duc: ./configure --disable-cairo --disable-x11
- php: buildconf --force + --disable-all --enable-cli
- pingu: GONOSUMCHECK + GOFLAGS=-mod=mod with go mod tidy

Dataset PrimeIntellect/programbench-processed updated: 193/193 tasks have binaries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Dockerfile: --default-toolchain 1.92.0 to match paper spec (constants.py).
Previously used 'stable' which resolved to 1.95.0. Requires image rebuild
(primeintellect/programbench-toolchain) once Docker Hub credentials available.

all_tasks.jsonl: add cslarsen__jp2a and google__brotli (2 new tasks added to
PrimeIntellect/programbench-processed after initial build run). Dataset now
195 tasks total, all 195 with binaries. HF dataset updated.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tasks

- Add jgm__pandoc.4e37075 (Haskell, 229MB stripped) and stathissideris__ditaa.f2286c4 (Java wrapper, 12MB) to all_tasks.jsonl
- Both use test_hf_repo=PrimeIntellect/programbench-processed; binaries and test archives uploaded to HF
- Add GHC 9.6.7 + cabal 3.14.2.0 via GHCup to Dockerfile for Haskell compilation support
- Extend language support in programbench.py for haskell and java (images, timeouts, disk sizes)
- Add test_hf_repo per-task field to allow new tasks to use PrimeIntellect/programbench-processed/tests/ instead of the read-only programbench/ProgramBench-Tests repo

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bootstrap script was putting ghcup binary in XDG paths (GHCUP_USE_XDG_DIRS=0
is still truthy in shell) rather than ~/.ghcup/bin/. Switch to downloading the
ghcup binary directly and placing it at the expected path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n not ~/.cabal/bin

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
HuggingFace datasets builder requires uniform schema. Old tasks were missing
test_hf_repo; backfill with default "programbench/ProgramBench-Tests".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e/test

- Add filter_task_ids param to target specific tasks by ID (enables 5-task smoke test)
- Switch _compile() and _run_tests() from sandbox.execute() to sandbox.run_background_job()
  to bypass the execute_command 900s hard kill limit at the API level
- Remove 900s cap on command_timeout (compile/test use per-call timeouts; agent budget
  is the full sandbox timeout minus 60s headroom)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
execute_command has a hard 900s API limit — removing the cap broke the
mini-swe-agent launch. compile/test now use run_background_job (no cap),
so the only thing constrained at 900s is the agent command itself.
Remove cap when PR #1364 lands.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rged

Mini-swe-agent command now runs via run_background_job (no execute_command
hard limit), so the full sandbox lifetime is available as the agent budget.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@sethkarten sethkarten force-pushed the feat/programbench-env branch from 6735fe5 to 105503c Compare May 13, 2026 17:38
…ages

PRIME_TEAM_ID env var (or sandbox_config["team_id"]) is now forwarded so
team-scoped Docker images (e.g. team-*/programbench-toolchain:latest) can
be pulled without IMAGE_PULL_FAILED.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant