APR-262 Add Harbor verifiers support for v1 by xeophon · Pull Request #1401 · PrimeIntellect-ai/verifiers

xeophon · 2026-05-17T13:41:54Z

Summary

add Harbor verifier environment mode resolution for v1 tasksets
run separate Harbor verifiers in a fresh sandbox while reusing shared test execution and archive upload paths
transfer Harbor verifier inputs via one archive round trip and cover shared/separate behavior in tests

Verification

uv run --frozen ruff check --fix .
uv run --frozen ruff format .
uv run --frozen ty check verifiers/v1/packages/tasksets/harbor.py
uv run --frozen pytest tests/test_v1_harbor_cli.py tests/test_opencode_harbor.py -q
git diff --check

Note

Medium Risk
Adds a new execution mode that provisions and tears down additional sandboxes and moves data between them, which can impact grading reliability and resource usage if misconfigured. Changes touch reward execution and sandbox IO paths (tar/command construction), so failures could surface at runtime in Harbor task evaluation.

Overview
Adds verifier environment mode resolution to Harbor v1 task rows, emitting new harbor metadata (verifier_mode, verifier_sandbox, verifier_upload_tests, verifier_env, artifacts) and validating TOML mappings (including memory_mb/storage_mb and shared-vs-separate incompatibilities).

Updates harbor_reward to support separate verifier execution: when verifier_mode="separate", it creates a fresh verifier sandbox via create_sandbox_lease, optionally uploads tests there, bundles agent-side artifact inputs into a tarball, uploads/extracts them in the verifier sandbox, runs /tests/test.sh, then deletes the sandbox; shared mode continues to run tests in the agent sandbox but now goes through a shared run_harbor_tests path and HARBOR_REWARD_COMMAND.

Expands tests to cover mode resolution, reward file preference, and separate-sandbox behavior (including tar upload/download flows and ensuring the agent sandbox is not used for verifier test execution).

^{Reviewed by Cursor Bugbot for commit 8ae240a. Bugbot is set up for automated code reviews on this repo. Configure here.}

Note

Add separate verifier sandbox support to Harbor v1 tasksets

Introduces verifier_mode (shared or separate) to harbor.py, derived from [verifier.environment_mode] and presence of a [verifier.environment] block in task.toml.
In separate mode, run_separate_harbor_verifier creates a fresh sandbox per reward call, transfers agent artifacts via tar archive using transfer_harbor_verifier_inputs, runs tests via bash /tests/test.sh, then deletes the sandbox in a finally block.
Adds HARBOR_REWARD_COMMAND constant that prefers reward.txt over reward.json for reward retrieval, used by both modes.
upload_harbor_tests now raises RuntimeError on non-zero extraction exit status and ensures /oracle, /tests, and /logs/verifier exist before extraction.
Risk: separate mode creates and destroys a sandbox on every reward call; misconfigured artifacts or missing tar content returns exit code 42 and skips transfer silently.

^{Macroscope summarized 8ae240a.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a91f41ae52

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-17T13:43:40Z

+        if verifier_mode == VERIFIER_MODE_SEPARATE and verifier_environment is None:
+            verifier_sandbox = {**sandbox, "command_timeout": int(test_timeout)}


Upload task tests when reusing the agent image

When a task sets [verifier] environment_mode = "separate" but does not provide [verifier.environment], this branch advertises support by creating a fresh verifier sandbox from the agent sandbox config. That fresh sandbox never receives the task's tests/ directory: run_separate_harbor_verifier only creates an empty /tests directory and transfers /logs/artifacts plus configured artifacts before running bash /tests/test.sh. In this supported mode the script is absent unless the base agent image happens to contain it, so otherwise-valid Harbor tasks always get reward 0.

Useful? React with 👍 / 👎.

Harbor PR #1655 does support environment_mode = "separate" without [verifier.environment]; upstream resolves it to a fresh copy of the top-level environment. The difference is that Harbor can build the verifier image from tests/, while this v1 sandbox path cannot. I fixed the fallback by marking that case with verifier_upload_tests and uploading the task tests into the fresh verifier sandbox before running /tests/test.sh, while explicit [verifier.environment] continues to expect a pre-baked verifier image.

macroscopeapp · 2026-05-17T13:49:22Z

Approvability

Verdict: Needs human review

Unable to check for correctness in 8ae240a. This PR introduces new Harbor verifier sandbox support with significant new runtime behavior (sandbox lifecycle management, file transfers). Two unresolved review comments identify potential bugs: missing test uploads in one mode, and incorrect reward file precedence. These substantive concerns warrant human review.

^{You can customize Macroscope's approvability policy. Learn more.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f02768b6d6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-17T13:50:26Z

+HARBOR_REWARD_COMMAND = (
+    "if [ -s /logs/verifier/reward.json ]; then "
+    "cat /logs/verifier/reward.json; "
+    "elif [ -s /logs/verifier/reward.txt ]; then "
+    "cat /logs/verifier/reward.txt; fi"


Restore reward.txt precedence when reading Harbor rewards

When a Harbor test leaves both /logs/verifier/reward.txt and /logs/verifier/reward.json, this new helper reads the JSON first, whereas the existing Harbor implementations and the pre-change code prefer reward.txt and only fall back to JSON. In tasks that emit a simple final score to reward.txt while also leaving a stale or diagnostic reward.json, this will score the rollout from the wrong file or return 0 for invalid JSON despite a valid text reward.

Useful? React with 👍 / 👎.

Good catch. Restored the existing v1 Harbor behavior so /logs/verifier/reward.txt is read before reward.json, and added a regression assertion for that precedence.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit f02768b. Configure here.}

APR-262 Add Harbor verifier modes for v1

a91f41a

chatgpt-codex-connector Bot reviewed May 17, 2026

View reviewed changes

APR-262 Upload tests for fallback verifier sandbox

f02768b

chatgpt-codex-connector Bot reviewed May 17, 2026

View reviewed changes

APR-262 Restore Harbor reward text precedence

8ae240a

cursor Bot reviewed May 17, 2026

View reviewed changes

Comment thread verifiers/v1/packages/tasksets/harbor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

APR-262 Add Harbor verifiers support for v1#1401

APR-262 Add Harbor verifiers support for v1#1401
xeophon wants to merge 3 commits into
mainfrom
apr-262-harbor-v1-verifier-modes

xeophon commented May 17, 2026 •

edited by macroscopeapp Bot

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Uh oh!

xeophon May 17, 2026

Uh oh!

macroscopeapp Bot commented May 17, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Uh oh!

xeophon May 17, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if verifier_mode == VERIFIER_MODE_SEPARATE and verifier_environment is None:
		verifier_sandbox = {**sandbox, "command_timeout": int(test_timeout)}

Conversation

xeophon commented May 17, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Add separate verifier sandbox support to Harbor v1 tasksets

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

xeophon May 17, 2026

Choose a reason for hiding this comment

Uh oh!

macroscopeapp Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approvability

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

xeophon May 17, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xeophon commented May 17, 2026 •

edited by macroscopeapp Bot

Loading

macroscopeapp Bot commented May 17, 2026 •

edited

Loading