Add emulator training eval environments by sethkarten · Pull Request #1390 · PrimeIntellect-ai/verifiers

sethkarten · 2026-05-15T08:07:05Z

Summary

add ten Prime emulator benchmark environments for CHIP-8 through PS1
add shared emulator harness, runner contract, suite adapters, public test manifests, and private artifact mount support
add openai/gpt-5.4-mini eval configs for all-env Prime smoke, all-env suite, and CHIP-8 suite canary
add Hosted Training and self-managed prime-rl CHIP-8 smoke configs
add the GPT-5.4 Mini eval/training report at environments/emulator_common/GPT_5_4_MINI_EVAL_REPORT.md
keep implementation-task verifier scoring reachable after MiniSWEAgent command-loop failures so partial/broken attempts produce reward components
add inline_system_prompt and v1 sampling-args output compatibility needed for self-managed prime-rl rollouts

Verification

uv run pytest tests/test_emulator_benchmark_envs.py tests/test_v1_runtime_lifecycle.py::test_v1_env_apply_controls_mirrors_sampling_args_for_rollout_outputs -q -> 56 passed
uv run pytest tests/test_v1_runtime_lifecycle.py -q -> 56 passed
uv run ruff check environments/emulator_common/env_factory.py tests/test_emulator_benchmark_envs.py tests/test_v1_runtime_lifecycle.py verifiers/v1/env.py -> passed
uv run pre-commit run --files ... on touched files -> passed
git diff --check -> passed
CPU node root@86.38.238.176 (pb-coordinator) -> cloned/pulled branch at 125a9ce6, uv run pytest tests/test_emulator_benchmark_envs.py tests/test_v1_runtime_lifecycle.py::test_v1_env_apply_controls_mirrors_sampling_args_for_rollout_outputs -q -> 56 passed
all-env openai/gpt-5.4-mini Prime smoke -> 10/10 envs reward 1.00; plumbing only, not ROM correctness
all-env openai/gpt-5.4-mini suite eval -> 10 envs run, mean reward 0.58, mean error 0.10, full public pass envs 1/10, nine envs with public ROM pass 0.00; GPT-5.4 Mini did not pass all emulator tests
training node root@31.22.104.54 (chess-dagger-2xpro6000b) -> repo cloned, envs installed editable, prime-rl set up, renderers==0.1.8.dev0 installed in prime-rl/.venv
real self-managed prime-rl smoke on 2 x RTX PRO 6000 node -> Qwen/Qwen3-0.6B, Step 0 reward 0.5400, trainer Step 0 loss 0.0000, entropy 0.5280, KL 0.0019, checkpoint and weights written under /root/verifiers-emulator/outputs/emulator-chip8-smoke

Notes

no ROM/BIOS bytes are committed
private/commercial/generated artifacts are mounted through EMULATOR_PRIVATE_ARTIFACT_DIR or /private/emulator
Genesis reached public pass 1.00 against the currently configured source manifest; keep private artifacts mounted before treating that as comprehensive Genesis coverage
hard systems should stay eval-only until easier emulator tasks show nonzero public ROM pass rates under training

macroscopeapp · 2026-05-15T08:32:32Z

+    cycles = int(runtime.get("smoke_cycles", 10000))
+    frames = max(1, int(runtime.get("smoke_frames", 1)))


🟢 Low emulator_common/suite_adapters.py:383

When the manifest's runtime dict contains an explicit None for smoke_cycles or smoke_frames (e.g., from JSON {"smoke_cycles": null}), int(runtime.get("smoke_cycles", 10000)) raises TypeError because .get() returns None instead of using the default. The default only applies when the key is missing, not when it exists with None value. Consider using int(runtime.get("smoke_cycles") or 10000) to treat None as missing.

Suggested change

cycles = int(runtime.get("smoke_cycles", 10000))

frames = max(1, int(runtime.get("smoke_frames", 1)))

cycles = int(runtime.get("smoke_cycles") or 10000)

frames = max(1, int(runtime.get("smoke_frames") or 1))

Also found in 1 other location(s)

environments/emulator_common/env_factory.py:33

In _score_component, if a component value is explicitly None (JSON null) rather than missing, components.get(name, 0.0) returns None (not the default 0.0), and float(None) raises TypeError. The code guards against missing keys but not explicit null values in the verifier output.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:

In file environments/emulator_common/suite_adapters.py around lines 383-384: When the manifest's `runtime` dict contains an explicit `None` for `smoke_cycles` or `smoke_frames` (e.g., from JSON `{"smoke_cycles": null}`), `int(runtime.get("smoke_cycles", 10000))` raises `TypeError` because `.get()` returns `None` instead of using the default. The default only applies when the key is missing, not when it exists with `None` value. Consider using `int(runtime.get("smoke_cycles") or 10000)` to treat `None` as missing. Evidence trail: environments/emulator_common/suite_adapters.py lines 383-384 (the vulnerable `int(runtime.get(...))` calls), line 499 (`runtime = dict(manifest.get("runtime", {}))` showing runtime comes from JSON manifest data). Python dict.get() semantics: returns stored value (including None) when key exists, only uses default when key is absent. int(None) raises TypeError. Also found in 1 other location(s): - environments/emulator_common/env_factory.py:33 -- In `_score_component`, if a component value is explicitly `None` (JSON `null`) rather than missing, `components.get(name, 0.0)` returns `None` (not the default `0.0`), and `float(None)` raises `TypeError`. The code guards against missing keys but not explicit null values in the verifier output.

macroscopeapp · 2026-05-15T08:32:32Z

+        cases = tuple(
+            VerificationCase.from_mapping(case)
+            for case in value.get("verification_cases", [])
+        )


🟢 Low emulator_common/task_schema.py:95

EmulatorManifest.from_mapping raises TypeError: 'NoneType' object is not iterable when verification_cases or any list-typed field (e.g., requirements, core_concepts) is explicitly set to None. The code uses value.get("verification_cases", []) but passes the result directly to tuple(), which fails on None rather than producing a clear validation error or defaulting to empty.

- cases = tuple( - VerificationCase.from_mapping(case) - for case in value.get("verification_cases", []) - ) + cases_raw = value.get("verification_cases", []) + if cases_raw is None: + raise ValueError("manifest field 'verification_cases' cannot be None") + if not isinstance(cases_raw, list): + raise ValueError("manifest field 'verification_cases' must be a list") + cases = tuple( + VerificationCase.from_mapping(case) + for case in cases_raw + )

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:

In file environments/emulator_common/task_schema.py around lines 95-98: `EmulatorManifest.from_mapping` raises `TypeError: 'NoneType' object is not iterable` when `verification_cases` or any list-typed field (e.g., `requirements`, `core_concepts`) is explicitly set to `None`. The code uses `value.get("verification_cases", [])` but passes the result directly to `tuple()`, which fails on `None` rather than producing a clear validation error or defaulting to empty. Evidence trail: environments/emulator_common/task_schema.py lines 10-25 (REQUIRED_MANIFEST_FIELDS includes 'verification_cases', 'requirements', etc.), lines 92-97 (from_mapping required-field check only verifies key presence, not value type), lines 98-100 (value.get('verification_cases', []) returns None when key exists with None value, then tuple() iterates over None), lines 109-113 (tuple(map(str, value['requirements'])) etc. also fail on None).

macroscopeapp · 2026-05-15T09:34:50Z

+`prime-rl` validation after the hosted config is accepted:
+
+```bash
+ssh root@31.22.104.54 -p 22 -i private_key.pem


🟢 Low emulator_common/GPT_5_4_MINI_EVAL_REPORT.md:131

Line 131 hardcodes an SSH connection string with a real IP address (31.22.104.54), root username, and private key filename (private_key.pem) into a committed Markdown file. This leaks internal network topology and access patterns; even if the repo is private, the details persist in git history and facilitate targeted access attempts if the file is ever shared. Consider replacing with a placeholder like <HOST>, <USER>, and <KEY_PATH> and documenting that users must supply their own values.

-ssh root@31.22.104.54 -p 22 -i private_key.pem +ssh <USER>@<HOST> -p <PORT> -i <KEY_PATH>

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:

In file environments/emulator_common/GPT_5_4_MINI_EVAL_REPORT.md around line 131: Line 131 hardcodes an SSH connection string with a real IP address (`31.22.104.54`), root username, and private key filename (`private_key.pem`) into a committed Markdown file. This leaks internal network topology and access patterns; even if the repo is private, the details persist in git history and facilitate targeted access attempts if the file is ever shared. Consider replacing with a placeholder like `<HOST>`, `<USER>`, and `<KEY_PATH>` and documenting that users must supply their own values.

macroscopeapp · 2026-05-15T09:34:50Z

+
+[env.args]
+max_tasks = 1
+sandbox_timeout_minutes = 60


🟠 High rl/emulator-chip8-minimal.toml:20

In the training [env.args] block, sandbox_timeout_minutes = 60 (60 min) is half of environment_timeout = 7200 (120 min). The sandbox is killed at 60 minutes while the agent expects 120 minutes, causing premature termination of in-progress training work. The [eval.env.args] block correctly aligns both to 30 minutes. Consider setting sandbox_timeout_minutes = 120 to match environment_timeout, or reduce environment_timeout to 3600 to match the 60-minute sandbox limit.

-sandbox_timeout_minutes = 60 +sandbox_timeout_minutes = 120

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:

In file configs/rl/emulator-chip8-minimal.toml around line 20: In the training `[env.args]` block, `sandbox_timeout_minutes = 60` (60 min) is half of `environment_timeout = 7200` (120 min). The sandbox is killed at 60 minutes while the agent expects 120 minutes, causing premature termination of in-progress training work. The `[eval.env.args]` block correctly aligns both to 30 minutes. Consider setting `sandbox_timeout_minutes = 120` to match `environment_timeout`, or reduce `environment_timeout` to 3600 to match the 60-minute sandbox limit.

macroscopeapp · 2026-05-15T16:06:39Z

+        try:
+            parsed = json.loads(output.stdout or "{}")
+        except json.JSONDecodeError as exc:
+            state["emulator_score"] = {"score": 0.0, "error": str(exc)}
+            return 0.0
+        agent_exit_status = state.get("artifacts", {}).get("emulator_agent_exit_status")
+        if agent_exit_status is not None and isinstance(parsed, dict):
+            parsed["agent_exit_status"] = str(agent_exit_status).strip()
+        if agent_error is not None and isinstance(parsed, dict):
+            parsed["agent_error"] = str(agent_error)
+        state["emulator_score"] = parsed
+        return float(parsed.get("score", 0.0))


🟡 Medium emulator_common/env_factory.py:189

When /tmp/emulator_score.json contains valid JSON that is not a dictionary (e.g., null, [], 123), line 200 raises AttributeError because parsed.get() is called unconditionally. Lines 195-198 check isinstance(parsed, dict) before modifying parsed, but line 200 does not perform this check before accessing .get().

try: parsed = json.loads(output.stdout or "{}") except json.JSONDecodeError as exc: state["emulator_score"] = {"score": 0.0, "error": str(exc)} return 0.0 + if not isinstance(parsed, dict): + state["emulator_score"] = {"score": 0.0, "error": "verifier output is not a dict"} + return 0.0 agent_exit_status = state.get("artifacts", {}).get("emulator_agent_exit_status") if agent_exit_status is not None and isinstance(parsed, dict): parsed["agent_exit_status"] = str(agent_exit_status).strip() if agent_error is not None and isinstance(parsed, dict): parsed["agent_error"] = str(agent_error) state["emulator_score"] = parsed return float(parsed.get("score", 0.0))

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:

In file environments/emulator_common/env_factory.py around lines 189-200: When `/tmp/emulator_score.json` contains valid JSON that is not a dictionary (e.g., `null`, `[]`, `123`), line 200 raises `AttributeError` because `parsed.get()` is called unconditionally. Lines 195-198 check `isinstance(parsed, dict)` before modifying `parsed`, but line 200 does not perform this check before accessing `.get()`. Evidence trail: environments/emulator_common/env_factory.py lines 189-200 at REVIEWED_COMMIT: line ~191 `parsed = json.loads(output.stdout or '{}')`, lines 196+198 guard with `isinstance(parsed, dict)`, line 200 calls `parsed.get('score', 0.0)` unconditionally.

sethkarten force-pushed the feat/emulator-benchmark branch from 7bea448 to 80b20eb Compare May 15, 2026 08:15

macroscopeapp Bot reviewed May 15, 2026

View reviewed changes

sethkarten force-pushed the feat/emulator-benchmark branch from 80b20eb to b520fef Compare May 15, 2026 09:09

macroscopeapp Bot reviewed May 15, 2026

View reviewed changes

sethkarten force-pushed the feat/emulator-benchmark branch 2 times, most recently from e5b2b3a to 4af1fa1 Compare May 15, 2026 16:04

Add emulator training eval environments

125a9ce

sethkarten force-pushed the feat/emulator-benchmark branch from 4af1fa1 to 125a9ce Compare May 15, 2026 16:04

macroscopeapp Bot reviewed May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add emulator training eval environments#1390

Add emulator training eval environments#1390
sethkarten wants to merge 1 commit into
mainfrom
feat/emulator-benchmark

sethkarten commented May 15, 2026 •

edited

Loading

Uh oh!

macroscopeapp Bot May 15, 2026

Uh oh!

macroscopeapp Bot May 15, 2026

Uh oh!

macroscopeapp Bot May 15, 2026

Uh oh!

macroscopeapp Bot May 15, 2026

Uh oh!

macroscopeapp Bot May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		cycles = int(runtime.get("smoke_cycles", 10000))
		frames = max(1, int(runtime.get("smoke_frames", 1)))

Conversation

sethkarten commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Notes

Uh oh!

macroscopeapp Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

macroscopeapp Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

macroscopeapp Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

macroscopeapp Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

macroscopeapp Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sethkarten commented May 15, 2026 •

edited

Loading