Skip to content

Add emulator training eval environments#1390

Draft
sethkarten wants to merge 1 commit into
mainfrom
feat/emulator-benchmark
Draft

Add emulator training eval environments#1390
sethkarten wants to merge 1 commit into
mainfrom
feat/emulator-benchmark

Conversation

@sethkarten
Copy link
Copy Markdown

@sethkarten sethkarten commented May 15, 2026

Summary

  • add ten Prime emulator benchmark environments for CHIP-8 through PS1
  • add shared emulator harness, runner contract, suite adapters, public test manifests, and private artifact mount support
  • add openai/gpt-5.4-mini eval configs for all-env Prime smoke, all-env suite, and CHIP-8 suite canary
  • add Hosted Training and self-managed prime-rl CHIP-8 smoke configs
  • add the GPT-5.4 Mini eval/training report at environments/emulator_common/GPT_5_4_MINI_EVAL_REPORT.md
  • keep implementation-task verifier scoring reachable after MiniSWEAgent command-loop failures so partial/broken attempts produce reward components
  • add inline_system_prompt and v1 sampling-args output compatibility needed for self-managed prime-rl rollouts

Verification

  • uv run pytest tests/test_emulator_benchmark_envs.py tests/test_v1_runtime_lifecycle.py::test_v1_env_apply_controls_mirrors_sampling_args_for_rollout_outputs -q -> 56 passed
  • uv run pytest tests/test_v1_runtime_lifecycle.py -q -> 56 passed
  • uv run ruff check environments/emulator_common/env_factory.py tests/test_emulator_benchmark_envs.py tests/test_v1_runtime_lifecycle.py verifiers/v1/env.py -> passed
  • uv run pre-commit run --files ... on touched files -> passed
  • git diff --check -> passed
  • CPU node root@86.38.238.176 (pb-coordinator) -> cloned/pulled branch at 125a9ce6, uv run pytest tests/test_emulator_benchmark_envs.py tests/test_v1_runtime_lifecycle.py::test_v1_env_apply_controls_mirrors_sampling_args_for_rollout_outputs -q -> 56 passed
  • all-env openai/gpt-5.4-mini Prime smoke -> 10/10 envs reward 1.00; plumbing only, not ROM correctness
  • all-env openai/gpt-5.4-mini suite eval -> 10 envs run, mean reward 0.58, mean error 0.10, full public pass envs 1/10, nine envs with public ROM pass 0.00; GPT-5.4 Mini did not pass all emulator tests
  • training node root@31.22.104.54 (chess-dagger-2xpro6000b) -> repo cloned, envs installed editable, prime-rl set up, renderers==0.1.8.dev0 installed in prime-rl/.venv
  • real self-managed prime-rl smoke on 2 x RTX PRO 6000 node -> Qwen/Qwen3-0.6B, Step 0 reward 0.5400, trainer Step 0 loss 0.0000, entropy 0.5280, KL 0.0019, checkpoint and weights written under /root/verifiers-emulator/outputs/emulator-chip8-smoke

Notes

  • no ROM/BIOS bytes are committed
  • private/commercial/generated artifacts are mounted through EMULATOR_PRIVATE_ARTIFACT_DIR or /private/emulator
  • Genesis reached public pass 1.00 against the currently configured source manifest; keep private artifacts mounted before treating that as comprehensive Genesis coverage
  • hard systems should stay eval-only until easier emulator tasks show nonzero public ROM pass rates under training

@sethkarten sethkarten force-pushed the feat/emulator-benchmark branch from 7bea448 to 80b20eb Compare May 15, 2026 08:15
Comment on lines +383 to +384
cycles = int(runtime.get("smoke_cycles", 10000))
frames = max(1, int(runtime.get("smoke_frames", 1)))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low emulator_common/suite_adapters.py:383

When the manifest's runtime dict contains an explicit None for smoke_cycles or smoke_frames (e.g., from JSON {"smoke_cycles": null}), int(runtime.get("smoke_cycles", 10000)) raises TypeError because .get() returns None instead of using the default. The default only applies when the key is missing, not when it exists with None value. Consider using int(runtime.get("smoke_cycles") or 10000) to treat None as missing.

Suggested change
cycles = int(runtime.get("smoke_cycles", 10000))
frames = max(1, int(runtime.get("smoke_frames", 1)))
cycles = int(runtime.get("smoke_cycles") or 10000)
frames = max(1, int(runtime.get("smoke_frames") or 1))
Also found in 1 other location(s)

environments/emulator_common/env_factory.py:33

In _score_component, if a component value is explicitly None (JSON null) rather than missing, components.get(name, 0.0) returns None (not the default 0.0), and float(None) raises TypeError. The code guards against missing keys but not explicit null values in the verifier output.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file environments/emulator_common/suite_adapters.py around lines 383-384:

When the manifest's `runtime` dict contains an explicit `None` for `smoke_cycles` or `smoke_frames` (e.g., from JSON `{"smoke_cycles": null}`), `int(runtime.get("smoke_cycles", 10000))` raises `TypeError` because `.get()` returns `None` instead of using the default. The default only applies when the key is missing, not when it exists with `None` value. Consider using `int(runtime.get("smoke_cycles") or 10000)` to treat `None` as missing.

Evidence trail:
environments/emulator_common/suite_adapters.py lines 383-384 (the vulnerable `int(runtime.get(...))` calls), line 499 (`runtime = dict(manifest.get("runtime", {}))` showing runtime comes from JSON manifest data). Python dict.get() semantics: returns stored value (including None) when key exists, only uses default when key is absent. int(None) raises TypeError.

Also found in 1 other location(s):
- environments/emulator_common/env_factory.py:33 -- In `_score_component`, if a component value is explicitly `None` (JSON `null`) rather than missing, `components.get(name, 0.0)` returns `None` (not the default `0.0`), and `float(None)` raises `TypeError`. The code guards against missing keys but not explicit null values in the verifier output.

Comment on lines +95 to +98
cases = tuple(
VerificationCase.from_mapping(case)
for case in value.get("verification_cases", [])
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low emulator_common/task_schema.py:95

EmulatorManifest.from_mapping raises TypeError: 'NoneType' object is not iterable when verification_cases or any list-typed field (e.g., requirements, core_concepts) is explicitly set to None. The code uses value.get("verification_cases", []) but passes the result directly to tuple(), which fails on None rather than producing a clear validation error or defaulting to empty.

-        cases = tuple(
-            VerificationCase.from_mapping(case)
-            for case in value.get("verification_cases", [])
-        )
+        cases_raw = value.get("verification_cases", [])
+        if cases_raw is None:
+            raise ValueError("manifest field 'verification_cases' cannot be None")
+        if not isinstance(cases_raw, list):
+            raise ValueError("manifest field 'verification_cases' must be a list")
+        cases = tuple(
+            VerificationCase.from_mapping(case)
+            for case in cases_raw
+        )
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file environments/emulator_common/task_schema.py around lines 95-98:

`EmulatorManifest.from_mapping` raises `TypeError: 'NoneType' object is not iterable` when `verification_cases` or any list-typed field (e.g., `requirements`, `core_concepts`) is explicitly set to `None`. The code uses `value.get("verification_cases", [])` but passes the result directly to `tuple()`, which fails on `None` rather than producing a clear validation error or defaulting to empty.

Evidence trail:
environments/emulator_common/task_schema.py lines 10-25 (REQUIRED_MANIFEST_FIELDS includes 'verification_cases', 'requirements', etc.), lines 92-97 (from_mapping required-field check only verifies key presence, not value type), lines 98-100 (value.get('verification_cases', []) returns None when key exists with None value, then tuple() iterates over None), lines 109-113 (tuple(map(str, value['requirements'])) etc. also fail on None).

@sethkarten sethkarten force-pushed the feat/emulator-benchmark branch from 80b20eb to b520fef Compare May 15, 2026 09:09
`prime-rl` validation after the hosted config is accepted:

```bash
ssh root@31.22.104.54 -p 22 -i private_key.pem
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low emulator_common/GPT_5_4_MINI_EVAL_REPORT.md:131

Line 131 hardcodes an SSH connection string with a real IP address (31.22.104.54), root username, and private key filename (private_key.pem) into a committed Markdown file. This leaks internal network topology and access patterns; even if the repo is private, the details persist in git history and facilitate targeted access attempts if the file is ever shared. Consider replacing with a placeholder like <HOST>, <USER>, and <KEY_PATH> and documenting that users must supply their own values.

-ssh root@31.22.104.54 -p 22 -i private_key.pem
+ssh <USER>@<HOST> -p <PORT> -i <KEY_PATH>
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file environments/emulator_common/GPT_5_4_MINI_EVAL_REPORT.md around line 131:

Line 131 hardcodes an SSH connection string with a real IP address (`31.22.104.54`), root username, and private key filename (`private_key.pem`) into a committed Markdown file. This leaks internal network topology and access patterns; even if the repo is private, the details persist in git history and facilitate targeted access attempts if the file is ever shared. Consider replacing with a placeholder like `<HOST>`, `<USER>`, and `<KEY_PATH>` and documenting that users must supply their own values.


[env.args]
max_tasks = 1
sandbox_timeout_minutes = 60
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 High rl/emulator-chip8-minimal.toml:20

In the training [env.args] block, sandbox_timeout_minutes = 60 (60 min) is half of environment_timeout = 7200 (120 min). The sandbox is killed at 60 minutes while the agent expects 120 minutes, causing premature termination of in-progress training work. The [eval.env.args] block correctly aligns both to 30 minutes. Consider setting sandbox_timeout_minutes = 120 to match environment_timeout, or reduce environment_timeout to 3600 to match the 60-minute sandbox limit.

-sandbox_timeout_minutes = 60
+sandbox_timeout_minutes = 120
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file configs/rl/emulator-chip8-minimal.toml around line 20:

In the training `[env.args]` block, `sandbox_timeout_minutes = 60` (60 min) is half of `environment_timeout = 7200` (120 min). The sandbox is killed at 60 minutes while the agent expects 120 minutes, causing premature termination of in-progress training work. The `[eval.env.args]` block correctly aligns both to 30 minutes. Consider setting `sandbox_timeout_minutes = 120` to match `environment_timeout`, or reduce `environment_timeout` to 3600 to match the 60-minute sandbox limit.

@sethkarten sethkarten force-pushed the feat/emulator-benchmark branch 2 times, most recently from e5b2b3a to 4af1fa1 Compare May 15, 2026 16:04
@sethkarten sethkarten force-pushed the feat/emulator-benchmark branch from 4af1fa1 to 125a9ce Compare May 15, 2026 16:04
Comment on lines +189 to +200
try:
parsed = json.loads(output.stdout or "{}")
except json.JSONDecodeError as exc:
state["emulator_score"] = {"score": 0.0, "error": str(exc)}
return 0.0
agent_exit_status = state.get("artifacts", {}).get("emulator_agent_exit_status")
if agent_exit_status is not None and isinstance(parsed, dict):
parsed["agent_exit_status"] = str(agent_exit_status).strip()
if agent_error is not None and isinstance(parsed, dict):
parsed["agent_error"] = str(agent_error)
state["emulator_score"] = parsed
return float(parsed.get("score", 0.0))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium emulator_common/env_factory.py:189

When /tmp/emulator_score.json contains valid JSON that is not a dictionary (e.g., null, [], 123), line 200 raises AttributeError because parsed.get() is called unconditionally. Lines 195-198 check isinstance(parsed, dict) before modifying parsed, but line 200 does not perform this check before accessing .get().

        try:
            parsed = json.loads(output.stdout or "{}")
        except json.JSONDecodeError as exc:
            state["emulator_score"] = {"score": 0.0, "error": str(exc)}
            return 0.0
+        if not isinstance(parsed, dict):
+            state["emulator_score"] = {"score": 0.0, "error": "verifier output is not a dict"}
+            return 0.0
        agent_exit_status = state.get("artifacts", {}).get("emulator_agent_exit_status")
        if agent_exit_status is not None and isinstance(parsed, dict):
            parsed["agent_exit_status"] = str(agent_exit_status).strip()
        if agent_error is not None and isinstance(parsed, dict):
            parsed["agent_error"] = str(agent_error)
        state["emulator_score"] = parsed
        return float(parsed.get("score", 0.0))
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file environments/emulator_common/env_factory.py around lines 189-200:

When `/tmp/emulator_score.json` contains valid JSON that is not a dictionary (e.g., `null`, `[]`, `123`), line 200 raises `AttributeError` because `parsed.get()` is called unconditionally. Lines 195-198 check `isinstance(parsed, dict)` before modifying `parsed`, but line 200 does not perform this check before accessing `.get()`.

Evidence trail:
environments/emulator_common/env_factory.py lines 189-200 at REVIEWED_COMMIT: line ~191 `parsed = json.loads(output.stdout or '{}')`, lines 196+198 guard with `isinstance(parsed, dict)`, line 200 calls `parsed.get('score', 0.0)` unconditionally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant