Add emulator training eval environments#1390
Conversation
7bea448 to
80b20eb
Compare
| cycles = int(runtime.get("smoke_cycles", 10000)) | ||
| frames = max(1, int(runtime.get("smoke_frames", 1))) |
There was a problem hiding this comment.
🟢 Low emulator_common/suite_adapters.py:383
When the manifest's runtime dict contains an explicit None for smoke_cycles or smoke_frames (e.g., from JSON {"smoke_cycles": null}), int(runtime.get("smoke_cycles", 10000)) raises TypeError because .get() returns None instead of using the default. The default only applies when the key is missing, not when it exists with None value. Consider using int(runtime.get("smoke_cycles") or 10000) to treat None as missing.
| cycles = int(runtime.get("smoke_cycles", 10000)) | |
| frames = max(1, int(runtime.get("smoke_frames", 1))) | |
| cycles = int(runtime.get("smoke_cycles") or 10000) | |
| frames = max(1, int(runtime.get("smoke_frames") or 1)) |
Also found in 1 other location(s)
environments/emulator_common/env_factory.py:33
In
_score_component, if a component value is explicitlyNone(JSONnull) rather than missing,components.get(name, 0.0)returnsNone(not the default0.0), andfloat(None)raisesTypeError. The code guards against missing keys but not explicit null values in the verifier output.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file environments/emulator_common/suite_adapters.py around lines 383-384:
When the manifest's `runtime` dict contains an explicit `None` for `smoke_cycles` or `smoke_frames` (e.g., from JSON `{"smoke_cycles": null}`), `int(runtime.get("smoke_cycles", 10000))` raises `TypeError` because `.get()` returns `None` instead of using the default. The default only applies when the key is missing, not when it exists with `None` value. Consider using `int(runtime.get("smoke_cycles") or 10000)` to treat `None` as missing.
Evidence trail:
environments/emulator_common/suite_adapters.py lines 383-384 (the vulnerable `int(runtime.get(...))` calls), line 499 (`runtime = dict(manifest.get("runtime", {}))` showing runtime comes from JSON manifest data). Python dict.get() semantics: returns stored value (including None) when key exists, only uses default when key is absent. int(None) raises TypeError.
Also found in 1 other location(s):
- environments/emulator_common/env_factory.py:33 -- In `_score_component`, if a component value is explicitly `None` (JSON `null`) rather than missing, `components.get(name, 0.0)` returns `None` (not the default `0.0`), and `float(None)` raises `TypeError`. The code guards against missing keys but not explicit null values in the verifier output.
| cases = tuple( | ||
| VerificationCase.from_mapping(case) | ||
| for case in value.get("verification_cases", []) | ||
| ) |
There was a problem hiding this comment.
🟢 Low emulator_common/task_schema.py:95
EmulatorManifest.from_mapping raises TypeError: 'NoneType' object is not iterable when verification_cases or any list-typed field (e.g., requirements, core_concepts) is explicitly set to None. The code uses value.get("verification_cases", []) but passes the result directly to tuple(), which fails on None rather than producing a clear validation error or defaulting to empty.
- cases = tuple(
- VerificationCase.from_mapping(case)
- for case in value.get("verification_cases", [])
- )
+ cases_raw = value.get("verification_cases", [])
+ if cases_raw is None:
+ raise ValueError("manifest field 'verification_cases' cannot be None")
+ if not isinstance(cases_raw, list):
+ raise ValueError("manifest field 'verification_cases' must be a list")
+ cases = tuple(
+ VerificationCase.from_mapping(case)
+ for case in cases_raw
+ )🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file environments/emulator_common/task_schema.py around lines 95-98:
`EmulatorManifest.from_mapping` raises `TypeError: 'NoneType' object is not iterable` when `verification_cases` or any list-typed field (e.g., `requirements`, `core_concepts`) is explicitly set to `None`. The code uses `value.get("verification_cases", [])` but passes the result directly to `tuple()`, which fails on `None` rather than producing a clear validation error or defaulting to empty.
Evidence trail:
environments/emulator_common/task_schema.py lines 10-25 (REQUIRED_MANIFEST_FIELDS includes 'verification_cases', 'requirements', etc.), lines 92-97 (from_mapping required-field check only verifies key presence, not value type), lines 98-100 (value.get('verification_cases', []) returns None when key exists with None value, then tuple() iterates over None), lines 109-113 (tuple(map(str, value['requirements'])) etc. also fail on None).
80b20eb to
b520fef
Compare
| `prime-rl` validation after the hosted config is accepted: | ||
|
|
||
| ```bash | ||
| ssh root@31.22.104.54 -p 22 -i private_key.pem |
There was a problem hiding this comment.
🟢 Low emulator_common/GPT_5_4_MINI_EVAL_REPORT.md:131
Line 131 hardcodes an SSH connection string with a real IP address (31.22.104.54), root username, and private key filename (private_key.pem) into a committed Markdown file. This leaks internal network topology and access patterns; even if the repo is private, the details persist in git history and facilitate targeted access attempts if the file is ever shared. Consider replacing with a placeholder like <HOST>, <USER>, and <KEY_PATH> and documenting that users must supply their own values.
-ssh root@31.22.104.54 -p 22 -i private_key.pem
+ssh <USER>@<HOST> -p <PORT> -i <KEY_PATH>🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file environments/emulator_common/GPT_5_4_MINI_EVAL_REPORT.md around line 131:
Line 131 hardcodes an SSH connection string with a real IP address (`31.22.104.54`), root username, and private key filename (`private_key.pem`) into a committed Markdown file. This leaks internal network topology and access patterns; even if the repo is private, the details persist in git history and facilitate targeted access attempts if the file is ever shared. Consider replacing with a placeholder like `<HOST>`, `<USER>`, and `<KEY_PATH>` and documenting that users must supply their own values.
|
|
||
| [env.args] | ||
| max_tasks = 1 | ||
| sandbox_timeout_minutes = 60 |
There was a problem hiding this comment.
🟠 High rl/emulator-chip8-minimal.toml:20
In the training [env.args] block, sandbox_timeout_minutes = 60 (60 min) is half of environment_timeout = 7200 (120 min). The sandbox is killed at 60 minutes while the agent expects 120 minutes, causing premature termination of in-progress training work. The [eval.env.args] block correctly aligns both to 30 minutes. Consider setting sandbox_timeout_minutes = 120 to match environment_timeout, or reduce environment_timeout to 3600 to match the 60-minute sandbox limit.
-sandbox_timeout_minutes = 60
+sandbox_timeout_minutes = 120🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file configs/rl/emulator-chip8-minimal.toml around line 20:
In the training `[env.args]` block, `sandbox_timeout_minutes = 60` (60 min) is half of `environment_timeout = 7200` (120 min). The sandbox is killed at 60 minutes while the agent expects 120 minutes, causing premature termination of in-progress training work. The `[eval.env.args]` block correctly aligns both to 30 minutes. Consider setting `sandbox_timeout_minutes = 120` to match `environment_timeout`, or reduce `environment_timeout` to 3600 to match the 60-minute sandbox limit.
e5b2b3a to
4af1fa1
Compare
4af1fa1 to
125a9ce
Compare
| try: | ||
| parsed = json.loads(output.stdout or "{}") | ||
| except json.JSONDecodeError as exc: | ||
| state["emulator_score"] = {"score": 0.0, "error": str(exc)} | ||
| return 0.0 | ||
| agent_exit_status = state.get("artifacts", {}).get("emulator_agent_exit_status") | ||
| if agent_exit_status is not None and isinstance(parsed, dict): | ||
| parsed["agent_exit_status"] = str(agent_exit_status).strip() | ||
| if agent_error is not None and isinstance(parsed, dict): | ||
| parsed["agent_error"] = str(agent_error) | ||
| state["emulator_score"] = parsed | ||
| return float(parsed.get("score", 0.0)) |
There was a problem hiding this comment.
🟡 Medium emulator_common/env_factory.py:189
When /tmp/emulator_score.json contains valid JSON that is not a dictionary (e.g., null, [], 123), line 200 raises AttributeError because parsed.get() is called unconditionally. Lines 195-198 check isinstance(parsed, dict) before modifying parsed, but line 200 does not perform this check before accessing .get().
try:
parsed = json.loads(output.stdout or "{}")
except json.JSONDecodeError as exc:
state["emulator_score"] = {"score": 0.0, "error": str(exc)}
return 0.0
+ if not isinstance(parsed, dict):
+ state["emulator_score"] = {"score": 0.0, "error": "verifier output is not a dict"}
+ return 0.0
agent_exit_status = state.get("artifacts", {}).get("emulator_agent_exit_status")
if agent_exit_status is not None and isinstance(parsed, dict):
parsed["agent_exit_status"] = str(agent_exit_status).strip()
if agent_error is not None and isinstance(parsed, dict):
parsed["agent_error"] = str(agent_error)
state["emulator_score"] = parsed
return float(parsed.get("score", 0.0))🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file environments/emulator_common/env_factory.py around lines 189-200:
When `/tmp/emulator_score.json` contains valid JSON that is not a dictionary (e.g., `null`, `[]`, `123`), line 200 raises `AttributeError` because `parsed.get()` is called unconditionally. Lines 195-198 check `isinstance(parsed, dict)` before modifying `parsed`, but line 200 does not perform this check before accessing `.get()`.
Evidence trail:
environments/emulator_common/env_factory.py lines 189-200 at REVIEWED_COMMIT: line ~191 `parsed = json.loads(output.stdout or '{}')`, lines 196+198 guard with `isinstance(parsed, dict)`, line 200 calls `parsed.get('score', 0.0)` unconditionally.
Summary
openai/gpt-5.4-minieval configs for all-env Prime smoke, all-env suite, and CHIP-8 suite canaryprime-rlCHIP-8 smoke configsenvironments/emulator_common/GPT_5_4_MINI_EVAL_REPORT.mdinline_system_promptand v1 sampling-args output compatibility needed for self-managedprime-rlrolloutsVerification
uv run pytest tests/test_emulator_benchmark_envs.py tests/test_v1_runtime_lifecycle.py::test_v1_env_apply_controls_mirrors_sampling_args_for_rollout_outputs -q-> 56 passeduv run pytest tests/test_v1_runtime_lifecycle.py -q-> 56 passeduv run ruff check environments/emulator_common/env_factory.py tests/test_emulator_benchmark_envs.py tests/test_v1_runtime_lifecycle.py verifiers/v1/env.py-> passeduv run pre-commit run --files ...on touched files -> passedgit diff --check-> passedroot@86.38.238.176(pb-coordinator) -> cloned/pulled branch at125a9ce6,uv run pytest tests/test_emulator_benchmark_envs.py tests/test_v1_runtime_lifecycle.py::test_v1_env_apply_controls_mirrors_sampling_args_for_rollout_outputs -q-> 56 passedopenai/gpt-5.4-miniPrime smoke -> 10/10 envs reward 1.00; plumbing only, not ROM correctnessopenai/gpt-5.4-minisuite eval -> 10 envs run, mean reward 0.58, mean error 0.10, full public pass envs 1/10, nine envs with public ROM pass 0.00; GPT-5.4 Mini did not pass all emulator testsroot@31.22.104.54(chess-dagger-2xpro6000b) -> repo cloned, envs installed editable,prime-rlset up,renderers==0.1.8.dev0installed inprime-rl/.venvprime-rlsmoke on 2 x RTX PRO 6000 node ->Qwen/Qwen3-0.6B, Step 0 reward 0.5400, trainer Step 0 loss 0.0000, entropy 0.5280, KL 0.0019, checkpoint and weights written under/root/verifiers-emulator/outputs/emulator-chip8-smokeNotes
EMULATOR_PRIVATE_ARTIFACT_DIRor/private/emulator