Mohit/trial episode redesign#35
Conversation
Under multi-episode-per-row rollouts (planned for the trial-redesign), training applies PE by slot-in-segment while rollout applies PE by slot-in-cache (which resets at episode boundary). Same logical step gets different PE values in train vs eval — distribution shift. Removing PE entirely sidesteps the mismatch by construction. Five sites patched in models.py: buffer registration removed, get_positional_embedding helper removed, forward_eval / _forward_eval_legacy / forward / _prime_kv_cache all updated to skip the PE add. _prime_kv_cache primes with zeros instead of pos_embed. Backwards-compat for existing PE-baked checkpoints: - vector.py partner load filters `positional_embedding` from state_dict - pufferl.py:load_policy filters the same key + autodetect heuristic now also accepts any `transformer.layers.*` key as a Transformer marker (since legacy detection relied on the PE buffer's presence). Tests: tests/test_transformer_kv_cache.py PASS (cached vs legacy forward_eval still bit-identical at fp32). Smoke train: 3 epochs at ~100K SPS, entropy declining, no NaN. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverts the NoPE removal from 8237688. Brings back nn.Parameter positional_embedding (matches pre-eb2d7fe2 behavior). Adds compute_pos_within_episode static method that computes per-slot position-within-episode from a (B, T) terminals tensor, used at training time to align PE indexing with rollout's cache-pos indexing. Why: under multi-episode-per-row rollouts (planned for trial-redesign), forward_eval resets pos to 0 at every episode boundary via pufferl.py's done handling. So forward_eval applies pe[pos_within_episode] for the first step of each new episode in the cache. forward() (training) was applying pe[slot_in_segment] across the whole row regardless of boundaries — different absolute positions for the same logical step. This silently corrupted gradients under multi-episode-per-row. Formula (vectorized cummax trick): arange = [0..T-1] shifted = pad(terminals[:, :-1], (1, 0)) # right-shift terminals starts = arange * shifted # mark episode-start slots ep_start = starts.cummax(dim=1).values # propagate forward pos_in_ep = arange - ep_start # = 0 at each episode start Convention matches create_episode_mask: terminal slot belongs to OLD episode; new episode starts at slot terminal+1. Tests added: tests/test_pos_within_episode.py — 9 unit tests, all passing tests/test_pe_train_eval_consistency.py — 3 integration tests proving forward() output bit-matches step-by-step forward_eval (with manual cache reset at episode boundaries) for single-, 2-, and 3-episode-per-row segments. Existing tests/test_transformer_kv_cache.py still PASS (cached vs legacy forward_eval still bit-identical). Smoke train: 3 epochs in 2min, SPS ~100K, entropy stable, no NaN. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Keep the per-episode-reset gather logic in forward() unchanged — only the PE buffer initialization swaps from nn.Parameter (std=0.02 init) to register_buffer with Vaswani sinusoidal values. Per-episode reset applies regardless of PE flavor. Tests still pass: - tests/test_pos_within_episode.py (9/9) - tests/test_pe_train_eval_consistency.py (3/3) — train==eval bit-close - tests/test_transformer_kv_cache.py (3/3) — cached vs legacy Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…opy)
Per-agent flag intended to be set in c_step under goal_behavior=GOAL_TRIAL
(coming in M3) when a trial ends — distinct from `terminals`, which fires
only at the EPISODE boundary (after max_trials_per_episode trials).
End-to-end plumbing:
drive.h: add `unsigned char *trial_ended_this_step` to Drive struct;
calloc/free in standalone allocate()/free_allocated();
memset to zero at the top of c_step (guarded on != NULL).
binding.c: parse `trial_ended_this_step` kwarg in both my_init (initial
env_init) and my_put (re-init via _reinit_envs_with_new_maps);
OPTIONAL — older callers without the kwarg get NULL pointer
and the c_step memset is no-op'd.
drive.py: allocate self.trial_ended_this_step = np.zeros(num_agents,
dtype=bool); pass slice via kwargs to env_init at both call
sites (initial + reinit).
env_binding.h is NOT modified — the generic vec_init 6-tuple
(obs, action, reward, term, trunc, seed) remains intact, so this change
doesn't ripple to other ocean envs. The recv() contract is unchanged;
trial_ended_this_step is exposed as a Python-side attribute, not part
of the standard step return.
Tests (tests/test_trial_ended_buffer.py): 3/3 PASS
1. Drive.__init__ exposes the buffer with correct shape/dtype
2. C zeros the buffer in c_step (proven by Python-side pollution)
3. 50 sequential steps survive with buffer still bound
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New behavior, fully gated on goal_behavior == GOAL_TRIAL (=3). All other
goal_behavior values (0/1/2) are unaffected.
Mechanics:
- Each agent runs trials. A trial ends on goal-reach OR per-trial
timeout (`per_trial_timeout` ticks since trial start; default
scenario_length).
- At trial end: trial_ended_this_step[i]=1, trial_count++, agent
respawned to traj[0], trial_start_timestep updated.
- When trial_count == max_trials_per_episode (default 2), terminals[i]=1
fires for that agent (episode boundary), trial_count reset to 0.
- The c_step early-return at scenario_length is suppressed under
GOAL_TRIAL — episode boundaries come from the trial logic instead.
Drive.h struct additions:
- Entity.trial_count, trial_start_timestep (zero under non-TRIAL paths)
- Drive.max_trials_per_episode, per_trial_timeout (config)
Binding.c:
- my_init parses max_trials_per_episode + per_trial_timeout kwargs.
per_trial_timeout=0 means "use default" (scenario_length).
Drive.py:
- max_trials_per_episode + per_trial_timeout added to __init__ signature
and threaded through both env_init call sites.
Tests (tests/test_goal_trial.py): 5/5 PASS
- 60-step runs at gb=0/1/2 confirm trial_ended_this_step stays zero.
- gb=3 with timeout=5, max_trials=2: trial_ended fires every 5 steps,
terminals fires every 10 steps (= max_trials × timeout), trial_count
resets correctly across episode boundaries.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…each_rate)
Adds 4 float fields to Log struct:
- n_trials_completed
- n_trials_goal_reached
- n_trials_timed_out
- trial_total_length (running sum; divided by n_trials_completed in my_log)
Under GOAL_TRIAL only, c_step writes directly to env->log on every
trial-end event (bypasses logs[i] aggregation since add_log doesn't
fire under GOAL_TRIAL — the scenario_length early-return is suppressed).
env->log.n is incremented at episode boundaries so vec_log's
total_n >= num_agents gate triggers and metrics emit.
Under goal_behavior 0/1/2: all trial fields stay zero. vec_log path
unchanged.
binding.c my_log emits two derived metrics for convenience:
trial_mean_length = total_length / n_completed
trial_goal_reach_rate = n_goal_reached / n_completed
Tests (tests/test_trial_log_fields.py): 2/2 PASS
- All goal_behavior values (0/1/2/3) expose the new keys.
- Under non-TRIAL: keys present but zero.
- Under TRIAL with timeout=5: random policy times out every trial,
n_completed grows, mean_length ≈ 5, goal_reach_rate = 0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds trial-mode rollout loop that branches on goal_behavior:
- goal_behavior in {0,1,2}: existing scenario-mode loop preserved.
- goal_behavior == 3: outer loop runs up to max_trials *
per_trial_timeout ticks; per-agent trial counter advances on
trial_ended_this_step (Python reads env.trial_ended_this_step
after each step). Trial outcome = (reward > goal_reward_threshold)
at the trial-end tick.
Output schema additions for trial mode:
- trial_0_score, trial_1_score, ..., trial_{K-1}_score
- ada_delta_trial_1_minus_0, ada_delta_trial_K_minus_0, etc.
- per_agent_success_log records use t0/t1/... keys (not s0/s1)
Tests (tests/test_evaluator_trial_mode.py): 2/2 PASS
- gb=3: trial_X_score and ada_delta keys appear; records use t0/t1.
- gb=0: scenario mode unchanged; records use s0/s1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ults
utils.py:run_human_replay_eval_in_subprocess:
- Inherit goal_behavior from training env_config (was hardcoded to 0).
- Forward max_trials_per_episode and per_trial_timeout flags.
Wandb log path already auto-forwards every numeric metric under
eval/human_replay_*, so the new trial_X_score and
ada_delta_trial_K_minus_0 keys from M5 flow through automatically.
adaptive.ini, drive.ini:
- Add max_trials_per_episode=2 and per_trial_timeout=0 defaults.
- Update goal_behavior comment to mention 3:"trial".
Without these in the INIs, the pufferargs CLI parser doesn't generate
--env.max-trials-per-episode / --env.per-trial-timeout flags, so
launchers can't override them.
Smoke train (gb=3, max_trials=2, per_trial_timeout=201) on adaptive_drive:
Epoch 2 reached, SPS ~93K, no NaN.
Trial fields flow through vec_log:
n_trials_completed = 2.257, n_trials_goal_reached = 1.692,
n_trials_timed_out = 0.565, trial_mean_length = 56.4 ticks
→ 75% of trials reach goal; mean trial length 56 ticks << 201 timeout.
End of M1-M6. The trial-redesign is end-to-end functional:
M1b: Per-episode-reset PE handles multi-episode-per-row attention
M2: trial_ended_this_step buffer plumbed Python ↔ C
M3: GOAL_TRIAL=3 implements per-trial logic in c_step
M4: Per-trial Log fields exposed via vec_log
M5: HumanReplayEvaluator emits trial_X_score, ada_delta_trial_K_minus_0
M6: Subprocess eval forwards goal_behavior; INIs register the new flags
Known limitation (documented in the chat): episodes can span buffer
segment row boundaries; the spillover slots in the next row are
trained under truncated context vs full-cache rollout context. PPO
clip damps but doesn't block the resulting gradient mismatch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drive (drive.h + drive.py):
- Option D: under goal_behavior=3, when an agent's trial_count reaches
max_trials_per_episode, fire terminals + add_log_one_agent + mark agent
as removed=1 and move it off-grid (INVALID_POSITION). Do NOT call
c_reset; the agent idles until Python's resample_frequency triggers
_reinit_envs_with_new_maps. Removes the 1-map-many-episodes pathology
that was driving over-fit scores.
- add_log_one_agent: per-agent variant of add_log used at variable-length
trial-mode episode ends. Mirrors all per-entity state c_reset would
reset (respawn_timestep, current_goal_reached, metrics_array, etc.) so
the next cycle starts clean.
- current_goal_reached=1 set in the goal-reach else branch (GOAL_STOP /
GOAL_TRIAL share that path); reset to 0 in respawn_agent so the next
trial can register a fresh goal-reach. Without this, every in-radius
tick incremented goals_reached_this_episode and the score metric was
meaningless.
- move_expert loops the recorded trajectory under GOAL_TRIAL (t %
array_size) so static experts don't vanish past scenario_length.
- Trial overlay in c_render_with_mode: 'Trial X / K' instead of scenario
counter.
Drive Python (drive.py):
- self.truncations[:] = 0 at the top of step so trial_ended_this_step
-> truncations mirroring stays per-step.
- Mirror trial_ended_this_step -> truncations under gb=3 (carries the
trial-boundary signal through the SHM buffer to pufferl for GAE
bootstrap-stop without triggering KV-cache reset).
- Per-scenario block gated off under gb=3 (variable-length trials would
land mid-trial).
- At resample_frequency boundary under gb=3, force a vec_log emission
with num_agents=1 so slow agents that didn't finish max_trials still
flush their metrics before _reinit_envs_with_new_maps zeros env->log.
adaptive.py:
- Under goal_behavior=3, auto-link max_trials_per_episode = k_scenarios
and per_trial_timeout = scenario_length, matching the mental model
'k_scenarios trials of scenario_length each.'
pufferl.py:
- done_mask = d (was d + t) so cache reset gates on terminals only.
Trial boundaries (truncations) no longer wipe KV cache, which is the
whole point of adaptive training across trials.
- GAE bootstrap_stop = (terminals + truncations).clamp(max=1.0): use
truncations as the bootstrap-stop channel so V[t+1] from the
post-respawn trial does not pull into the value target of the last
step of the old trial.
- truncations persisted into the rollout buffer (parallel to terminals)
so GAE can read the boundary mask.
- Debug logging: PUFFER_TRIAL_DEBUG_FILE env var enables per-event JSON
trace of cache resets, GAE inner/outer, scenario boundaries, etc.
render.py + utils.py:
- Forward --goal-behavior + --max-trials-per-episode + --per-trial-timeout
CLI flags to env init.
- max_steps default under gb=3 = max_trials * per_trial_timeout (the
worst-case episode budget) instead of k_scenarios * scenario_length.
- Video basename uses 'trials{N}' label under gb=3.
rollout.py:
- max_steps default under trial mode matches drive.py / render.py.
- Break when terms.all() so trial-mode rollouts stop at episode boundary
rather than after exactly k_scenarios * scenario_length ticks.
- info dict carries _trial_starts for downstream consumers.
Tests (8 new + 1 updated):
- test_gae_trial_boundary.py: GAE bootstrap-stop fires on truncations
too, not just terminals.
- test_gae_decoupling_integration.py: end-to-end trial_ended_this_step
-> truncations mirror.
- test_adaptive_trial_link.py: auto-link of max_trials/per_trial_timeout
under gb=3.
- test_trial_standard_metrics.py: episode-length, score, offroad,
collision populate under gb=3 via add_log_one_agent.
- test_trial_per_scenario_gate.py: per-scenario logic gated off under
gb=3.
- test_trial_score_semantics.py: score uses max_trials denominator.
- test_trial_overcounting_fix.py: current_goal_reached gates
goals_reached_this_episode increments.
- test_rollout_trial_mode.py: rollout max_steps / break / info match
trial mode.
- test_goal_trial.py: updated test_trial_episode_resets to match Option
D (one episode per resample_frequency; agent idles between).
All 53 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Trials 2..K rendered empty: respawn_agent sets respawn_timestep = env->timestep (drive.h:2685), and seven downstream gates treat respawn_timestep != -1 as a "post-respawn ghost" marker, including the 3D mesh draw at drive.h:3482 (visible symptom), the collision checks at 1327/1342, ego obs[6] at 2409, and other-car obs at 2455/2457. The mid-episode trial-respawn branch needs to clear the flag back to -1 since GOAL_TRIAL is not a ghost-fade mode. Single-line fix in the GOAL_TRIAL else-branch of c_step. End-of-episode (Option D: removed=1, off-grid, terminals+add_log) is unaffected — that branch never calls respawn_agent. GOAL_RESPAWN's intentional ghost-render semantic is preserved (different branch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…arios>2
Symptom: eval logged only trial_0_score, trial_1_score, and
ada_delta_trial_1_minus_0 — missing trial_2/trial_3 even though the
env was actually running 4 trials (k_scenarios=4 with the adaptive.py
auto-link).
Cause: evaluator.py:658 read max_trials_per_episode straight from
args['env'] (= the INI default of 2). The auto-link in
AdaptiveDrivingAgent.__init__ updates the env's attribute but not the
args dict the eval inherits.
Fix: prefer puffer_env.driver_env.max_trials_per_episode (the actual
env value after auto-link); fall back to k_scenarios when args says the
INI default of 2. Same path for per_trial_timeout.
Test: tests/test_evaluator_trial_mode.py::test_trial_mode_auto_link_k4
exercises the k_scenarios=4 / args.max_trials=2 case and asserts
trial_0..3 scores + ada_delta_trial_{1,2,3}_minus_0 all appear.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
docs/src/trial_mode.md — full GOAL_TRIAL design doc: - Two-boundary problem (terminals vs truncations) with the 4×4 table - PPO/GAE bootstrap-stop formula (LaTeX) - KV cache reset gate - Option D semantic (idle-after-max_trials) with rationale - Auto-link of trial parameters (current behavior, to be simplified) - End-to-end signal flow (C → SHM → Python → pufferl → GAE) - Score semantic + threshold ladder - The 7 render gates on respawn_timestep != -1 - Per-trial metrics + evaluator's trial_K_score breakdown - Test coverage map tests/test_render_contract.py — programmatic visibility tests: - gb=3 k=4: ego obs[6] must clear within 2 steps of mid-episode trial respawn (else 3D mesh draw at drive.h:3482 hides ego in trials 2..K) - gb=3 k=2: same for the smallest meaningful trial setup - gb=0: ghost-fade semantics preserved (non-regression) - gb=3 k=4 end-of-episode: clean idle state, no 'invisible-but-onscreen' Run before any drive.h / drive.py / render.py change. Catches the M7-fix bug class without spinning up Xvfb + ffmpeg. Wired trial_mode.md into mdBook SUMMARY under new 'Design' section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-fix, per-trial breakdown only appeared in eval-time
HumanReplayEvaluator output (every 40 epochs); training-time wandb was
blind to in-episode adaptation. Now every report_interval ticks the env
emits:
- trial_0_score, trial_1_score, ..., trial_7_score
- ada_delta_trial_1_minus_0, ..., ada_delta_trial_{k-1}_minus_0
Implementation:
- drive.h Log: added trial_k_goal_reached[N_TRIAL_K_SLOTS=8]. Incremented
in c_step's GOAL_TRIAL trial-end branch for ego only, at index k =
trial_count - 1 (the just-completed trial).
- binding.c my_log: emits trial_K_score keys, but only if
n_trials_completed > 0 (gates on actual gb=3 activity so gb=0/1/2
output stays clean).
- drive.py _inject_trial_deltas: post-processes the emitted log dict to
add ada_delta_trial_K_minus_0 = trial_K_score - trial_0_score for
K in 1..max_trials-1. Called only under goal_behavior == 3.
Tests: tests/test_ada_delta_train_logging.py (4) — trial_K keys emit,
ada_delta keys appear in info[], values match the subtraction, gb=0
doesn't leak trial keys.
All 16 trial-mode test files pass (61 tests).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…l_timeout
Drops the parallel knob surface (max_trials_per_episode +
per_trial_timeout). Under gb=3 the auto-link is now an invariant, not a
heuristic with INI-default detection. To change the trial count, change
k_scenarios.
Surface changes:
- INI: removed `max_trials_per_episode` and `per_trial_timeout` lines
from drive.ini + adaptive.ini.
- render.py: dropped --max-trials-per-episode + --per-trial-timeout CLI
flags; max_steps default = k_scenarios * scenario_length under all
goal_behaviors.
- pufferlib/utils.py: subprocess-eval no longer forwards the two flags
(the env derives them); video-naming uses k_scenarios.
Internals:
- adaptive.py: auto-link is unconditional under gb=3:
kwargs["max_trials_per_episode"] = k_scenarios
kwargs["per_trial_timeout"] = scenario_length
Removed the "if user passed default of 2 vs override" detection.
- evaluator.py: simplified to read from driver_env (still falls back to
k_scenarios / sim_steps).
- C-side struct keeps max_trials_per_episode + per_trial_timeout fields
(used internally); the only consumers now are the C trial-end branch
and add_log_one_agent's denominator. No external API surface.
Tests:
- test_adaptive_trial_link.py: replaced test_explicit_user_override_wins /
test_explicit_per_trial_timeout_wins with their negations — under gb=3
any override is ignored, k_scenarios / scenario_length win. Documents
the new invariant.
Design doc: simplified the "Auto-link" section and the quick-reference
knob table to reflect the smaller surface.
All 16 trial-mode test files pass in isolation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-fix, Python mirrored trial_ended_this_step → truncations after every vec_step. Two writers (C: zero, write at trial-end; Python: zero, mirror) made the buffer write-order-sensitive. Now C does everything; Python is read-only under gb=3. C side: - drive.h Env struct: added `unsigned char *truncations`. - env_binding.h: uncommented env->truncations wiring from positional arg (was wired for all sibling buffers, just not this one). Only Drive uses this header so no cross-env risk. - drive.h c_step: zeroes env->truncations at top under gb=3; writes 1 at each trial boundary in the GOAL_TRIAL branch (alongside trial_ended_this_step). Both signals fire on the same agents. Python side (drive.py): - step(): no longer zeroes truncations under gb=3 — C handles it. - step(): dropped the trial_ended_this_step → truncations mirror block. - Under non-trial modes (gb=0/1/2), Python still owns truncations: zeroes it at top of step (so the k_eff-curriculum scenario-boundary write at drive.py:1202 keeps working). Tests: tests/test_truncations_ownership.py (3): - C zeroes truncations each step under gb=3 (Python-side pollute is wiped). - C writes 1 simultaneously to truncations and trial_ended_this_step at trial boundary. - Non-trial mode untouched by C. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removed narrate-the-code comment blocks I added during the trial-mode build-out; kept only why-comments (non-obvious constraints, gotchas, references to specific bug fixes). drive.h: - Trimmed add_log_one_agent header from 6 lines to 3. - Collapsed the per-trial-K-slot doc to 2 lines. - Collapsed the score-threshold-ladder rationale to 4 lines. - Trimmed Option D / mid-trial-respawn ghost-flag blocks; cross-refs the design doc for the long form. - Added the goal-reach invariant block (single source of truth for goals_reached_this_episode gating). drive.py: - Trimmed trial_ended_this_step field doc from 5 lines to 2. - Trimmed per-scenario-gate rationale and resample_frequency block. pufferl.py: - Cache-reset gate explanation: 7 lines → 4. - Truncations persistence: 6 → 2. - GAE bootstrap-stop: 6 → 2. All 17 trial-mode test files pass in isolation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lightweight categorization in tests/README.md. Avoids 17 file moves + sys.path rewrites — the suite is already stable, so a doc-only split serves the same intent (signal which tests are load-bearing design invariants vs paper-trail bug locks) without code churn. Records the going-forward rule: write contract tests BEFORE implementation; only add regression tests when fixing real bugs. Also documents the multi-Drive pytest-segfault workaround (run each file in its own invocation) and links to the design doc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces per-agent trial clocks with a single env-level trial clock.
Under gb=3:
- On ego goal-reach mid-trial: ego goes off-map (entity.removed=1,
INVALID_POSITION, vx=vy=0). No truncations / terminals — wait for
env trial-end.
- env trial-end fires when ALL active egos in env have removed=1 OR
env per_trial_timeout elapses. On trial-end, ALL entities (egos +
co-players) reset to init position; env_trial_start updates; env
trial_count++.
- env episode-end (env_trial_count == max_trials): terminals fires for
every active ego; Option D applies (removed stays, off-map until
c_reset). env_episode_ended sentinel prevents repeat trial-ends until
c_reset.
Why: prior per-agent semantic broke the "trial 1 == trial 2 except cache"
invariant — co-players and recorded humans ran on their own clocks, so
trial 2's world differed from trial 1's world. Adaptation Δscore =
trial_K - trial_0 conflated cache effects with stochastic traffic
differences. B'' guarantees trial-conditions identity.
C side (drive.h):
- Env struct: added env_trial_count, env_trial_start_timestep,
env_episode_ended.
- Env struct: added `unsigned char *removed` SHM buffer (per-agent
off-map flag).
- c_reset: zeroes the env-level trial state + per-agent removed.
- Goal-reach else-branch: under gb=3, set entity.removed=1 + off-map
instead of stopped=1.
- c_step GOAL_TRIAL block: rewritten to env-level trial-end detection
+ batch reset. Gates on !env_episode_ended.
- move_expert: uses env_trial_start_timestep for the replay clock, so
recorded humans rewind to frame 0 at each env trial-end.
Python side:
- drive.py: `self.removed` SHM buffer, wired through env_init kwargs.
- binding.c: kwargs read for `removed` (both my_shared_self_play and
my_init paths).
Tests: tests/test_env_level_trial.py (6 new contract tests):
- removed buffer exists + zero at reset
- env trial-end fires on timeout when no ego reaches
- ego goes off-map on goal-reach
- env trial-end resets all entities to init (mid-episode, k>=3)
- episode-end fires after max_trials → Option D removed=1 stays
- truncations NOT fired on individual reach (only at env trial-end)
All 17 prior trial-mode test files still pass in isolation. Pufferl-side
KV cache freeze (so off-map limbo doesn't pollute the transformer cache
across trial boundaries) is the next step — gated on vecenv.recv()
integration.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scripts/debug_b_demo.py walks through:
1. env-level trial state evolution (removed flag, trunc, term)
2. move_expert clock reset at env trial-end (verifies humans rewind
to frame 0 by checking obs[20:] is identical at trial 1 tick 1 and
trial 2 tick 1)
3. per-trial-K metrics in vec_log (trial_K_score for K in 0..k-1)
4. KV cache freeze design (documents the pufferl-side integration
that's still pending)
Output is human-readable ASCII traces — for visual inspection, not
unittest assertions. Run: python scripts/debug_b_demo.py
…t reset Two bugs from the user's render of B''-on-pre-B''-checkpoint: 1) Trial overlay always showed "Trial 1 / 4" all four trials. Cause: overlay read per-entity `trial_count` (no longer maintained under B'' — env-level `env_trial_count` is the truth). Switched overlay to env-level counter and clamped to max_trials_per_episode. 2) Recorded humans / static experts vanished in trials 2..K. Cause: my B'' move_expert reset (t = timestep - env_trial_start) replays the same EARLY frames of recorded-human trajectories each trial. Many nuplan agents have late-valid windows (enter the scene mid-trajectory) — short trials never reach those valid frames, so the agents are invisible. Reverted move_expert to OLD looping `t = timestep % array_size`, which keeps humans continuously playing across trials. Trade-off: humans drift slightly across trials (different frame in trial 2 than trial 1). This violates the strict "trial 1 == trial 2" invariant for static experts. For training this is acceptable (humans are stochastic background). For paper-grade human_replay eval (1 ego per env) where strict equivalence matters, the eval path will need to explicitly reset move_expert — captured as a TODO in the function comment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per user: "yeah whatever the humans do whatever they want at the timestep. We dont care about what humans do, only what egos do. when trial 2 starts, we reset the humans." Restores the env-relative move_expert clock so humans (recorded experts) restart at frame 0 every env trial-end. Trade-off: if humans have late-valid windows and trials end fast (cache makes ego reach quickly), humans may not appear in those trials — that's faithful to the data, not a bug. Strict trial-equivalence on the ego side is what matters for the adaptation experiment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
…+ respawn from init_steps Two bugs caused trial 1 to differ from trial 2 even after the env-relative move_expert clock was in place: 1) respawn_agent put active entities at traj[0]. But c_reset's set_start_position uses traj[init_steps]. So trial 1 started at recorded frame `init_steps`, trial 2..K started at frame 0 — different recorded poses, different velocities, different lane alignment. Fix: respawn_agent now uses init_steps (clamped to array_size). 2) move_expert mapped (timestep - env_trial_start) to recording frame directly, ignoring init_steps. Static experts were at frame 1 at trial K tick 1, but at frame init_steps+1 at trial 1 tick 1 (because c_reset put them at init_steps then move_expert advanced one tick). Fix: t = init_steps + (timestep - env_trial_start). 3) Even with (1) and (2), the per-entity reset in the trial-end loop didn't touch static agents or rest various e->state fields (collision_state, valid, dynamics integrators). Replaced the per-active respawn_agent loop with a single set_start_position(env) call at trial-end, which resets the EXACT same state c_reset does for every entity in the env. Strict bit-for-bit equivalence verified: trial1 vs trial2 obs differ in 0/1850 dims, max diff = 0.0. Now trial K == trial 1 except for ego's KV cache, as designed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
scripts/trace_b_render.py prints per-step: tick, env trial counter, rem count (egos off-map mid-trial), trial_ended_this_step / truncations / terminals counts, simulated KV cache write position, and event highlights (GOAL-REACH, TRIAL-END, EPISODE-END). Useful for visualizing B'' dynamics alongside the rendered mp4. Verified empirically that the same agents reach goal at the same RELATIVE tick across all 4 trials (agent 33 at relative tick 26 in trials 1-4, agent 42 at relative tick 116 in trials 1-4) — confirms strict trial equivalence. cache_pos is a simulation showing what the transformer's cache write position would do; once task #33 wires the actual pufferl freeze for removed agents, the displayed value will match the real cache.
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
bug_003: AdaptiveDrivingAgent.__init__ now asserts k_scenarios <= 8 under gb=3 so the silent-truncation failure mode for per-trial metrics (trial_k_goal_reached[8] in drive.h is fixed-size) is loud. Pre-fix, k=9+ would lose trial_8_score from training-time wandb but eval would still report it — diverging signal. With the assert, the user is told to either bump N_TRIAL_K_SLOTS or pick a smaller k. bug_002: self.removed docstring claimed "pufferl uses this to freeze the KV cache during the off-map limbo" but pufferl has zero references to the buffer — it's reserved for task #33. Reworded the comment. Both flagged as nits by ultrareview, batched in one commit.
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
Without this, when an ego reaches goal mid-trial and goes off-map (removed=1), pufferl keeps invoking the transformer with that agent's garbage INVALID_POSITION observations. Each invocation appends a new K/V token to that agent's cache. Over a 174-tick limbo period, ~87% of the cache becomes garbage; future trials' attention is dominated by junk tokens. Fix: per-agent attention-mask exclusion of garbage cache slots. models.py TransformerWrapper: - init_eval_state adds state["garbage_mask"]: (B, horizon) bool. - forward_eval reads state["removed"] (B,) bool from pufferl. The attention mask becomes (B, 1, 1, horizon) per-agent = (slot <= current_pos) AND (not garbage_mask). After cache write, garbage_mask[:, slot_t] |= removed marks the just-written slot as garbage for off-map agents. - reset_eval_state zeros garbage_mask alongside k_cache / v_cache, both for full-env reset and per-agent done_indices reset. pufferl.py: - New self.transformer_garbage_mask dict (per state_key, mirrors k_cache layout). Lazy-allocated by the model on first forward_eval. - Before forward_eval: state["garbage_mask"] = ..., state["removed"] = view of vecenv.driver_env.removed sliced by env_id. - After forward_eval: persist state["garbage_mask"] back. - On done_mask reset (terminals): clear garbage_mask rows alongside k_cache / v_cache rows. Tests (tests/test_cache_freeze.py, 4 contract tests): - Slots written while removed=1 have zero softmax weight in next step (verified via _probe_attention path that captures attention weights per layer per agent). - Other agents' attention to the same slots is unaffected. - Full reset (done_indices=None) zeros garbage_mask. - Partial reset (done_indices=[a, b]) zeros only those rows. - Without state["removed"] (non-gb=3 modes), no slots are marked garbage — backward-compat preserved. All 19 trial-mode test files pass. Multi-env caveat: pufferl reads vecenv.driver_env.removed, which works in single-Drive setups (render). For multi-worker training the buffer needs to be unified (SHM); driver_env.removed currently is one Drive's view. Captured as follow-up — same mechanism that propagates terminals across workers needs to do removed too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
Previously `self.removed = np.zeros(num_agents)` was allocated as a private
numpy array per Drive instance. In Multiprocessing this means each worker's
`removed` writes were invisible to the main process — `driver_env.removed`
on the parent was just a metadata stub that no worker ever wrote to.
Effect: garbage_mask was never populated in training, so the cache-freeze
mask did nothing.
Plumb `removed` through SHM the same way `terminals`/`truncations` are:
- vector.py Multiprocessing: allocate `removed=RawArray("b", num_agents)`
in self.shm; expose `self.removed = self.buf["removed"].ravel()` for
pufferl. _worker_process maps the worker's row into buf["removed"].
- vector.py Serial: allocate `self.removed` (or accept from outer buf)
and slice per env into buf_i["removed"].
- drive.py: read `buf["removed"]` if present, fall back to local alloc.
- pufferl.py: read unified `vecenv.removed` (works for both backends),
falling back to driver_env.removed for compat.
Verified: 4-worker Multiprocessing test shows worker-side writes
(bit counts [1,2,1,1]) visible from the main process via SHM after one
step. Serial driver_env.removed shares memory with vecenv.removed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…O loss Eval forward (`forward_eval`) already excluded off-map limbo cache slots via `garbage_mask`. Training forward (`forward` → `create_episode_mask`) used only `terminals` and saw limbo tokens unmasked. The PPO loss also ignored `masks` (per the pre-existing `# Note: We are not yet handling masks in this version` at pufferl.py:798). Result: the policy attended to garbage during training AND limbo tuples (zero reward, post-reset or INVALID_POSITION obs) contributed gradient to pg_loss/v_loss/entropy. This commit closes the train/eval mismatch: - pufferl.py: add removed_history rollout buffer; capture vec.removed per step alongside terminals/truncations; pass mb_removed into the training forward via state["removed"]; weight pg_loss, v_loss, entropy_loss and diagnostic stats (approx_kl, clipfrac) by valid=~removed, normalized by sum(valid). - models.py: in the training `forward` else-branch (use_episode_mask), add (B, 1, T) limbo bias of -inf to the attention mask on the SOURCE axis. attn_mask[b, t, s] += -inf if removed[b, s], for all queries t. Matches eval-side `garbage_mask` semantics exactly. Verified with tests/test_train_eval_parity.py (out-of-tree): - output at non-limbo positions is bit-invariant to limbo obs values - output WITHOUT the mask leaks (Δ=0.57 confirms the prior bug) - gradient at limbo input positions is exactly 0 with full gating Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Audit agent flagged: my earlier limbo_bias masked the diagonal (t==s), so a query whose causal prefix is entirely limbo had ZERO valid sources → softmax(all -inf) = NaN. NaN then poisons the (now-masked) per-step losses via 0 * NaN = NaN. Fix: leave attn_mask[b, t, t] = 0 always. This matches eval semantics exactly — garbage_mask[a, slot_t] is set AFTER the forward at step t, so a limbo query during its own forward attends to itself (just as it would on the eval side). Other limbo SOURCES still get -inf from any non-self query, which is the train/eval parity we want. Test 4 in /tmp/test_train_eval_parity.py: constructs removed=[1,1,1,0,...] and confirms no NaN. All prior parity tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PPO audit agent flagged three remaining holes in the previous train/eval parity patch (commit e396507): 1. **self.values[idx] = newvalue.detach()** (line ~1112) unconditionally wrote the value head's output at limbo positions back into the buffer. At limbo positions newvalue is computed from INVALID_POSITION obs and is garbage; the next outer GAE call (next minibatch) read it and propagated the poison. Gated: keep prior value at limbo positions. 2. **self.ratio[idx] = ratio.detach()** (line ~1016) cached limbo importance ratios. Outer GAE's v-trace uses these as rho/c coefficients. Limbo ratios are from logprob(garbage_obs, garbage_action) and pollute the next minibatch's advantage chain. Gated identically. 3. **bootstrap_stop** at both outer GAE (line ~900) and inner recomputation (line ~1049) only included `terminals + truncations`. Now also includes `removed_history.float()`, so V(t+1) bootstrapping stops at every limbo step. Otherwise the value at a healthy step t-1 would still bootstrap into V(garbage obs at t) when t is limbo. Together with the in-loop loss gating from e396507, this fully contains limbo poisoning: limbo positions contribute zero gradient AND zero information to the GAE chain AND don't poison the value/ratio buffers across PPO epochs. All existing tests pass: - /tmp/test_train_eval_parity.py (4/4 tests) - tests/test_cache_freeze.py - tests/test_env_level_trial.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r ckpt Two targeted fixes from the audit pass: 1. drive.h:447-450 — `env->log.goals_reached_this_episode` and `goals_sampled_this_episode` were accumulated OUTSIDE the `if (e->is_ego)` guard inside `add_log_one_agent`. With population_play this meant co-player goal counts contaminated the ego aggregate, biasing the `completion_rate` metric upward (and indirectly any downstream consumer). Moved the two increments inside the ego guard. 2. vector.py: module-level cache `_CO_PLAYER_STATE_DICT_CACHE` keyed by (checkpoint_path, mtime). Render epochs were calling `pufferlib.vector.make` and reloading the (~50MB) co-player checkpoint from disk each time, contributing to the OOM seen at first-eval/render epochs. Cache invalidates if the file is replaced. Training path also benefits (only relevant if you construct multiple vec envs in one process). c_reset-at-episode-end was prototyped, then reverted after discussion: existing `bootstrap_stop ∨ removed` + loss mask already neutralizes the limbo gap's gradient damage; forcing c_reset would repeat the same map ~7× per resample window and reduce data diversity. Trade-off not worth the 5-10% wall-clock win. All tests pass: - /tmp/test_train_eval_parity.py (4/4) - tests/test_cache_freeze.py - tests/test_env_level_trial.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The PPO loss-gating patch broadcast `valid_mask` (B, T) against `entropy` (B*T,) — sample_logits returns entropy flat, not 2D. RuntimeError: tensor size mismatch (20100 vs 402). Fix: flatten valid_mask once (`valid_mask_flat`) and use it for the entropy term in both the entropy_conditioned and the standard branch. Normalize by the flat sum. pg_loss / v_loss are unaffected (they use (B, T) tensors throughout). All parity tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…outing
* tests/inspect_system.py: end-to-end system probe (env + policy + KV cache
+ garbage_mask + attention) producing per-tick JSONL, npz snapshots, and
rendered mp4 for each of {coplayer, human_replay, ego_only} modes.
* tests/plot_attention.py: heatmaps + per-head bar plots + limbo-mask
verification plots from inspect_system outputs.
* scripts/adaptive/nuplan_transformer_local_k2_201_gb3_{2,4}partners.sh: the
k=2 gb=3 sweep launchers we've been iterating on (nw=16 for 2 partners,
nw=8 for 4 partners; eval shrunk to 64 maps to keep memory under cap).
* scripts/adaptive/nuplan_transformer_local_k4_201_gb3_2partners.sh: k=4
variant with nw=8 (k=4 doubles pinned RAM so nw drops for 2-way parallel).
* pufferlib/utils.py: render_videos honors eval.map_dir if set, so renders
match the eval distribution (e.g. nuplan_hard) rather than training maps.
* pufferlib/ocean/drive/drive.h: ffmpeg render fps lowered to 10 (matches
env dt=0.1; previously 30 was 3x wall-clock playback).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
Repo-local scores CSV (611 KB) so a fresh clone can run build_nuplan_hard.py without copying the file separately. Default --scores path updated to scripts/nuplan_201_hardness_scores.csv. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
Forked from scripts/adaptive/nuplan_transformer.sh with:
* gb=3 GOAL_TRIAL flags (--env.goal-behavior 3 + k_scenarios + scenario_length)
* nw=32 (cluster has more RAM than local container; per-task mem=256GB
covers 181 GiB pinned + 30 GiB overhead + 15 GiB eval-spike margin)
* cpus-per-task=40 (nw=32 + 8 overhead)
* minibatch_multiplier=25 → keeps minibatch_size=20100 at horizon=804
* eval shrunk to 64 maps for the eval subprocess spike
* PYTORCH_CUDA_ALLOC_CONF=expandable_segments + WANDB_MODE=online forced
* --array=0-1 for 2 partners (miku2puk @ 0.05, 2e029h15 @ 0.10)
Submit with: sbatch scripts/adaptive/cluster_nuplan_transformer_k4_gb3.sh
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
Was 0-1 (2 partners, 1 seed each). Now array=0-11, indexed as: TASK_ID = partner_idx * 3 + seed_idx partners = [miku2puk@e_ub=0.05, 2e029h15@0.10, m2ygolog@0.20, 6rauydj2@0.50] seeds = [42, 43, 44] Each task gets one GPU + 256GB + 40 CPUs (cluster handles the 12-way parallelism via the array). 2B steps per task → ~5h each at nw=32. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
Validates the cluster setup before submitting the 12-run sweep: singularity + venv + binding load, nuplan_201/nuplan_hard binaries present, partner ckpt loads, co-player + ego forward pass works, wandb sync online, no OOM/NaN at first dashboard frame. Resources: 48GB / 8 CPUs / 1 GPU / 30 min. ~3-4 min wall-clock once running. Submit with sbatch, watch /scratch/.../smoke_<jobid>.out, scancel as soon as Steps advances. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
Force-adds the 4 partner checkpoints (gitignored experiments/ dir): puffer_drive_miku2puk.pt (e_ub=0.05) puffer_drive_2e029h15.pt (e_ub=0.10) puffer_drive_m2ygolog.pt (e_ub=0.20) puffer_drive_6rauydj2.pt (e_ub=0.50) Used by cluster_nuplan_transformer_k4_gb3.sh (the 4×3-seed array sweep) and the local k=4 2-partner launcher. 5 MB × 4 = 20 MB one-time cost. Also bumps the local k=4 2-partner launcher from nw=8 to nw=10 (after container memory.max increased from 132 GiB to 154 GiB). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
Previous cluster smoke OOM'd at startup despite nw=4. Reducing: --mem 48GB → 64GB (more CPU headroom) --train.minibatch-multiplier 25 → 4 (minibatch 20100 → 3216, 6x less GPU memory) --train.max-minibatch-size 20100 → 3216 --train.total-timesteps 5M → 2M (finish faster) --time 30min → 20min This matches the config of the first 50M smoke on the vast box that ran cleanly end-to-end. If THIS still OOMs, the issue is cluster container limits, not our minibatch sizing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
Rewrite `_fill_external_co_player_actions` to handle vec.batch_size > 1. When the env is launched with `--vec.batch-size=$NUM_WORKERS`, recv() returns all workers' obs stacked, and this function now does a single batched GPU forward over the combined co-player batch instead of launching one forward per worker. Mechanism: - Parse N workers' co_ids + reset flags from info[] - Concatenate per-worker co_obs slices into one batched tensor - Merge per-worker KV caches along batch dim for the forward - Single forward_eval call - Split updated state back to per-worker stores - Scatter actions to each worker's SHM action buffer slot When batch_size == 1 the new code is a no-op equivalent of the old: n_in_batch=1, no concat/split work, single worker processed. Smoke result at nw=4 (RTX 4090): SPS 10.2K -> 32.5K (3.2x). At cluster nw=32 the per-step Python overhead reduction should give ~2-2.5x SPS (16h -> ~6-8h per partner-seed run). To enable, add to sbatch: --vec.batch-size <NUM_WORKERS> Without that flag the code path is unchanged (batch_size=1 default). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
Two fixes that remove the two main triggers preventing torch.compile from
producing a clean, stable forward_eval graph:
1. models.py: garbage_mask update was using a 0-d scalar tensor index
(slot_t.squeeze()), which causes Dynamo to call aten._local_scalar_dense
and graph-break every call. Replaced with a functional broadcast:
slots = torch.arange(self.horizon, device=device)
position_mask = slot_t.view(-1, 1) == slots.view(1, -1)
garbage_mask = garbage_mask | (r.unsqueeze(1) & position_mask)
No semantic change — same final values, no scalar extraction.
2. pufferl.py: self.amp_context was being entered/exited per recv()
iteration inside the eval loop. Dynamo sees autocast GLOBAL_STATE flip
between calls and recompiles forward_eval each step. Hoisted the
amp_context to wrap the entire rollout loop with explicit enter/exit.
Smoke at nw=4 (RTX 4090, default compile mode):
- opt432 (#2 only): 1.2M steps in 79s, Eval.Forward 17s
- opt432 + Path A: 1.2M steps in 65s, Eval.Forward 10s
- Wall-clock speedup: ~18% (Forward time dropped ~41%)
This does NOT unlock CUDA graphs (compile_mode=reduce-overhead still
crashes on state-dict mutation/lazy alloc). The win comes from a clean
single compile of forward_eval that doesn't re-fire each step.
Stacks multiplicatively with #2. Projected at cluster nw=32:
7h post-#2 -> ~5.9h post-Path A.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
No description provided.