Mohit/trial episode redesign by m2kulkarni · Pull Request #35 · Emerge-Lab/Adaptive_Driving_Agent

m2kulkarni · 2026-05-15T10:30:14Z

No description provided.

Under multi-episode-per-row rollouts (planned for the trial-redesign), training applies PE by slot-in-segment while rollout applies PE by slot-in-cache (which resets at episode boundary). Same logical step gets different PE values in train vs eval — distribution shift. Removing PE entirely sidesteps the mismatch by construction. Five sites patched in models.py: buffer registration removed, get_positional_embedding helper removed, forward_eval / _forward_eval_legacy / forward / _prime_kv_cache all updated to skip the PE add. _prime_kv_cache primes with zeros instead of pos_embed. Backwards-compat for existing PE-baked checkpoints: - vector.py partner load filters `positional_embedding` from state_dict - pufferl.py:load_policy filters the same key + autodetect heuristic now also accepts any `transformer.layers.*` key as a Transformer marker (since legacy detection relied on the PE buffer's presence). Tests: tests/test_transformer_kv_cache.py PASS (cached vs legacy forward_eval still bit-identical at fp32). Smoke train: 3 epochs at ~100K SPS, entropy declining, no NaN. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reverts the NoPE removal from 8237688. Brings back nn.Parameter positional_embedding (matches pre-eb2d7fe2 behavior). Adds compute_pos_within_episode static method that computes per-slot position-within-episode from a (B, T) terminals tensor, used at training time to align PE indexing with rollout's cache-pos indexing. Why: under multi-episode-per-row rollouts (planned for trial-redesign), forward_eval resets pos to 0 at every episode boundary via pufferl.py's done handling. So forward_eval applies pe[pos_within_episode] for the first step of each new episode in the cache. forward() (training) was applying pe[slot_in_segment] across the whole row regardless of boundaries — different absolute positions for the same logical step. This silently corrupted gradients under multi-episode-per-row. Formula (vectorized cummax trick): arange = [0..T-1] shifted = pad(terminals[:, :-1], (1, 0)) # right-shift terminals starts = arange * shifted # mark episode-start slots ep_start = starts.cummax(dim=1).values # propagate forward pos_in_ep = arange - ep_start # = 0 at each episode start Convention matches create_episode_mask: terminal slot belongs to OLD episode; new episode starts at slot terminal+1. Tests added: tests/test_pos_within_episode.py — 9 unit tests, all passing tests/test_pe_train_eval_consistency.py — 3 integration tests proving forward() output bit-matches step-by-step forward_eval (with manual cache reset at episode boundaries) for single-, 2-, and 3-episode-per-row segments. Existing tests/test_transformer_kv_cache.py still PASS (cached vs legacy forward_eval still bit-identical). Smoke train: 3 epochs in 2min, SPS ~100K, entropy stable, no NaN. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Keep the per-episode-reset gather logic in forward() unchanged — only the PE buffer initialization swaps from nn.Parameter (std=0.02 init) to register_buffer with Vaswani sinusoidal values. Per-episode reset applies regardless of PE flavor. Tests still pass: - tests/test_pos_within_episode.py (9/9) - tests/test_pe_train_eval_consistency.py (3/3) — train==eval bit-close - tests/test_transformer_kv_cache.py (3/3) — cached vs legacy Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…opy) Per-agent flag intended to be set in c_step under goal_behavior=GOAL_TRIAL (coming in M3) when a trial ends — distinct from `terminals`, which fires only at the EPISODE boundary (after max_trials_per_episode trials). End-to-end plumbing: drive.h: add `unsigned char *trial_ended_this_step` to Drive struct; calloc/free in standalone allocate()/free_allocated(); memset to zero at the top of c_step (guarded on != NULL). binding.c: parse `trial_ended_this_step` kwarg in both my_init (initial env_init) and my_put (re-init via _reinit_envs_with_new_maps); OPTIONAL — older callers without the kwarg get NULL pointer and the c_step memset is no-op'd. drive.py: allocate self.trial_ended_this_step = np.zeros(num_agents, dtype=bool); pass slice via kwargs to env_init at both call sites (initial + reinit). env_binding.h is NOT modified — the generic vec_init 6-tuple (obs, action, reward, term, trunc, seed) remains intact, so this change doesn't ripple to other ocean envs. The recv() contract is unchanged; trial_ended_this_step is exposed as a Python-side attribute, not part of the standard step return. Tests (tests/test_trial_ended_buffer.py): 3/3 PASS 1. Drive.__init__ exposes the buffer with correct shape/dtype 2. C zeros the buffer in c_step (proven by Python-side pollution) 3. 50 sequential steps survive with buffer still bound Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New behavior, fully gated on goal_behavior == GOAL_TRIAL (=3). All other goal_behavior values (0/1/2) are unaffected. Mechanics: - Each agent runs trials. A trial ends on goal-reach OR per-trial timeout (`per_trial_timeout` ticks since trial start; default scenario_length). - At trial end: trial_ended_this_step[i]=1, trial_count++, agent respawned to traj[0], trial_start_timestep updated. - When trial_count == max_trials_per_episode (default 2), terminals[i]=1 fires for that agent (episode boundary), trial_count reset to 0. - The c_step early-return at scenario_length is suppressed under GOAL_TRIAL — episode boundaries come from the trial logic instead. Drive.h struct additions: - Entity.trial_count, trial_start_timestep (zero under non-TRIAL paths) - Drive.max_trials_per_episode, per_trial_timeout (config) Binding.c: - my_init parses max_trials_per_episode + per_trial_timeout kwargs. per_trial_timeout=0 means "use default" (scenario_length). Drive.py: - max_trials_per_episode + per_trial_timeout added to __init__ signature and threaded through both env_init call sites. Tests (tests/test_goal_trial.py): 5/5 PASS - 60-step runs at gb=0/1/2 confirm trial_ended_this_step stays zero. - gb=3 with timeout=5, max_trials=2: trial_ended fires every 5 steps, terminals fires every 10 steps (= max_trials × timeout), trial_count resets correctly across episode boundaries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…each_rate) Adds 4 float fields to Log struct: - n_trials_completed - n_trials_goal_reached - n_trials_timed_out - trial_total_length (running sum; divided by n_trials_completed in my_log) Under GOAL_TRIAL only, c_step writes directly to env->log on every trial-end event (bypasses logs[i] aggregation since add_log doesn't fire under GOAL_TRIAL — the scenario_length early-return is suppressed). env->log.n is incremented at episode boundaries so vec_log's total_n >= num_agents gate triggers and metrics emit. Under goal_behavior 0/1/2: all trial fields stay zero. vec_log path unchanged. binding.c my_log emits two derived metrics for convenience: trial_mean_length = total_length / n_completed trial_goal_reach_rate = n_goal_reached / n_completed Tests (tests/test_trial_log_fields.py): 2/2 PASS - All goal_behavior values (0/1/2/3) expose the new keys. - Under non-TRIAL: keys present but zero. - Under TRIAL with timeout=5: random policy times out every trial, n_completed grows, mean_length ≈ 5, goal_reach_rate = 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds trial-mode rollout loop that branches on goal_behavior: - goal_behavior in {0,1,2}: existing scenario-mode loop preserved. - goal_behavior == 3: outer loop runs up to max_trials * per_trial_timeout ticks; per-agent trial counter advances on trial_ended_this_step (Python reads env.trial_ended_this_step after each step). Trial outcome = (reward > goal_reward_threshold) at the trial-end tick. Output schema additions for trial mode: - trial_0_score, trial_1_score, ..., trial_{K-1}_score - ada_delta_trial_1_minus_0, ada_delta_trial_K_minus_0, etc. - per_agent_success_log records use t0/t1/... keys (not s0/s1) Tests (tests/test_evaluator_trial_mode.py): 2/2 PASS - gb=3: trial_X_score and ada_delta keys appear; records use t0/t1. - gb=0: scenario mode unchanged; records use s0/s1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ults utils.py:run_human_replay_eval_in_subprocess: - Inherit goal_behavior from training env_config (was hardcoded to 0). - Forward max_trials_per_episode and per_trial_timeout flags. Wandb log path already auto-forwards every numeric metric under eval/human_replay_*, so the new trial_X_score and ada_delta_trial_K_minus_0 keys from M5 flow through automatically. adaptive.ini, drive.ini: - Add max_trials_per_episode=2 and per_trial_timeout=0 defaults. - Update goal_behavior comment to mention 3:"trial". Without these in the INIs, the pufferargs CLI parser doesn't generate --env.max-trials-per-episode / --env.per-trial-timeout flags, so launchers can't override them. Smoke train (gb=3, max_trials=2, per_trial_timeout=201) on adaptive_drive: Epoch 2 reached, SPS ~93K, no NaN. Trial fields flow through vec_log: n_trials_completed = 2.257, n_trials_goal_reached = 1.692, n_trials_timed_out = 0.565, trial_mean_length = 56.4 ticks → 75% of trials reach goal; mean trial length 56 ticks << 201 timeout. End of M1-M6. The trial-redesign is end-to-end functional: M1b: Per-episode-reset PE handles multi-episode-per-row attention M2: trial_ended_this_step buffer plumbed Python ↔ C M3: GOAL_TRIAL=3 implements per-trial logic in c_step M4: Per-trial Log fields exposed via vec_log M5: HumanReplayEvaluator emits trial_X_score, ada_delta_trial_K_minus_0 M6: Subprocess eval forwards goal_behavior; INIs register the new flags Known limitation (documented in the chat): episodes can span buffer segment row boundaries; the spillover slots in the next row are trained under truncated context vs full-cache rollout context. PPO clip damps but doesn't block the resulting gradient mismatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drive (drive.h + drive.py): - Option D: under goal_behavior=3, when an agent's trial_count reaches max_trials_per_episode, fire terminals + add_log_one_agent + mark agent as removed=1 and move it off-grid (INVALID_POSITION). Do NOT call c_reset; the agent idles until Python's resample_frequency triggers _reinit_envs_with_new_maps. Removes the 1-map-many-episodes pathology that was driving over-fit scores. - add_log_one_agent: per-agent variant of add_log used at variable-length trial-mode episode ends. Mirrors all per-entity state c_reset would reset (respawn_timestep, current_goal_reached, metrics_array, etc.) so the next cycle starts clean. - current_goal_reached=1 set in the goal-reach else branch (GOAL_STOP / GOAL_TRIAL share that path); reset to 0 in respawn_agent so the next trial can register a fresh goal-reach. Without this, every in-radius tick incremented goals_reached_this_episode and the score metric was meaningless. - move_expert loops the recorded trajectory under GOAL_TRIAL (t % array_size) so static experts don't vanish past scenario_length. - Trial overlay in c_render_with_mode: 'Trial X / K' instead of scenario counter. Drive Python (drive.py): - self.truncations[:] = 0 at the top of step so trial_ended_this_step -> truncations mirroring stays per-step. - Mirror trial_ended_this_step -> truncations under gb=3 (carries the trial-boundary signal through the SHM buffer to pufferl for GAE bootstrap-stop without triggering KV-cache reset). - Per-scenario block gated off under gb=3 (variable-length trials would land mid-trial). - At resample_frequency boundary under gb=3, force a vec_log emission with num_agents=1 so slow agents that didn't finish max_trials still flush their metrics before _reinit_envs_with_new_maps zeros env->log. adaptive.py: - Under goal_behavior=3, auto-link max_trials_per_episode = k_scenarios and per_trial_timeout = scenario_length, matching the mental model 'k_scenarios trials of scenario_length each.' pufferl.py: - done_mask = d (was d + t) so cache reset gates on terminals only. Trial boundaries (truncations) no longer wipe KV cache, which is the whole point of adaptive training across trials. - GAE bootstrap_stop = (terminals + truncations).clamp(max=1.0): use truncations as the bootstrap-stop channel so V[t+1] from the post-respawn trial does not pull into the value target of the last step of the old trial. - truncations persisted into the rollout buffer (parallel to terminals) so GAE can read the boundary mask. - Debug logging: PUFFER_TRIAL_DEBUG_FILE env var enables per-event JSON trace of cache resets, GAE inner/outer, scenario boundaries, etc. render.py + utils.py: - Forward --goal-behavior + --max-trials-per-episode + --per-trial-timeout CLI flags to env init. - max_steps default under gb=3 = max_trials * per_trial_timeout (the worst-case episode budget) instead of k_scenarios * scenario_length. - Video basename uses 'trials{N}' label under gb=3. rollout.py: - max_steps default under trial mode matches drive.py / render.py. - Break when terms.all() so trial-mode rollouts stop at episode boundary rather than after exactly k_scenarios * scenario_length ticks. - info dict carries _trial_starts for downstream consumers. Tests (8 new + 1 updated): - test_gae_trial_boundary.py: GAE bootstrap-stop fires on truncations too, not just terminals. - test_gae_decoupling_integration.py: end-to-end trial_ended_this_step -> truncations mirror. - test_adaptive_trial_link.py: auto-link of max_trials/per_trial_timeout under gb=3. - test_trial_standard_metrics.py: episode-length, score, offroad, collision populate under gb=3 via add_log_one_agent. - test_trial_per_scenario_gate.py: per-scenario logic gated off under gb=3. - test_trial_score_semantics.py: score uses max_trials denominator. - test_trial_overcounting_fix.py: current_goal_reached gates goals_reached_this_episode increments. - test_rollout_trial_mode.py: rollout max_steps / break / info match trial mode. - test_goal_trial.py: updated test_trial_episode_resets to match Option D (one episode per resample_frequency; agent idles between). All 53 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Trials 2..K rendered empty: respawn_agent sets respawn_timestep = env->timestep (drive.h:2685), and seven downstream gates treat respawn_timestep != -1 as a "post-respawn ghost" marker, including the 3D mesh draw at drive.h:3482 (visible symptom), the collision checks at 1327/1342, ego obs[6] at 2409, and other-car obs at 2455/2457. The mid-episode trial-respawn branch needs to clear the flag back to -1 since GOAL_TRIAL is not a ghost-fade mode. Single-line fix in the GOAL_TRIAL else-branch of c_step. End-of-episode (Option D: removed=1, off-grid, terminals+add_log) is unaffected — that branch never calls respawn_agent. GOAL_RESPAWN's intentional ghost-render semantic is preserved (different branch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…arios>2 Symptom: eval logged only trial_0_score, trial_1_score, and ada_delta_trial_1_minus_0 — missing trial_2/trial_3 even though the env was actually running 4 trials (k_scenarios=4 with the adaptive.py auto-link). Cause: evaluator.py:658 read max_trials_per_episode straight from args['env'] (= the INI default of 2). The auto-link in AdaptiveDrivingAgent.__init__ updates the env's attribute but not the args dict the eval inherits. Fix: prefer puffer_env.driver_env.max_trials_per_episode (the actual env value after auto-link); fall back to k_scenarios when args says the INI default of 2. Same path for per_trial_timeout. Test: tests/test_evaluator_trial_mode.py::test_trial_mode_auto_link_k4 exercises the k_scenarios=4 / args.max_trials=2 case and asserts trial_0..3 scores + ada_delta_trial_{1,2,3}_minus_0 all appear. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs/src/trial_mode.md — full GOAL_TRIAL design doc: - Two-boundary problem (terminals vs truncations) with the 4×4 table - PPO/GAE bootstrap-stop formula (LaTeX) - KV cache reset gate - Option D semantic (idle-after-max_trials) with rationale - Auto-link of trial parameters (current behavior, to be simplified) - End-to-end signal flow (C → SHM → Python → pufferl → GAE) - Score semantic + threshold ladder - The 7 render gates on respawn_timestep != -1 - Per-trial metrics + evaluator's trial_K_score breakdown - Test coverage map tests/test_render_contract.py — programmatic visibility tests: - gb=3 k=4: ego obs[6] must clear within 2 steps of mid-episode trial respawn (else 3D mesh draw at drive.h:3482 hides ego in trials 2..K) - gb=3 k=2: same for the smallest meaningful trial setup - gb=0: ghost-fade semantics preserved (non-regression) - gb=3 k=4 end-of-episode: clean idle state, no 'invisible-but-onscreen' Run before any drive.h / drive.py / render.py change. Catches the M7-fix bug class without spinning up Xvfb + ffmpeg. Wired trial_mode.md into mdBook SUMMARY under new 'Design' section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pre-fix, per-trial breakdown only appeared in eval-time HumanReplayEvaluator output (every 40 epochs); training-time wandb was blind to in-episode adaptation. Now every report_interval ticks the env emits: - trial_0_score, trial_1_score, ..., trial_7_score - ada_delta_trial_1_minus_0, ..., ada_delta_trial_{k-1}_minus_0 Implementation: - drive.h Log: added trial_k_goal_reached[N_TRIAL_K_SLOTS=8]. Incremented in c_step's GOAL_TRIAL trial-end branch for ego only, at index k = trial_count - 1 (the just-completed trial). - binding.c my_log: emits trial_K_score keys, but only if n_trials_completed > 0 (gates on actual gb=3 activity so gb=0/1/2 output stays clean). - drive.py _inject_trial_deltas: post-processes the emitted log dict to add ada_delta_trial_K_minus_0 = trial_K_score - trial_0_score for K in 1..max_trials-1. Called only under goal_behavior == 3. Tests: tests/test_ada_delta_train_logging.py (4) — trial_K keys emit, ada_delta keys appear in info[], values match the subtraction, gb=0 doesn't leak trial keys. All 16 trial-mode test files pass (61 tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…l_timeout Drops the parallel knob surface (max_trials_per_episode + per_trial_timeout). Under gb=3 the auto-link is now an invariant, not a heuristic with INI-default detection. To change the trial count, change k_scenarios. Surface changes: - INI: removed `max_trials_per_episode` and `per_trial_timeout` lines from drive.ini + adaptive.ini. - render.py: dropped --max-trials-per-episode + --per-trial-timeout CLI flags; max_steps default = k_scenarios * scenario_length under all goal_behaviors. - pufferlib/utils.py: subprocess-eval no longer forwards the two flags (the env derives them); video-naming uses k_scenarios. Internals: - adaptive.py: auto-link is unconditional under gb=3: kwargs["max_trials_per_episode"] = k_scenarios kwargs["per_trial_timeout"] = scenario_length Removed the "if user passed default of 2 vs override" detection. - evaluator.py: simplified to read from driver_env (still falls back to k_scenarios / sim_steps). - C-side struct keeps max_trials_per_episode + per_trial_timeout fields (used internally); the only consumers now are the C trial-end branch and add_log_one_agent's denominator. No external API surface. Tests: - test_adaptive_trial_link.py: replaced test_explicit_user_override_wins / test_explicit_per_trial_timeout_wins with their negations — under gb=3 any override is ignored, k_scenarios / scenario_length win. Documents the new invariant. Design doc: simplified the "Auto-link" section and the quick-reference knob table to reflect the smaller surface. All 16 trial-mode test files pass in isolation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pre-fix, Python mirrored trial_ended_this_step → truncations after every vec_step. Two writers (C: zero, write at trial-end; Python: zero, mirror) made the buffer write-order-sensitive. Now C does everything; Python is read-only under gb=3. C side: - drive.h Env struct: added `unsigned char *truncations`. - env_binding.h: uncommented env->truncations wiring from positional arg (was wired for all sibling buffers, just not this one). Only Drive uses this header so no cross-env risk. - drive.h c_step: zeroes env->truncations at top under gb=3; writes 1 at each trial boundary in the GOAL_TRIAL branch (alongside trial_ended_this_step). Both signals fire on the same agents. Python side (drive.py): - step(): no longer zeroes truncations under gb=3 — C handles it. - step(): dropped the trial_ended_this_step → truncations mirror block. - Under non-trial modes (gb=0/1/2), Python still owns truncations: zeroes it at top of step (so the k_eff-curriculum scenario-boundary write at drive.py:1202 keeps working). Tests: tests/test_truncations_ownership.py (3): - C zeroes truncations each step under gb=3 (Python-side pollute is wiped). - C writes 1 simultaneously to truncations and trial_ended_this_step at trial boundary. - Non-trial mode untouched by C. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Removed narrate-the-code comment blocks I added during the trial-mode build-out; kept only why-comments (non-obvious constraints, gotchas, references to specific bug fixes). drive.h: - Trimmed add_log_one_agent header from 6 lines to 3. - Collapsed the per-trial-K-slot doc to 2 lines. - Collapsed the score-threshold-ladder rationale to 4 lines. - Trimmed Option D / mid-trial-respawn ghost-flag blocks; cross-refs the design doc for the long form. - Added the goal-reach invariant block (single source of truth for goals_reached_this_episode gating). drive.py: - Trimmed trial_ended_this_step field doc from 5 lines to 2. - Trimmed per-scenario-gate rationale and resample_frequency block. pufferl.py: - Cache-reset gate explanation: 7 lines → 4. - Truncations persistence: 6 → 2. - GAE bootstrap-stop: 6 → 2. All 17 trial-mode test files pass in isolation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lightweight categorization in tests/README.md. Avoids 17 file moves + sys.path rewrites — the suite is already stable, so a doc-only split serves the same intent (signal which tests are load-bearing design invariants vs paper-trail bug locks) without code churn. Records the going-forward rule: write contract tests BEFORE implementation; only add regression tests when fixing real bugs. Also documents the multi-Drive pytest-segfault workaround (run each file in its own invocation) and links to the design doc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces per-agent trial clocks with a single env-level trial clock. Under gb=3: - On ego goal-reach mid-trial: ego goes off-map (entity.removed=1, INVALID_POSITION, vx=vy=0). No truncations / terminals — wait for env trial-end. - env trial-end fires when ALL active egos in env have removed=1 OR env per_trial_timeout elapses. On trial-end, ALL entities (egos + co-players) reset to init position; env_trial_start updates; env trial_count++. - env episode-end (env_trial_count == max_trials): terminals fires for every active ego; Option D applies (removed stays, off-map until c_reset). env_episode_ended sentinel prevents repeat trial-ends until c_reset. Why: prior per-agent semantic broke the "trial 1 == trial 2 except cache" invariant — co-players and recorded humans ran on their own clocks, so trial 2's world differed from trial 1's world. Adaptation Δscore = trial_K - trial_0 conflated cache effects with stochastic traffic differences. B'' guarantees trial-conditions identity. C side (drive.h): - Env struct: added env_trial_count, env_trial_start_timestep, env_episode_ended. - Env struct: added `unsigned char *removed` SHM buffer (per-agent off-map flag). - c_reset: zeroes the env-level trial state + per-agent removed. - Goal-reach else-branch: under gb=3, set entity.removed=1 + off-map instead of stopped=1. - c_step GOAL_TRIAL block: rewritten to env-level trial-end detection + batch reset. Gates on !env_episode_ended. - move_expert: uses env_trial_start_timestep for the replay clock, so recorded humans rewind to frame 0 at each env trial-end. Python side: - drive.py: `self.removed` SHM buffer, wired through env_init kwargs. - binding.c: kwargs read for `removed` (both my_shared_self_play and my_init paths). Tests: tests/test_env_level_trial.py (6 new contract tests): - removed buffer exists + zero at reset - env trial-end fires on timeout when no ego reaches - ego goes off-map on goal-reach - env trial-end resets all entities to init (mid-episode, k>=3) - episode-end fires after max_trials → Option D removed=1 stays - truncations NOT fired on individual reach (only at env trial-end) All 17 prior trial-mode test files still pass in isolation. Pufferl-side KV cache freeze (so off-map limbo doesn't pollute the transformer cache across trial boundaries) is the next step — gated on vecenv.recv() integration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

scripts/debug_b_demo.py walks through: 1. env-level trial state evolution (removed flag, trunc, term) 2. move_expert clock reset at env trial-end (verifies humans rewind to frame 0 by checking obs[20:] is identical at trial 1 tick 1 and trial 2 tick 1) 3. per-trial-K metrics in vec_log (trial_K_score for K in 0..k-1) 4. KV cache freeze design (documents the pufferl-side integration that's still pending) Output is human-readable ASCII traces — for visual inspection, not unittest assertions. Run: python scripts/debug_b_demo.py

…t reset Two bugs from the user's render of B''-on-pre-B''-checkpoint: 1) Trial overlay always showed "Trial 1 / 4" all four trials. Cause: overlay read per-entity `trial_count` (no longer maintained under B'' — env-level `env_trial_count` is the truth). Switched overlay to env-level counter and clamped to max_trials_per_episode. 2) Recorded humans / static experts vanished in trials 2..K. Cause: my B'' move_expert reset (t = timestep - env_trial_start) replays the same EARLY frames of recorded-human trajectories each trial. Many nuplan agents have late-valid windows (enter the scene mid-trajectory) — short trials never reach those valid frames, so the agents are invisible. Reverted move_expert to OLD looping `t = timestep % array_size`, which keeps humans continuously playing across trials. Trade-off: humans drift slightly across trials (different frame in trial 2 than trial 1). This violates the strict "trial 1 == trial 2" invariant for static experts. For training this is acceptable (humans are stochastic background). For paper-grade human_replay eval (1 ego per env) where strict equivalence matters, the eval path will need to explicitly reset move_expert — captured as a TODO in the function comment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per user: "yeah whatever the humans do whatever they want at the timestep. We dont care about what humans do, only what egos do. when trial 2 starts, we reset the humans." Restores the env-relative move_expert clock so humans (recorded experts) restart at frame 0 every env trial-end. Trade-off: if humans have late-valid windows and trials end fast (cache makes ego reach quickly), humans may not appear in those trials — that's faithful to the data, not a bug. Strict trial-equivalence on the ego side is what matters for the adaptation experiment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

…+ respawn from init_steps Two bugs caused trial 1 to differ from trial 2 even after the env-relative move_expert clock was in place: 1) respawn_agent put active entities at traj[0]. But c_reset's set_start_position uses traj[init_steps]. So trial 1 started at recorded frame `init_steps`, trial 2..K started at frame 0 — different recorded poses, different velocities, different lane alignment. Fix: respawn_agent now uses init_steps (clamped to array_size). 2) move_expert mapped (timestep - env_trial_start) to recording frame directly, ignoring init_steps. Static experts were at frame 1 at trial K tick 1, but at frame init_steps+1 at trial 1 tick 1 (because c_reset put them at init_steps then move_expert advanced one tick). Fix: t = init_steps + (timestep - env_trial_start). 3) Even with (1) and (2), the per-entity reset in the trial-end loop didn't touch static agents or rest various e->state fields (collision_state, valid, dynamics integrators). Replaced the per-active respawn_agent loop with a single set_start_position(env) call at trial-end, which resets the EXACT same state c_reset does for every entity in the env. Strict bit-for-bit equivalence verified: trial1 vs trial2 obs differ in 0/1850 dims, max diff = 0.0. Now trial K == trial 1 except for ego's KV cache, as designed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

scripts/trace_b_render.py prints per-step: tick, env trial counter, rem count (egos off-map mid-trial), trial_ended_this_step / truncations / terminals counts, simulated KV cache write position, and event highlights (GOAL-REACH, TRIAL-END, EPISODE-END). Useful for visualizing B'' dynamics alongside the rendered mp4. Verified empirically that the same agents reach goal at the same RELATIVE tick across all 4 trials (agent 33 at relative tick 26 in trials 1-4, agent 42 at relative tick 116 in trials 1-4) — confirms strict trial equivalence. cache_pos is a simulation showing what the transformer's cache write position would do; once task #33 wires the actual pufferl freeze for removed agents, the displayed value will match the real cache.

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

bug_003: AdaptiveDrivingAgent.__init__ now asserts k_scenarios <= 8 under gb=3 so the silent-truncation failure mode for per-trial metrics (trial_k_goal_reached[8] in drive.h is fixed-size) is loud. Pre-fix, k=9+ would lose trial_8_score from training-time wandb but eval would still report it — diverging signal. With the assert, the user is told to either bump N_TRIAL_K_SLOTS or pick a smaller k. bug_002: self.removed docstring claimed "pufferl uses this to freeze the KV cache during the off-map limbo" but pufferl has zero references to the buffer — it's reserved for task #33. Reworded the comment. Both flagged as nits by ultrareview, batched in one commit.

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Without this, when an ego reaches goal mid-trial and goes off-map (removed=1), pufferl keeps invoking the transformer with that agent's garbage INVALID_POSITION observations. Each invocation appends a new K/V token to that agent's cache. Over a 174-tick limbo period, ~87% of the cache becomes garbage; future trials' attention is dominated by junk tokens. Fix: per-agent attention-mask exclusion of garbage cache slots. models.py TransformerWrapper: - init_eval_state adds state["garbage_mask"]: (B, horizon) bool. - forward_eval reads state["removed"] (B,) bool from pufferl. The attention mask becomes (B, 1, 1, horizon) per-agent = (slot <= current_pos) AND (not garbage_mask). After cache write, garbage_mask[:, slot_t] |= removed marks the just-written slot as garbage for off-map agents. - reset_eval_state zeros garbage_mask alongside k_cache / v_cache, both for full-env reset and per-agent done_indices reset. pufferl.py: - New self.transformer_garbage_mask dict (per state_key, mirrors k_cache layout). Lazy-allocated by the model on first forward_eval. - Before forward_eval: state["garbage_mask"] = ..., state["removed"] = view of vecenv.driver_env.removed sliced by env_id. - After forward_eval: persist state["garbage_mask"] back. - On done_mask reset (terminals): clear garbage_mask rows alongside k_cache / v_cache rows. Tests (tests/test_cache_freeze.py, 4 contract tests): - Slots written while removed=1 have zero softmax weight in next step (verified via _probe_attention path that captures attention weights per layer per agent). - Other agents' attention to the same slots is unaffected. - Full reset (done_indices=None) zeros garbage_mask. - Partial reset (done_indices=[a, b]) zeros only those rows. - Without state["removed"] (non-gb=3 modes), no slots are marked garbage — backward-compat preserved. All 19 trial-mode test files pass. Multi-env caveat: pufferl reads vecenv.driver_env.removed, which works in single-Drive setups (render). For multi-worker training the buffer needs to be unified (SHM); driver_env.removed currently is one Drive's view. Captured as follow-up — same mechanism that propagates terminals across workers needs to do removed too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Previously `self.removed = np.zeros(num_agents)` was allocated as a private numpy array per Drive instance. In Multiprocessing this means each worker's `removed` writes were invisible to the main process — `driver_env.removed` on the parent was just a metadata stub that no worker ever wrote to. Effect: garbage_mask was never populated in training, so the cache-freeze mask did nothing. Plumb `removed` through SHM the same way `terminals`/`truncations` are: - vector.py Multiprocessing: allocate `removed=RawArray("b", num_agents)` in self.shm; expose `self.removed = self.buf["removed"].ravel()` for pufferl. _worker_process maps the worker's row into buf["removed"]. - vector.py Serial: allocate `self.removed` (or accept from outer buf) and slice per env into buf_i["removed"]. - drive.py: read `buf["removed"]` if present, fall back to local alloc. - pufferl.py: read unified `vecenv.removed` (works for both backends), falling back to driver_env.removed for compat. Verified: 4-worker Multiprocessing test shows worker-side writes (bit counts [1,2,1,1]) visible from the main process via SHM after one step. Serial driver_env.removed shares memory with vecenv.removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…O loss Eval forward (`forward_eval`) already excluded off-map limbo cache slots via `garbage_mask`. Training forward (`forward` → `create_episode_mask`) used only `terminals` and saw limbo tokens unmasked. The PPO loss also ignored `masks` (per the pre-existing `# Note: We are not yet handling masks in this version` at pufferl.py:798). Result: the policy attended to garbage during training AND limbo tuples (zero reward, post-reset or INVALID_POSITION obs) contributed gradient to pg_loss/v_loss/entropy. This commit closes the train/eval mismatch: - pufferl.py: add removed_history rollout buffer; capture vec.removed per step alongside terminals/truncations; pass mb_removed into the training forward via state["removed"]; weight pg_loss, v_loss, entropy_loss and diagnostic stats (approx_kl, clipfrac) by valid=~removed, normalized by sum(valid). - models.py: in the training `forward` else-branch (use_episode_mask), add (B, 1, T) limbo bias of -inf to the attention mask on the SOURCE axis. attn_mask[b, t, s] += -inf if removed[b, s], for all queries t. Matches eval-side `garbage_mask` semantics exactly. Verified with tests/test_train_eval_parity.py (out-of-tree): - output at non-limbo positions is bit-invariant to limbo obs values - output WITHOUT the mask leaks (Δ=0.57 confirms the prior bug) - gradient at limbo input positions is exactly 0 with full gating Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Audit agent flagged: my earlier limbo_bias masked the diagonal (t==s), so a query whose causal prefix is entirely limbo had ZERO valid sources → softmax(all -inf) = NaN. NaN then poisons the (now-masked) per-step losses via 0 * NaN = NaN. Fix: leave attn_mask[b, t, t] = 0 always. This matches eval semantics exactly — garbage_mask[a, slot_t] is set AFTER the forward at step t, so a limbo query during its own forward attends to itself (just as it would on the eval side). Other limbo SOURCES still get -inf from any non-self query, which is the train/eval parity we want. Test 4 in /tmp/test_train_eval_parity.py: constructs removed=[1,1,1,0,...] and confirms no NaN. All prior parity tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PPO audit agent flagged three remaining holes in the previous train/eval parity patch (commit e396507): 1. **self.values[idx] = newvalue.detach()** (line ~1112) unconditionally wrote the value head's output at limbo positions back into the buffer. At limbo positions newvalue is computed from INVALID_POSITION obs and is garbage; the next outer GAE call (next minibatch) read it and propagated the poison. Gated: keep prior value at limbo positions. 2. **self.ratio[idx] = ratio.detach()** (line ~1016) cached limbo importance ratios. Outer GAE's v-trace uses these as rho/c coefficients. Limbo ratios are from logprob(garbage_obs, garbage_action) and pollute the next minibatch's advantage chain. Gated identically. 3. **bootstrap_stop** at both outer GAE (line ~900) and inner recomputation (line ~1049) only included `terminals + truncations`. Now also includes `removed_history.float()`, so V(t+1) bootstrapping stops at every limbo step. Otherwise the value at a healthy step t-1 would still bootstrap into V(garbage obs at t) when t is limbo. Together with the in-loop loss gating from e396507, this fully contains limbo poisoning: limbo positions contribute zero gradient AND zero information to the GAE chain AND don't poison the value/ratio buffers across PPO epochs. All existing tests pass: - /tmp/test_train_eval_parity.py (4/4 tests) - tests/test_cache_freeze.py - tests/test_env_level_trial.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…r ckpt Two targeted fixes from the audit pass: 1. drive.h:447-450 — `env->log.goals_reached_this_episode` and `goals_sampled_this_episode` were accumulated OUTSIDE the `if (e->is_ego)` guard inside `add_log_one_agent`. With population_play this meant co-player goal counts contaminated the ego aggregate, biasing the `completion_rate` metric upward (and indirectly any downstream consumer). Moved the two increments inside the ego guard. 2. vector.py: module-level cache `_CO_PLAYER_STATE_DICT_CACHE` keyed by (checkpoint_path, mtime). Render epochs were calling `pufferlib.vector.make` and reloading the (~50MB) co-player checkpoint from disk each time, contributing to the OOM seen at first-eval/render epochs. Cache invalidates if the file is replaced. Training path also benefits (only relevant if you construct multiple vec envs in one process). c_reset-at-episode-end was prototyped, then reverted after discussion: existing `bootstrap_stop ∨ removed` + loss mask already neutralizes the limbo gap's gradient damage; forcing c_reset would repeat the same map ~7× per resample window and reduce data diversity. Trade-off not worth the 5-10% wall-clock win. All tests pass: - /tmp/test_train_eval_parity.py (4/4) - tests/test_cache_freeze.py - tests/test_env_level_trial.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The PPO loss-gating patch broadcast `valid_mask` (B, T) against `entropy` (B*T,) — sample_logits returns entropy flat, not 2D. RuntimeError: tensor size mismatch (20100 vs 402). Fix: flatten valid_mask once (`valid_mask_flat`) and use it for the entropy term in both the entropy_conditioned and the standard branch. Normalize by the flat sum. pg_loss / v_loss are unaffected (they use (B, T) tensors throughout). All parity tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…outing * tests/inspect_system.py: end-to-end system probe (env + policy + KV cache + garbage_mask + attention) producing per-tick JSONL, npz snapshots, and rendered mp4 for each of {coplayer, human_replay, ego_only} modes. * tests/plot_attention.py: heatmaps + per-head bar plots + limbo-mask verification plots from inspect_system outputs. * scripts/adaptive/nuplan_transformer_local_k2_201_gb3_{2,4}partners.sh: the k=2 gb=3 sweep launchers we've been iterating on (nw=16 for 2 partners, nw=8 for 4 partners; eval shrunk to 64 maps to keep memory under cap). * scripts/adaptive/nuplan_transformer_local_k4_201_gb3_2partners.sh: k=4 variant with nw=8 (k=4 doubles pinned RAM so nw drops for 2-way parallel). * pufferlib/utils.py: render_videos honors eval.map_dir if set, so renders match the eval distribution (e.g. nuplan_hard) rather than training maps. * pufferlib/ocean/drive/drive.h: ffmpeg render fps lowered to 10 (matches env dt=0.1; previously 30 was 3x wall-clock playback). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Repo-local scores CSV (611 KB) so a fresh clone can run build_nuplan_hard.py without copying the file separately. Default --scores path updated to scripts/nuplan_201_hardness_scores.csv. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Forked from scripts/adaptive/nuplan_transformer.sh with: * gb=3 GOAL_TRIAL flags (--env.goal-behavior 3 + k_scenarios + scenario_length) * nw=32 (cluster has more RAM than local container; per-task mem=256GB covers 181 GiB pinned + 30 GiB overhead + 15 GiB eval-spike margin) * cpus-per-task=40 (nw=32 + 8 overhead) * minibatch_multiplier=25 → keeps minibatch_size=20100 at horizon=804 * eval shrunk to 64 maps for the eval subprocess spike * PYTORCH_CUDA_ALLOC_CONF=expandable_segments + WANDB_MODE=online forced * --array=0-1 for 2 partners (miku2puk @ 0.05, 2e029h15 @ 0.10) Submit with: sbatch scripts/adaptive/cluster_nuplan_transformer_k4_gb3.sh Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Was 0-1 (2 partners, 1 seed each). Now array=0-11, indexed as: TASK_ID = partner_idx * 3 + seed_idx partners = [miku2puk@e_ub=0.05, 2e029h15@0.10, m2ygolog@0.20, 6rauydj2@0.50] seeds = [42, 43, 44] Each task gets one GPU + 256GB + 40 CPUs (cluster handles the 12-way parallelism via the array). 2B steps per task → ~5h each at nw=32. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Validates the cluster setup before submitting the 12-run sweep: singularity + venv + binding load, nuplan_201/nuplan_hard binaries present, partner ckpt loads, co-player + ego forward pass works, wandb sync online, no OOM/NaN at first dashboard frame. Resources: 48GB / 8 CPUs / 1 GPU / 30 min. ~3-4 min wall-clock once running. Submit with sbatch, watch /scratch/.../smoke_<jobid>.out, scancel as soon as Steps advances. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Force-adds the 4 partner checkpoints (gitignored experiments/ dir): puffer_drive_miku2puk.pt (e_ub=0.05) puffer_drive_2e029h15.pt (e_ub=0.10) puffer_drive_m2ygolog.pt (e_ub=0.20) puffer_drive_6rauydj2.pt (e_ub=0.50) Used by cluster_nuplan_transformer_k4_gb3.sh (the 4×3-seed array sweep) and the local k=4 2-partner launcher. 5 MB × 4 = 20 MB one-time cost. Also bumps the local k=4 2-partner launcher from nw=8 to nw=10 (after container memory.max increased from 132 GiB to 154 GiB). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Previous cluster smoke OOM'd at startup despite nw=4. Reducing: --mem 48GB → 64GB (more CPU headroom) --train.minibatch-multiplier 25 → 4 (minibatch 20100 → 3216, 6x less GPU memory) --train.max-minibatch-size 20100 → 3216 --train.total-timesteps 5M → 2M (finish faster) --time 30min → 20min This matches the config of the first 50M smoke on the vast box that ran cleanly end-to-end. If THIS still OOMs, the issue is cluster container limits, not our minibatch sizing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Rewrite `_fill_external_co_player_actions` to handle vec.batch_size > 1. When the env is launched with `--vec.batch-size=$NUM_WORKERS`, recv() returns all workers' obs stacked, and this function now does a single batched GPU forward over the combined co-player batch instead of launching one forward per worker. Mechanism: - Parse N workers' co_ids + reset flags from info[] - Concatenate per-worker co_obs slices into one batched tensor - Merge per-worker KV caches along batch dim for the forward - Single forward_eval call - Split updated state back to per-worker stores - Scatter actions to each worker's SHM action buffer slot When batch_size == 1 the new code is a no-op equivalent of the old: n_in_batch=1, no concat/split work, single worker processed. Smoke result at nw=4 (RTX 4090): SPS 10.2K -> 32.5K (3.2x). At cluster nw=32 the per-step Python overhead reduction should give ~2-2.5x SPS (16h -> ~6-8h per partner-seed run). To enable, add to sbatch: --vec.batch-size <NUM_WORKERS> Without that flag the code path is unchanged (batch_size=1 default). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Two fixes that remove the two main triggers preventing torch.compile from producing a clean, stable forward_eval graph: 1. models.py: garbage_mask update was using a 0-d scalar tensor index (slot_t.squeeze()), which causes Dynamo to call aten._local_scalar_dense and graph-break every call. Replaced with a functional broadcast: slots = torch.arange(self.horizon, device=device) position_mask = slot_t.view(-1, 1) == slots.view(1, -1) garbage_mask = garbage_mask | (r.unsqueeze(1) & position_mask) No semantic change — same final values, no scalar extraction. 2. pufferl.py: self.amp_context was being entered/exited per recv() iteration inside the eval loop. Dynamo sees autocast GLOBAL_STATE flip between calls and recompiles forward_eval each step. Hoisted the amp_context to wrap the entire rollout loop with explicit enter/exit. Smoke at nw=4 (RTX 4090, default compile mode): - opt432 (#2 only): 1.2M steps in 79s, Eval.Forward 17s - opt432 + Path A: 1.2M steps in 65s, Eval.Forward 10s - Wall-clock speedup: ~18% (Forward time dropped ~41%) This does NOT unlock CUDA graphs (compile_mode=reduce-overhead still crashes on state-dict mutation/lazy alloc). The win comes from a clean single compile of forward_eval that doesn't re-fire each step. Stacks multiplicatively with #2. Projected at cluster nw=32: 7h post-#2 -> ~5.9h post-Path A. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

mohitmk01 and others added 21 commits May 4, 2026 19:07

greptile-apps Bot reviewed May 15, 2026

View reviewed changes

render: 30fps → 15fps for ffmpeg encoding

16ca6b3

greptile-apps Bot reviewed May 15, 2026

View reviewed changes

mohitmk01 and others added 7 commits May 15, 2026 11:41

greptile-apps Bot reviewed May 16, 2026

View reviewed changes

Conversation

m2kulkarni commented May 15, 2026

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants