Skip to content

Mohit/trial episode redesign#35

Open
m2kulkarni wants to merge 41 commits into
mohit/cleanup-apr25from
mohit/trial-episode-redesign
Open

Mohit/trial episode redesign#35
m2kulkarni wants to merge 41 commits into
mohit/cleanup-apr25from
mohit/trial-episode-redesign

Conversation

@m2kulkarni
Copy link
Copy Markdown

No description provided.

mohitmk01 and others added 21 commits May 4, 2026 19:07
Under multi-episode-per-row rollouts (planned for the trial-redesign),
training applies PE by slot-in-segment while rollout applies PE by
slot-in-cache (which resets at episode boundary). Same logical step
gets different PE values in train vs eval — distribution shift.

Removing PE entirely sidesteps the mismatch by construction. Five sites
patched in models.py: buffer registration removed, get_positional_embedding
helper removed, forward_eval / _forward_eval_legacy / forward / _prime_kv_cache
all updated to skip the PE add. _prime_kv_cache primes with zeros instead
of pos_embed.

Backwards-compat for existing PE-baked checkpoints:
- vector.py partner load filters `positional_embedding` from state_dict
- pufferl.py:load_policy filters the same key + autodetect heuristic
  now also accepts any `transformer.layers.*` key as a Transformer
  marker (since legacy detection relied on the PE buffer's presence).

Tests: tests/test_transformer_kv_cache.py PASS (cached vs legacy
forward_eval still bit-identical at fp32). Smoke train: 3 epochs at
~100K SPS, entropy declining, no NaN.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverts the NoPE removal from 8237688. Brings back nn.Parameter
positional_embedding (matches pre-eb2d7fe2 behavior). Adds
compute_pos_within_episode static method that computes per-slot
position-within-episode from a (B, T) terminals tensor, used at
training time to align PE indexing with rollout's cache-pos indexing.

Why: under multi-episode-per-row rollouts (planned for trial-redesign),
forward_eval resets pos to 0 at every episode boundary via pufferl.py's
done handling. So forward_eval applies pe[pos_within_episode] for the
first step of each new episode in the cache. forward() (training) was
applying pe[slot_in_segment] across the whole row regardless of
boundaries — different absolute positions for the same logical step.
This silently corrupted gradients under multi-episode-per-row.

Formula (vectorized cummax trick):
  arange   = [0..T-1]
  shifted  = pad(terminals[:, :-1], (1, 0))         # right-shift terminals
  starts   = arange * shifted                       # mark episode-start slots
  ep_start = starts.cummax(dim=1).values            # propagate forward
  pos_in_ep = arange - ep_start                     # = 0 at each episode start

Convention matches create_episode_mask: terminal slot belongs to OLD
episode; new episode starts at slot terminal+1.

Tests added:
  tests/test_pos_within_episode.py — 9 unit tests, all passing
  tests/test_pe_train_eval_consistency.py — 3 integration tests proving
    forward() output bit-matches step-by-step forward_eval (with
    manual cache reset at episode boundaries) for single-, 2-, and
    3-episode-per-row segments.

Existing tests/test_transformer_kv_cache.py still PASS (cached vs
legacy forward_eval still bit-identical).

Smoke train: 3 epochs in 2min, SPS ~100K, entropy stable, no NaN.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Keep the per-episode-reset gather logic in forward() unchanged — only
the PE buffer initialization swaps from nn.Parameter (std=0.02 init) to
register_buffer with Vaswani sinusoidal values. Per-episode reset
applies regardless of PE flavor.

Tests still pass:
- tests/test_pos_within_episode.py (9/9)
- tests/test_pe_train_eval_consistency.py (3/3) — train==eval bit-close
- tests/test_transformer_kv_cache.py (3/3) — cached vs legacy

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…opy)

Per-agent flag intended to be set in c_step under goal_behavior=GOAL_TRIAL
(coming in M3) when a trial ends — distinct from `terminals`, which fires
only at the EPISODE boundary (after max_trials_per_episode trials).

End-to-end plumbing:
  drive.h: add `unsigned char *trial_ended_this_step` to Drive struct;
           calloc/free in standalone allocate()/free_allocated();
           memset to zero at the top of c_step (guarded on != NULL).
  binding.c: parse `trial_ended_this_step` kwarg in both my_init (initial
           env_init) and my_put (re-init via _reinit_envs_with_new_maps);
           OPTIONAL — older callers without the kwarg get NULL pointer
           and the c_step memset is no-op'd.
  drive.py: allocate self.trial_ended_this_step = np.zeros(num_agents,
           dtype=bool); pass slice via kwargs to env_init at both call
           sites (initial + reinit).

env_binding.h is NOT modified — the generic vec_init 6-tuple
(obs, action, reward, term, trunc, seed) remains intact, so this change
doesn't ripple to other ocean envs. The recv() contract is unchanged;
trial_ended_this_step is exposed as a Python-side attribute, not part
of the standard step return.

Tests (tests/test_trial_ended_buffer.py): 3/3 PASS
  1. Drive.__init__ exposes the buffer with correct shape/dtype
  2. C zeros the buffer in c_step (proven by Python-side pollution)
  3. 50 sequential steps survive with buffer still bound

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New behavior, fully gated on goal_behavior == GOAL_TRIAL (=3). All other
goal_behavior values (0/1/2) are unaffected.

Mechanics:
  - Each agent runs trials. A trial ends on goal-reach OR per-trial
    timeout (`per_trial_timeout` ticks since trial start; default
    scenario_length).
  - At trial end: trial_ended_this_step[i]=1, trial_count++, agent
    respawned to traj[0], trial_start_timestep updated.
  - When trial_count == max_trials_per_episode (default 2), terminals[i]=1
    fires for that agent (episode boundary), trial_count reset to 0.
  - The c_step early-return at scenario_length is suppressed under
    GOAL_TRIAL — episode boundaries come from the trial logic instead.

Drive.h struct additions:
  - Entity.trial_count, trial_start_timestep (zero under non-TRIAL paths)
  - Drive.max_trials_per_episode, per_trial_timeout (config)

Binding.c:
  - my_init parses max_trials_per_episode + per_trial_timeout kwargs.
    per_trial_timeout=0 means "use default" (scenario_length).

Drive.py:
  - max_trials_per_episode + per_trial_timeout added to __init__ signature
    and threaded through both env_init call sites.

Tests (tests/test_goal_trial.py): 5/5 PASS
  - 60-step runs at gb=0/1/2 confirm trial_ended_this_step stays zero.
  - gb=3 with timeout=5, max_trials=2: trial_ended fires every 5 steps,
    terminals fires every 10 steps (= max_trials × timeout), trial_count
    resets correctly across episode boundaries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…each_rate)

Adds 4 float fields to Log struct:
  - n_trials_completed
  - n_trials_goal_reached
  - n_trials_timed_out
  - trial_total_length (running sum; divided by n_trials_completed in my_log)

Under GOAL_TRIAL only, c_step writes directly to env->log on every
trial-end event (bypasses logs[i] aggregation since add_log doesn't
fire under GOAL_TRIAL — the scenario_length early-return is suppressed).
env->log.n is incremented at episode boundaries so vec_log's
total_n >= num_agents gate triggers and metrics emit.

Under goal_behavior 0/1/2: all trial fields stay zero. vec_log path
unchanged.

binding.c my_log emits two derived metrics for convenience:
  trial_mean_length = total_length / n_completed
  trial_goal_reach_rate = n_goal_reached / n_completed

Tests (tests/test_trial_log_fields.py): 2/2 PASS
  - All goal_behavior values (0/1/2/3) expose the new keys.
  - Under non-TRIAL: keys present but zero.
  - Under TRIAL with timeout=5: random policy times out every trial,
    n_completed grows, mean_length ≈ 5, goal_reach_rate = 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds trial-mode rollout loop that branches on goal_behavior:
  - goal_behavior in {0,1,2}: existing scenario-mode loop preserved.
  - goal_behavior == 3: outer loop runs up to max_trials *
    per_trial_timeout ticks; per-agent trial counter advances on
    trial_ended_this_step (Python reads env.trial_ended_this_step
    after each step). Trial outcome = (reward > goal_reward_threshold)
    at the trial-end tick.

Output schema additions for trial mode:
  - trial_0_score, trial_1_score, ..., trial_{K-1}_score
  - ada_delta_trial_1_minus_0, ada_delta_trial_K_minus_0, etc.
  - per_agent_success_log records use t0/t1/... keys (not s0/s1)

Tests (tests/test_evaluator_trial_mode.py): 2/2 PASS
  - gb=3: trial_X_score and ada_delta keys appear; records use t0/t1.
  - gb=0: scenario mode unchanged; records use s0/s1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ults

utils.py:run_human_replay_eval_in_subprocess:
  - Inherit goal_behavior from training env_config (was hardcoded to 0).
  - Forward max_trials_per_episode and per_trial_timeout flags.
  Wandb log path already auto-forwards every numeric metric under
  eval/human_replay_*, so the new trial_X_score and
  ada_delta_trial_K_minus_0 keys from M5 flow through automatically.

adaptive.ini, drive.ini:
  - Add max_trials_per_episode=2 and per_trial_timeout=0 defaults.
  - Update goal_behavior comment to mention 3:"trial".
  Without these in the INIs, the pufferargs CLI parser doesn't generate
  --env.max-trials-per-episode / --env.per-trial-timeout flags, so
  launchers can't override them.

Smoke train (gb=3, max_trials=2, per_trial_timeout=201) on adaptive_drive:
  Epoch 2 reached, SPS ~93K, no NaN.
  Trial fields flow through vec_log:
    n_trials_completed = 2.257, n_trials_goal_reached = 1.692,
    n_trials_timed_out = 0.565, trial_mean_length = 56.4 ticks
  → 75% of trials reach goal; mean trial length 56 ticks << 201 timeout.

End of M1-M6. The trial-redesign is end-to-end functional:
  M1b: Per-episode-reset PE handles multi-episode-per-row attention
  M2:  trial_ended_this_step buffer plumbed Python ↔ C
  M3:  GOAL_TRIAL=3 implements per-trial logic in c_step
  M4:  Per-trial Log fields exposed via vec_log
  M5:  HumanReplayEvaluator emits trial_X_score, ada_delta_trial_K_minus_0
  M6:  Subprocess eval forwards goal_behavior; INIs register the new flags

Known limitation (documented in the chat): episodes can span buffer
segment row boundaries; the spillover slots in the next row are
trained under truncated context vs full-cache rollout context. PPO
clip damps but doesn't block the resulting gradient mismatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drive (drive.h + drive.py):
- Option D: under goal_behavior=3, when an agent's trial_count reaches
  max_trials_per_episode, fire terminals + add_log_one_agent + mark agent
  as removed=1 and move it off-grid (INVALID_POSITION). Do NOT call
  c_reset; the agent idles until Python's resample_frequency triggers
  _reinit_envs_with_new_maps. Removes the 1-map-many-episodes pathology
  that was driving over-fit scores.
- add_log_one_agent: per-agent variant of add_log used at variable-length
  trial-mode episode ends. Mirrors all per-entity state c_reset would
  reset (respawn_timestep, current_goal_reached, metrics_array, etc.) so
  the next cycle starts clean.
- current_goal_reached=1 set in the goal-reach else branch (GOAL_STOP /
  GOAL_TRIAL share that path); reset to 0 in respawn_agent so the next
  trial can register a fresh goal-reach. Without this, every in-radius
  tick incremented goals_reached_this_episode and the score metric was
  meaningless.
- move_expert loops the recorded trajectory under GOAL_TRIAL (t %
  array_size) so static experts don't vanish past scenario_length.
- Trial overlay in c_render_with_mode: 'Trial X / K' instead of scenario
  counter.

Drive Python (drive.py):
- self.truncations[:] = 0 at the top of step so trial_ended_this_step
  -> truncations mirroring stays per-step.
- Mirror trial_ended_this_step -> truncations under gb=3 (carries the
  trial-boundary signal through the SHM buffer to pufferl for GAE
  bootstrap-stop without triggering KV-cache reset).
- Per-scenario block gated off under gb=3 (variable-length trials would
  land mid-trial).
- At resample_frequency boundary under gb=3, force a vec_log emission
  with num_agents=1 so slow agents that didn't finish max_trials still
  flush their metrics before _reinit_envs_with_new_maps zeros env->log.

adaptive.py:
- Under goal_behavior=3, auto-link max_trials_per_episode = k_scenarios
  and per_trial_timeout = scenario_length, matching the mental model
  'k_scenarios trials of scenario_length each.'

pufferl.py:
- done_mask = d (was d + t) so cache reset gates on terminals only.
  Trial boundaries (truncations) no longer wipe KV cache, which is the
  whole point of adaptive training across trials.
- GAE bootstrap_stop = (terminals + truncations).clamp(max=1.0): use
  truncations as the bootstrap-stop channel so V[t+1] from the
  post-respawn trial does not pull into the value target of the last
  step of the old trial.
- truncations persisted into the rollout buffer (parallel to terminals)
  so GAE can read the boundary mask.
- Debug logging: PUFFER_TRIAL_DEBUG_FILE env var enables per-event JSON
  trace of cache resets, GAE inner/outer, scenario boundaries, etc.

render.py + utils.py:
- Forward --goal-behavior + --max-trials-per-episode + --per-trial-timeout
  CLI flags to env init.
- max_steps default under gb=3 = max_trials * per_trial_timeout (the
  worst-case episode budget) instead of k_scenarios * scenario_length.
- Video basename uses 'trials{N}' label under gb=3.

rollout.py:
- max_steps default under trial mode matches drive.py / render.py.
- Break when terms.all() so trial-mode rollouts stop at episode boundary
  rather than after exactly k_scenarios * scenario_length ticks.
- info dict carries _trial_starts for downstream consumers.

Tests (8 new + 1 updated):
- test_gae_trial_boundary.py: GAE bootstrap-stop fires on truncations
  too, not just terminals.
- test_gae_decoupling_integration.py: end-to-end trial_ended_this_step
  -> truncations mirror.
- test_adaptive_trial_link.py: auto-link of max_trials/per_trial_timeout
  under gb=3.
- test_trial_standard_metrics.py: episode-length, score, offroad,
  collision populate under gb=3 via add_log_one_agent.
- test_trial_per_scenario_gate.py: per-scenario logic gated off under
  gb=3.
- test_trial_score_semantics.py: score uses max_trials denominator.
- test_trial_overcounting_fix.py: current_goal_reached gates
  goals_reached_this_episode increments.
- test_rollout_trial_mode.py: rollout max_steps / break / info match
  trial mode.
- test_goal_trial.py: updated test_trial_episode_resets to match Option
  D (one episode per resample_frequency; agent idles between).

All 53 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Trials 2..K rendered empty: respawn_agent sets respawn_timestep =
env->timestep (drive.h:2685), and seven downstream gates treat
respawn_timestep != -1 as a "post-respawn ghost" marker, including the
3D mesh draw at drive.h:3482 (visible symptom), the collision checks
at 1327/1342, ego obs[6] at 2409, and other-car obs at 2455/2457. The
mid-episode trial-respawn branch needs to clear the flag back to -1
since GOAL_TRIAL is not a ghost-fade mode.

Single-line fix in the GOAL_TRIAL else-branch of c_step.

End-of-episode (Option D: removed=1, off-grid, terminals+add_log) is
unaffected — that branch never calls respawn_agent. GOAL_RESPAWN's
intentional ghost-render semantic is preserved (different branch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…arios>2

Symptom: eval logged only trial_0_score, trial_1_score, and
ada_delta_trial_1_minus_0 — missing trial_2/trial_3 even though the
env was actually running 4 trials (k_scenarios=4 with the adaptive.py
auto-link).

Cause: evaluator.py:658 read max_trials_per_episode straight from
args['env'] (= the INI default of 2). The auto-link in
AdaptiveDrivingAgent.__init__ updates the env's attribute but not the
args dict the eval inherits.

Fix: prefer puffer_env.driver_env.max_trials_per_episode (the actual
env value after auto-link); fall back to k_scenarios when args says the
INI default of 2. Same path for per_trial_timeout.

Test: tests/test_evaluator_trial_mode.py::test_trial_mode_auto_link_k4
exercises the k_scenarios=4 / args.max_trials=2 case and asserts
trial_0..3 scores + ada_delta_trial_{1,2,3}_minus_0 all appear.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
docs/src/trial_mode.md — full GOAL_TRIAL design doc:
- Two-boundary problem (terminals vs truncations) with the 4×4 table
- PPO/GAE bootstrap-stop formula (LaTeX)
- KV cache reset gate
- Option D semantic (idle-after-max_trials) with rationale
- Auto-link of trial parameters (current behavior, to be simplified)
- End-to-end signal flow (C → SHM → Python → pufferl → GAE)
- Score semantic + threshold ladder
- The 7 render gates on respawn_timestep != -1
- Per-trial metrics + evaluator's trial_K_score breakdown
- Test coverage map

tests/test_render_contract.py — programmatic visibility tests:
- gb=3 k=4: ego obs[6] must clear within 2 steps of mid-episode trial
  respawn (else 3D mesh draw at drive.h:3482 hides ego in trials 2..K)
- gb=3 k=2: same for the smallest meaningful trial setup
- gb=0: ghost-fade semantics preserved (non-regression)
- gb=3 k=4 end-of-episode: clean idle state, no 'invisible-but-onscreen'

Run before any drive.h / drive.py / render.py change. Catches the M7-fix
bug class without spinning up Xvfb + ffmpeg.

Wired trial_mode.md into mdBook SUMMARY under new 'Design' section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-fix, per-trial breakdown only appeared in eval-time
HumanReplayEvaluator output (every 40 epochs); training-time wandb was
blind to in-episode adaptation. Now every report_interval ticks the env
emits:
  - trial_0_score, trial_1_score, ..., trial_7_score
  - ada_delta_trial_1_minus_0, ..., ada_delta_trial_{k-1}_minus_0

Implementation:
- drive.h Log: added trial_k_goal_reached[N_TRIAL_K_SLOTS=8]. Incremented
  in c_step's GOAL_TRIAL trial-end branch for ego only, at index k =
  trial_count - 1 (the just-completed trial).
- binding.c my_log: emits trial_K_score keys, but only if
  n_trials_completed > 0 (gates on actual gb=3 activity so gb=0/1/2
  output stays clean).
- drive.py _inject_trial_deltas: post-processes the emitted log dict to
  add ada_delta_trial_K_minus_0 = trial_K_score - trial_0_score for
  K in 1..max_trials-1. Called only under goal_behavior == 3.

Tests: tests/test_ada_delta_train_logging.py (4) — trial_K keys emit,
ada_delta keys appear in info[], values match the subtraction, gb=0
doesn't leak trial keys.

All 16 trial-mode test files pass (61 tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…l_timeout

Drops the parallel knob surface (max_trials_per_episode +
per_trial_timeout). Under gb=3 the auto-link is now an invariant, not a
heuristic with INI-default detection. To change the trial count, change
k_scenarios.

Surface changes:
- INI: removed `max_trials_per_episode` and `per_trial_timeout` lines
  from drive.ini + adaptive.ini.
- render.py: dropped --max-trials-per-episode + --per-trial-timeout CLI
  flags; max_steps default = k_scenarios * scenario_length under all
  goal_behaviors.
- pufferlib/utils.py: subprocess-eval no longer forwards the two flags
  (the env derives them); video-naming uses k_scenarios.

Internals:
- adaptive.py: auto-link is unconditional under gb=3:
    kwargs["max_trials_per_episode"] = k_scenarios
    kwargs["per_trial_timeout"]      = scenario_length
  Removed the "if user passed default of 2 vs override" detection.
- evaluator.py: simplified to read from driver_env (still falls back to
  k_scenarios / sim_steps).
- C-side struct keeps max_trials_per_episode + per_trial_timeout fields
  (used internally); the only consumers now are the C trial-end branch
  and add_log_one_agent's denominator. No external API surface.

Tests:
- test_adaptive_trial_link.py: replaced test_explicit_user_override_wins /
  test_explicit_per_trial_timeout_wins with their negations — under gb=3
  any override is ignored, k_scenarios / scenario_length win. Documents
  the new invariant.

Design doc: simplified the "Auto-link" section and the quick-reference
knob table to reflect the smaller surface.

All 16 trial-mode test files pass in isolation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-fix, Python mirrored trial_ended_this_step → truncations after every
vec_step. Two writers (C: zero, write at trial-end; Python: zero, mirror)
made the buffer write-order-sensitive. Now C does everything; Python is
read-only under gb=3.

C side:
- drive.h Env struct: added `unsigned char *truncations`.
- env_binding.h: uncommented env->truncations wiring from positional arg
  (was wired for all sibling buffers, just not this one). Only Drive uses
  this header so no cross-env risk.
- drive.h c_step: zeroes env->truncations at top under gb=3; writes 1 at
  each trial boundary in the GOAL_TRIAL branch (alongside
  trial_ended_this_step). Both signals fire on the same agents.

Python side (drive.py):
- step(): no longer zeroes truncations under gb=3 — C handles it.
- step(): dropped the trial_ended_this_step → truncations mirror block.
- Under non-trial modes (gb=0/1/2), Python still owns truncations: zeroes
  it at top of step (so the k_eff-curriculum scenario-boundary write at
  drive.py:1202 keeps working).

Tests: tests/test_truncations_ownership.py (3):
- C zeroes truncations each step under gb=3 (Python-side pollute is wiped).
- C writes 1 simultaneously to truncations and trial_ended_this_step at
  trial boundary.
- Non-trial mode untouched by C.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removed narrate-the-code comment blocks I added during the trial-mode
build-out; kept only why-comments (non-obvious constraints, gotchas,
references to specific bug fixes).

drive.h:
- Trimmed add_log_one_agent header from 6 lines to 3.
- Collapsed the per-trial-K-slot doc to 2 lines.
- Collapsed the score-threshold-ladder rationale to 4 lines.
- Trimmed Option D / mid-trial-respawn ghost-flag blocks; cross-refs the
  design doc for the long form.
- Added the goal-reach invariant block (single source of truth for
  goals_reached_this_episode gating).

drive.py:
- Trimmed trial_ended_this_step field doc from 5 lines to 2.
- Trimmed per-scenario-gate rationale and resample_frequency block.

pufferl.py:
- Cache-reset gate explanation: 7 lines → 4.
- Truncations persistence: 6 → 2.
- GAE bootstrap-stop: 6 → 2.

All 17 trial-mode test files pass in isolation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lightweight categorization in tests/README.md. Avoids 17 file moves +
sys.path rewrites — the suite is already stable, so a doc-only split
serves the same intent (signal which tests are load-bearing design
invariants vs paper-trail bug locks) without code churn.

Records the going-forward rule: write contract tests BEFORE
implementation; only add regression tests when fixing real bugs.

Also documents the multi-Drive pytest-segfault workaround (run each
file in its own invocation) and links to the design doc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces per-agent trial clocks with a single env-level trial clock.
Under gb=3:
  - On ego goal-reach mid-trial: ego goes off-map (entity.removed=1,
    INVALID_POSITION, vx=vy=0). No truncations / terminals — wait for
    env trial-end.
  - env trial-end fires when ALL active egos in env have removed=1 OR
    env per_trial_timeout elapses. On trial-end, ALL entities (egos +
    co-players) reset to init position; env_trial_start updates; env
    trial_count++.
  - env episode-end (env_trial_count == max_trials): terminals fires for
    every active ego; Option D applies (removed stays, off-map until
    c_reset). env_episode_ended sentinel prevents repeat trial-ends until
    c_reset.

Why: prior per-agent semantic broke the "trial 1 == trial 2 except cache"
invariant — co-players and recorded humans ran on their own clocks, so
trial 2's world differed from trial 1's world. Adaptation Δscore =
trial_K - trial_0 conflated cache effects with stochastic traffic
differences. B'' guarantees trial-conditions identity.

C side (drive.h):
  - Env struct: added env_trial_count, env_trial_start_timestep,
    env_episode_ended.
  - Env struct: added `unsigned char *removed` SHM buffer (per-agent
    off-map flag).
  - c_reset: zeroes the env-level trial state + per-agent removed.
  - Goal-reach else-branch: under gb=3, set entity.removed=1 + off-map
    instead of stopped=1.
  - c_step GOAL_TRIAL block: rewritten to env-level trial-end detection
    + batch reset. Gates on !env_episode_ended.
  - move_expert: uses env_trial_start_timestep for the replay clock, so
    recorded humans rewind to frame 0 at each env trial-end.

Python side:
  - drive.py: `self.removed` SHM buffer, wired through env_init kwargs.
  - binding.c: kwargs read for `removed` (both my_shared_self_play and
    my_init paths).

Tests: tests/test_env_level_trial.py (6 new contract tests):
  - removed buffer exists + zero at reset
  - env trial-end fires on timeout when no ego reaches
  - ego goes off-map on goal-reach
  - env trial-end resets all entities to init (mid-episode, k>=3)
  - episode-end fires after max_trials → Option D removed=1 stays
  - truncations NOT fired on individual reach (only at env trial-end)

All 17 prior trial-mode test files still pass in isolation. Pufferl-side
KV cache freeze (so off-map limbo doesn't pollute the transformer cache
across trial boundaries) is the next step — gated on vecenv.recv()
integration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scripts/debug_b_demo.py walks through:
  1. env-level trial state evolution (removed flag, trunc, term)
  2. move_expert clock reset at env trial-end (verifies humans rewind
     to frame 0 by checking obs[20:] is identical at trial 1 tick 1 and
     trial 2 tick 1)
  3. per-trial-K metrics in vec_log (trial_K_score for K in 0..k-1)
  4. KV cache freeze design (documents the pufferl-side integration
     that's still pending)

Output is human-readable ASCII traces — for visual inspection, not
unittest assertions. Run: python scripts/debug_b_demo.py
…t reset

Two bugs from the user's render of B''-on-pre-B''-checkpoint:

1) Trial overlay always showed "Trial 1 / 4" all four trials.
   Cause: overlay read per-entity `trial_count` (no longer maintained
   under B'' — env-level `env_trial_count` is the truth). Switched
   overlay to env-level counter and clamped to max_trials_per_episode.

2) Recorded humans / static experts vanished in trials 2..K.
   Cause: my B'' move_expert reset (t = timestep - env_trial_start)
   replays the same EARLY frames of recorded-human trajectories each
   trial. Many nuplan agents have late-valid windows (enter the scene
   mid-trajectory) — short trials never reach those valid frames, so
   the agents are invisible. Reverted move_expert to OLD looping
   `t = timestep % array_size`, which keeps humans continuously playing
   across trials.

   Trade-off: humans drift slightly across trials (different frame in
   trial 2 than trial 1). This violates the strict "trial 1 == trial 2"
   invariant for static experts. For training this is acceptable
   (humans are stochastic background). For paper-grade human_replay
   eval (1 ego per env) where strict equivalence matters, the eval path
   will need to explicitly reset move_expert — captured as a TODO in
   the function comment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per user: "yeah whatever the humans do whatever they want at the timestep.
We dont care about what humans do, only what egos do. when trial 2 starts,
we reset the humans."

Restores the env-relative move_expert clock so humans (recorded experts)
restart at frame 0 every env trial-end. Trade-off: if humans have
late-valid windows and trials end fast (cache makes ego reach quickly),
humans may not appear in those trials — that's faithful to the data,
not a bug. Strict trial-equivalence on the ego side is what matters for
the adaptation experiment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

…+ respawn from init_steps

Two bugs caused trial 1 to differ from trial 2 even after the env-relative
move_expert clock was in place:

1) respawn_agent put active entities at traj[0]. But c_reset's
   set_start_position uses traj[init_steps]. So trial 1 started at
   recorded frame `init_steps`, trial 2..K started at frame 0 — different
   recorded poses, different velocities, different lane alignment.
   Fix: respawn_agent now uses init_steps (clamped to array_size).

2) move_expert mapped (timestep - env_trial_start) to recording frame
   directly, ignoring init_steps. Static experts were at frame 1 at
   trial K tick 1, but at frame init_steps+1 at trial 1 tick 1 (because
   c_reset put them at init_steps then move_expert advanced one tick).
   Fix: t = init_steps + (timestep - env_trial_start).

3) Even with (1) and (2), the per-entity reset in the trial-end loop
   didn't touch static agents or rest various e->state fields
   (collision_state, valid, dynamics integrators). Replaced the per-active
   respawn_agent loop with a single set_start_position(env) call at
   trial-end, which resets the EXACT same state c_reset does for every
   entity in the env. Strict bit-for-bit equivalence verified: trial1
   vs trial2 obs differ in 0/1850 dims, max diff = 0.0.

Now trial K == trial 1 except for ego's KV cache, as designed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

scripts/trace_b_render.py prints per-step: tick, env trial counter, rem
count (egos off-map mid-trial), trial_ended_this_step / truncations /
terminals counts, simulated KV cache write position, and event
highlights (GOAL-REACH, TRIAL-END, EPISODE-END).

Useful for visualizing B'' dynamics alongside the rendered mp4. Verified
empirically that the same agents reach goal at the same RELATIVE tick
across all 4 trials (agent 33 at relative tick 26 in trials 1-4, agent
42 at relative tick 116 in trials 1-4) — confirms strict trial
equivalence.

cache_pos is a simulation showing what the transformer's cache write
position would do; once task #33 wires the actual pufferl freeze for
removed agents, the displayed value will match the real cache.
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

bug_003: AdaptiveDrivingAgent.__init__ now asserts k_scenarios <= 8 under
gb=3 so the silent-truncation failure mode for per-trial metrics
(trial_k_goal_reached[8] in drive.h is fixed-size) is loud. Pre-fix,
k=9+ would lose trial_8_score from training-time wandb but eval would
still report it — diverging signal. With the assert, the user is told
to either bump N_TRIAL_K_SLOTS or pick a smaller k.

bug_002: self.removed docstring claimed "pufferl uses this to freeze the
KV cache during the off-map limbo" but pufferl has zero references to
the buffer — it's reserved for task #33. Reworded the comment.

Both flagged as nits by ultrareview, batched in one commit.
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Without this, when an ego reaches goal mid-trial and goes off-map
(removed=1), pufferl keeps invoking the transformer with that agent's
garbage INVALID_POSITION observations. Each invocation appends a new
K/V token to that agent's cache. Over a 174-tick limbo period, ~87%
of the cache becomes garbage; future trials' attention is dominated
by junk tokens.

Fix: per-agent attention-mask exclusion of garbage cache slots.

models.py TransformerWrapper:
- init_eval_state adds state["garbage_mask"]: (B, horizon) bool.
- forward_eval reads state["removed"] (B,) bool from pufferl. The
  attention mask becomes (B, 1, 1, horizon) per-agent =
  (slot <= current_pos) AND (not garbage_mask). After cache write,
  garbage_mask[:, slot_t] |= removed marks the just-written slot as
  garbage for off-map agents.
- reset_eval_state zeros garbage_mask alongside k_cache / v_cache,
  both for full-env reset and per-agent done_indices reset.

pufferl.py:
- New self.transformer_garbage_mask dict (per state_key, mirrors
  k_cache layout). Lazy-allocated by the model on first forward_eval.
- Before forward_eval: state["garbage_mask"] = ..., state["removed"]
  = view of vecenv.driver_env.removed sliced by env_id.
- After forward_eval: persist state["garbage_mask"] back.
- On done_mask reset (terminals): clear garbage_mask rows alongside
  k_cache / v_cache rows.

Tests (tests/test_cache_freeze.py, 4 contract tests):
- Slots written while removed=1 have zero softmax weight in next step
  (verified via _probe_attention path that captures attention weights
  per layer per agent).
- Other agents' attention to the same slots is unaffected.
- Full reset (done_indices=None) zeros garbage_mask.
- Partial reset (done_indices=[a, b]) zeros only those rows.
- Without state["removed"] (non-gb=3 modes), no slots are marked
  garbage — backward-compat preserved.

All 19 trial-mode test files pass.

Multi-env caveat: pufferl reads vecenv.driver_env.removed, which works
in single-Drive setups (render). For multi-worker training the buffer
needs to be unified (SHM); driver_env.removed currently is one Drive's
view. Captured as follow-up — same mechanism that propagates terminals
across workers needs to do removed too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

mohitmk01 and others added 7 commits May 15, 2026 11:41
Previously `self.removed = np.zeros(num_agents)` was allocated as a private
numpy array per Drive instance. In Multiprocessing this means each worker's
`removed` writes were invisible to the main process — `driver_env.removed`
on the parent was just a metadata stub that no worker ever wrote to.
Effect: garbage_mask was never populated in training, so the cache-freeze
mask did nothing.

Plumb `removed` through SHM the same way `terminals`/`truncations` are:
- vector.py Multiprocessing: allocate `removed=RawArray("b", num_agents)`
  in self.shm; expose `self.removed = self.buf["removed"].ravel()` for
  pufferl. _worker_process maps the worker's row into buf["removed"].
- vector.py Serial: allocate `self.removed` (or accept from outer buf)
  and slice per env into buf_i["removed"].
- drive.py: read `buf["removed"]` if present, fall back to local alloc.
- pufferl.py: read unified `vecenv.removed` (works for both backends),
  falling back to driver_env.removed for compat.

Verified: 4-worker Multiprocessing test shows worker-side writes
(bit counts [1,2,1,1]) visible from the main process via SHM after one
step. Serial driver_env.removed shares memory with vecenv.removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…O loss

Eval forward (`forward_eval`) already excluded off-map limbo cache slots
via `garbage_mask`. Training forward (`forward` → `create_episode_mask`)
used only `terminals` and saw limbo tokens unmasked. The PPO loss also
ignored `masks` (per the pre-existing `# Note: We are not yet handling
masks in this version` at pufferl.py:798). Result: the policy attended
to garbage during training AND limbo tuples (zero reward, post-reset or
INVALID_POSITION obs) contributed gradient to pg_loss/v_loss/entropy.

This commit closes the train/eval mismatch:
- pufferl.py: add removed_history rollout buffer; capture vec.removed
  per step alongside terminals/truncations; pass mb_removed into the
  training forward via state["removed"]; weight pg_loss, v_loss,
  entropy_loss and diagnostic stats (approx_kl, clipfrac) by
  valid=~removed, normalized by sum(valid).
- models.py: in the training `forward` else-branch (use_episode_mask),
  add (B, 1, T) limbo bias of -inf to the attention mask on the SOURCE
  axis. attn_mask[b, t, s] += -inf if removed[b, s], for all queries t.
  Matches eval-side `garbage_mask` semantics exactly.

Verified with tests/test_train_eval_parity.py (out-of-tree):
- output at non-limbo positions is bit-invariant to limbo obs values
- output WITHOUT the mask leaks (Δ=0.57 confirms the prior bug)
- gradient at limbo input positions is exactly 0 with full gating

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Audit agent flagged: my earlier limbo_bias masked the diagonal (t==s),
so a query whose causal prefix is entirely limbo had ZERO valid sources
→ softmax(all -inf) = NaN. NaN then poisons the (now-masked) per-step
losses via 0 * NaN = NaN.

Fix: leave attn_mask[b, t, t] = 0 always. This matches eval semantics
exactly — garbage_mask[a, slot_t] is set AFTER the forward at step t,
so a limbo query during its own forward attends to itself (just as it
would on the eval side). Other limbo SOURCES still get -inf from any
non-self query, which is the train/eval parity we want.

Test 4 in /tmp/test_train_eval_parity.py: constructs removed=[1,1,1,0,...]
and confirms no NaN. All prior parity tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PPO audit agent flagged three remaining holes in the previous train/eval
parity patch (commit e396507):

1. **self.values[idx] = newvalue.detach()** (line ~1112) unconditionally
   wrote the value head's output at limbo positions back into the buffer.
   At limbo positions newvalue is computed from INVALID_POSITION obs and
   is garbage; the next outer GAE call (next minibatch) read it and
   propagated the poison. Gated: keep prior value at limbo positions.

2. **self.ratio[idx] = ratio.detach()** (line ~1016) cached limbo
   importance ratios. Outer GAE's v-trace uses these as rho/c
   coefficients. Limbo ratios are from logprob(garbage_obs, garbage_action)
   and pollute the next minibatch's advantage chain. Gated identically.

3. **bootstrap_stop** at both outer GAE (line ~900) and inner
   recomputation (line ~1049) only included `terminals + truncations`.
   Now also includes `removed_history.float()`, so V(t+1) bootstrapping
   stops at every limbo step. Otherwise the value at a healthy step t-1
   would still bootstrap into V(garbage obs at t) when t is limbo.

Together with the in-loop loss gating from e396507, this fully contains
limbo poisoning: limbo positions contribute zero gradient AND zero
information to the GAE chain AND don't poison the value/ratio buffers
across PPO epochs.

All existing tests pass:
- /tmp/test_train_eval_parity.py (4/4 tests)
- tests/test_cache_freeze.py
- tests/test_env_level_trial.py

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r ckpt

Two targeted fixes from the audit pass:

1. drive.h:447-450 — `env->log.goals_reached_this_episode` and
   `goals_sampled_this_episode` were accumulated OUTSIDE the
   `if (e->is_ego)` guard inside `add_log_one_agent`. With
   population_play this meant co-player goal counts contaminated the
   ego aggregate, biasing the `completion_rate` metric upward (and
   indirectly any downstream consumer). Moved the two increments
   inside the ego guard.

2. vector.py: module-level cache `_CO_PLAYER_STATE_DICT_CACHE` keyed
   by (checkpoint_path, mtime). Render epochs were calling
   `pufferlib.vector.make` and reloading the (~50MB) co-player
   checkpoint from disk each time, contributing to the OOM seen at
   first-eval/render epochs. Cache invalidates if the file is
   replaced. Training path also benefits (only relevant if you
   construct multiple vec envs in one process).

c_reset-at-episode-end was prototyped, then reverted after
discussion: existing `bootstrap_stop ∨ removed` + loss mask already
neutralizes the limbo gap's gradient damage; forcing c_reset would
repeat the same map ~7× per resample window and reduce data
diversity. Trade-off not worth the 5-10% wall-clock win.

All tests pass:
- /tmp/test_train_eval_parity.py (4/4)
- tests/test_cache_freeze.py
- tests/test_env_level_trial.py

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The PPO loss-gating patch broadcast `valid_mask` (B, T) against
`entropy` (B*T,) — sample_logits returns entropy flat, not 2D.
RuntimeError: tensor size mismatch (20100 vs 402).

Fix: flatten valid_mask once (`valid_mask_flat`) and use it for the
entropy term in both the entropy_conditioned and the standard branch.
Normalize by the flat sum.

pg_loss / v_loss are unaffected (they use (B, T) tensors throughout).

All parity tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…outing

* tests/inspect_system.py: end-to-end system probe (env + policy + KV cache
  + garbage_mask + attention) producing per-tick JSONL, npz snapshots, and
  rendered mp4 for each of {coplayer, human_replay, ego_only} modes.
* tests/plot_attention.py: heatmaps + per-head bar plots + limbo-mask
  verification plots from inspect_system outputs.
* scripts/adaptive/nuplan_transformer_local_k2_201_gb3_{2,4}partners.sh: the
  k=2 gb=3 sweep launchers we've been iterating on (nw=16 for 2 partners,
  nw=8 for 4 partners; eval shrunk to 64 maps to keep memory under cap).
* scripts/adaptive/nuplan_transformer_local_k4_201_gb3_2partners.sh: k=4
  variant with nw=8 (k=4 doubles pinned RAM so nw drops for 2-way parallel).
* pufferlib/utils.py: render_videos honors eval.map_dir if set, so renders
  match the eval distribution (e.g. nuplan_hard) rather than training maps.
* pufferlib/ocean/drive/drive.h: ffmpeg render fps lowered to 10 (matches
  env dt=0.1; previously 30 was 3x wall-clock playback).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Repo-local scores CSV (611 KB) so a fresh clone can run
build_nuplan_hard.py without copying the file separately. Default
--scores path updated to scripts/nuplan_201_hardness_scores.csv.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Forked from scripts/adaptive/nuplan_transformer.sh with:
  * gb=3 GOAL_TRIAL flags (--env.goal-behavior 3 + k_scenarios + scenario_length)
  * nw=32 (cluster has more RAM than local container; per-task mem=256GB
    covers 181 GiB pinned + 30 GiB overhead + 15 GiB eval-spike margin)
  * cpus-per-task=40 (nw=32 + 8 overhead)
  * minibatch_multiplier=25 → keeps minibatch_size=20100 at horizon=804
  * eval shrunk to 64 maps for the eval subprocess spike
  * PYTORCH_CUDA_ALLOC_CONF=expandable_segments + WANDB_MODE=online forced
  * --array=0-1 for 2 partners (miku2puk @ 0.05, 2e029h15 @ 0.10)

Submit with: sbatch scripts/adaptive/cluster_nuplan_transformer_k4_gb3.sh

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Was 0-1 (2 partners, 1 seed each). Now array=0-11, indexed as:
  TASK_ID = partner_idx * 3 + seed_idx
  partners = [miku2puk@e_ub=0.05, 2e029h15@0.10, m2ygolog@0.20, 6rauydj2@0.50]
  seeds    = [42, 43, 44]

Each task gets one GPU + 256GB + 40 CPUs (cluster handles the 12-way
parallelism via the array). 2B steps per task → ~5h each at nw=32.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Validates the cluster setup before submitting the 12-run sweep:
singularity + venv + binding load, nuplan_201/nuplan_hard binaries
present, partner ckpt loads, co-player + ego forward pass works,
wandb sync online, no OOM/NaN at first dashboard frame.

Resources: 48GB / 8 CPUs / 1 GPU / 30 min. ~3-4 min wall-clock once
running. Submit with sbatch, watch /scratch/.../smoke_<jobid>.out,
scancel as soon as Steps advances.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Force-adds the 4 partner checkpoints (gitignored experiments/ dir):
  puffer_drive_miku2puk.pt   (e_ub=0.05)
  puffer_drive_2e029h15.pt   (e_ub=0.10)
  puffer_drive_m2ygolog.pt   (e_ub=0.20)
  puffer_drive_6rauydj2.pt   (e_ub=0.50)

Used by cluster_nuplan_transformer_k4_gb3.sh (the 4×3-seed array sweep)
and the local k=4 2-partner launcher. 5 MB × 4 = 20 MB one-time cost.

Also bumps the local k=4 2-partner launcher from nw=8 to nw=10 (after
container memory.max increased from 132 GiB to 154 GiB).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Previous cluster smoke OOM'd at startup despite nw=4. Reducing:
  --mem 48GB → 64GB              (more CPU headroom)
  --train.minibatch-multiplier 25 → 4   (minibatch 20100 → 3216, 6x less GPU memory)
  --train.max-minibatch-size 20100 → 3216
  --train.total-timesteps 5M → 2M       (finish faster)
  --time 30min → 20min

This matches the config of the first 50M smoke on the vast box that
ran cleanly end-to-end. If THIS still OOMs, the issue is cluster
container limits, not our minibatch sizing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Rewrite `_fill_external_co_player_actions` to handle vec.batch_size > 1.
When the env is launched with `--vec.batch-size=$NUM_WORKERS`, recv()
returns all workers' obs stacked, and this function now does a single
batched GPU forward over the combined co-player batch instead of
launching one forward per worker.

Mechanism:
  - Parse N workers' co_ids + reset flags from info[]
  - Concatenate per-worker co_obs slices into one batched tensor
  - Merge per-worker KV caches along batch dim for the forward
  - Single forward_eval call
  - Split updated state back to per-worker stores
  - Scatter actions to each worker's SHM action buffer slot

When batch_size == 1 the new code is a no-op equivalent of the old:
n_in_batch=1, no concat/split work, single worker processed.

Smoke result at nw=4 (RTX 4090): SPS 10.2K -> 32.5K (3.2x).
At cluster nw=32 the per-step Python overhead reduction should give
~2-2.5x SPS (16h -> ~6-8h per partner-seed run).

To enable, add to sbatch:
  --vec.batch-size <NUM_WORKERS>

Without that flag the code path is unchanged (batch_size=1 default).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Two fixes that remove the two main triggers preventing torch.compile from
producing a clean, stable forward_eval graph:

1. models.py: garbage_mask update was using a 0-d scalar tensor index
   (slot_t.squeeze()), which causes Dynamo to call aten._local_scalar_dense
   and graph-break every call. Replaced with a functional broadcast:

     slots = torch.arange(self.horizon, device=device)
     position_mask = slot_t.view(-1, 1) == slots.view(1, -1)
     garbage_mask = garbage_mask | (r.unsqueeze(1) & position_mask)

   No semantic change — same final values, no scalar extraction.

2. pufferl.py: self.amp_context was being entered/exited per recv()
   iteration inside the eval loop. Dynamo sees autocast GLOBAL_STATE flip
   between calls and recompiles forward_eval each step. Hoisted the
   amp_context to wrap the entire rollout loop with explicit enter/exit.

Smoke at nw=4 (RTX 4090, default compile mode):
  - opt432 (#2 only):    1.2M steps in 79s, Eval.Forward 17s
  - opt432 + Path A:     1.2M steps in 65s, Eval.Forward 10s
  - Wall-clock speedup:  ~18% (Forward time dropped ~41%)

This does NOT unlock CUDA graphs (compile_mode=reduce-overhead still
crashes on state-dict mutation/lazy alloc). The win comes from a clean
single compile of forward_eval that doesn't re-fire each step.

Stacks multiplicatively with #2. Projected at cluster nw=32:
  7h post-#2 -> ~5.9h post-Path A.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants