Skip to content

Mohit/cleanup apr25#36

Open
m2kulkarni wants to merge 159 commits into
mainfrom
mohit/cleanup-apr25
Open

Mohit/cleanup apr25#36
m2kulkarni wants to merge 159 commits into
mainfrom
mohit/cleanup-apr25

Conversation

@m2kulkarni
Copy link
Copy Markdown

No description provided.

daphne-cornelisse and others added 30 commits November 11, 2025 13:35
* Make sure we can overwrite goal_behavior from python side and other minor improvements.

* Fix stop goal behavior bug.

* Make goal radius configurable for WOSAC eval.

* Reset to defaults + cleanup.

* Minor

* Minor

* Incorprate feedback.
Accel is being cut in half for no reason
* Fix incorrect obs dim in draw_agent_obs

* Update drive.h

---------

Co-authored-by: Daphne Cornelisse <cor.daphne@gmail.com>
…erge-Lab#104)

* make joint action space, currently uses multidiscrete and should be replaced with discrete

* Fix shape mismatch in logits.

* Minor

* Revert: Puffer doesn't like Discrete

* Minor

* Make action dim conditional on dynamics model.

---------

Co-authored-by: Daphne Cornelisse <cor.daphne@gmail.com>
* Replace default learning rate and ent_coef.

* Minor

* Round.
* Quick integration of WOSAC eval during training, will clean up tomorrow.

* Refactor eval code into separate util functions.

* Refactor code to support more eval modes.

* Add human replay evaluation mode.

* Address comments.

* Fix args and add to readme

* Improve and simplify code.

* Minor.

* Reset to default ini settings.
* Add python test for ini file parsing

- Check values from default.ini
- Check values from drive.ini
- Additional checks for comments capabilities

* Add C test for ini file parsing

- Add CMake project to configure, build and test
- Test value parsing
- Test comments format
- Add comments for (un)expected results

* FIX: Solve all memory errors in tests

- Compile with asan

* Remove unprinted messages

* Add utest to the CI

- Ini parsing tests
- Update comments to clarify intent

* Update tests/ini_parser/ini_tester.c

- Change check conditions to if/else instead of ifs
- Speed up parsing speed (exist as soon as match is found)

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update tests/ini_parser/ini_tester.c

- Fix mismatch assignation

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* FIX: Move num_map to the high level of testing

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…-Lab#138)

* Adding Interaction features

Notes:
- Need to add safeguards to load each map only once
- Might be slow if we increase num_agents per scenario, next step will
be torch.

I added some tests to see the distance and ttc computations are correct,
and metrics_sanity_check looks okay. I'll keep making some plots to
validate it.

* Added the additive smoothing logic for Bernoulli estimate.

Ref in original code:
message BernoulliEstimate {
    // Additive smoothing to apply to the underlying 2-bins histogram, to avoid
    // infinite values for empty bins.
    optional float additive_smoothing_pseudocount = 4 [default = 0.001];
  }

* Little cleanup of estimators.py

* Towards map-based realism metrics:

First step: extract the map from the vecenv

* Second step: Map features (signed distance to road edges)

A bunch of little tests in test_map_metric_features.py to ensure this do what it is supposed to do.

python -m pufferlib.ocean.benchmark.test_map_metrics

Next steps should be straightforward.

Will need to check at some point if doing this on numpy isnt too slow

* Map-based features.

This works, and passes all the tests, I would still want to make additionnal checks with the renderer because we never know.

With this, we have the whole set of WOSAC metrics (except for traffic lights), and we might also have the same issue as the original WOSAC code: it is slow.

Next step would be to transition from numpy to torch.

* Added a visual sanity check, plot random  trajectories and indicate when WOSAC sees an offorad or a collision

python pufferlib/ocean/benchmark/visual_sanity_check.py

* Update WOSAC control mode and ids.

* Eval mask for tracks_to_predict agents

* Replacing numpy by torch for the computation of interaction and map metrics.

It makes the computation way faster, and all the tests pass.

I didn't switch kinematics to torch because it was already fast, but I might make the change for consistency.

* Precommit

* Resolve small comments.

* More descriptive error message when going OOM.

---------

Co-authored-by: WaelDLZ <wawa@CRE1-W60060.vnet.valeo.com>
Co-authored-by: Waël Doulazmi <wawa@10-20-1-143.dynapool.wireless.nyu.edu>
Co-authored-by: Waël Doulazmi <wawa@Waels-MacBook-Air.local>
Co-authored-by: Daphne Cornelisse <cor.daphne@gmail.com>
Co-authored-by: Pragnay Mandavilli <pm3881@gr052.hpc.nyu.edu>
* Add option for targeted experiments.

* Rename for clarity.

* Minor

* Remove tag

* Add to help message and make deepcopy of args to prevent state pollution.
…merge-Lab#146)

* Little optimizations to use less memory in interaction_features.py

They mostly consist in using in-place operations and deleting unused variables.

Code passes the tests.

Next steps:
- clean the .cpu().numpy() in ttc computation
- memory optimization for the map_features as well

* Add future todo.

---------

Co-authored-by: Waël Doulazmi <waeldoulazmi@gmail.com>
…rent dataset (Emerge-Lab#151)

* Support train/test split with datasets.

* Switch defaults.

* Minor.

* Typo.

* More robust way of parsing the path.
* Load the sprites inside eval-gif()

* Color consistency.

* pedestrians and cyclists 3d models

* Minor.

---------

Co-authored-by: Spencer Cheng <spenccheng@gmail.com>
* multiprocessing and progbar

* cleanup
* Test

* Edit.

* Edit.
* Get rid of magic numbers in torch net.

* Stop recording agent view once agent reaches first got goal. Respawning vids look confusing.

* Add in missing models for headless rendering.

* Fix bbox rotation bug in render function.

* Remove magic numbers. Define constants once in drive.h and read from there.
Co-authored-by: Daphne Cornelisse <cor.daphne@gmail.com>
…merge-Lab#165)

* Get rid of magic numbers in torch net.

* Stop recording agent view once agent reaches first got goal. Respawning vids look confusing.

* Add in missing models for headless rendering.

* Fix bbox rotation bug in render function.

* Remove magic numbers. Define constants once in drive.h and read from there.

* Remove all magic numbers in drivenet.h

* Clean up more magic numbers.

* Minor

* Minor.
mohitmk01 and others added 30 commits April 25, 2026 20:57
…uiv test

The previous version did `int(pos.item()) % horizon` to get a Python int
for cache slot indexing. On the GPU ego policy under torch.compile this
caused a Dynamo graph break and a CUDA sync every step. Refactored to
keep `slot` as a 1-element long tensor throughout:

- Slot computed as `(pos % horizon).long()` (no .item()).
- Position embedding pulled with `index_select(1, slot_t)`.
- Cache write uses `index_copy_(2, slot_t, k)` instead of `[..., slot, :] =`.
- Attention mask built from `arange <= slot_t` instead of slot-keyed dict.

Verified: torch.compile of forward_eval on CUDA bf16 completes 5 calls
with no graph breaks logged.

Also adds tests/test_co_player_equivalence.py — equivalence check on the
*real* puffer_drive_b6big5j1.pt co-player checkpoint over 200 streaming
steps with multiple horizon wraps + per-row reset, on both CPU and GPU,
in fp32 and bf16. Production paths (CPU fp32 for co-player, CUDA bf16 for
ego) match legacy to ~1e-5 and ~0.16 max-logit-diff respectively, both
well within RL noise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sweep on a 1-GPU test mirroring the production puffer_adaptive_drive
command (KV-cache fix on, MM=400, MAXMB=36400) showed:

  nwork=16 (old default):  70.9K SPS  (eval 44s, train 5s)
  nwork=32 (new default):  90.1K SPS  (eval 1m 9s, train 7s)  -> +27%

Eval scales sublinearly because of CPU/memory-bandwidth contention,
but the gain is real and free of any algorithmic change. nwork=48
needed --train.cpu-offload True (~33 GB obs buffer doesn't fit on a
5090); cpu_offload has a CPU/GPU index-device bug in the adaptive
code path so 48 isn't reachable yet.

For the typical multi-job pattern (4 simultaneous 1-GPU runs), 4 jobs
x 32 workers = 128 cores; the box has 224, so still well within
budget.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… goal-behavior

When --load-model-path points at a checkpoint that has a sibling
trainer_state.pt, restore optimizer state, epoch, global_step, and
advance the cosine LR scheduler to the saved position. When --load-id
is also passed, the wandb logger now reuses that run id (resume="allow"
was already the default in WandbLogger), so the resumed training appends
to the original wandb history instead of starting a fresh run.

Also pin --env.goal-behavior 2 in the eval-time render path so the
human_replay videos match the eval metric semantics (stop-on-goal, not
respawn).

torch.load needs weights_only=False because trainer_state.pt holds the
optimizer state dict (with class refs), not just tensors — otherwise
PyTorch 2.6's default weights_only=True silently rejects the load and
the resume falls back to a cold-start.

Two new launcher scripts (nuplan_transformer_local_resume.sh and the
k=3 variant) automate the resume flow: snapshot the latest checkpoint
of each run, pass --load-id + --load-model-path, and re-spawn the 8
killed runs in fresh tmux sessions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
With cpu_offload=True, the observations buffer is allocated on CPU (with
pin_memory) so it doesn't eat GPU VRAM. The training-time minibatch
sampling then crashed at:

  RuntimeError: indices should be either on cpu or on the same device as
  the indexed tensor (cpu)

because the prio-sampled `idx` tensor lives on the training device (GPU)
but is used to fancy-index the CPU obs tensor. Move the index to CPU for
the gather, then ship the gathered minibatch back to the device with a
pinned-memory async copy. Path is gated on cpu_offload so the
non-offload default is bit-identical.

Unlocks --vec.num-workers > 16 for k=3 adaptive training (was OOM-bound
on the GPU obs buffer; with offload, obs lives in 500GB RAM).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the per-step co-player Transformer forward pass out of the forked
worker subprocesses (where it's stuck on a single CPU thread) and onto
the main GPU process. Workers now skip get_co_player_actions and just
read the co-player slots of the shared-memory `actions` buffer that the
main process fills before each vec_step.

For k=3 adaptive runs this lifts the env-stepping bottleneck (single
worker = 273 steps × ~50-100 ms CPU forward = ~15-27 s of inference per
worker per epoch). On GPU with the existing KV cache the same forward
is ~5-10 ms — the env-stepping section becomes essentially free.

Wiring:
  * adaptive.ini: new opt-in flag `external_co_player_actions`. Default
    False so existing per-worker CPU runs are bit-identical.
  * vector.py: when the flag + `co_player_enabled` are set, allocate a
    `co_player_conditioning` SHM buffer (per-worker x co-players x
    conditioning_dim), expose the GPU co-player policy on `vecenv`, and
    skip the parent-process single-thread torch lockdown (workers no
    longer run torch).
  * drive.py: env step() skips the local CPU forward; reset() and the
    scenario-boundary still resample conditioning and write it to the
    SHM (main reads it before each forward). State management is owned
    entirely by main now.
  * pufferl.py: PuffeRL.__init__ moves the co-player to the training
    device and sets up per-worker state dicts (lazy-allocated by
    forward_eval to avoid dtype mismatch with autocast). evaluate()
    gains _fill_external_co_player_actions which extracts co-player obs
    from the recv batch, concatenates SHM-resident conditioning (matches
    drive.py:_add_co_player_conditioning exactly), runs forward_eval on
    GPU, and writes argmax actions into vecenv.actions[worker_id]
    before send.
  * models.py: defensive cast in _prime_kv_cache so a dtype mismatch
    between the K/V cache and the prime-time layer output (which can
    appear under autocast/mixed precision) doesn't crash with
    `Index put requires the source and destination dtypes match`.

Smoke run on GPU 6 (puffer_adaptive_drive, k=2, nw=2, b6big5j1
co-player ckpt with all-conditioning) advances cleanly through 3+
epochs, no errors, SPS = 17K → 23K.

Known limitations / followups:
  * batch_size > 1 in vec is not yet supported (raises NotImplemented).
    Production runs use batch_size=1 so this is fine, but the loop
    needs a small generalization for completeness.
  * Numerical equivalence to the per-worker CPU path has not yet been
    verified end-to-end on a long run; smoke matches by inspection of
    losses (no NaN, sensible policy_loss / kl).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds trajectory_length and output_subdir kwargs to process_all_maps /
load_map / save_map_binary so we can regenerate nuplan binaries at the
full 201-frame length (the underlying JSONs go up to 201 frames; the
old 91-frame default truncates ~55% of the data on average) without
overwriting the existing 91-frame nuplan/ binaries.

Defaults preserved at 91 so all existing code paths and binaries keep
working unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The aggregate ada_delta_score was hovering at ~0 across all 4 k=3 trained
checkpoints, but most nuplan scenes are easy (ego trivially succeeds in
scenario 0) so the signal in the few hard cases gets diluted. This adds
a per-(rollout, agent, scenario) success log so downstream analysis can
compute conditional rates like P(succeed s_k | failed s_0) — the actual
in-context-adaptation signal lives there.

Pipeline:
  - HumanReplayEvaluator now tracks per-agent goal-reach via the +1 reward
    spike (in stop-on-goal eval, dones are not set per agent). Dumps a
    flat list-of-records to wandb metrics under per_agent_success_log.
  - RECOVERY_CACHE_RESET_PER_SCENARIO=1 env var resets the policy's K/V
    cache at every scenario boundary — the control variant that lets us
    isolate "is the cache helping?" from "is current obs alone enough?"
  - scripts/eval_recovery_{all,control}.sh launch the 4 k=3 checkpoints
    in parallel on GPUs 4-7. recovery_compare.py renders adaptive-vs-
    control side-by-side with all conditional rates.

Result on the 4 current k=3 checkpoints (300 rollouts × 99 agents):
  - Cache contributes a ~+1% lift on P(succeed s_k | failed s_0) — small
    but positive across all 4, statistically significant (~3σ) on 2/4.
  - Bulk of policy capacity is single-scenario observation; cross-scenario
    context provides marginal signal. Map rotation per scenario (TODO)
    should force the policy to lean on the cache more.

Also includes:
  - scripts/coplayers/nuplan_transformer_local_201.sh: launcher for new
    co-players trained on the regenerated 201-frame nuplan binaries.
  - Resume scripts updated to point at /workspace/ADA (post-merge of the
    coplayer-to-gpu branch) and use the optimized flags where appropriate.
  - TODO_paper.md tracks pending work (resume k=3, map rotation, etc.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-scenario map rotation: at each scenario boundary the env hard-reinits
with fresh map_ids while leaving the ego policy's K/V cache (held in main)
intact. Forces the policy to use cross-scenario context because the current
scene is genuinely new each scenario.

Render path needed C-side help to survive the in-step vec_close + revec:
- New c_donate_client / c_adopt_client (env_binding.h, drive.h) stash
  env[0]->client into a global before vec_close so raylib + ffmpeg pipe
  survive the swap.
- Render env sets _render_keep_client_on_swap=True; reinit calls donate
  before vec_close and adopt after revec.
- Reinit's env_init now passes render_mode=self._render_mode_int so the
  new env stays in HEADLESS and write_frame_to_pipe keeps firing past
  the first scenario.

Without this, render segfaulted at the boundary (raylib's CloseWindow ->
InitWindow cycle is not safe under xvfb) or stopped writing frames at
frame 91 (render_mode silently reset to OFF).

Launchers:
- nuplan_transformer_local_k3_maprand.sh: 4 k=3 runs vs old 91-frame
  coplayers with map_rand on. Default GPUs 4-7.
- nuplan_transformer_local_k2_201.sh: 4 k=2 runs vs new 201-frame
  coplayers with map_rand on. Default GPUs 0-3, nw=32.

The 4 k=2/201/maprand runs that were started today (wandb runs
1f7gi2r3, 99vg079m, i8qy2lc8, szzu12a2) were trained on this commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Experiments enabled by this commit (in approx order of run):
- City-adapt log-replay training: train on US (Boston+Pittsburgh+Vegas,
  5072 scenes), eval on Singapore (329 held-out scenes). co_player_enabled
  off, scenarios continue on the same map (map_rand off). Launcher:
  scripts/adaptive/nuplan_transformer_local_k2_201_city.sh.

- Entropy-sweep coplayer training: 5 partners trained on the 201-frame
  nuPlan data with entropy_ub ∈ {0.05, 0.10, 0.20, 0.50, 1.00} at fixed
  discount_lb=0.4. Gives a partner pool spanning near-deterministic to
  very stochastic. Launcher:
  scripts/coplayers/nuplan_transformer_local_201_entropy_sweep.sh.

- Diverse-partner adaptive training: pair ego with one entropy-sweep
  partner, sample partner conditioning per episode from the partner's wide
  trained range. map_rand off so ego batch row → agent identity stays
  stable across the boundary; cache encodes partner type from s_0,
  applies in s_1. Launcher:
  scripts/adaptive/nuplan_transformer_local_k2_201_diverse.sh.

Architecture / env additions:
- drive.py: condition_rand_per_scenario flag (default False). When True
  the partner conditioning vector is re-sampled at every scenario
  boundary; partner POLICY weights are unchanged but its conditioning
  input changes. Defined as a separate axis from map_rand (the two
  resamplings happen via different code paths). Currently unused by the
  diverse launcher (per-episode sampling already gives the right
  semantics for adaptation), but kept for future curriculum work.
- adaptive.ini: exposes condition_rand_per_scenario as a CLI flag.
- utils.py: render_videos forces external_co_player_actions=False on
  the render env (centralized inference is only set up for the trainer's
  workers, not the standalone render env, so co-players froze without
  this).
- drive.h: render_mode propagation in env reinit + per-agent color
  distinguishing ego (magenta) from co-player (blue) so debug renders
  show which policy controls which car.
- models.py: probe_attention support — when state["_probe_attention"]
  is True, forward_eval computes attention weights via explicit softmax
  (instead of SDPA's functional which doesn't return weights) and
  appends per-(layer, slot) tensors to state["_attn_weights"]. Gated so
  no overhead in production.

Analysis tooling:
- scripts/probe_attention.py: load checkpoint + env, run a probed
  rollout, dump (layer, head, query_step, key_position) attention
  tensors as npz.
- scripts/visualize_attention.py: render heatmaps and a cross-scenario
  attention-mass time series from probe outputs.
- scripts/counterfactual_cache.py: paired rollouts with cache preserved
  vs zeroed at scenario boundary; reports paired s_1 lift and
  conditional recovery P(succ s_1 | fail s_0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In-process curriculum that anneals partner's entropy_weight_ub through
4 stages (5% → 20% → 50% → 100% of the user-passed final value), each
30 episodes per worker. Stash sampled entropy stats per resample and
drain to wandb at next report_interval — gives a per-episode trace of
the partner-conditioning distribution actually fed to the partner
policy.

Launcher: 4-run ablation (curriculum vs no-curriculum × {final_ub=0.5,
1.0}) using matched partners from the entropy sweep (n48teqjs for
e=1.0, 6rauydj2 for e=0.5).
Curriculum stages cut the ego K/V cache at within-episode scenario
boundaries:
  Stage 0 (k_eff=1): cut every boundary → 4 independent scenarios
  Stage 1 (k_eff=2): cut middle boundary → 2 clean k=2 chunks
  Stage 2 (k_eff=K_max): no cuts → full cross-scenario context

Reset mechanism: at boundaries to cut, drive.py sets
truncations[ego_ids]=1 and terminals[ego_ids]=1. pufferl drops the
cache via done_mask=t+d at eval, and create_episode_mask blocks
cross-boundary attention during training — eval and train see the
same effective context. Trade-off: PPO treats the cut as episode-
end for advantage, slightly under-credits cross-segment value;
effect shrinks to zero at stage 2 (no cuts).

Stage formula: cut at boundary iff current_scenario % k_eff == 0
(within-episode only). With K_max=4 and stages k_eff∈{1,2,4} this
gives uniform splits.

Launcher: 2 runs at k_max=4 horizon=804, partner n48teqjs at
e_ub=1.0; differs only in k_eff_curriculum_enabled. Sized memory:
mb_mult=50 keeps minibatch_size=40200 (same as k=2/100). Per-run
RAM ≈ 2× k=2 → can't run alongside the active k=2 entropy ablation.
The per-worker episode counter that drives the entropy curriculum is
not part of the model checkpoint, so a naive resume of a curriculum
run restarts the curriculum from stage 0 (entropy_ub = 5% of final)
even if the original run was already in stage 2 or 3. The new kwarg
lets the resume launcher seed the counter to the original run's ending
episode count so the curriculum picks up at the right stage.

Resume launcher (scripts/adaptive/nuplan_transformer_local_k2_201_curriculum_resume.sh)
restores the 3 killed runs from the entropy_ub ablation:
  - exp1 nocurr_e1.0 from epoch 60 (no curriculum)
  - exp2 curr_e0.5  from epoch 80 (curriculum, episodes_start=80
                                    → resumes in stage 2)
  - exp3 nocurr_e0.5 from epoch 130 (no curriculum)
exp0 curr_e1.0 (uufybjgm) finished cleanly and is not resumed.

Default LR is preserved — pufferl already restores optimizer state
and global_step from sibling trainer_state.pt (pufferl.py:280) so
cosine LR resumes mid-schedule, not at peak.
pufferl.evaluate() created a fresh state dict every step in the
rollout loop and pulled `transformer_context` / `transformer_position`
from persistent storage — but NEVER persisted `k_cache` / `v_cache`.
The model wrote them into the local state dict at the end of each
forward_eval, but pufferl threw them away.

Effect: on every rollout step the model saw `state.get("k_cache")
is None`, triggered the `need_alloc` branch, allocated fresh empty
K/V tensors, AND reset `pos` to 0. So every step the transformer
attended only to the current observation; the cache was never used.

Meanwhile the training pass (full-sequence forward) DID apply
attention across all timesteps. So the policy's gradient was
computed assuming it had used past context, but at action-selection
time it never did. Massive train/eval mismatch — the policy weights
were pulled toward "use context" but the policy never actually had
context at decision time.

This silently broke every in-context-learning experiment in the
project (k_eff curriculum, entropy curriculum, oracle, partner-pool
diversity, gamma sweep) — none of them could have ever worked
because the cache they were trying to teach the policy to use was
empty at inference. The eval pipeline (HumanReplayEvaluator),
render rollouts (drive/rollout.py), and analysis scripts
(probe_attention.py) all persist state correctly, so they showed
the policy under "with cache" conditions — but the policy had been
trained against "no cache" rollouts, so its weights couldn't
actually leverage the context they were now being given.

Fix: add `transformer_k_cache` and `transformer_v_cache` persistent
dicts alongside the existing transformer_context/position. Pull them
into per-step state, write back after forward_eval, zero the rows on
episode boundary (so a fresh episode doesn't inherit prior K/V).

Verified with /tmp/test_kv_cache_persistence.py: with persisted
state, pos increments 1→2→3→…, and past slots fill with non-zero
K/V values. Without persistence (the old behavior), pos stays at 1
and past slots are always zero.
With the K/V cache now persisted across rollout steps, pos=-1 caused
the first post-reset step to write to slot horizon-1 instead of slot
0 (since (-1) % horizon == horizon-1). The cache zeroing handled the
correctness of attention values, but it left the new episode's first
token in a slot that the rest of the episode would never naturally
overwrite via index_copy_ at slot=pos%horizon. Cleaner to mirror the
need_alloc/first-call branch: pos=0 → slot=0 → write the new
episode's first token to slot 0 like a fresh run would.
Learnable PE (nn.Parameter, init zeros + std=0.02 noise) requires the
model to learn temporal structure from gradients. With our sparse
adaptation reward, that structure was probably never being learned
well — at start of training PE is essentially zero and the model
sees a 'bag of timesteps'.

Sinusoidal PE encodes absolute position via sin/cos at multiple
frequencies. The model has temporal structure available from step
zero of training, no waiting for gradients to teach it.

Implementation: register_buffer (non-trainable, no grad). Same shape
(1, horizon, hidden_size) as before, accessed via the same
get_positional_embedding helper, so both forward (training) and
forward_eval (rollout) paths work unchanged.

Note: existing checkpoints with the trained learnable PE param will
need strict=False loading to migrate to this version. Fresh runs
start with the sinusoidal buffer immediately.
Default eval (50 rollouts on full nuplan_201) showed ada_delta_score ~0
across all configs. Inspection of the underlying trajectory data: 54%
of nuplan_201 maps have zero SDC-vehicle interaction, diluting the
average. nuplan_hard is the top 10% (540 maps) by SDC interaction-step
count, defined purely from the recorded human trajectories in
data/nuplan_gpudrive/nuplan/ — no policy involvement.

On 4lm6kkh7 epoch 40 with 200 rollouts, ada_delta_score moves from
±0.005 (full set) to +0.222 ± 0.18 on nuplan_hard. Collisions in s_1
roughly halve vs s_0; episode_return nearly doubles. The model adapts;
the metric was diluted.

Also adds reward_only_last_scenario flag to Drive: zeros rewards in
scenarios 0..k-2, used to test whether reward shape (rather than
architecture) is the bottleneck for cross-scenario adaptation.

Files:
- scripts/score_maps_interaction.py (compute hardness per map)
- scripts/build_nuplan_hard.py     (symlink top-K maps into a new dir)
- scripts/adaptive/nuplan_transformer_local_k2_201_lastscen.sh
- pufferlib/ocean/drive/drive.py   (+reward_only_last_scenario)
- pufferlib/config/ocean/adaptive.ini (expose new kwarg)
- notes/nuplan_hard.md             (recreation + eval recipe + caveats)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- partner_sweep: 5 ego runs against the entropy-conditioned partners
  (miku2puk, 2e029h15, m2ygolog, 6rauydj2, n48teqjs) at γ=0.995,
  lane_align=0.025, eval against nuplan_hard. Holds everything else
  fixed so ada_delta vs partner-entropy is the only varying dimension.
- gamma_sweep: γ ∈ {0.99, 0.995, 0.999} × 2 partners. The 0.995 winner
  (4lm6kkh7) is the basis for all later runs.
- adam_test, lr_test, prio_test, prio_clip_test, prio_ent_test:
  ablation launchers for optimizer/LR/priority sampling/clip-coef
  investigations into the high clipfrac.
- oracle_g0995: oracle-conditioned ego at γ=0.995 (oracle is incompatible
  with human_replay eval, kept for completeness).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.