Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
8237688
M1: Remove positional embeddings (NoPE)
mohitmk01 May 4, 2026
8899b40
M1b: Restore learnable PE + per-episode reset for multi-episode rollouts
mohitmk01 May 4, 2026
09b92b3
Switch positional embedding from learnable to sinusoidal
mohitmk01 May 4, 2026
a7e3840
M2: Add trial_ended_this_step buffer end-to-end (Python-owned, zero-c…
mohitmk01 May 4, 2026
fbaf4d8
M3: GOAL_TRIAL (goal_behavior=3) — variable-length trials per episode
mohitmk01 May 4, 2026
d9641eb
M4: per-trial Log fields (n_trials_*, trial_mean_length, trial_goal_r…
mohitmk01 May 4, 2026
3b9b060
M5: HumanReplayEvaluator trial mode (goal_behavior=3)
mohitmk01 May 4, 2026
ea55e49
M6: Forward goal_behavior + trial config to subprocess eval; INI defa…
mohitmk01 May 4, 2026
0204d23
M7: GOAL_TRIAL Option D (idle-after-max_trials) + GAE/test suite
mohitmk01 May 15, 2026
70c703a
M7-fix: clear respawn_timestep after GOAL_TRIAL mid-episode respawn
mohitmk01 May 15, 2026
c081ced
M7-fix: evaluator picks up auto-linked max_trials under gb=3 + k_scen…
mohitmk01 May 15, 2026
157ad5c
Docs + render contract tests for trial mode
mohitmk01 May 15, 2026
6ade130
gb=3: emit trial_K_score + ada_delta_trial_K_minus_0 during training
mohitmk01 May 15, 2026
b6c2ad5
gb=3: Option A — k_scenarios IS n_trials, scenario_length IS per_tria…
mohitmk01 May 15, 2026
3e748d5
gb=3: C owns truncations + trial_ended_this_step (single-writer)
mohitmk01 May 15, 2026
a6c2aae
Comment cleanup pass on trial-mode files
mohitmk01 May 15, 2026
34bee5f
tests: document contract vs regression split
mohitmk01 May 15, 2026
20c25cd
gb=3 B'': env-level trials, off-map on reach, sync world reset
mohitmk01 May 15, 2026
49b7922
Add debug demo script for B'' env-level trial semantic
mohitmk01 May 15, 2026
30efe78
gb=3 B'': fix trial overlay reading wrong counter + revert move_exper…
mohitmk01 May 15, 2026
34af649
gb=3 B'': reset humans to frame 0 at env trial-end (re-applied)
mohitmk01 May 15, 2026
16ca6b3
render: 30fps → 15fps for ffmpeg encoding
mohitmk01 May 15, 2026
bfc094f
gb=3 B'': strict trial equivalence — set_start_position at trial-end …
mohitmk01 May 15, 2026
bf8fee9
Add per-step text tracer for B'' rollouts
mohitmk01 May 15, 2026
29084a6
Address ultrareview nits: k_scenarios>8 assert + removed docstring
mohitmk01 May 15, 2026
808edca
gb=3 B'': pufferl KV-cache freeze for off-map (removed=1) agents
mohitmk01 May 15, 2026
ee689cd
M7: SHM-back `removed` so KV-cache freeze works in multi-worker training
mohitmk01 May 15, 2026
e396507
gb=3 train/eval parity: thread `removed` into training attention + PP…
mohitmk01 May 15, 2026
69e08b3
gb=3 train mask: leave diagonal open to prevent NaN in all-limbo rows
mohitmk01 May 15, 2026
98610ed
gb=3 PPO: gate value/ratio writebacks + bootstrap_stop by `removed`
mohitmk01 May 15, 2026
b0034fe
gb=3: fix co-player goals leaking into ego aggregate + cache co-playe…
mohitmk01 May 15, 2026
6deb6a4
fix: entropy shape mismatch in PPO loss-gating
mohitmk01 May 15, 2026
06ab40d
add: inspection tools + sweep launchers + render fps fix + eval map r…
mohitmk01 May 16, 2026
1c1c8ec
add nuplan_201 hardness scores + build_nuplan_hard reads from repo
mohitmk01 May 16, 2026
ed0d917
add cluster sbatch script for k=4 gb=3 2-partner sweep
mohitmk01 May 16, 2026
822c5e9
cluster k4_gb3 sweep: 4 partners x 3 seeds = 12 array tasks
mohitmk01 May 16, 2026
f34a87b
add cluster_smoke_test.sh: single-partner nw=4 sanity check
mohitmk01 May 16, 2026
a379013
add 4 co-player partner checkpoints + bump local k4 nw 8→10
mohitmk01 May 16, 2026
a82a006
smoke: shrink to mb_mult=4 (matches first working 50M smoke)
mohitmk01 May 16, 2026
de06161
perf: batched co-player forward across workers (#2)
mohitmk01 May 16, 2026
b72cc1a
perf: eliminate forward_eval graph breaks (Path A)
mohitmk01 May 16, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,10 @@
- [Evaluation overview](evaluation.md)
- [WOSAC](wosac.md)

# Design

- [Trial mode (`goal_behavior=3`)](trial_mode.md)

# Blog

- [PufferDrive 2.0 release](pufferdrive-2.0.md)
352 changes: 352 additions & 0 deletions docs/src/trial_mode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,352 @@
# Trial Mode (`goal_behavior=3`)

Design and contract for the in-context adaptation training mode.

## Why

The adaptive ego is a Transformer with a KV cache. We want it to **adapt
across attempts within a single fixed-budget episode** — i.e., use what it
saw in trial 1 to do better in trial 2, etc. That requires:

1. Multiple goal-reach attempts ("trials") inside one episode.
2. KV cache that **persists across trial boundaries** (so context is
preserved) but **resets at episode boundaries** (so episodes are i.i.d.).
3. PPO/GAE that **stops bootstrap at trial boundaries** (because the
agent's value at $t+1$ is computed post-respawn, from a different
state, and bootstrapping it into the last step of the old trial
contaminates the target).

These three things — cache reset, GAE bootstrap-stop, episode-vs-trial
distinction — have different gates. The next sections specify each.

## Terms

| Term | Meaning |
|---|---|
| **Trial** | One goal-reach attempt. Ends on goal-reach OR `per_trial_timeout` ticks. |
| **Episode** | A sequence of at most `max_trials_per_episode` trials, sharing a single KV cache. |
| **Scenario** | A map. Under `goal_behavior=3`, each episode runs on **one** map (no per-trial map swap). |
| **`terminals[t]`** | 1 ⇔ the *episode* ended at step $t$. Used for both **cache reset** and **GAE bootstrap-stop**. |
| **`truncations[t]`** | 1 ⇔ a *trial* ended at step $t$ but the episode continues. Used **only for GAE bootstrap-stop**; cache persists. |
| **`trial_ended_this_step[i]`** | Per-agent C-side flag, set every trial boundary (goal-reach or timeout). Mirrored to `truncations` by Python. |
| **Cache reset** | Zero out the Transformer's K/V tensors. Done at episode boundary only. |
| **GAE bootstrap-stop** | Setting $(1-\text{stop}_t) = 0$ in the GAE recursion to prevent $V_{t+1}$ contamination across the boundary. |

## The two-boundary problem

Standard PPO has one boundary signal (`dones`). We need two, because the
two distinct things that happen at trial-vs-episode boundaries don't
align:

| Event | `terminals` | `truncations` | KV cache | GAE bootstrap |
|---|:-:|:-:|:-:|:-:|
| Within-trial step | 0 | 0 | continues | continues |
| **Trial end** (goal or timeout), more trials to go | 0 | **1** | **continues** | **stops** |
| **Episode end** (last trial done) | **1** | 0 | **resets** | stops |
| Scenario boundary (gb≠3 only) | 0 | 0 | n/a here | n/a |

Mnemonic: **`terminals` ⇒ cache reset; (`terminals` OR `truncations`) ⇒ bootstrap stop.**

## State machine — per agent, per `c_step`

```
┌──────────────────────┐
│ agent.removed == 1 │ ───── skip (Option D)
└──────────┬───────────┘
│ no
┌─────── trial_ended? ─────┴───── neither ───→ continue trial
│ (reached || timed_out)
trial_ended_this_step[i] = 1
trial_count++
├── trial_count >= max_trials_per_episode ───→ EPISODE END
│ │
│ ▼
│ terminals[i] = 1 // Python: cache reset HERE
│ add_log_one_agent(env, i) // flush this agent's metrics
│ agent.removed = 1 // Option D: idle
│ agent.x, agent.y = INVALID // off-grid
└── otherwise ─────────────────────────→ TRIAL END
respawn_agent(env, i) // back to start
agent.respawn_timestep = -1 // clear ghost flag (see "Render gates")
agent.trial_start_timestep = env->timestep
```

The Python side mirrors `trial_ended_this_step → truncations` after every
`vec_step`. So:

* Trial-end branch → C sets `trial_ended_this_step[i] = 1`. Python sets
`truncations[i] = 1`. `terminals[i] = 0` (it was zeroed at the top of
`step`).
* Episode-end branch → C sets BOTH `trial_ended_this_step[i] = 1` AND
`terminals[i] = 1`. Python sets `truncations[i] = 1`.

Both signals fire at the last trial end. That's intentional — the cache
reset gate (terminals) and the bootstrap-stop gate (terminals OR
truncations) both want to fire there.

## PPO / GAE formulation

Standard GAE (Schulman et al. 2016) with a single `done` signal:

$$
\delta_t = r_t + \gamma\,(1-d_t)\,V_{t+1} - V_t
$$

$$
\hat A_t = \delta_t + \gamma\lambda\,(1-d_t)\,\hat A_{t+1}
$$

In vanilla PPO, $d_t = \text{terminals}_t$. The $(1-d_t)$ factors zero out
the $V_{t+1}$ bootstrap and the recursive advantage at episode boundaries
(where state $t+1$ is a fresh env reset — no semantic relation to state
$t$).

**Trial-mode modification.** At every trial boundary (not just episode
boundary), state $t+1$ is the post-respawn state — back at the trajectory
start position with reset velocity. $V_{t+1}$ from that state is **not** a
valid bootstrap for state $t$ (the last step of the old trial, somewhere
else in the map). We define:

$$
\text{bootstrap\_stop}_t \;=\; \min\!\bigl(\text{terminals}_t + \text{truncations}_t,\; 1\bigr)
$$

and replace $d_t$ in BOTH GAE equations:

$$
\delta_t = r_t + \gamma\,(1-\text{bootstrap\_stop}_t)\,V_{t+1} - V_t
$$

$$
\hat A_t = \delta_t + \gamma\lambda\,(1-\text{bootstrap\_stop}_t)\,\hat A_{t+1}
$$

This is `pufferlib/pufferl.py`'s `bootstrap_stop = (self.terminals + self.truncations).clamp(max=1.0)`.

The KV cache reset is **independent**: it gates on `terminals` alone, NOT
on `bootstrap_stop`. Otherwise we'd lose the cross-trial context that is
the entire point of trial mode.

```
PPO/GAE bootstrap-stop KV cache reset
───────────────────── ──────────────
trial boundary YES (truncations[t]=1) no ← preserves context
episode boundary YES (terminals[t]=1) YES (fresh i.i.d. start)
```

## Cache reset gate (`pufferl.py`)

```python
done_mask = d # was: d + t (gated on terminals only)
self.transformer_context[done_mask.bool()] = 0
```

If we used `d + t`, every trial boundary would wipe the cache — exactly
the opposite of what we want. Trial mode breaks without this fix.

## Option D — idle-after-max_trials

Naïve trial mode would, after the agent completes `max_trials_per_episode`
trials, immediately reset the env and start a new episode. With Python's
typical rollout of one map per `resample_frequency` ticks, this leads to
**many short episodes on the same map** — the agent overfits to a tiny
subset of maps within a single Python cycle, and gradient updates see the
same map's gradients repeatedly.

Option D fixes this by **idling the agent after its episode ends**:

```c
if (e->trial_count >= env->max_trials_per_episode) {
env->terminals[i] = 1;
add_log_one_agent(env, i);
e->removed = 1;
e->x = e->y = INVALID_POSITION; // off-grid
e->vx = e->vy = 0.0f;
// do NOT call c_reset
}
```

The agent is invisible to subsequent `c_step`s (the top-of-loop
`if (e->removed) continue;` gates it out). It stays idle until Python's
`_reinit_envs_with_new_maps` fires at the next `resample_frequency`
boundary — that's when the env loads a fresh map and `c_reset` resets
`removed = 0`.

**Net effect**: 1 episode per resample window, exactly one fresh map per
episode. Map diversity restored.

## Trial parameter naming

Under `goal_behavior=3`:

* `k_scenarios` IS the number of trials per episode.
* `scenario_length` IS the per-trial timeout.

These are the canonical names — the only two knobs you set. The C side
exposes internal fields named `max_trials_per_episode` and
`per_trial_timeout` (legacy: shared with non-trial code paths), and
`AdaptiveDrivingAgent.__init__` unconditionally sets them from
`k_scenarios` / `scenario_length` under gb=3. **There is no override.**
If you want a different trial count, change `k_scenarios`.

`resample_frequency` is also derived: $k \times L$, the worst-case
episode budget.

So `--env.k-scenarios 4 --env.goal-behavior 3 --env.scenario-length 201`
gives 4 trials of 201 ticks each, episode budget = 804 ticks, resample
at tick 804. (Pre-Option-A there were two more CLI flags
`--env.max-trials-per-episode` and `--env.per-trial-timeout`; both gone.)

## End-to-end signal flow

```
┌────────────────────┐
│ C (drive.h) │ trial_count++, trial_ended_this_step[i] = 1
│ c_step trial loop│ ── if last trial: terminals[i] = 1, removed = 1
└─────────┬──────────┘
│ zero-copy NumPy view of trial_ended_this_step (1D u8)
┌────────────────────┐
│ Python (drive.py) │ truncations[:] = 0 # top of step
│ step() │ vec_step(c_envs)
│ │ truncations[trial_ended_this_step] = 1 # mirror, gb=3 only
│ │ terminals already set by C if episode end
└─────────┬──────────┘
│ PufferLib SHM (np.bool views)
┌────────────────────┐
│ pufferl.py │ rollout buffers store BOTH d_t and t_t
│ rollout + GAE │ done_mask = d # cache-reset gate
│ │ bootstrap = (d + t).clamp(1)
│ │ δ_t = r_t + γ(1-bootstrap_t) V_{t+1} - V_t
│ │ Â_t = δ_t + γλ(1-bootstrap_t) Â_{t+1}
└─────────┬──────────┘
PPO update
```

## Score semantic

Standard non-trial modes set `score = 1` if the agent reaches goal "well
enough" in a single scenario (frac of goals reached above a threshold,
and no collisions). Under trial mode each episode has `max_trials`
attempts, so:

* `goals_reached_this_episode` ∈ {0, 1, …, max_trials} (one increment per
successful trial; gated by `current_goal_reached` to prevent
over-counting within a trial)
* `frac = goals_reached_this_episode / max_trials_per_episode`
* threshold $\tau$ ladder by $k$:
* $k = 2$: $\tau = 0.5$ (both trials must succeed for $\text{frac} > 0.5$)
* $k \in \{3, 4\}$: $\tau = 0.8$
* $k \geq 5$: $\tau = 0.9$
* `score = 1` iff `frac > τ AND !collided_in_episode`

## Render gates

`respawn_agent()` is shared between `goal_behavior=0` (RESPAWN, with
intentional ghost-fade post-respawn) and the trial-mode mid-episode
respawn (no ghost — agent should be fully visible immediately for
trial 2..K). The function sets `respawn_timestep = env->timestep`, and
**seven** downstream gates use `respawn_timestep != -1` as a "ghosted"
marker:

| Location | Effect when active |
|---|---|
| `drive.h:1327` | Skip self-side collision check |
| `drive.h:1342` | Skip other-as-target collision check |
| `drive.h:2409` | Force `obs[6] = 1` (post-respawn flag) |
| `drive.h:2455` | Other-car obs (self ghosted) zeroed |
| `drive.h:2457` | Other-car obs (other ghosted) zeroed |
| `drive.h:3482` | Skip 3D mesh draw (visible symptom) |
| `drive.h:3688` | Skip WOSAC track-index overlay |

In trial mode, after `respawn_agent` in the mid-episode branch we
**must** clear the flag immediately:

```c
respawn_agent(env, agent_idx);
e->respawn_timestep = -1; // GOAL_TRIAL is NOT a ghost-fade mode
e->trial_start_timestep = env->timestep;
```

Pre-fix symptom: trial 1 rendered correctly, trials 2..K appeared empty.

## Per-trial metrics

`add_log_one_agent` (in C) is called when an agent's episode ends. It
aggregates that single agent's metrics into `env->log` (the vec_log
sink), then resets all per-entity state the agent's next episode would
otherwise inherit (respawn_timestep, current_goal_reached, the
`metrics_array` slots, etc.).

New per-trial log fields, in addition to the standard episode metrics:

| Field | Meaning |
|---|---|
| `n_trials_completed` | Trials finished this episode (always equals `max_trials` for ego under Option D) |
| `n_trials_goal_reached` | Of those, how many reached goal |
| `n_trials_timed_out` | Of those, how many timed out |
| `trial_total_length` | Sum of trial lengths (ticks) |
| `trial_mean_length` | `trial_total_length / n_trials_completed` (computed in `add_log`) |
| `trial_goal_reach_rate` | `n_trials_goal_reached / n_trials_completed` |

These are populated **only** under `goal_behavior=3`. The standard
metrics (score, collision_rate, episode_length, …) still populate via
the same `add_log_one_agent` path.

The evaluator (`HumanReplayEvaluator`) computes additional per-trial
breakdowns from its own success array:

| Field | Definition |
|---|---|
| `trial_K_score` | $\Pr$(reached in trial $K$) over the eval rollouts |
| `ada_delta_trial_K_minus_0` | `trial_K_score - trial_0_score` (the in-context adaptation signal) |

For $K$ = `max_trials_per_episode` = 4 (auto-link from `k_scenarios=4`),
that's `trial_0_score`, …, `trial_3_score` and `ada_delta_trial_{1,2,3}_minus_0`.

## Test coverage

| File | What it covers |
|---|---|
| `test_goal_trial.py` | Trial timer fires; episode boundary fires at `trial_count == max_trials`; non-regression for gb∈{0,1,2} |
| `test_trial_ended_buffer.py` | `trial_ended_this_step` Python ↔ C buffer plumbing |
| `test_trial_log_fields.py` | Per-trial Log fields populate |
| `test_trial_standard_metrics.py` | Standard episode metrics still populate via `add_log_one_agent` |
| `test_trial_per_scenario_gate.py` | Per-scenario logic gated off under gb=3 |
| `test_trial_score_semantics.py` | Score uses `max_trials_per_episode` denominator |
| `test_trial_overcounting_fix.py` | `current_goal_reached` gates `goals_reached_this_episode` increments |
| `test_gae_trial_boundary.py` | GAE bootstrap-stop fires on truncations |
| `test_gae_decoupling_integration.py` | End-to-end `trial_ended_this_step → truncations` mirror |
| `test_adaptive_trial_link.py` | Auto-link of `max_trials_per_episode` and `per_trial_timeout` |
| `test_rollout_trial_mode.py` | Rollout `max_steps` / break / info under trial mode |
| `test_evaluator_trial_mode.py` | `HumanReplayEvaluator` emits `trial_K_score` + auto-link case |
| `test_pe_train_eval_consistency.py` | Transformer PE indexing matches between train and eval |
| `test_pos_within_episode.py` | `compute_pos_within_episode` correctness |

All 53 tests pass on `mohit/trial-episode-redesign` HEAD.

## Quick reference

```
goal_behavior = 3 # the toggle
k_scenarios = 4 # the number of trials, by auto-link
scenario_length = 201 # nuplan trajectories are 201 ticks
# → per_trial_timeout = 201, by auto-link
# → episode budget = 804 ticks
# → resample_frequency = 804, by auto-link
```

| Knob | Type | Default | Notes |
|---|---|---|---|
| `--env.goal-behavior` | int | 0 | 3 = trial mode |
| `--env.k-scenarios` | int | 1 | Under gb=3: number of trials per episode |
| `--env.scenario-length` | int | 91 | Under gb=3: per-trial timeout (ticks) |
Binary file added experiments/puffer_drive_2e029h15.pt
Binary file not shown.
Binary file added experiments/puffer_drive_6rauydj2.pt
Binary file not shown.
Binary file added experiments/puffer_drive_m2ygolog.pt
Binary file not shown.
Binary file added experiments/puffer_drive_miku2puk.pt
Binary file not shown.
6 changes: 5 additions & 1 deletion pufferlib/config/ocean/adaptive.ini
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,11 @@ reward_vel_align = 1.0
goal_radius = 2.0
; Max target speed in m/s for the agent to maintain towards the goal
goal_speed = 100.0
; What to do when the goal is reached. Options: 0:"respawn", 1:"generate_new_goals", 2:"stop"
; What to do when the goal is reached. Options: 0:"respawn", 1:"generate_new_goals", 2:"stop", 3:"trial"
; Under 3 (trial): k_scenarios = number of trials, scenario_length = per-trial timeout.
; The C side still exposes `max_trials_per_episode` and `per_trial_timeout` for
; tests that want fine-grained control; runtime path overrides them from
; k_scenarios / scenario_length in AdaptiveDrivingAgent.__init__.
goal_behavior = 0
; Determines the target distance to the new goal in the case of goal_behavior = generate_new_goals.
; Large numbers will select a goal point further away from the agent's current position.
Expand Down
3 changes: 2 additions & 1 deletion pufferlib/config/ocean/drive.ini
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,8 @@ reward_vel_align = 1.0
goal_radius = 2.0
; Max target speed in m/s for the agent to maintain towards the goal
goal_speed = 100.0
; What to do when the goal is reached. Options: 0:"respawn", 1:"generate_new_goals", 2:"stop"
; What to do when the goal is reached. Options: 0:"respawn", 1:"generate_new_goals", 2:"stop", 3:"trial"
; Under 3 (trial), k_scenarios = number of trials, scenario_length = per-trial timeout.
goal_behavior = 0
; Determines the target distance to the new goal in the case of goal_behavior = generate_new_goals.
; Large numbers will select a goal point further away from the agent's current position.
Expand Down
Loading
Loading