Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
479b24b
Add periodic safe eval during training with reward conditioning
Mar 8, 2026
14e288a
Move bin file cleanup from render_videos to caller
Mar 8, 2026
c655175
Fix eval subprocess: pass device and handle negative arg values
Mar 9, 2026
a674189
Apply ruff formatting to utils.py
Mar 9, 2026
4517070
Decouple render and safe eval from checkpoint interval
Mar 9, 2026
a97ed98
Simplify safe_eval config: remove redundant prefixes
Mar 9, 2026
1d8f526
Replace SAFE_EVAL_REWARD_BOUNDS list with dynamic config iteration
Mar 9, 2026
d98d553
Discover reward bounds from env config instead of hardcoding
Mar 9, 2026
fa599f0
Fix wosac eval: use correct CLI flag and show more stderr on failure
Mar 9, 2026
f6f8d07
Pass episode_length=91 to wosac eval subprocess to match ground truth…
Mar 9, 2026
1db6914
Add eval_async option to run eval subprocesses in background threads
Mar 9, 2026
3b91abc
Fix ruff formatting for eval_async call sites
Mar 9, 2026
648bd0b
Adjust lane alignment value in drive.ini
eugenevinitsky Mar 10, 2026
e73380d
Update drive.ini configuration parameters
eugenevinitsky Mar 10, 2026
741401b
Fix eval config parameters not being passed to subprocesses
Mar 10, 2026
ea1c7af
Fix safe eval: count episodes not steps, fix wandb async logging
Mar 10, 2026
f6413c2
Fix async render bin file race condition
Mar 11, 2026
1d37f29
Merge 3.0_beta: adopt new WOSAC batched eval API
Mar 11, 2026
0a9277a
Use lexicographic sort for checkpoints, update cluster config for torch
Mar 11, 2026
b3a03dc
Switch cluster account to torch_pr_355_general
Mar 11, 2026
c7563d6
Fix sbatch exclude error: don't pass empty --exclude
Mar 11, 2026
46d74f6
Make all evals async by default, fix human replay --eval.num-maps arg
Mar 13, 2026
5781223
Fix async render cleanup, remove env_config forwarding, remove TIMING…
Mar 13, 2026
77be25e
Forward training map_dir and num_maps to safe eval subprocess
Mar 13, 2026
93ce117
Add human replay video rendering, forward map_dir to safe eval
Mar 13, 2026
d9328e6
Extract _dispatch_render to eliminate triplicated render code
Mar 13, 2026
58961db
Organize wandb metrics into separate tabs per eval type
Mar 13, 2026
a578800
Fix safe eval env_config key, fix wandb non-monotonic step errors
Mar 13, 2026
e2d758d
Add thread-safe wandb logging, restrict evals to rank 0
Mar 13, 2026
e04e0e1
Replace lock-based wandb logging with queue-based approach
Mar 13, 2026
f2df83c
Fix memory leak in shared() map counting loop
Mar 14, 2026
8a1c261
Update evaluation docs: async defaults, human replay eval types
Mar 14, 2026
9f9da0f
Fix bugs: C memory leaks, queue drain, stats mean, missing defaults
Mar 14, 2026
455ab55
Default evals to sync instead of async
Mar 14, 2026
43413b2
Revert Py_DECREF change in shared() — pre-existing, not our bug
Mar 14, 2026
14eaa42
Refactor shared() to use continue pattern instead of if/else
Mar 14, 2026
afb8a33
Remove bare try/except blocks from eval launching code
Mar 14, 2026
b00ec64
Reduce eval thread join timeout from 660s to 10s in close()
Mar 14, 2026
7a77642
Enable human replay eval by default, fix ruff formatting
Mar 15, 2026
2c17312
Make safe eval render match metrics subprocess setup
Mar 15, 2026
0a42f38
Add --scale CLI arg to visualize binary for controlling render resolu…
Mar 16, 2026
8125393
Increase render subprocess timeout to 3600s
Mar 16, 2026
2c49401
Enable async rendering by default
Mar 16, 2026
47caa0c
Randomize agent positions on every respawn in variable agent mode
Mar 16, 2026
4c9bcdb
Fix: sample new goals after randomizing agent positions
Mar 17, 2026
66f9ba3
Fix: don't overwrite sampled goals with stale init_goal in variable a…
Mar 17, 2026
7fcebc2
Fix collision check and velocity restoration in respawn
Mar 17, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 73 additions & 3 deletions docs/src/evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,77 @@

Driving is a safety-critical multi-agent application, making careful evaluation and risk assessment essential. Mistakes in the real world are costly, so simulations are used to catch errors before deployment. To support rapid iteration, evaluations should ideally run efficiently. This is why we also paid attention to optimizing the speed of the evaluations. This page contains an overview of the available benchmarks and evals.

## Sanity maps 🐛
## Evaluation during training

PufferDrive supports running evaluations automatically during training. There are four evaluation types that can run periodically:

| Eval type | What it does | CLI flag to enable | Interval flag |
|---|---|---|---|
| **Render** | Records top-down and agent-view videos | `--train.render True` | `--train.render-interval N` |
| **Safe eval render** | Records videos with safe reward conditioning | `--safe-eval.enabled True` | `--safe-eval.interval N` |
| **Safe eval metrics** | Runs policy in subprocess, logs driving metrics | `--safe-eval.enabled True` | `--safe-eval.interval N` |
| **WOSAC realism** | Measures distributional realism (WOSAC benchmark) | `--eval.wosac-realism-eval True` | `--eval.eval-interval N` |
| **Human replay render** | Records videos with policy-controlled SDC + replayed humans | `--eval.human-replay-eval True` | `--eval.eval-interval N` |
| **Human replay metrics** | Logs collision/offroad/completion rates vs human replays | `--eval.human-replay-eval True` | `--eval.eval-interval N` |

All eval types trigger at `epoch % interval == 0`. They require a saved checkpoint, so **`checkpoint-interval` must be <= the smallest eval interval**.

### Example: enable all evals

```bash
puffer train puffer_drive \
--wandb --wandb-project pufferdrive \
--train.checkpoint-interval 250 \
--train.render True --train.render-interval 250 \
--safe-eval.enabled True --safe-eval.interval 250 \
--eval.wosac-realism-eval True \
--eval.human-replay-eval True \
--eval.eval-interval 250
```

### Safe eval

Safe eval measures how well the policy drives when given "safe" reward conditioning values (high penalties for collisions and offroad driving, rewards for lane keeping). It runs in a **separate subprocess** that loads the latest checkpoint, creates a fresh environment, and collects metrics over multiple episodes.

The safe eval subprocess inherits the training environment configuration (map directory, reward bounds, etc.) but overrides a few parameters:

- `num_agents`: Number of agents in the eval environment (default: 64)
- `episode_length`: How long each eval episode runs (default: 1000 steps)
- `num_episodes`: How many episode completions to collect before reporting (default: 100)
- `resample_frequency`: Automatically set to 0 (disabled) so episodes can run to completion

Metrics logged to wandb under `eval/*`:

- `eval/score`, `eval/collision_rate`, `eval/offroad_rate`
- `eval/completion_rate`, `eval/dnf_rate`
- `eval/episode_length`, `eval/episode_return`
- `eval/lane_alignment_rate`, `eval/lane_center_rate`
- And more (see `drive.h` `Log` struct for the full list)

Configure safe eval reward conditioning in `drive.ini` under `[safe_eval]`:

```ini
[safe_eval]
enabled = True
interval = 250
num_agents = 64
num_episodes = 100
episode_length = 1000

; Fixed reward conditioning values (min=max pins the value)
collision = -3.0
offroad = -3.0
overspeed = -1.0
traffic_light = -1.0
lane_align = 0.025
velocity = 0.005
```

### Async vs sync evaluation

By default, all evals run synchronously (blocking training until they finish). Set `--train.render-async True` to run video renders in separate processes, and `--eval.eval-async True` to run metric evals (safe eval, WOSAC, human replay) in background threads. When async, results are queued and logged to wandb on the main thread during the next training epoch.

## Sanity maps

Quickly test the training on curated, lightweight scenarios without downloading the full dataset. Each sanity map tests a specific behavior.

Expand Down Expand Up @@ -33,7 +103,7 @@ Available maps:

![Sanity map gallery placeholder](images/maps_screenshot.png)

## Distributional realism benchmark 📊
## Distributional realism benchmark (WOSAC)

We provide a PufferDrive implementation of the Waymo Open Sim Agents Challenge (WOSAC) for fast, easy evaluation of how well your trained agent matches distributional properties of human behavior.

Expand All @@ -45,7 +115,7 @@ Add `--load-model-path <path_to_checkpoint>.pt` to score a trained policy, inste

See [the WOSAC benchmark page](wosac.md) for the metric pipeline and all the details.

## Human-compatibility benchmark 🤝
## Human-compatibility benchmark

You may be interested in how compatible your agent is with human partners. For this purpose, we support an eval where your policy only controls the self-driving car (SDC). The rest of the agents in the scene are stepped using the logs. While it is not a perfect eval since the human partners here are static, it will still give you a sense of how closely aligned your agent's behavior is to how people drive. You can run it like this:

Expand Down
50 changes: 46 additions & 4 deletions pufferlib/config/ocean/drive.ini
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,7 @@ vtrace_rho_clip = 1
checkpoint_interval = 1000
; Rendering options
render = True
render_async = False # Render interval of below 50 might cause process starvation and slowness in training
render_async = True
render_interval = 1000
; If True, show exactly what the agent sees in agent observation
obs_only = True
Expand All @@ -175,6 +175,8 @@ render_map = none

[eval]
eval_interval = 1000
; If True, run eval subprocesses (wosac, human replay, safe eval metrics) in background threads
eval_async = False # Run eval subprocesses (wosac, human replay, safe eval metrics) in background threads
; Path to dataset used for evaluation
map_dir = "resources/drive/binaries/training"
; Number of scenarios to process per batch
Expand Down Expand Up @@ -204,14 +206,54 @@ wosac_goal_radius = 2.0
wosac_sanity_check = False
; Only return aggregate results across all scenes
wosac_aggregate_results = True
; Episode length for WOSAC eval (ground truth logs are 9.1s at 10Hz = 91 steps)
wosac_episode_length = 91
; Evaluation mode: "policy", "ground_truth"
wosac_eval_mode = "policy"
; If True, enable human replay evaluation (pair policy-controlled agent with human replays)
human_replay_eval = False
human_replay_eval = True
; Number of agents for human replay evaluation
human_replay_num_agents = 16
; Control only the self-driving car
human_replay_control_mode = "control_sdc_only"
; Number of scenarios for human replay evaluation equals the number of agents
human_replay_num_agents = 16

[safe_eval]
; If True, periodically run policy with safe/law-abiding reward conditioning and log videos + metrics
enabled = True
; How often to run safe eval (in training epochs). Defaults to render_interval.
interval = 250
; Number of agents to run in the eval environment
num_agents = 64
; Number of episodes to collect metrics over
num_episodes = 100
; episode length
episode_length = 1000
min_goal_distance = 0.5
max_goal_distance = 1000.0

; Reward conditioning values (min=max to fix the value).
; Names match the env reward_bound_* keys.
; High penalties for unsafe behavior
collision = -3.0
offroad = -3.0
overspeed = -1.0
traffic_light = -1.0
reverse = -0.0075
comfort = -0.1

; Standard driving rewards
goal_radius = 2.0
lane_align = 0.025
lane_center = -0.00075
velocity = 0.005
center_bias = 0.0
vel_align = 1.0
timestep = -0.00005

; Neutral scaling factors
throttle = 1.0
steer = 1.0
acc = 1.0

[render]
; Mode to render a bunch of maps with a given policy
Expand Down
33 changes: 25 additions & 8 deletions pufferlib/ocean/drive/binding.c
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,7 @@ static PyObject *my_shared(PyObject *self, PyObject *args, PyObject *kwargs) {
free(env->agents);
free(env->road_elements);
free(env->road_scenario_ids);
free(env->tracks_to_predict_indices);
free(env->active_agent_indices);
free(env->static_agent_indices);
free(env->expert_static_agent_indices);
Expand All @@ -257,14 +258,6 @@ static PyObject *my_shared(PyObject *self, PyObject *args, PyObject *kwargs) {
return NULL;
}

// Store map_id
PyObject *map_id_obj = PyLong_FromLong(map_id);
PyList_SetItem(map_ids, env_count, map_id_obj);
// Store agent offset
PyObject *offset = PyLong_FromLong(total_agent_count);
PyList_SetItem(agent_offsets, env_count, offset);
total_agent_count += env->active_agent_count;
env_count++;
for (int j = 0; j < env->num_objects; j++) {
free_agent(&env->agents[j]);
}
Expand All @@ -274,12 +267,36 @@ static PyObject *my_shared(PyObject *self, PyObject *args, PyObject *kwargs) {
free(env->agents);
free(env->road_elements);
free(env->road_scenario_ids);
free(env->tracks_to_predict_indices);
free(env->active_agent_indices);
free(env->static_agent_indices);
free(env->expert_static_agent_indices);
free(env);
continue;
}

// Map has active agents — record it
PyObject *map_id_obj = PyLong_FromLong(map_id);
PyList_SetItem(map_ids, env_count, map_id_obj);
PyObject *offset = PyLong_FromLong(total_agent_count);
PyList_SetItem(agent_offsets, env_count, offset);
total_agent_count += env->active_agent_count;
env_count++;

for (int j = 0; j < env->num_objects; j++) {
free_agent(&env->agents[j]);
}
for (int j = 0; j < env->num_roads; j++) {
free_road_element(&env->road_elements[j]);
}
free(env->agents);
free(env->road_elements);
free(env->road_scenario_ids);
free(env->tracks_to_predict_indices);
free(env->active_agent_indices);
free(env->static_agent_indices);
free(env->expert_static_agent_indices);
free(env);
}

if (total_agent_count >= num_agents) {
Expand Down
122 changes: 111 additions & 11 deletions pufferlib/ocean/drive/drive.h
Original file line number Diff line number Diff line change
Expand Up @@ -2634,16 +2634,103 @@ void compute_observations(Drive *env) {
}
}

// Find a random collision-free position on a drivable lane for an existing agent.
// Returns true if a valid position was found and updates the agent's sim_x/y/z/heading.
static bool randomize_agent_position(Drive *env, int agent_idx) {
Agent *agent = &env->agents[agent_idx];

// Pre-compute drivable lanes
int drivable_lanes[env->num_roads];
float lane_lengths[env->num_roads];
int num_drivable = 0;
float total_lane_length = 0.0f;
for (int i = 0; i < env->num_roads; i++) {
if (env->road_elements[i].type == ROAD_LANE && env->road_elements[i].polyline_length > 0.0f) {
drivable_lanes[num_drivable] = i;
lane_lengths[num_drivable] = env->road_elements[i].polyline_length;
total_lane_length += lane_lengths[num_drivable];
num_drivable++;
}
}

if (num_drivable == 0) return false;

for (int attempt = 0; attempt < MAX_SPAWN_ATTEMPTS; attempt++) {
// Length-weighted lane selection
float r = ((float)rand() / (float)RAND_MAX) * total_lane_length;
float cumulative = 0.0f;
int selected = num_drivable - 1;
for (int k = 0; k < num_drivable; k++) {
cumulative += lane_lengths[k];
if (r < cumulative) {
selected = k;
break;
}
}
RoadMapElement *lane = &env->road_elements[drivable_lanes[selected]];

float spawn_x, spawn_y, spawn_z, spawn_heading;
get_random_point_on_lane(lane, &spawn_x, &spawn_y, &spawn_z, &spawn_heading);
spawn_z += agent->sim_height / 2.0f;

// Temporarily invalidate this agent so check_spawn_collision skips it
float saved_x = agent->sim_x;
agent->sim_x = INVALID_POSITION;
bool collision = check_spawn_collision(env, env->active_agent_count, spawn_x, spawn_y, spawn_z,
spawn_heading, agent->sim_length, agent->sim_width, agent->sim_height);
agent->sim_x = saved_x;
if (collision) continue;

// Check offroad
if (check_spawn_offroad(env, spawn_x, spawn_y, spawn_z, spawn_heading,
agent->sim_length, agent->sim_width, agent->sim_height))
continue;

agent->sim_x = spawn_x;
agent->sim_y = spawn_y;
agent->sim_z = spawn_z;
agent->sim_heading = spawn_heading;
agent->heading_x = cosf(spawn_heading);
agent->heading_y = sinf(spawn_heading);
// Update stored initial position so future non-random resets are consistent
agent->log_trajectory_x[0] = spawn_x;
agent->log_trajectory_y[0] = spawn_y;
agent->log_trajectory_z[0] = spawn_z;
agent->log_heading[0] = spawn_heading;
return true;
}
return false;
}

void respawn_agent(Drive *env, int agent_idx) {
Agent *agent = &env->agents[agent_idx];
agent->sim_x = agent->log_trajectory_x[0];
agent->sim_y = agent->log_trajectory_y[0];
agent->sim_z = agent->log_trajectory_z[0];
agent->sim_heading = agent->log_heading[0];
agent->heading_x = cosf(agent->sim_heading);
agent->heading_y = sinf(agent->sim_heading);
agent->sim_vx = agent->log_velocity_x[0];
agent->sim_vy = agent->log_velocity_y[0];

if (env->init_mode == INIT_VARIABLE_AGENT_NUMBER) {
if (!randomize_agent_position(env, agent_idx)) {
// Fallback to original position if no valid spawn found
agent->sim_x = agent->log_trajectory_x[0];
agent->sim_y = agent->log_trajectory_y[0];
agent->sim_z = agent->log_trajectory_z[0];
agent->sim_heading = agent->log_heading[0];
agent->heading_x = cosf(agent->sim_heading);
agent->heading_y = sinf(agent->sim_heading);
}
// Sample a new goal relative to the new position
sample_new_goal(env, agent_idx);
agent->sim_vx = 0.0f;
agent->sim_vy = 0.0f;
agent->sim_speed = 0.0f;
agent->sim_speed_signed = 0.0f;
} else {
agent->sim_x = agent->log_trajectory_x[0];
agent->sim_y = agent->log_trajectory_y[0];
agent->sim_z = agent->log_trajectory_z[0];
agent->sim_heading = agent->log_heading[0];
agent->heading_x = cosf(agent->sim_heading);
agent->heading_y = sinf(agent->sim_heading);
agent->sim_vx = agent->log_velocity_x[0];
agent->sim_vy = agent->log_velocity_y[0];
}
agent->metrics_array[COLLISION_IDX] = 0.0f;
agent->metrics_array[OFFROAD_IDX] = 0.0f;
agent->metrics_array[REACHED_GOAL_IDX] = 0.0f;
Expand Down Expand Up @@ -2908,8 +2995,21 @@ void move_dynamics(Drive *env, int action_idx, int agent_idx) {

void c_reset(Drive *env) {
env->timestep = env->init_steps;
set_start_position(env);
reset_goal_positions(env);
if (env->init_mode == INIT_VARIABLE_AGENT_NUMBER) {
// Randomize all agent positions on reset
for (int x = 0; x < env->active_agent_count; x++) {
int agent_idx = env->active_agent_indices[x];
randomize_agent_position(env, agent_idx);
}
// Sample new goals relative to new positions
for (int x = 0; x < env->active_agent_count; x++) {
int agent_idx = env->active_agent_indices[x];
sample_new_goal(env, agent_idx);
}
} else {
set_start_position(env);
reset_goal_positions(env);
}
for (int x = 0; x < env->active_agent_count; x++) {
env->logs[x] = (Log){0};
int agent_idx = env->active_agent_indices[x];
Expand Down Expand Up @@ -2939,7 +3039,7 @@ void c_reset(Drive *env) {
agent->prev_goal_z = agent->sim_z;
generate_reward_coefs(env, agent);

if (env->goal_behavior == GOAL_GENERATE_NEW) {
if (env->goal_behavior == GOAL_GENERATE_NEW && env->init_mode != INIT_VARIABLE_AGENT_NUMBER) {
agent->goal_position_x = agent->init_goal_x;
agent->goal_position_y = agent->init_goal_y;
agent->goal_position_z = agent->init_goal_z;
Expand Down
Loading
Loading