Emerge-Lab · eugenevinitsky · Mar 8, 2026 · Mar 8, 2026 · Mar 9, 2026 · Mar 9, 2026
diff --git a/docs/src/evaluation.md b/docs/src/evaluation.md
@@ -2,7 +2,77 @@
 
 Driving is a safety-critical multi-agent application, making careful evaluation and risk assessment essential. Mistakes in the real world are costly, so simulations are used to catch errors before deployment. To support rapid iteration, evaluations should ideally run efficiently. This is why we also paid attention to optimizing the speed of the evaluations. This page contains an overview of the available benchmarks and evals.
 
-## Sanity maps 🐛
+## Evaluation during training
+
+PufferDrive supports running evaluations automatically during training. There are four evaluation types that can run periodically:
+
+| Eval type | What it does | CLI flag to enable | Interval flag |
+|---|---|---|---|
+| **Render** | Records top-down and agent-view videos | `--train.render True` | `--train.render-interval N` |
+| **Safe eval render** | Records videos with safe reward conditioning | `--safe-eval.enabled True` | `--safe-eval.interval N` |
+| **Safe eval metrics** | Runs policy in subprocess, logs driving metrics | `--safe-eval.enabled True` | `--safe-eval.interval N` |
+| **WOSAC realism** | Measures distributional realism (WOSAC benchmark) | `--eval.wosac-realism-eval True` | `--eval.eval-interval N` |
+| **Human replay render** | Records videos with policy-controlled SDC + replayed humans | `--eval.human-replay-eval True` | `--eval.eval-interval N` |
+| **Human replay metrics** | Logs collision/offroad/completion rates vs human replays | `--eval.human-replay-eval True` | `--eval.eval-interval N` |
+
+All eval types trigger at `epoch % interval == 0`. They require a saved checkpoint, so **`checkpoint-interval` must be <= the smallest eval interval**.
+
+### Example: enable all evals
+
+```bash
+puffer train puffer_drive \
+  --wandb --wandb-project pufferdrive \
+  --train.checkpoint-interval 250 \
+  --train.render True --train.render-interval 250 \
+  --safe-eval.enabled True --safe-eval.interval 250 \
+  --eval.wosac-realism-eval True \
+  --eval.human-replay-eval True \
+  --eval.eval-interval 250
+```
+
+### Safe eval
+
+Safe eval measures how well the policy drives when given "safe" reward conditioning values (high penalties for collisions and offroad driving, rewards for lane keeping). It runs in a **separate subprocess** that loads the latest checkpoint, creates a fresh environment, and collects metrics over multiple episodes.
+
+The safe eval subprocess inherits the training environment configuration (map directory, reward bounds, etc.) but overrides a few parameters:
+
+- `num_agents`: Number of agents in the eval environment (default: 64)
+- `episode_length`: How long each eval episode runs (default: 1000 steps)
+- `num_episodes`: How many episode completions to collect before reporting (default: 100)
+- `resample_frequency`: Automatically set to 0 (disabled) so episodes can run to completion
+
+Metrics logged to wandb under `eval/*`:
+
+- `eval/score`, `eval/collision_rate`, `eval/offroad_rate`
+- `eval/completion_rate`, `eval/dnf_rate`
+- `eval/episode_length`, `eval/episode_return`
+- `eval/lane_alignment_rate`, `eval/lane_center_rate`
+- And more (see `drive.h` `Log` struct for the full list)
+
+Configure safe eval reward conditioning in `drive.ini` under `[safe_eval]`:
+
+```ini
+[safe_eval]
+enabled = True
+interval = 250
+num_agents = 64
+num_episodes = 100
+episode_length = 1000
+
+; Fixed reward conditioning values (min=max pins the value)
+collision = -3.0
+offroad = -3.0
+overspeed = -1.0
+traffic_light = -1.0
+lane_align = 0.025
+velocity = 0.005
+```
+
+### Async vs sync evaluation
+
+By default, all evals run synchronously (blocking training until they finish). Set `--train.render-async True` to run video renders in separate processes, and `--eval.eval-async True` to run metric evals (safe eval, WOSAC, human replay) in background threads. When async, results are queued and logged to wandb on the main thread during the next training epoch.
+
+## Sanity maps
 
 Quickly test the training on curated, lightweight scenarios without downloading the full dataset. Each sanity map tests a specific behavior.
 
@@ -33,7 +103,7 @@ Available maps:
 
 ![Sanity map gallery placeholder](images/maps_screenshot.png)
 
-## Distributional realism benchmark 📊
+## Distributional realism benchmark (WOSAC)
 
 We provide a PufferDrive implementation of the Waymo Open Sim Agents Challenge (WOSAC) for fast, easy evaluation of how well your trained agent matches distributional properties of human behavior.
 
@@ -45,7 +115,7 @@ Add `--load-model-path <path_to_checkpoint>.pt` to score a trained policy, inste
 
 See [the WOSAC benchmark page](wosac.md) for the metric pipeline and all the details.
 
-## Human-compatibility benchmark 🤝
+## Human-compatibility benchmark
 
 You may be interested in how compatible your agent is with human partners. For this purpose, we support an eval where your policy only controls the self-driving car (SDC). The rest of the agents in the scene are stepped using the logs. While it is not a perfect eval since the human partners here are static, it will still give you a sense of how closely aligned your agent's behavior is to how people drive. You can run it like this:
 

diff --git a/pufferlib/config/ocean/drive.ini b/pufferlib/config/ocean/drive.ini
@@ -158,7 +158,7 @@ vtrace_rho_clip = 1
 checkpoint_interval = 1000
 ; Rendering options
 render = True
-render_async = False     # Render interval of below 50 might cause process starvation and slowness in training
+render_async = True
 render_interval = 1000
 ; If True, show exactly what the agent sees in agent observation
 obs_only = True
@@ -175,6 +175,8 @@ render_map = none
 
 [eval]
 eval_interval = 1000
+; If True, run eval subprocesses (wosac, human replay, safe eval metrics) in background threads
+eval_async = False    # Run eval subprocesses (wosac, human replay, safe eval metrics) in background threads
 ; Path to dataset used for evaluation
 map_dir = "resources/drive/binaries/training"
 ; Number of scenarios to process per batch
@@ -204,14 +206,54 @@ wosac_goal_radius = 2.0
 wosac_sanity_check = False
 ; Only return aggregate results across all scenes
 wosac_aggregate_results = True
+; Episode length for WOSAC eval (ground truth logs are 9.1s at 10Hz = 91 steps)
+wosac_episode_length = 91
 ; Evaluation mode: "policy", "ground_truth"
 wosac_eval_mode = "policy"
 ; If True, enable human replay evaluation (pair policy-controlled agent with human replays)
-human_replay_eval = False
+human_replay_eval = True
+; Number of agents for human replay evaluation
+human_replay_num_agents = 16
 ; Control only the self-driving car
 human_replay_control_mode = "control_sdc_only"
-; Number of scenarios for human replay evaluation equals the number of agents
-human_replay_num_agents = 16
+
+[safe_eval]
+; If True, periodically run policy with safe/law-abiding reward conditioning and log videos + metrics
+enabled = True
+; How often to run safe eval (in training epochs). Defaults to render_interval.
+interval = 250
+; Number of agents to run in the eval environment
+num_agents = 64
+; Number of episodes to collect metrics over
+num_episodes = 100
+; episode length
+episode_length = 1000
+min_goal_distance = 0.5
+max_goal_distance = 1000.0
+
+; Reward conditioning values (min=max to fix the value).
+; Names match the env reward_bound_* keys.
+; High penalties for unsafe behavior
+collision = -3.0
+offroad = -3.0
+overspeed = -1.0
+traffic_light = -1.0
+reverse = -0.0075
+comfort = -0.1
+
+; Standard driving rewards
+goal_radius = 2.0
+lane_align = 0.025
+lane_center = -0.00075
+velocity = 0.005
+center_bias = 0.0
+vel_align = 1.0
+timestep = -0.00005
+
+; Neutral scaling factors
+throttle = 1.0
+steer = 1.0
+acc = 1.0
 
 [render]
 ; Mode to render a bunch of maps with a given policy

diff --git a/pufferlib/ocean/drive/binding.c b/pufferlib/ocean/drive/binding.c
@@ -245,6 +245,7 @@ static PyObject *my_shared(PyObject *self, PyObject *args, PyObject *kwargs) {
                 free(env->agents);
                 free(env->road_elements);
                 free(env->road_scenario_ids);
+                free(env->tracks_to_predict_indices);
                 free(env->active_agent_indices);
                 free(env->static_agent_indices);
                 free(env->expert_static_agent_indices);
@@ -257,14 +258,6 @@ static PyObject *my_shared(PyObject *self, PyObject *args, PyObject *kwargs) {
                 return NULL;
             }
 
-            // Store map_id
-            PyObject *map_id_obj = PyLong_FromLong(map_id);
-            PyList_SetItem(map_ids, env_count, map_id_obj);
-            // Store agent offset
-            PyObject *offset = PyLong_FromLong(total_agent_count);
-            PyList_SetItem(agent_offsets, env_count, offset);
-            total_agent_count += env->active_agent_count;
-            env_count++;
             for (int j = 0; j < env->num_objects; j++) {
                 free_agent(&env->agents[j]);
             }
@@ -274,12 +267,36 @@ static PyObject *my_shared(PyObject *self, PyObject *args, PyObject *kwargs) {
             free(env->agents);
             free(env->road_elements);
             free(env->road_scenario_ids);
+            free(env->tracks_to_predict_indices);
             free(env->active_agent_indices);
             free(env->static_agent_indices);
             free(env->expert_static_agent_indices);
             free(env);
             continue;
         }
+
+        // Map has active agents — record it
+        PyObject *map_id_obj = PyLong_FromLong(map_id);
+        PyList_SetItem(map_ids, env_count, map_id_obj);
+        PyObject *offset = PyLong_FromLong(total_agent_count);
+        PyList_SetItem(agent_offsets, env_count, offset);
+        total_agent_count += env->active_agent_count;
+        env_count++;
+
+        for (int j = 0; j < env->num_objects; j++) {
+            free_agent(&env->agents[j]);
+        }
+        for (int j = 0; j < env->num_roads; j++) {
+            free_road_element(&env->road_elements[j]);
+        }
+        free(env->agents);
+        free(env->road_elements);
+        free(env->road_scenario_ids);
+        free(env->tracks_to_predict_indices);
+        free(env->active_agent_indices);
+        free(env->static_agent_indices);
+        free(env->expert_static_agent_indices);
+        free(env);
     }
 
     if (total_agent_count >= num_agents) {

diff --git a/pufferlib/ocean/drive/drive.h b/pufferlib/ocean/drive/drive.h
@@ -2634,16 +2634,103 @@ void compute_observations(Drive *env) {
     }
 }
 
+// Find a random collision-free position on a drivable lane for an existing agent.
+// Returns true if a valid position was found and updates the agent's sim_x/y/z/heading.
+static bool randomize_agent_position(Drive *env, int agent_idx) {
+    Agent *agent = &env->agents[agent_idx];
+
+    // Pre-compute drivable lanes
+    int drivable_lanes[env->num_roads];
+    float lane_lengths[env->num_roads];
+    int num_drivable = 0;
+    float total_lane_length = 0.0f;
+    for (int i = 0; i < env->num_roads; i++) {
+        if (env->road_elements[i].type == ROAD_LANE && env->road_elements[i].polyline_length > 0.0f) {
+            drivable_lanes[num_drivable] = i;
+            lane_lengths[num_drivable] = env->road_elements[i].polyline_length;
+            total_lane_length += lane_lengths[num_drivable];
+            num_drivable++;
+        }
+    }
+
+    if (num_drivable == 0) return false;
+
+    for (int attempt = 0; attempt < MAX_SPAWN_ATTEMPTS; attempt++) {
+        // Length-weighted lane selection
+        float r = ((float)rand() / (float)RAND_MAX) * total_lane_length;
+        float cumulative = 0.0f;
+        int selected = num_drivable - 1;
+        for (int k = 0; k < num_drivable; k++) {
+            cumulative += lane_lengths[k];
+            if (r < cumulative) {
+                selected = k;
+                break;
+            }
+        }
+        RoadMapElement *lane = &env->road_elements[drivable_lanes[selected]];
+
+        float spawn_x, spawn_y, spawn_z, spawn_heading;
+        get_random_point_on_lane(lane, &spawn_x, &spawn_y, &spawn_z, &spawn_heading);
+        spawn_z += agent->sim_height / 2.0f;
+
+        // Temporarily invalidate this agent so check_spawn_collision skips it
+        float saved_x = agent->sim_x;
+        agent->sim_x = INVALID_POSITION;
+        bool collision = check_spawn_collision(env, env->active_agent_count, spawn_x, spawn_y, spawn_z,
+                                              spawn_heading, agent->sim_length, agent->sim_width, agent->sim_height);
+        agent->sim_x = saved_x;
+        if (collision) continue;
+
+        // Check offroad
+        if (check_spawn_offroad(env, spawn_x, spawn_y, spawn_z, spawn_heading,
+                                agent->sim_length, agent->sim_width, agent->sim_height))
+            continue;
+
+        agent->sim_x = spawn_x;
+        agent->sim_y = spawn_y;
+        agent->sim_z = spawn_z;
+        agent->sim_heading = spawn_heading;
+        agent->heading_x = cosf(spawn_heading);
+        agent->heading_y = sinf(spawn_heading);
+        // Update stored initial position so future non-random resets are consistent
+        agent->log_trajectory_x[0] = spawn_x;
+        agent->log_trajectory_y[0] = spawn_y;
+        agent->log_trajectory_z[0] = spawn_z;
+        agent->log_heading[0] = spawn_heading;
+        return true;
+    }
+    return false;
+}
+
 void respawn_agent(Drive *env, int agent_idx) {
     Agent *agent = &env->agents[agent_idx];
-    agent->sim_x = agent->log_trajectory_x[0];
-    agent->sim_y = agent->log_trajectory_y[0];
-    agent->sim_z = agent->log_trajectory_z[0];
-    agent->sim_heading = agent->log_heading[0];
-    agent->heading_x = cosf(agent->sim_heading);
-    agent->heading_y = sinf(agent->sim_heading);
-    agent->sim_vx = agent->log_velocity_x[0];
-    agent->sim_vy = agent->log_velocity_y[0];
+
+    if (env->init_mode == INIT_VARIABLE_AGENT_NUMBER) {
+        if (!randomize_agent_position(env, agent_idx)) {
+            // Fallback to original position if no valid spawn found
+            agent->sim_x = agent->log_trajectory_x[0];
+            agent->sim_y = agent->log_trajectory_y[0];
+            agent->sim_z = agent->log_trajectory_z[0];
+            agent->sim_heading = agent->log_heading[0];
+            agent->heading_x = cosf(agent->sim_heading);
+            agent->heading_y = sinf(agent->sim_heading);
+        }
+        // Sample a new goal relative to the new position
+        sample_new_goal(env, agent_idx);
+        agent->sim_vx = 0.0f;
+        agent->sim_vy = 0.0f;
+        agent->sim_speed = 0.0f;
+        agent->sim_speed_signed = 0.0f;
+    } else {
+        agent->sim_x = agent->log_trajectory_x[0];
+        agent->sim_y = agent->log_trajectory_y[0];
+        agent->sim_z = agent->log_trajectory_z[0];
+        agent->sim_heading = agent->log_heading[0];
+        agent->heading_x = cosf(agent->sim_heading);
+        agent->heading_y = sinf(agent->sim_heading);
+        agent->sim_vx = agent->log_velocity_x[0];
+        agent->sim_vy = agent->log_velocity_y[0];
+    }
     agent->metrics_array[COLLISION_IDX] = 0.0f;
     agent->metrics_array[OFFROAD_IDX] = 0.0f;
     agent->metrics_array[REACHED_GOAL_IDX] = 0.0f;
@@ -2908,8 +2995,21 @@ void move_dynamics(Drive *env, int action_idx, int agent_idx) {
 
 void c_reset(Drive *env) {
     env->timestep = env->init_steps;
-    set_start_position(env);
-    reset_goal_positions(env);
+    if (env->init_mode == INIT_VARIABLE_AGENT_NUMBER) {
+        // Randomize all agent positions on reset
+        for (int x = 0; x < env->active_agent_count; x++) {
+            int agent_idx = env->active_agent_indices[x];
+            randomize_agent_position(env, agent_idx);
+        }
+        // Sample new goals relative to new positions
+        for (int x = 0; x < env->active_agent_count; x++) {
+            int agent_idx = env->active_agent_indices[x];
+            sample_new_goal(env, agent_idx);
+        }
+    } else {
+        set_start_position(env);
+        reset_goal_positions(env);
+    }
     for (int x = 0; x < env->active_agent_count; x++) {
         env->logs[x] = (Log){0};
         int agent_idx = env->active_agent_indices[x];
@@ -2939,7 +3039,7 @@ void c_reset(Drive *env) {
         agent->prev_goal_z = agent->sim_z;
         generate_reward_coefs(env, agent);
 
-        if (env->goal_behavior == GOAL_GENERATE_NEW) {
+        if (env->goal_behavior == GOAL_GENERATE_NEW && env->init_mode != INIT_VARIABLE_AGENT_NUMBER) {
             agent->goal_position_x = agent->init_goal_x;
             agent->goal_position_y = agent->init_goal_y;
             agent->goal_position_z = agent->init_goal_z;