STOP behavior reward and learning masking. by riccardosavorgnan · Pull Request #353 · Emerge-Lab/PufferDrive

riccardosavorgnan · 2026-03-20T16:10:30Z

This PR updates the code so that we don't accumulate returns nor update the policy from steps where the car is STOPPED (STOP behavior from either collission or offroad).

High level:

we add a flag for an invalid step, and pass it to the python hierarchy
we set rewards to 0 when the flag is true
we mask gradients on steps where the flag is true to prevent updating the policy in those instances (and save grad operations).

NOTE: we do not mask the entropy bonus at the moment.

…ask for rewards and loss

… advantages, masked gradients correctly -filter data-

eugenevinitsky · 2026-03-23T12:34:38Z

pufferlib/ocean/drive/drive.py

        self.map_ids = map_ids
        self.num_envs = num_envs
        super().__init__(buf=buf)
+        if buf is not None and "is_invalid_step" in buf:


when does this if get triggered?

eugenevinitsky · 2026-03-23T12:35:11Z

pufferlib/ocean/drive/drive.py

                self.terminals[cur:nxt],
                self.truncations[cur:nxt],
                seed,
+                self.is_invalid_step[cur:nxt],


total aesthetic nit, but I feel like this should go before the seed and next to truncations

eugenevinitsky · 2026-03-23T12:35:31Z

pufferlib/ocean/env_binding.h

+    if (!PyObject_TypeCheck(inv, &PyArray_Type)) {
+        PyErr_SetString(PyExc_TypeError, "is_invalid_step must be a NumPy array");
+        return NULL;
+    }
+    PyArrayObject *is_invalid_step = (PyArrayObject *)inv;
+    if (!PyArray_ISCONTIGUOUS(is_invalid_step)) {
+        PyErr_SetString(PyExc_ValueError, "is_invalid_step must be contiguous");
+        return NULL;
+    }
+    if (PyArray_NDIM(is_invalid_step) != 1) {
+        PyErr_SetString(PyExc_ValueError, "is_invalid_step must be 1D");
+        return NULL;
+    }
+    env->is_invalid_step = PyArray_DATA(is_invalid_step);


eugenevinitsky · 2026-03-23T12:36:17Z

pufferlib/pufferl.py

        while self.full_rows < self.segments:
            profile("env", epoch)
-            o, r, d, t, info, env_id, mask = self.vecenv.recv()
+            o, r, d, t, info, env_id, mask, is_invalid_step = self.vecenv.recv()


this is not important, and we don't have to change it, but a more common semantic is "is_valid_step" because people find it harder to do negations

…ask for rewards and loss

… advantages, masked gradients correctly -filter data-

…havior_gradient_fix

pg_loss and v_loss were already masked with ~mb_is_invalid_step, but entropy_loss was computed over all steps including stopped agents. This meant the entropy bonus was encouraging exploration for agents that can't act, wasting gradient signal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Truncate episodes when 50% or more of agents are stopped, rather than waiting for the full episode length. Stopped agents accumulate no useful learning signal but consume compute. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

entropy is 1D [B*TT] but mb_is_invalid_step is 2D [minibatch, bptt]. Reshape mask to 1D before indexing to match entropy's shape. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Stopped agents can't act, so computing metrics for them pollutes logged statistics (e.g. offroad_per_agent keeps incrementing while stopped-in-place). Now we skip compute_agent_metrics and all reward/metric accumulation for stopped agents, setting reward to 0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

stopped is now passed through to the python hierarchy and used as a m…

6496492

…ask for rewards and loss

riccardosavorgnan marked this pull request as ready for review March 20, 2026 16:15

riccardosavorgnan added 2 commits March 23, 2026 00:32

modified flag to be universal invalid step, adjusted normalization of…

b23d12a

… advantages, masked gradients correctly -filter data-

fix typo bug

4f7b754

eugenevinitsky reviewed Mar 23, 2026

View reviewed changes

riccardosavorgnan and others added 11 commits March 23, 2026 12:40

stopped is now passed through to the python hierarchy and used as a m…

9678e59

…ask for rewards and loss

modified flag to be universal invalid step, adjusted normalization of…

dc56c4f

… advantages, masked gradients correctly -filter data-

fix typo bug

79952c2

apply render fix for remote

48446f9

episodes are now truncated when x percent of the agents are stopped

493ec88

Merge branch 'ricky/stop_behavior_gradient_fix_v3' into ricky/stop_be…

1f962fd

…havior_gradient_fix

Fix: flatten is_invalid_step mask for entropy (shape mismatch)

84e6285

entropy is 1D [B*TT] but mb_is_invalid_step is 2D [minibatch, bptt]. Reshape mask to 1D before indexing to match entropy's shape. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Change min_agents_per_env from 16 to 1

6cab4e2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

STOP behavior reward and learning masking.#353

STOP behavior reward and learning masking.#353
riccardosavorgnan wants to merge 14 commits into3.0from
ricky/stop_behavior_gradient_fix

riccardosavorgnan commented Mar 20, 2026 •

edited

Loading

Uh oh!

eugenevinitsky Mar 23, 2026

Uh oh!

eugenevinitsky Mar 23, 2026

Uh oh!

eugenevinitsky Mar 23, 2026

Uh oh!

eugenevinitsky Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

riccardosavorgnan commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eugenevinitsky Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

eugenevinitsky Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

eugenevinitsky Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

eugenevinitsky Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

riccardosavorgnan commented Mar 20, 2026 •

edited

Loading