Skip to content

STOP behavior reward and learning masking.#353

Open
riccardosavorgnan wants to merge 14 commits into3.0from
ricky/stop_behavior_gradient_fix
Open

STOP behavior reward and learning masking.#353
riccardosavorgnan wants to merge 14 commits into3.0from
ricky/stop_behavior_gradient_fix

Conversation

@riccardosavorgnan
Copy link
Copy Markdown
Collaborator

@riccardosavorgnan riccardosavorgnan commented Mar 20, 2026

This PR updates the code so that we don't accumulate returns nor update the policy from steps where the car is STOPPED (STOP behavior from either collission or offroad).

High level:

  • we add a flag for an invalid step, and pass it to the python hierarchy
  • we set rewards to 0 when the flag is true
  • we mask gradients on steps where the flag is true to prevent updating the policy in those instances (and save grad operations).

NOTE: we do not mask the entropy bonus at the moment.

@riccardosavorgnan riccardosavorgnan marked this pull request as ready for review March 20, 2026 16:15
self.map_ids = map_ids
self.num_envs = num_envs
super().__init__(buf=buf)
if buf is not None and "is_invalid_step" in buf:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when does this if get triggered?

self.terminals[cur:nxt],
self.truncations[cur:nxt],
seed,
self.is_invalid_step[cur:nxt],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total aesthetic nit, but I feel like this should go before the seed and next to truncations

Comment on lines +162 to +175
if (!PyObject_TypeCheck(inv, &PyArray_Type)) {
PyErr_SetString(PyExc_TypeError, "is_invalid_step must be a NumPy array");
return NULL;
}
PyArrayObject *is_invalid_step = (PyArrayObject *)inv;
if (!PyArray_ISCONTIGUOUS(is_invalid_step)) {
PyErr_SetString(PyExc_ValueError, "is_invalid_step must be contiguous");
return NULL;
}
if (PyArray_NDIM(is_invalid_step) != 1) {
PyErr_SetString(PyExc_ValueError, "is_invalid_step must be 1D");
return NULL;
}
env->is_invalid_step = PyArray_DATA(is_invalid_step);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love it

while self.full_rows < self.segments:
profile("env", epoch)
o, r, d, t, info, env_id, mask = self.vecenv.recv()
o, r, d, t, info, env_id, mask, is_invalid_step = self.vecenv.recv()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not important, and we don't have to change it, but a more common semantic is "is_valid_step" because people find it harder to do negations

riccardosavorgnan and others added 11 commits March 23, 2026 12:40
… advantages, masked gradients correctly -filter data-
pg_loss and v_loss were already masked with ~mb_is_invalid_step, but
entropy_loss was computed over all steps including stopped agents.
This meant the entropy bonus was encouraging exploration for agents
that can't act, wasting gradient signal.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Truncate episodes when 50% or more of agents are stopped, rather than
waiting for the full episode length. Stopped agents accumulate no useful
learning signal but consume compute.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
entropy is 1D [B*TT] but mb_is_invalid_step is 2D [minibatch, bptt].
Reshape mask to 1D before indexing to match entropy's shape.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stopped agents can't act, so computing metrics for them pollutes
logged statistics (e.g. offroad_per_agent keeps incrementing while
stopped-in-place). Now we skip compute_agent_metrics and all
reward/metric accumulation for stopped agents, setting reward to 0.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants