STOP behavior reward and learning masking.#353
Open
riccardosavorgnan wants to merge 14 commits into3.0from
Open
STOP behavior reward and learning masking.#353riccardosavorgnan wants to merge 14 commits into3.0from
riccardosavorgnan wants to merge 14 commits into3.0from
Conversation
…ask for rewards and loss
… advantages, masked gradients correctly -filter data-
| self.map_ids = map_ids | ||
| self.num_envs = num_envs | ||
| super().__init__(buf=buf) | ||
| if buf is not None and "is_invalid_step" in buf: |
There was a problem hiding this comment.
when does this if get triggered?
| self.terminals[cur:nxt], | ||
| self.truncations[cur:nxt], | ||
| seed, | ||
| self.is_invalid_step[cur:nxt], |
There was a problem hiding this comment.
total aesthetic nit, but I feel like this should go before the seed and next to truncations
Comment on lines
+162
to
+175
| if (!PyObject_TypeCheck(inv, &PyArray_Type)) { | ||
| PyErr_SetString(PyExc_TypeError, "is_invalid_step must be a NumPy array"); | ||
| return NULL; | ||
| } | ||
| PyArrayObject *is_invalid_step = (PyArrayObject *)inv; | ||
| if (!PyArray_ISCONTIGUOUS(is_invalid_step)) { | ||
| PyErr_SetString(PyExc_ValueError, "is_invalid_step must be contiguous"); | ||
| return NULL; | ||
| } | ||
| if (PyArray_NDIM(is_invalid_step) != 1) { | ||
| PyErr_SetString(PyExc_ValueError, "is_invalid_step must be 1D"); | ||
| return NULL; | ||
| } | ||
| env->is_invalid_step = PyArray_DATA(is_invalid_step); |
| while self.full_rows < self.segments: | ||
| profile("env", epoch) | ||
| o, r, d, t, info, env_id, mask = self.vecenv.recv() | ||
| o, r, d, t, info, env_id, mask, is_invalid_step = self.vecenv.recv() |
There was a problem hiding this comment.
this is not important, and we don't have to change it, but a more common semantic is "is_valid_step" because people find it harder to do negations
…ask for rewards and loss
… advantages, masked gradients correctly -filter data-
…havior_gradient_fix
pg_loss and v_loss were already masked with ~mb_is_invalid_step, but entropy_loss was computed over all steps including stopped agents. This meant the entropy bonus was encouraging exploration for agents that can't act, wasting gradient signal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Truncate episodes when 50% or more of agents are stopped, rather than waiting for the full episode length. Stopped agents accumulate no useful learning signal but consume compute. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
entropy is 1D [B*TT] but mb_is_invalid_step is 2D [minibatch, bptt]. Reshape mask to 1D before indexing to match entropy's shape. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stopped agents can't act, so computing metrics for them pollutes logged statistics (e.g. offroad_per_agent keeps incrementing while stopped-in-place). Now we skip compute_agent_metrics and all reward/metric accumulation for stopped agents, setting reward to 0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR updates the code so that we don't accumulate returns nor update the policy from steps where the car is STOPPED (STOP behavior from either collission or offroad).
High level:
NOTE: we do not mask the entropy bonus at the moment.