clip speed at limit, remove overspeed reward, move advantages out of loop#378
clip speed at limit, remove overspeed reward, move advantages out of loop#378eugenevinitsky wants to merge 10 commits into3.0from
Conversation
- clipSpeed now takes a maxSpeed parameter instead of using MAX_SPEED (100 m/s) - Classic dynamics clips at SPEED_LIMIT (20 m/s) - Jerk dynamics uses SPEED_LIMIT instead of hardcoded 20.0f - Remove overspeed reward penalty (was ~1.0/step, dwarfing all other rewards) Speed is now physically clamped so agents can't overspeed.
The GOAL_STOP branch used assignment (=) instead of accumulation (+=) for rewards[i], discarding all other reward components (collision, lane, velocity) computed earlier in the same step. episode_return already used +=, so it diverged from the actual training reward.
Previously required respawn_timestep != -1, which is only set after the first respawn. The agent never received a reward for reaching its initial goal. Now uses reward_goal for the first goal and reward_goal_post_respawn for subsequent goals.
There was a problem hiding this comment.
Pull request overview
This PR adjusts Drive environment dynamics and rewards so agent speed is physically clamped to the configured speed limit, and removes the now-redundant overspeed reward term.
Changes:
- Update
clipSpeedto accept amaxSpeedparameter and clamp classic dynamics atSPEED_LIMIT(20 m/s) instead ofMAX_SPEED(100 m/s). - Replace a hardcoded jerk dynamics speed cap (
20.0f) with theSPEED_LIMITconstant. - Remove the overspeed reward/penalty term in
c_stepsince speed is clamped in dynamics.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (signed_v * v_new < 0) { | ||
| v_new = 0.0f; | ||
| } else { | ||
| v_new = clip(v_new, -2.0f, 20.0f); | ||
| v_new = clip(v_new, -2.0f, SPEED_LIMIT); | ||
| } |
There was a problem hiding this comment.
In JERK dynamics, v_new is clipped to SPEED_LIMIT, but the subsequent displacement uses d = 0.5f * (v_new + signed_v) * env->dt. Since signed_v comes from the current state (which can be > SPEED_LIMIT right after set_start_position()), the agent can still move farther than the speed limit would allow for a step even though the stored velocity is clamped. Consider also clamping signed_v (or using a clamped signed_v in the d computation) when v_new is clipped so the per-step displacement is physically consistent with the speed limit.
…atch Previously computed after training using partially-overwritten values and last-minibatch advantages. Now captured on the first minibatch iteration when values are still pristine and advantages use ratio=1, matching SB3's approach. No extra advantage computation needed.
After GOAL_RESPAWN, current_goal_reached stayed 1, so the !current_goal_reached check at the goal reward gate always failed. Combined with bug #5, GOAL_RESPAWN agents got zero goal reward for the entire episode. Also reset score_for_current_goal.
…ved agents Stopped/removed agents can't act, so their metrics pollute logged statistics (e.g. lane center rate, comfort violations, avg speed). Now skips compute_agent_metrics and all reward/metric accumulation for these agents, zeroing their reward.
…values Previously computed after training using partially-overwritten values and last-minibatch advantages. Now computed once before the training loop using original value predictions and full-buffer advantages with ratio=1, matching SB3's approach.
7e22de8 to
f7904da
Compare
GAE was recomputed every minibatch for V-trace importance weighting, but the result was immediately overwritten (dead code). Now computed once before the loop like SB3, saving total_minibatches-1 redundant GAE passes. Also removes unused self.ratio and self.importance tensors.
|
wandb runs to evaluate: https://wandb.ai/emerge_/mar-29-bugfix-comparison?nw=nwusereugenevinitsky |
self.values[idx] = newvalue was overwriting stored values during training, added for V-trace (now removed). This broke value function clipping (the clipping window moved with updates instead of anchoring to original predictions) and corrupted explained variance. Values are now frozen during training like SB3.
Summary
clipSpeednow takes a configurable max speed parameterSPEED_LIMIT(20 m/s) instead ofMAX_SPEED(100 m/s)SPEED_LIMITconstant instead of hardcoded20.0f