Skip to content

clip speed at limit, remove overspeed reward, move advantages out of loop#378

Open
eugenevinitsky wants to merge 10 commits into3.0from
ev/bug_fixes
Open

clip speed at limit, remove overspeed reward, move advantages out of loop#378
eugenevinitsky wants to merge 10 commits into3.0from
ev/bug_fixes

Conversation

@eugenevinitsky
Copy link
Copy Markdown

@eugenevinitsky eugenevinitsky commented Mar 29, 2026

Summary

  • clipSpeed now takes a configurable max speed parameter
  • Classic dynamics clips at SPEED_LIMIT (20 m/s) instead of MAX_SPEED (100 m/s)
  • Jerk dynamics uses SPEED_LIMIT constant instead of hardcoded 20.0f
  • Remove overspeed reward penalty — speed is now physically clamped so agents can't overspeed. The penalty was ~1.0/step without dt scaling, dwarfing all other per-step rewards (~0.001-0.01).
  • Remove stopped agents from the metrics
  • Move advantage computation out of the loop

- clipSpeed now takes a maxSpeed parameter instead of using MAX_SPEED (100 m/s)
- Classic dynamics clips at SPEED_LIMIT (20 m/s)
- Jerk dynamics uses SPEED_LIMIT instead of hardcoded 20.0f
- Remove overspeed reward penalty (was ~1.0/step, dwarfing all other rewards)
  Speed is now physically clamped so agents can't overspeed.
Copilot AI review requested due to automatic review settings March 29, 2026 15:18
The GOAL_STOP branch used assignment (=) instead of accumulation (+=)
for rewards[i], discarding all other reward components (collision,
lane, velocity) computed earlier in the same step. episode_return
already used +=, so it diverged from the actual training reward.
Previously required respawn_timestep != -1, which is only set after
the first respawn. The agent never received a reward for reaching its
initial goal. Now uses reward_goal for the first goal and
reward_goal_post_respawn for subsequent goals.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts Drive environment dynamics and rewards so agent speed is physically clamped to the configured speed limit, and removes the now-redundant overspeed reward term.

Changes:

  • Update clipSpeed to accept a maxSpeed parameter and clamp classic dynamics at SPEED_LIMIT (20 m/s) instead of MAX_SPEED (100 m/s).
  • Replace a hardcoded jerk dynamics speed cap (20.0f) with the SPEED_LIMIT constant.
  • Remove the overspeed reward/penalty term in c_step since speed is clamped in dynamics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 3010 to 3014
if (signed_v * v_new < 0) {
v_new = 0.0f;
} else {
v_new = clip(v_new, -2.0f, 20.0f);
v_new = clip(v_new, -2.0f, SPEED_LIMIT);
}
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In JERK dynamics, v_new is clipped to SPEED_LIMIT, but the subsequent displacement uses d = 0.5f * (v_new + signed_v) * env->dt. Since signed_v comes from the current state (which can be > SPEED_LIMIT right after set_start_position()), the agent can still move farther than the speed limit would allow for a step even though the stored velocity is clamped. Consider also clamping signed_v (or using a clamped signed_v in the d computation) when v_new is clipped so the per-step displacement is physically consistent with the speed limit.

Copilot uses AI. Check for mistakes.
…atch

Previously computed after training using partially-overwritten values
and last-minibatch advantages. Now captured on the first minibatch
iteration when values are still pristine and advantages use ratio=1,
matching SB3's approach. No extra advantage computation needed.
After GOAL_RESPAWN, current_goal_reached stayed 1, so the
!current_goal_reached check at the goal reward gate always failed.
Combined with bug #5, GOAL_RESPAWN agents got zero goal reward
for the entire episode. Also reset score_for_current_goal.
…ved agents

Stopped/removed agents can't act, so their metrics pollute logged
statistics (e.g. lane center rate, comfort violations, avg speed).
Now skips compute_agent_metrics and all reward/metric accumulation
for these agents, zeroing their reward.
…values

Previously computed after training using partially-overwritten values
and last-minibatch advantages. Now computed once before the training
loop using original value predictions and full-buffer advantages with
ratio=1, matching SB3's approach.
GAE was recomputed every minibatch for V-trace importance weighting,
but the result was immediately overwritten (dead code). Now computed
once before the loop like SB3, saving total_minibatches-1 redundant
GAE passes. Also removes unused self.ratio and self.importance tensors.
@eugenevinitsky
Copy link
Copy Markdown
Author

@eugenevinitsky eugenevinitsky changed the title Bug fixes: clip speed at limit, remove overspeed reward Bug fixes: clip speed at limit, remove overspeed reward, move advantages out of loop Mar 29, 2026
self.values[idx] = newvalue was overwriting stored values during
training, added for V-trace (now removed). This broke value function
clipping (the clipping window moved with updates instead of anchoring
to original predictions) and corrupted explained variance. Values are
now frozen during training like SB3.
@eugenevinitsky eugenevinitsky changed the title Bug fixes: clip speed at limit, remove overspeed reward, move advantages out of loop clip speed at limit, remove overspeed reward, move advantages out of loop Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants