Skip to content

Stagger map resampling across workers#373

Open
eugenevinitsky wants to merge 1 commit into3.0from
ev/stagger-resample
Open

Stagger map resampling across workers#373
eugenevinitsky wants to merge 1 commit into3.0from
ev/stagger-resample

Conversation

@eugenevinitsky
Copy link
Copy Markdown

Summary

  • All workers started with tick=0 and hit resample_frequency at the same step, causing every worker to reset maps simultaneously
  • This produced a large correlated shift in the environment distribution within a single PPO update, destabilizing training and causing spiky losses
  • Fix: on first reset(), randomize each worker's initial tick to [0, resample_frequency) using a seeded RNG
  • The first episode per worker is shorter than usual, but workers stay desynchronized for the rest of training

Test plan

  • Verify losses are smoother (no periodic spikes at resample boundaries)
  • Verify different workers resample at different global steps

All workers started with tick=0 and hit resample_frequency at the same
step, causing every worker to reset maps simultaneously. This produced
a large correlated shift in the environment distribution within a single
PPO update, destabilizing training and causing spiky losses.

Fix: on first reset, randomize each worker's initial tick to a value
in [0, resample_frequency). The first episode per worker will be shorter
than usual, but from then on workers resample at different times,
spreading the environment distribution shift across updates.
Copilot AI review requested due to automatic review settings March 28, 2026 19:50
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent all Drive workers from resampling maps on the same global step by staggering each worker’s initial tick, reducing correlated environment shifts that can destabilize PPO training.

Changes:

  • Initialize a per-environment RNG (np.random.default_rng(seed)) to drive deterministic staggering.
  • On reset(), set tick to a random offset in [0, resample_frequency) (or 0 if resampling is disabled).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +430 to +433
if self.resample_frequency > 0:
self.tick = int(self._rng.integers(self.resample_frequency))
else:
self.tick = 0
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change introduces new behavior (staggering map resampling via randomized initial tick) that is central to training stability, but there’s no automated test asserting workers/env instances start with different ticks when resample_frequency > 0. Adding a small pytest regression test (e.g., create multiple Drive instances with different seeds and assert their initial tick values aren’t all equal, and/or that only the first reset staggers if that’s the intended behavior) would help prevent accidental re-synchronization regressions.

Copilot uses AI. Check for mistakes.
Comment on lines +427 to +431
# Stagger initial tick so workers don't all resample maps at the same step.
# The first episode will be shorter than resample_frequency, but this
# desynchronizes resets across workers for the rest of training.
if self.resample_frequency > 0:
self.tick = int(self._rng.integers(self.resample_frequency))
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reset() now randomizes self.tick on every reset, but the PR description says to stagger only on the first reset per worker. If reset() is called at every episode boundary, this will repeatedly shorten the post-reset interval until the next resample (and may change training dynamics beyond the intended one-time desync). Consider gating this with a flag (e.g., only stagger once, then reset tick to 0 on subsequent resets).

Copilot uses AI. Check for mistakes.
Comment on lines 127 to 129
self.resample_frequency = resample_frequency
self._rng = np.random.default_rng(seed)
self.dynamics_model = dynamics_model
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._rng is seeded from the constructor argument seed, but reset(seed=...) also accepts a seed and is used by PufferLib to control per-run determinism. Because self.tick staggering draws from self._rng, changing the reset() seed will not affect the stagger offset, which can be surprising for reproducibility. Consider deriving the stagger RNG from the reset() seed (or re-seeding _rng on reset) so reset(seed=...) fully controls stochastic reset behavior.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants