Conversation
All workers started with tick=0 and hit resample_frequency at the same step, causing every worker to reset maps simultaneously. This produced a large correlated shift in the environment distribution within a single PPO update, destabilizing training and causing spiky losses. Fix: on first reset, randomize each worker's initial tick to a value in [0, resample_frequency). The first episode per worker will be shorter than usual, but from then on workers resample at different times, spreading the environment distribution shift across updates.
There was a problem hiding this comment.
Pull request overview
This PR aims to prevent all Drive workers from resampling maps on the same global step by staggering each worker’s initial tick, reducing correlated environment shifts that can destabilize PPO training.
Changes:
- Initialize a per-environment RNG (
np.random.default_rng(seed)) to drive deterministic staggering. - On
reset(), settickto a random offset in[0, resample_frequency)(or0if resampling is disabled).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if self.resample_frequency > 0: | ||
| self.tick = int(self._rng.integers(self.resample_frequency)) | ||
| else: | ||
| self.tick = 0 |
There was a problem hiding this comment.
This change introduces new behavior (staggering map resampling via randomized initial tick) that is central to training stability, but there’s no automated test asserting workers/env instances start with different ticks when resample_frequency > 0. Adding a small pytest regression test (e.g., create multiple Drive instances with different seeds and assert their initial tick values aren’t all equal, and/or that only the first reset staggers if that’s the intended behavior) would help prevent accidental re-synchronization regressions.
| # Stagger initial tick so workers don't all resample maps at the same step. | ||
| # The first episode will be shorter than resample_frequency, but this | ||
| # desynchronizes resets across workers for the rest of training. | ||
| if self.resample_frequency > 0: | ||
| self.tick = int(self._rng.integers(self.resample_frequency)) |
There was a problem hiding this comment.
reset() now randomizes self.tick on every reset, but the PR description says to stagger only on the first reset per worker. If reset() is called at every episode boundary, this will repeatedly shorten the post-reset interval until the next resample (and may change training dynamics beyond the intended one-time desync). Consider gating this with a flag (e.g., only stagger once, then reset tick to 0 on subsequent resets).
| self.resample_frequency = resample_frequency | ||
| self._rng = np.random.default_rng(seed) | ||
| self.dynamics_model = dynamics_model |
There was a problem hiding this comment.
self._rng is seeded from the constructor argument seed, but reset(seed=...) also accepts a seed and is used by PufferLib to control per-run determinism. Because self.tick staggering draws from self._rng, changing the reset() seed will not affect the stagger offset, which can be surprising for reproducibility. Consider deriving the stagger RNG from the reset() seed (or re-seeding _rng on reset) so reset(seed=...) fully controls stochastic reset behavior.
Summary
tick=0and hitresample_frequencyat the same step, causing every worker to reset maps simultaneouslyreset(), randomize each worker's initial tick to[0, resample_frequency)using a seeded RNGTest plan