Skip to content

Fix clean runs not always saving on last step#129

Open
camilobrownpinilla wants to merge 3 commits into
mainfrom
fix/save-on-train-end
Open

Fix clean runs not always saving on last step#129
camilobrownpinilla wants to merge 3 commits into
mainfrom
fix/save-on-train-end

Conversation

@camilobrownpinilla

@camilobrownpinilla camilobrownpinilla commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Problem. A clean training finish wrote a final checkpoint only when max_steps happened to land on the save schedule. The per-step gate config.checkpoint.should_save(step) runs inside the loop on the post-increment step; after the loop only ckpt_mgr.wait() runs, which flushes an in-flight async save but never starts a new one. So when max_steps is neither a multiple of checkpoint.interval nor a dyn_ckpt_window milestone, the fully-trained model — including the entire WSD learning-rate decay tail — was never persisted, and latest resolved to the highest scheduled step ≤ max_steps. This was inconsistent with the preemption path, which already saves an emergency checkpoint at the current step on shutdown.
  • Fix (scripts/train.py). Add one unconditional checkpoint at max_steps in the training-loop teardown, guarded so it fires only on a clean, off-schedule finish — giving clean completion the same guarantee as the preemption path. Reuses the existing CheckpointManager.save and the on_checkpoint_save hook; the savewait order is preserved (correct for both sync and async modes).
  • Guard mechanism. A completed_normally flag is initialized False before the loop and set True only as the last statement of the loop body when step >= tc.max_steps. That line is reachable only on the iteration that reaches max_steps with no break, so it excludes all early exits — NaN rollback, NCCL-health-check failure, and graceful shutdown — and is never reached on a zero-iteration resume, which also guarantees ckpt_extra is bound when the final save runs. The save is additionally gated on not config.checkpoint.should_save(step) to avoid a duplicate when max_steps already coincided with the schedule.
  • Backward compatibility. Purely additive: no existing checkpoint is changed, moved, or pruned, and the schedule/retention knobs (interval, keep_last_n, dyn_ckpt_window) are untouched. The only behavioral change is that an off-schedule final step now also produces a checkpoint.
  • Docs (docs/training/training-loop.md). The ## Shutdown section now documents the final-checkpoint guarantee and shows it in the abridged teardown snippet.
  • Changelog (CHANGELOG.md). Entry added under ### Fixed in ## [Unreleased].
  • Test (tests/e2e/test_training_e2e.py). test_final_checkpoint_on_clean_completion runs a single-process training run to an off-schedule completion (--train.max_steps=10 --checkpoint.interval=1000, and debug.toml configures no dyn_ckpt_window, so no scheduled save lands) and asserts on the filesystem that step_10/ exists and latest resolves to it.

Testing

  • uv run ruff check kempnerforge/ tests/ passes
  • uv run ruff format --check kempnerforge/ tests/ scripts/ passes
  • uv run pyright kempnerforge/ passes (0 errors)
  • uv run pytest tests/unit/ -v --timeout=60 passes
    NOTE I can currently only access 1 GPU. 20 distributed tests were skipped; the rest all pass.
  • If training loop / parallelism / optimizers changed: uv run pytest tests/e2e/ --e2e -v — applies; new test: tests/e2e/test_training_e2e.py::test_final_checkpoint_on_clean_completion

Closes #125

@codecov

codecov Bot commented Jun 26, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR ensures a training run that finishes cleanly always persists a final checkpoint at max_steps even when max_steps does not land on the configured checkpoint schedule, aligning clean completion behavior with the existing preemption/emergency-checkpoint guarantee.

Changes:

  • Add an unconditional final checkpoint save in the training teardown path when completion is clean and max_steps is off-schedule.
  • Add an end-to-end test that runs an off-schedule clean completion and asserts step_{max_steps} exists and latest points to it.
  • Update training-loop documentation and the changelog to describe the new final-checkpoint guarantee.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
scripts/train.py Adds a “clean completion” marker and a guarded unconditional final checkpoint save in teardown.
tests/e2e/test_training_e2e.py Adds an e2e regression test asserting an off-schedule clean completion still writes step_{max_steps} and updates latest.
docs/training/training-loop.md Documents the final-checkpoint-on-clean-completion behavior in the shutdown section.
CHANGELOG.md Adds an Unreleased “Fixed” entry describing the new final checkpoint guarantee.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/train.py
Comment on lines +1021 to +1027
# Clean-completion marker for the unconditional final save after the
# loop. Reached only on the iteration that hits max_steps with no
# break, so it is never set after a NaN/NCCL/shutdown break and never
# on a zero-iteration resume — guaranteeing ckpt_extra is bound when
# the final save runs.
if step >= tc.max_steps:
completed_normally = True

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional. On NaN, model weights are not updated so model is still at step-1 state. Saving this as the last step would not technically be correct.

This fix only saves clean runs, i.e., not runs with NaN at last step or NCCL errors.

Will wait for further review @amazloumi

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes maybe the comment can tell this is intentional more clearly.

Comment thread docs/training/training-loop.md
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@camilobrownpinilla camilobrownpinilla marked this pull request as ready for review June 26, 2026 21:19

@amazloumi amazloumi left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving,
Deferring to a follow-up (not blocking this PR): the fix's decision lives inside the ~1000-line main() in scripts/train.py, so today it's only reachable via the --e2e tier (which doesn't run in default CI). Extracting the training loop into its own function/class (run_training_loop(...) or a small Trainer) would create a seam to inject a fake CheckpointManager and unit-test the save decision directly — fast, and in default CI. Since that's a refactor of the central loop, it deserves its own PR + review rather than expanding this one. I'll open an issue.

potential tests in the above followup:

  • Clean finish, off-schedule step → save once at max_steps (on_checkpoint_save hook)
  • Clean finish, on-schedule step → no second save (dedup)
  • NaN on the last step → no final save (intentional — skipped optimizer step, weights are step-(N−1))
  • NCCL health failure on the last step → no final save
  • Graceful shutdown on the last step → no final save (emergency save already covers it)
  • Resume from a checkpoint already at max_steps → no save and no crash (the ckpt_extra unbound case)

Minor:

  • Copilot's NaN comment: right about the mechanism, but the behavior is intentional resolving as intended. Worth expanding the code comment to state why (skipped optimizer step → step-(N−1) weights) so it isn't "fixed" later.
  • should_shutdown() isn't all-reduced across ranks, so a preemption on the final step could desync the emergency save. Pre-existing — flagging for a separate issue.

Comment thread scripts/train.py
Comment on lines +1021 to +1027
# Clean-completion marker for the unconditional final save after the
# loop. Reached only on the iteration that hits max_steps with no
# break, so it is never set after a NaN/NCCL/shutdown break and never
# on a zero-iteration resume — guaranteeing ckpt_extra is bound when
# the final save runs.
if step >= tc.max_steps:
completed_normally = True

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes maybe the comment can tell this is intentional more clearly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Always save a final checkpoint when training completes

3 participants