Fix clean runs not always saving on last step by camilobrownpinilla · Pull Request #129 · KempnerInstitute/KempnerForge

camilobrownpinilla · 2026-06-26T20:59:12Z

Summary

Problem. A clean training finish wrote a final checkpoint only when max_steps happened to land on the save schedule. The per-step gate config.checkpoint.should_save(step) runs inside the loop on the post-increment step; after the loop only ckpt_mgr.wait() runs, which flushes an in-flight async save but never starts a new one. So when max_steps is neither a multiple of checkpoint.interval nor a dyn_ckpt_window milestone, the fully-trained model — including the entire WSD learning-rate decay tail — was never persisted, and latest resolved to the highest scheduled step ≤ max_steps. This was inconsistent with the preemption path, which already saves an emergency checkpoint at the current step on shutdown.
Fix (scripts/train.py). Add one unconditional checkpoint at max_steps in the training-loop teardown, guarded so it fires only on a clean, off-schedule finish — giving clean completion the same guarantee as the preemption path. Reuses the existing CheckpointManager.save and the on_checkpoint_save hook; the save → wait order is preserved (correct for both sync and async modes).
Guard mechanism. A completed_normally flag is initialized False before the loop and set True only as the last statement of the loop body when step >= tc.max_steps. That line is reachable only on the iteration that reaches max_steps with no break, so it excludes all early exits — NaN rollback, NCCL-health-check failure, and graceful shutdown — and is never reached on a zero-iteration resume, which also guarantees ckpt_extra is bound when the final save runs. The save is additionally gated on not config.checkpoint.should_save(step) to avoid a duplicate when max_steps already coincided with the schedule.
Backward compatibility. Purely additive: no existing checkpoint is changed, moved, or pruned, and the schedule/retention knobs (interval, keep_last_n, dyn_ckpt_window) are untouched. The only behavioral change is that an off-schedule final step now also produces a checkpoint.
Docs (docs/training/training-loop.md). The ## Shutdown section now documents the final-checkpoint guarantee and shows it in the abridged teardown snippet.
Changelog (CHANGELOG.md). Entry added under ### Fixed in ## [Unreleased].
Test (tests/e2e/test_training_e2e.py). test_final_checkpoint_on_clean_completion runs a single-process training run to an off-schedule completion (--train.max_steps=10 --checkpoint.interval=1000, and debug.toml configures no dyn_ckpt_window, so no scheduled save lands) and asserts on the filesystem that step_10/ exists and latest resolves to it.

Testing

uv run ruff check kempnerforge/ tests/ passes
uv run ruff format --check kempnerforge/ tests/ scripts/ passes
uv run pyright kempnerforge/ passes (0 errors)
uv run pytest tests/unit/ -v --timeout=60 passes
NOTE I can currently only access 1 GPU. 20 distributed tests were skipped; the rest all pass.
If training loop / parallelism / optimizers changed: uv run pytest tests/e2e/ --e2e -v — applies; new test: tests/e2e/test_training_e2e.py::test_final_checkpoint_on_clean_completion

Closes #125

codecov · 2026-06-26T21:03:16Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

This PR ensures a training run that finishes cleanly always persists a final checkpoint at max_steps even when max_steps does not land on the configured checkpoint schedule, aligning clean completion behavior with the existing preemption/emergency-checkpoint guarantee.

Changes:

Add an unconditional final checkpoint save in the training teardown path when completion is clean and max_steps is off-schedule.
Add an end-to-end test that runs an off-schedule clean completion and asserts step_{max_steps} exists and latest points to it.
Update training-loop documentation and the changelog to describe the new final-checkpoint guarantee.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
scripts/train.py	Adds a “clean completion” marker and a guarded unconditional final checkpoint save in teardown.
tests/e2e/test_training_e2e.py	Adds an e2e regression test asserting an off-schedule clean completion still writes `step_{max_steps}` and updates `latest`.
docs/training/training-loop.md	Documents the final-checkpoint-on-clean-completion behavior in the shutdown section.
CHANGELOG.md	Adds an Unreleased “Fixed” entry describing the new final checkpoint guarantee.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

camilobrownpinilla · 2026-06-26T21:19:27Z

+        # Clean-completion marker for the unconditional final save after the
+        # loop. Reached only on the iteration that hits max_steps with no
+        # break, so it is never set after a NaN/NCCL/shutdown break and never
+        # on a zero-iteration resume — guaranteeing ckpt_extra is bound when
+        # the final save runs.
+        if step >= tc.max_steps:
+            completed_normally = True


Intentional. On NaN, model weights are not updated so model is still at step-1 state. Saving this as the last step would not technically be correct.

This fix only saves clean runs, i.e., not runs with NaN at last step or NCCL errors.

Will wait for further review @amazloumi

yes maybe the comment can tell this is intentional more clearly.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

amazloumi

Approving,
Deferring to a follow-up (not blocking this PR): the fix's decision lives inside the ~1000-line main() in scripts/train.py, so today it's only reachable via the --e2e tier (which doesn't run in default CI). Extracting the training loop into its own function/class (run_training_loop(...) or a small Trainer) would create a seam to inject a fake CheckpointManager and unit-test the save decision directly — fast, and in default CI. Since that's a refactor of the central loop, it deserves its own PR + review rather than expanding this one. I'll open an issue.

potential tests in the above followup:

Clean finish, off-schedule step → save once at max_steps (on_checkpoint_save hook)
Clean finish, on-schedule step → no second save (dedup)
NaN on the last step → no final save (intentional — skipped optimizer step, weights are step-(N−1))
NCCL health failure on the last step → no final save
Graceful shutdown on the last step → no final save (emergency save already covers it)
Resume from a checkpoint already at max_steps → no save and no crash (the ckpt_extra unbound case)

Minor:

Copilot's NaN comment: right about the mechanism, but the behavior is intentional resolving as intended. Worth expanding the code comment to state why (skipped optimizer step → step-(N−1) weights) so it isn't "fixed" later.
should_shutdown() isn't all-reduced across ranks, so a preemption on the final step could desync the emergency save. Pre-existing — flagging for a separate issue.

amazloumi · 2026-06-26T23:56:47Z

+        # Clean-completion marker for the unconditional final save after the
+        # loop. Reached only on the iteration that hits max_steps with no
+        # break, so it is never set after a NaN/NCCL/shutdown break and never
+        # on a zero-iteration resume — guaranteeing ckpt_extra is bound when
+        # the final save runs.
+        if step >= tc.max_steps:
+            completed_normally = True


yes maybe the comment can tell this is intentional more clearly.

Fix clean runs not always saving on last step

da71b46

camilobrownpinilla requested a review from Copilot June 26, 2026 20:59

Copilot started reviewing on behalf of camilobrownpinilla June 26, 2026 20:59 View session

camilobrownpinilla requested review from Naeemkh, amazloumi and mmshad June 26, 2026 21:04

Copilot AI reviewed Jun 26, 2026

View reviewed changes

Fix inaccurate comment

2a92565

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

camilobrownpinilla marked this pull request as ready for review June 26, 2026 21:19

Merge branch 'main' into fix/save-on-train-end

51c6ebf

amazloumi approved these changes Jun 27, 2026

View reviewed changes

amazloumi mentioned this pull request Jun 27, 2026

Decompose scripts/train.py:main() into cohesive, testable units #130

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix clean runs not always saving on last step#129

Fix clean runs not always saving on last step#129
camilobrownpinilla wants to merge 3 commits into
mainfrom
fix/save-on-train-end

camilobrownpinilla commented Jun 26, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 26, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

camilobrownpinilla Jun 26, 2026

Uh oh!

amazloumi Jun 26, 2026

Uh oh!

Uh oh!

amazloumi left a comment

Uh oh!

amazloumi Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

camilobrownpinilla commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

codecov Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

camilobrownpinilla Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

amazloumi Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

amazloumi left a comment

Choose a reason for hiding this comment

Uh oh!

amazloumi Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

camilobrownpinilla commented Jun 26, 2026 •

edited

Loading

codecov Bot commented Jun 26, 2026 •

edited

Loading