Add abort_on_nan_loss and abort_on_inf_loss options#3440
Open
Steboss wants to merge 2 commits intoAI-Hypercomputer:mainfrom
Open
Add abort_on_nan_loss and abort_on_inf_loss options#3440Steboss wants to merge 2 commits intoAI-Hypercomputer:mainfrom
abort_on_nan_loss and abort_on_inf_loss options#3440Steboss wants to merge 2 commits intoAI-Hypercomputer:mainfrom
Conversation
add configs tests modify only when training loss is inf or nan move values to bool update configs tests place the checks after write metrics for platforms fix lines fix lines update description fix types and checks fix whitespaces fix whitespace fix whitespaces again
gobbleturk
reviewed
Mar 18, 2026
tests/unit/configs_value_test.py
Outdated
| config = pyconfig.initialize(argv) | ||
| self.assertEqual(config.tokenizer_type, "tiktoken") | ||
|
|
||
| def test_abort_on_nan_loss_defaults_and_cli_override(self): |
Collaborator
There was a problem hiding this comment.
I don't think these tests are necessary - this is generic to our config setup, not specific to these values, we don't need individual tests for every value (the other tests provide great coverage in your new file!)
gobbleturk
reviewed
Mar 18, 2026
| eval_interval: -1 # the specific number of train step between eval_step | ||
| eval_steps: -1 # run this number of steps for eval, recommend setting this to prevent error due to running out of evel data | ||
| target_eval_loss: 0. # early stop once reaching target eval_loss | ||
| abort_on_nan_loss: True # Check for NaN and abort if found in training loss |
Collaborator
There was a problem hiding this comment.
will check with team about these defaults - we may prefer default to false
Collaborator
There was a problem hiding this comment.
fail fast - leave as True it is!
Contributor
Author
|
@gobbleturk thanks for reviewing this. I removed the tests from |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
We experienced from time to time the following problem:
While a model training was carried on, the loss was
nan.This should be flagged as an error, and the training job should be stopped.
For this reason, here we are introducing
abort_on_nan_lossandabort_on_inf_lossfor checking NaN/Inf in training jobs loss. The options are boolean, the check is done after all the metrics have been written withwrite_metricsfunction. Two unit tests have been added, one intests/unit/configs_value_test.pyto check CLI overrides of values, and one intests/unit/metric_logger_abort_test.pyto check if the implementation works fine.Future improvements:
gradorparamsor other variables.Tests
The changes were tested through the created unit tests in
tests/unit/configs_value_test.pyandtests/unit/metric_logger_abort_test.pyChecklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.