Skip to content

Add abort_on_nan_loss and abort_on_inf_loss options#3440

Open
Steboss wants to merge 2 commits intoAI-Hypercomputer:mainfrom
Steboss:main
Open

Add abort_on_nan_loss and abort_on_inf_loss options#3440
Steboss wants to merge 2 commits intoAI-Hypercomputer:mainfrom
Steboss:main

Conversation

@Steboss
Copy link
Contributor

@Steboss Steboss commented Mar 18, 2026

Description

We experienced from time to time the following problem:

0: I0122 21:20:55.494193 140737350402816 metric_logger.py:180] completed step: 30, seconds: 0.449, TFLOP/s/device: 1877.176, Tokens/s/device: 36470.971, total_weights: 131072, loss: nan

While a model training was carried on, the loss was nan.
This should be flagged as an error, and the training job should be stopped.

For this reason, here we are introducing abort_on_nan_loss and abort_on_inf_loss for checking NaN/Inf in training jobs loss. The options are boolean, the check is done after all the metrics have been written with write_metrics function. Two unit tests have been added, one in tests/unit/configs_value_test.py to check CLI overrides of values, and one in tests/unit/metric_logger_abort_test.py to check if the implementation works fine.

Future improvements:

  • These two options could be lists, so that we can monitor NaN or Inf on grad or params or other variables.

Tests

The changes were tested through the created unit tests in tests/unit/configs_value_test.py and tests/unit/metric_logger_abort_test.py

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

add configs tests

modify only when training loss is inf or nan

move values to bool

update configs tests

place the checks after write metrics for platforms

fix lines

fix lines

update description

fix types and checks

fix whitespaces

fix whitespace

fix whitespaces again
config = pyconfig.initialize(argv)
self.assertEqual(config.tokenizer_type, "tiktoken")

def test_abort_on_nan_loss_defaults_and_cli_override(self):
Copy link
Collaborator

@gobbleturk gobbleturk Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these tests are necessary - this is generic to our config setup, not specific to these values, we don't need individual tests for every value (the other tests provide great coverage in your new file!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed :)

eval_interval: -1 # the specific number of train step between eval_step
eval_steps: -1 # run this number of steps for eval, recommend setting this to prevent error due to running out of evel data
target_eval_loss: 0. # early stop once reaching target eval_loss
abort_on_nan_loss: True # Check for NaN and abort if found in training loss
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will check with team about these defaults - we may prefer default to false

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fail fast - leave as True it is!

@Steboss
Copy link
Contributor Author

Steboss commented Mar 19, 2026

@gobbleturk thanks for reviewing this. I removed the tests from configs_value_test.py. Please, let me know if you need other modifications :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants