Add `abort_on_nan_loss` and `abort_on_inf_loss` options by Steboss · Pull Request #3440 · AI-Hypercomputer/maxtext

Steboss · 2026-03-18T12:30:18Z

Description

We experienced from time to time the following problem:

0: I0122 21:20:55.494193 140737350402816 metric_logger.py:180] completed step: 30, seconds: 0.449, TFLOP/s/device: 1877.176, Tokens/s/device: 36470.971, total_weights: 131072, loss: nan

While a model training was carried on, the loss was nan.
This should be flagged as an error, and the training job should be stopped.

For this reason, here we are introducing abort_on_nan_loss and abort_on_inf_loss for checking NaN/Inf in training jobs loss. The options are boolean, the check is done after all the metrics have been written with write_metrics function. Two unit tests have been added, one in tests/unit/configs_value_test.py to check CLI overrides of values, and one in tests/unit/metric_logger_abort_test.py to check if the implementation works fine.

Future improvements:

These two options could be lists, so that we can monitor NaN or Inf on grad or params or other variables.

Tests

The changes were tested through the created unit tests in tests/unit/configs_value_test.py and tests/unit/metric_logger_abort_test.py

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

add configs tests modify only when training loss is inf or nan move values to bool update configs tests place the checks after write metrics for platforms fix lines fix lines update description fix types and checks fix whitespaces fix whitespace fix whitespaces again

gobbleturk · 2026-03-18T18:44:09Z

tests/unit/configs_value_test.py

    config = pyconfig.initialize(argv)
    self.assertEqual(config.tokenizer_type, "tiktoken")

+  def test_abort_on_nan_loss_defaults_and_cli_override(self):


I don't think these tests are necessary - this is generic to our config setup, not specific to these values, we don't need individual tests for every value (the other tests provide great coverage in your new file!)

gobbleturk · 2026-03-18T18:45:01Z

src/maxtext/configs/base.yml

 eval_interval: -1  # the specific number of train step between eval_step
 eval_steps: -1  # run this number of steps for eval, recommend setting this to prevent error due to running out of evel data
 target_eval_loss: 0.  # early stop once reaching target eval_loss
+abort_on_nan_loss: True # Check for NaN and abort if found in training loss


will check with team about these defaults - we may prefer default to false

fail fast - leave as True it is!

Steboss · 2026-03-19T10:03:20Z

@gobbleturk thanks for reviewing this. I removed the tests from configs_value_test.py. Please, let me know if you need other modifications :)

Steboss requested review from A9isha, NicoGrande, NuojCheng, RissyRan, SurbhiJainUSC, aireenmei, bvandermoon, dipannita08, gagika, gobbleturk, hengtaoguo, igorts-git, jesselu-google, jiangjy1982, khatwanimohit, richjames0, shralex, suexu1025 and vipannalla as code owners March 18, 2026 12:30

Steboss force-pushed the main branch from 355758d to 8606de1 Compare March 18, 2026 14:24

gobbleturk reviewed Mar 18, 2026

View reviewed changes

remove these tests as per @gobbleturk

5548c58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `abort_on_nan_loss` and `abort_on_inf_loss` options#3440

Add `abort_on_nan_loss` and `abort_on_inf_loss` options#3440
Steboss wants to merge 2 commits intoAI-Hypercomputer:mainfrom
Steboss:main

Steboss commented Mar 18, 2026

Uh oh!

gobbleturk Mar 18, 2026 •

edited

Loading

Uh oh!

Steboss Mar 19, 2026

Uh oh!

gobbleturk Mar 18, 2026

Uh oh!

gobbleturk Mar 18, 2026

Uh oh!

Steboss commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Steboss commented Mar 18, 2026

Description

Tests

Checklist

Uh oh!

gobbleturk Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Steboss Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

gobbleturk Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

gobbleturk Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Steboss commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gobbleturk Mar 18, 2026 •

edited

Loading