Fix mixed precision call in group norm sharded. by coreyjadams · Pull Request #1380 · NVIDIA/physicsnemo

coreyjadams · 2026-02-06T22:57:52Z

Also fix a math error in how variances are combined across GPUs.

PhysicsNeMo Pull Request

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.
If I am implementing a new model or modifying any existing model, I have followed the Models Implementation Coding Standards.

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

…ow variances are combined across GPUs.

coreyjadams · 2026-02-06T22:59:11Z

+        if weight is not None:
+            weight = weight.to(input.dtype)
+        if bias is not None:
+            bias = bias.to(input.dtype)
+


This fixes a mixed precision crash.

torch tweaking the python dispatch behavior and DTensor. - adding more layers to handle select - add more reliable handling of casting DTensor to ShardTensor. In particular, the focus is on making sure we maintain proper autograd graphs. - switch to a first-principles implemetation of group norm. It's more stable, simpler, and while it might be a little slower the upcoming torch.compile work can address that. - add a dedicated view handler at functional and dispatch level. It's necessary at this point to wrap our own view implementation due to the differences with DTensor.

greptile-apps · 2026-02-11T21:31:28Z

Greptile Overview

Greptile Summary

This PR fixes two critical bugs in the sharded group normalization implementation and adds comprehensive AMP testing.

Key Changes:

Fixed variance calculation bug in normalization_patches.py: The old code incorrectly computed variance by inverting local rstd values and averaging them across GPUs (global_var = (1.0 / (rstd**2)) - eps). This is mathematically incorrect because you cannot average variances computed from different data partitions. The new implementation correctly computes global variance using Var(X) = E[X²] - E[X]² by reducing sums and sum-of-squares across GPUs.
Fixed mixed precision handling: Added explicit dtype casting for weight and bias parameters to match input dtype (lines 114-118), and added dtype conversion for gradients in backward pass (lines 210-212). This prevents dtype mismatches when running with AMP.
Complete rewrite of group normalization: Replaced reliance on aten.native_group_norm / aten.native_group_norm_backward with a from-scratch implementation that properly handles distributed statistics, reducing all-reduce calls from 3 in forward (mean, variance, separate reductions) to 1 (fused sum and sum-of-squares).
Added comprehensive AMP testing: Extended test suite with amp parameter to validate mixed precision behavior works correctly with the fixes.

Additional Changes:

Added DTensor conversion utilities and autograd-preserving promotion functions in shard_tensor.py
New view_ops.py module with proper view/reshape operations for ShardTensor
Updated version check to handle torch 2.10.0a (alpha releases)

Important Files Changed

Filename	Overview
physicsnemo/domain_parallel/shard_utils/normalization_patches.py	Complete rewrite implementing group normalization from first principles; fixes critical variance calculation bug and adds proper mixed precision dtype handling
test/domain_parallel/ops/test_normalization.py	Added AMP testing parameters to validate mixed precision behavior in layer norm and group norm tests
test/domain_parallel/ops/utils.py	Added AMP support to numerical_shard_tensor_check with autocast context wrapping forward and backward passes
physicsnemo/domain_parallel/shard_tensor.py	Added DTensor conversion helpers and autograd-preserving promotion functions to support improved ShardTensor/DTensor interoperability

greptile-apps

_{12 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

coreyjadams · 2026-02-12T19:54:46Z

 )

-if check_version_spec("torch", "2.10.0"):
+if check_version_spec("torch", "2.10.0a"):


This is to get pre-release versions. Fixes #1394

coreyjadams · 2026-02-12T21:23:27Z

This is getting broken up in to smaller PRs for easier review.

Fix mixed precision call in group norm sharded. Fix a math error in h…

fbdea49

…ow variances are combined across GPUs.

coreyjadams commented Feb 6, 2026

View reviewed changes

coreyjadams mentioned this pull request Feb 11, 2026

🐛[BUG]: Pytorch import error in register_prop_rule #1394

Closed

coreyjadams self-assigned this Feb 11, 2026

coreyjadams added 2 commits February 11, 2026 21:23

Merge branch 'main' into hotfix-sharded-group-norm

282ae20

coreyjadams marked this pull request as ready for review February 11, 2026 21:28

greptile-apps Bot reviewed Feb 11, 2026

View reviewed changes

coreyjadams commented Feb 12, 2026

View reviewed changes

Merge branch 'NVIDIA:main' into hotfix-sharded-group-norm

3b7121a

coreyjadams closed this Feb 12, 2026

coreyjadams deleted the hotfix-sharded-group-norm branch March 5, 2026 15:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix mixed precision call in group norm sharded. #1380

Fix mixed precision call in group norm sharded. #1380
coreyjadams wants to merge 4 commits intoNVIDIA:mainfrom
coreyjadams:hotfix-sharded-group-norm

coreyjadams commented Feb 6, 2026

Uh oh!

coreyjadams Feb 6, 2026

Uh oh!

greptile-apps Bot commented Feb 11, 2026

Uh oh!

greptile-apps Bot left a comment

Uh oh!

coreyjadams Feb 12, 2026

Uh oh!

coreyjadams commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

coreyjadams commented Feb 6, 2026

PhysicsNeMo Pull Request

Description

Checklist

Dependencies

Review Process

Uh oh!

coreyjadams Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Feb 11, 2026

Greptile Overview

Greptile Summary

Important Files Changed

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coreyjadams Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

coreyjadams commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant