Add NVTE_KEEP_BACKWARD_UNQUANTIZED #2644

zianglih · 2026-02-03T00:48:37Z

Description

@HumansAnd

Add an NVTE_KEEP_BACKWARD_UNQUANTIZED env var for quantized fprop + high precision wgrad & dgrad.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Ziang Li <ziangli@umich.edu>

for more information, see https://pre-commit.ci

greptile-apps · 2026-02-03T00:52:31Z

Greptile Overview

Greptile Summary

This PR adds an opt-in path to run FP8/FP4 quantized forward while keeping backward in higher precision, primarily controlled through a new quantize_backward recipe flag (defaulted from NVTE_KEEP_BACKWARD_UNQUANTIZED). The change is threaded through TE modules (Linear/GroupedLinear/LayerNormLinear) and fusible ops (BasicLinear + several forward fusions) by saving high-precision activations/weights for backward and disabling quantized backward compute when requested. It also refactors TransformerEngineBaseModule.prepare_forward from a context manager to an explicit prepare_forward() + end_forward() pair and introduces fast_setattr/module_setattr for attribute assignment.

In the ops fuser stack, forward fusion functions are migrated to a unified Class.fuse_forward_ops(ops, **unused) API and registered via transformer_engine/pytorch/ops/fused/__init__.py.

Confidence Score: 3/5

This PR is moderately safe to merge, but UB-enabled forward fusion likely violates the new keep-backward-unquantized contract.
Most linear/module paths consistently thread keep_backward_unquantized into saved tensors and disable quantized backward, and the fusion API refactors appear internally consistent. However, UserbuffersForwardLinear does not honor keep_backward_unquantized and will still save quantized tensors / keep quantized backward enabled, which breaks the feature semantics when UB overlap is active and could lead to incorrect behavior or unexpected precision/regressions in that configuration.
transformer_engine/pytorch/ops/fused/userbuffers_forward_linear.py

Important Files Changed

Filename	Overview
transformer_engine/pytorch/ops/fused/userbuffers_forward_linear.py	Refactors fusion registration into static fuse_forward_ops; does not incorporate new keep_backward_unquantized behavior unlike other fused linear ops (forward saves remain x_local/w, with_quantized_compute unchanged).
transformer_engine/pytorch/ops/basic/basic_linear.py	Threads keep_backward_unquantized into functional forward and quantizer usage; saves high-precision tensors for backward when requested and disables quantized backward compute.
transformer_engine/pytorch/module/base.py	Refactors prepare_forward from contextmanager to explicit prepare_forward/end_forward, adds fast_setattr/module_setattr, and gates FP8 backward preprocessing via ctx.keep_backward_unquantized.
transformer_engine/pytorch/module/linear.py	Implements keep_backward_unquantized for Linear autograd path (forces save_original_input, disables UB overlap/FP8 backward, uses full-precision weights for dgrad) and adapts to new prepare_forward/end_forward API.
transformer_engine/common/recipe/init.py	Adds quantize_forward/quantize_backward flags to recipes with NVTE_KEEP_BACKWARD_UNQUANTIZED default and new validation/repr fields.
transformer_engine/pytorch/module/layernorm_mlp.py	Adds keep_backward_unquantized detection but currently hard-asserts unsupported; refactors prepare_forward usage and quantizer settings.

Sequence Diagram

sequenceDiagram
    participant U as User code
    participant A as te.autocast/FP8GlobalStateManager
    participant M as TE Module (Linear/Grouped/LNLinear)
    participant F as OperationFuser (fusible ops)
    participant UB as UserbuffersForwardLinear
    participant BL as BasicLinear._functional_forward

    U->>A: enter autocast(recipe)
    A-->>M: fp8 enabled + recipe.quantize_backward

    U->>M: forward(inp)
    M->>M: prepare_forward(inp)
    Note over M: keep_backward_unquantized = fp8 && !recipe.quantize_backward

    alt BasicLinear / non-UB fused paths
        M->>F: run pipeline
        F->>BL: _functional_forward(..., keep_backward_unquantized)
        BL-->>F: output, x_local, w
        F->>F: save_for_backward(saved_input/saved_weight)
        Note over F: if keep_backward_unquantized, save high-precision tensors
    else Userbuffers fused forward selected
        M->>F: run pipeline
        F->>UB: fuser_forward(...)
        UB-->>F: output, x_local, w
        F->>F: save_for_backward(x_local, w)
        Note over UB,F: UB path does not incorporate keep_backward_unquantized
    end

    U-->>M: loss.backward()
    M->>F: autograd backward
    alt keep_backward_unquantized honored
        F->>BL: backward uses saved high-precision tensors
        Note over BL: quantized backward disabled
    else UB path
        F->>UB: backward expects quantized tensors / quantized compute
    end

    U->>A: exit autocast

greptile-apps

_{6 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{6 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

zianglih · 2026-02-03T08:31:14Z

I'll work on potential unit test breakage.

transformer_engine/pytorch/ops/fuser.py

Signed-off-by: Ziang Li <ziangli@umich.edu>

transformer_engine/pytorch/module/layernorm_linear.py

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Ziang Li <ziangli@umich.edu>

greptile-apps

_{5 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/pytorch/quantization.py

transformer_engine/pytorch/module/linear.py

transformer_engine/pytorch/module/layernorm_linear.py

transformer_engine/pytorch/module/layernorm_mlp.py

Signed-off-by: Ziang Li <ziangli@umich.edu>

for more information, see https://pre-commit.ci

greptile-apps

_{4 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/pytorch/module/linear.py

Signed-off-by: Ziang Li <ziangli@umich.edu>

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/pytorch/quantization.py

… is used Signed-off-by: Ziang Li <ziangli@umich.edu>

greptile-apps

_{5 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/pytorch/module/layernorm_mlp.py

greptile-apps · 2026-02-03T18:44:46Z

transformer_engine/pytorch/module/layernorm_linear.py

        ln_out_return = None
        if return_layernorm_output or return_layernorm_output_gathered:
            ln_out_return = ln_out
+        ln_out_hp = ln_out if keep_backward_unquantized else None


storing both ln_out (quantized) and ln_out_hp (high precision) doubles the memory footprint for this activation

verify this memory overhead is acceptable for your target models, especially during training with large batch sizes or long sequences

Signed-off-by: Ziang Li <ziangli@umich.edu>

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Ziang Li <ziangli@umich.edu>

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Ziang Li <ziangli@umich.edu>

greptile-apps

_{5 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/common/recipe/__init__.py

transformer_engine/pytorch/module/layernorm_mlp.py

transformer_engine/pytorch/quantization.py

Signed-off-by: Ziang Li <ziangli@umich.edu>

for more information, see https://pre-commit.ci

greptile-apps

_{6 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/pytorch/quantization.py

transformer_engine/pytorch/module/layernorm_mlp.py

Signed-off-by: Ziang Li <ziangli@umich.edu>

for more information, see https://pre-commit.ci

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Ziang Li <ziangli@umich.edu>

greptile-apps

_{5 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/pytorch/module/layernorm_mlp.py

transformer_engine/pytorch/quantization.py

zianglih · 2026-02-06T01:34:49Z

transformer_engine/pytorch/quantization.py

+    if enabled or calibrating:
+        _validate_recipe_quantization_flags(fp8_recipe)
+    quantize_forward = getattr(fp8_recipe, "quantize_forward", True)
+    effective_enabled = enabled and quantize_forward


I am not very sure if we should disable when quantize_forward is false

Signed-off-by: Ziang Li <ziangli@umich.edu>

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/common/recipe/__init__.py

Signed-off-by: Ziang Li <ziangli@umich.edu>

greptile-apps

_{12 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-09T19:59:27Z

transformer_engine/pytorch/ops/basic/quantize.py

        # Check if FP8 is enabled
        fp8_enabled = FP8GlobalStateManager.is_fp8_enabled()
        quantize_forward = fp8_enabled and self._quantize_forward
-        quantize_backward = fp8_enabled and self._quantize_backward
+        quantize_backward = (


Recipe None crash

FP8GlobalStateManager.get_fp8_recipe() can be None (e.g., if FP8 is enabled but no recipe was set), so ...get_fp8_recipe().quantize_backward will raise an AttributeError. This makes Quantize.forward() crash in that configuration; please guard for None (e.g., default to quantized backward when recipe is unset, or explicitly error with a clearer message).

transformer_engine/pytorch/module/layernorm_mlp.py

Signed-off-by: Ziang Li <ziangli@umich.edu>

for more information, see https://pre-commit.ci

greptile-apps

_{6 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-09T20:39:00Z

transformer_engine/common/recipe/__init__.py

+        assert not (
+            not self.quantize_forward and self.quantize_backward
+        ), "Invalid recipe configuration: quantize_backward=True requires quantize_forward=True."
+        assert self.quantize_backward, "Delayed scaling does not support quantize_backward=False."


assertion prevents using NVTE_KEEP_BACKWARD_UNQUANTIZED=1 with DelayedScaling recipe - when env var is set, quantize_backward becomes False, making this assert fail and blocking the entire feature for this recipe type

greptile-apps · 2026-02-09T20:39:01Z

transformer_engine/pytorch/module/layernorm_mlp.py

+        assert (
+            not keep_backward_unquantized
+        ), "NVTE_KEEP_BACKWARD_UNQUANTIZED is not implemented in LayerNormMLP"


assertion makes LayerNormMLP crash immediately when NVTE_KEEP_BACKWARD_UNQUANTIZED=1 is set - if this module doesn't support the feature, either implement it or handle more gracefully with a clear error message before reaching this point

greptile-apps · 2026-02-09T20:39:03Z

Additional Comments (1)

transformer_engine/common/recipe/__init__.py
Float8BlockScaling is missing field declarations for quantize_forward and quantize_backward, but they're used in __post_init__ (line 407) and __repr__ (line 425-426) - will cause AttributeError at runtime

    fp8_mha: bool = False
    quantize_forward: bool = True
    quantize_backward: bool = not (os.getenv("NVTE_KEEP_BACKWARD_UNQUANTIZED", "0") == "1")

Signed-off-by: Ziang Li <ziangli@umich.edu>

greptile-apps

_{10 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-09T20:56:49Z

transformer_engine/pytorch/ops/basic/basic_linear.py


        # Save state for backward pass
        if ctx.requires_grad:
+            saved_input = input_ if keep_backward_unquantized else x_local
+            saved_weight = self.weight if keep_backward_unquantized else w
            if is_cpu_offload_enabled():
-                mark_activation_offload(x_local)
-            ctx.save_for_backward(x_local, w)
-            ctx.with_quantized_compute = with_quantized_compute
+                mark_activation_offload(saved_input)
+            ctx.save_for_backward(saved_input, saved_weight)


Unnecessary saved tensors

The forward path now saves saved_input/saved_weight whenever ctx.requires_grad is true, even when weight_requires_grad or input_requires_grad is false. In cases like frozen weights (common for finetune) or when only one side needs grads, this will save extra tensors and can materially increase activation memory. The prior if not weight_requires_grad: saved_input = None / if not input_requires_grad: saved_weight = None logic avoided that.

Signed-off-by: Ziang Li <ziangli@umich.edu>

greptile-apps

_{6 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-09T23:32:07Z

Additional Comments (1)

transformer_engine/pytorch/ops/fused/userbuffers_forward_linear.py
Backward remains quantized
NVTE_KEEP_BACKWARD_UNQUANTIZED is implemented across other linear paths by saving high-precision tensors for backward and setting ctx.with_quantized_compute = fp8 && !keep_backward_unquantized. In the Userbuffers forward fusion, the forward path never computes/propagates keep_backward_unquantized and always saves x_local, w (potentially quantized) and sets linear_op_ctx.with_quantized_compute = with_quantized_compute, so UB-enabled execution will still take the quantized-backward path even when recipe.quantize_backward=False (i.e., when the feature is intended to disable quantized backward).

zianglih · 2026-02-10T02:40:34Z

Currently, without NVTE_KEEP_BACKWARD_UNQUANTIZED , unit test is aligned with main:
te-2644.log
te-main.log

zianglih and others added 2 commits February 2, 2026 16:45

Add NVTE_KEEP_BACKWARD_UNQUANTIZED

3afce1f

Signed-off-by: Ziang Li <ziangli@umich.edu>

[pre-commit.ci] auto fixes from pre-commit.com hooks

72149be

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Feb 3, 2026

View reviewed changes

zhongbozhu reviewed Feb 3, 2026

View reviewed changes

transformer_engine/pytorch/ops/fuser.py Outdated Show resolved Hide resolved

Merge branch 'main' into keep-bwd

3e6eb64

Signed-off-by: Ziang Li <ziangli@umich.edu>

ziang-and force-pushed the keep-bwd branch from 539af7d to 3e6eb64 Compare February 3, 2026 08:58

zhongbozhu reviewed Feb 3, 2026

View reviewed changes

transformer_engine/pytorch/module/layernorm_linear.py Outdated Show resolved Hide resolved

greptile-apps bot reviewed Feb 3, 2026

View reviewed changes

zianglih added 2 commits February 3, 2026 09:36

Disable ub and clean up

927d482

Signed-off-by: Ziang Li <ziangli@umich.edu>

Drop fuser changes

cc85b60

Signed-off-by: Ziang Li <ziangli@umich.edu>

greptile-apps bot reviewed Feb 3, 2026

View reviewed changes

zianglih and others added 2 commits February 3, 2026 09:56

Replace use_quantized_bwd with use_fp8_bwd

fe24f95

Signed-off-by: Ziang Li <ziangli@umich.edu>

[pre-commit.ci] auto fixes from pre-commit.com hooks

5ca3615

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Feb 3, 2026

View reviewed changes

transformer_engine/pytorch/module/linear.py Outdated Show resolved Hide resolved

Ignore keep_backward_unquantized if delayed scaling

5ba7674

Signed-off-by: Ziang Li <ziangli@umich.edu>

greptile-apps bot reviewed Feb 3, 2026

View reviewed changes

transformer_engine/pytorch/quantization.py Outdated Show resolved Hide resolved

Refactor ignoring NVTE_KEEP_BACKWARD_UNQUANTIZED when delayed scaling…

02b7b2a

… is used Signed-off-by: Ziang Li <ziangli@umich.edu>

greptile-apps bot reviewed Feb 3, 2026

View reviewed changes

Add back missing ctx.debug

01a7de0

Signed-off-by: Ziang Li <ziangli@umich.edu>

greptile-apps bot reviewed Feb 3, 2026

View reviewed changes

zianglih added 2 commits February 3, 2026 11:50

Refactor changes under fused

bf904aa

Signed-off-by: Ziang Li <ziangli@umich.edu>

Clean up

b449fc4

Signed-off-by: Ziang Li <ziangli@umich.edu>

ziang-and force-pushed the keep-bwd branch from c934298 to b449fc4 Compare February 3, 2026 19:51

greptile-apps bot reviewed Feb 3, 2026

View reviewed changes

zianglih added 2 commits February 3, 2026 14:11

Refactor high-precision overwrite if keep_backward_unquantized

de3acaf

Signed-off-by: Ziang Li <ziangli@umich.edu>

Clean up

fe65d34

Signed-off-by: Ziang Li <ziangli@umich.edu>

zianglih added 2 commits February 5, 2026 13:43

Move ub overrides to fwd

0135366

Signed-off-by: Ziang Li <ziangli@umich.edu>

Remove duplication

1de3c64

Signed-off-by: Ziang Li <ziangli@umich.edu>

greptile-apps bot reviewed Feb 5, 2026

View reviewed changes

transformer_engine/common/recipe/__init__.py Show resolved Hide resolved

transformer_engine/pytorch/module/layernorm_mlp.py Show resolved Hide resolved

transformer_engine/pytorch/quantization.py Outdated Show resolved Hide resolved

zianglih and others added 2 commits February 5, 2026 13:59

Simplify use_fp8_bwd logic in bwd

04d3543

Signed-off-by: Ziang Li <ziangli@umich.edu>

[pre-commit.ci] auto fixes from pre-commit.com hooks

454976e

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Feb 5, 2026

View reviewed changes

transformer_engine/pytorch/quantization.py Outdated Show resolved Hide resolved

transformer_engine/pytorch/module/layernorm_mlp.py Show resolved Hide resolved

zianglih and others added 2 commits February 5, 2026 14:28

Set grad quantizers to none if keep bwd unquantized

f7794c9

Signed-off-by: Ziang Li <ziangli@umich.edu>

[pre-commit.ci] auto fixes from pre-commit.com hooks

58db8ea

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Feb 5, 2026

View reviewed changes

Drop delayed scaling change

9d0b654

Signed-off-by: Ziang Li <ziangli@umich.edu>

greptile-apps bot reviewed Feb 6, 2026

View reviewed changes

transformer_engine/pytorch/module/layernorm_mlp.py Show resolved Hide resolved

transformer_engine/pytorch/quantization.py Outdated Show resolved Hide resolved

zianglih commented Feb 6, 2026

View reviewed changes

This was referenced Feb 6, 2026

Implement nvfp4 radixark/miles#546

Open

[Feature] Support nvfp4 RL THUDM/slime#1505

Open

zianglih added 2 commits February 9, 2026 11:29

Simplify env var logic

004cb45

Signed-off-by: Ziang Li <ziangli@umich.edu>

Move validation check to recipe

9baccfd

Signed-off-by: Ziang Li <ziangli@umich.edu>

greptile-apps bot reviewed Feb 9, 2026

View reviewed changes

transformer_engine/common/recipe/__init__.py Outdated Show resolved Hide resolved

zianglih added 2 commits February 9, 2026 11:55

Simplify effective_enabled

207eb5a

Signed-off-by: Ziang Li <ziangli@umich.edu>

Fix inverted assertion logic

15117b1

Signed-off-by: Ziang Li <ziangli@umich.edu>

greptile-apps bot reviewed Feb 9, 2026

View reviewed changes

zianglih and others added 2 commits February 9, 2026 12:33

Simplify changes under ops

3fc5270

Signed-off-by: Ziang Li <ziangli@umich.edu>

[pre-commit.ci] auto fixes from pre-commit.com hooks

9201d19

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Feb 9, 2026

View reviewed changes

Simplify ctx.keep_backward_unquantized

1e0f1d2

Signed-off-by: Ziang Li <ziangli@umich.edu>

greptile-apps bot reviewed Feb 9, 2026

View reviewed changes

Fix missing attribute

253873a

Signed-off-by: Ziang Li <ziangli@umich.edu>

greptile-apps bot reviewed Feb 9, 2026

View reviewed changes

Add NVTE_KEEP_BACKWARD_UNQUANTIZED #2644

Are you sure you want to change the base?

Add NVTE_KEEP_BACKWARD_UNQUANTIZED #2644

Uh oh!

Conversation

zianglih commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

zianglih commented Feb 3, 2026

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

zianglih commented Feb 3, 2026 •

edited

Loading

greptile-apps bot commented Feb 3, 2026 •

edited

Loading