[All] Remove legacy max512 backend by cyanguwa · Pull Request #2949 · NVIDIA/TransformerEngine

cyanguwa · 2026-04-30T23:06:42Z

Description

Fused attention (cuDNN attention) currently has 3 sub-backends:

f16_max512: FP16/BF16, max_seq_len <= 512, head_dim = 64, MHA only
f16_arbitrary: FP16/BF16, max_seq_len = any, head_dim <= 256, MHA/MQA/GQA/MLA, and
fp8: FP8 delayed scaling, FP8 current scaling, MXFP8.

f16_max512 was implemented using a much older cuDNN interface, which will be removed in the next cudnn-frontend release. This PR removes the f16_max512 sub-backend, and routes all BF16/FP16 attention calculations to f16_arbitrary sub-backend, which covers all max512 features.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Deleted the CUDA kernel (fused_attn_f16_max512_seqlen.cu) and its header
Removed NVTE_F16_max512_seqlen from the NVTE_Fused_Attn_Backend enum (existing values NVTE_F16_arbitrary_seqlen = 1 and NVTE_FP8 = 2 are unchanged)
Removed flag_m512 computation, backend selection logic, and fwd/bwd dispatch for max512 in fused_attn.cpp
Removed max512 from pybind definitions (PyTorch + JAX), Python FusedAttnBackend dict, RNG workspace sizing, and docstrings
Removed max512-specific workarounds in DotProductAttention backend selection (env var override for post_scale_bias, sliding window filter, bias shape filter)
Updated tests and docs; note that an outdated support matrix from the FusedAttention class docstring is removed and to be replaced with a more complete and accurate version in a follow-up

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

cyanguwa · 2026-04-30T23:06:59Z

/te-ci

for more information, see https://pre-commit.ci

greptile-apps · 2026-04-30T23:13:18Z

Greptile Summary

This PR removes the legacy F16_max512_seqlen cuDNN sub-backend (backend enum value 0), which used an older cuDNN interface scheduled for removal. All FP16/BF16 attention — including workloads previously handled by max512 (seq_len ≤ 512, head_dim = 64, MHA) — is now routed exclusively to the F16_arbitrary_seqlen backend, which is a strict superset of max512's feature coverage.

The CUDA kernel file (fused_attn_f16_max512_seqlen.cu), its header, the NVTE_F16_max512_seqlen = 0 enum value, all dispatch logic in fused_attn.cpp, and all Python/JAX/PyTorch bindings referencing it are cleanly removed.
Backend-selection workarounds in utils.py (bias-shape filter, sliding-window filter, env-var force) that existed to compensate for max512 limitations are also removed.
Test and documentation updates are included; a follow-up is noted to replace the removed FusedAttention support-matrix docstring with a more complete version.

Confidence Score: 5/5

The removal is well-scoped: the arbitrary-seqlen backend is a confirmed superset of max512's capabilities, so no previously-supported workload should silently fall back to NVTE_No_Backend.

The core deletion — kernel, header, dispatch, and bindings — is complete and internally consistent across C++, PyTorch, and JAX. The remaining findings are documentation-only: the FusedAttention docstring misstates the FP8 backend's sequence-length constraint, and the NVTE_FUSED_ATTN_BACKEND env-var docs still list value 1 as a meaningful F16 override when the code no longer reads it. Neither affects runtime correctness.

The FusedAttention class docstring in backends.py and the NVTE_FUSED_ATTN_BACKEND entry in docs/envvars.rst both carry inaccuracies introduced by this PR that are worth a second look before merging.

Important Files Changed

Filename	Overview
transformer_engine/common/fused_attn/fused_attn.cpp	Removes max512 backend selection logic, flag_m512 computation, and fwd/bwd dispatch; routes all F16/BF16 to flag_arb/NVTE_F16_arbitrary_seqlen. The NVTE_FUSED_ATTN_BACKEND env-var override is now completely gone from the F16 path.
transformer_engine/common/include/transformer_engine/fused_attn.h	Removes NVTE_F16_max512_seqlen = 0 enum value and updates the support-matrix verbatim table in both nvte_fused_attn_fwd and nvte_fused_attn_bwd docstrings to drop backend-0 row and widen backend-1 seqlen constraint to 'any'.
transformer_engine/pytorch/attention/dot_product_attention/backends.py	Removes max512 docstring table; replacement docstring incorrectly describes FP8 as supporting 'any sequence length' when the backend is still constrained to ≤512.
transformer_engine/pytorch/attention/dot_product_attention/utils.py	Removes three max512-specific workarounds: env-var force to backend 1 for non-grad bias, sliding-window filter, and bias-shape filter; all clean deletions with no residual references.
transformer_engine/pytorch/cpp_extensions/fused_attn.py	Removes F16_max512 from FusedAttnBackend dict, renames BACKEND_F16m512_FP8_THREADS_PER_CTA to BACKEND_FP8_THREADS_PER_CTA, drops the max512 RNG workspace sizing branch, and simplifies the aux_ctx_tensors guard; all changes are consistent.
tests/pytorch/attention/test_attention.py	Removes dual-backend forced comparison (backend 0 vs 1) and simplifies to a single FusedAttention run; matches the new single-F16-backend reality.
tests/pytorch/utils.py	Updates backends dict to {1: 'F16_arbitrary_seqlen', 2: 'FP8'} and loops over its keys instead of range(3), correctly excluding the removed backend-0.
transformer_engine/jax/csrc/extensions/attention.cpp	Removes the post-hoc softmax shape correction for max512 in PrepareFusedAttnBackwardAuxTensors; the remaining code correctly initialises tensors for the arbitrary-seqlen and FP8 backends only.
docs/envvars.rst	Removes value 0 from NVTE_FUSED_ATTN_BACKEND docs; values 1 and 2 remain documented as valid overrides, but the env-var is now completely unused in fused_attn.cpp for the F16 path.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[nvte_get_fused_attn_backend] --> B{dtype?}
    B -->|FP8| C[Evaluate FP8 conditions]
    B -->|FP16 / BF16| D[Evaluate flag_arb conditions]
    C --> E{FP8 conditions met?}
    E -->|Yes| F[NVTE_FP8]
    E -->|No| G[NVTE_No_Backend]
    D --> H{flag_arb?}
    H -->|Yes| I[NVTE_F16_arbitrary_seqlen]
    H -->|No| G

    style F fill:#4CAF50,color:#fff
    style I fill:#2196F3,color:#fff
    style G fill:#f44336,color:#fff

    subgraph REMOVED["Removed (this PR)"]
        R1["flag_m512 evaluation - seqlen <= 512, head_dim = 64, MHA only"]
        R2["NVTE_F16_max512_seqlen backend"]
        R3["NVTE_FUSED_ATTN_BACKEND env-var override for F16 path"]
    end

_{Reviews (5): Last reviewed commit: "remove sub-backend 0 from header docstri..." | Re-trigger Greptile}

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

cyanguwa · 2026-05-05T19:23:38Z

/te-ci L0

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

remove max512 subbackend

4ad50d4

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

e6cfec2

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/attention/dot_product_attention/context_parallel.py Outdated

Comment thread transformer_engine/pytorch/attention/dot_product_attention/context_parallel.py Outdated

cyanguwa added 4 commits May 5, 2026 11:49

minor tweak for docstring

7be64a7

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

revert fp8 t3hd changes

4e85287

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

Merge branch 'main' into remove_max512_subbackend

731e782

remove redudant test comparison 0 vs 1 subbackend

9787e72

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

cyanguwa requested a review from sudhakarsingh27 May 5, 2026 19:23

cyanguwa added the 2.16.0 label May 5, 2026

cyanguwa changed the title ~~[All] Remove max512 backend~~ [All] Remove legacy max512 backend May 5, 2026

remove sub-backend 0 from header docstring

a3e9b03

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[All] Remove legacy max512 backend#2949

[All] Remove legacy max512 backend#2949
cyanguwa wants to merge 7 commits intoNVIDIA:mainfrom
cyanguwa:remove_max512_subbackend

cyanguwa commented Apr 30, 2026 •

edited

Loading

Uh oh!

cyanguwa commented Apr 30, 2026

Uh oh!

greptile-apps Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

cyanguwa commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cyanguwa commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

cyanguwa commented Apr 30, 2026

Uh oh!

greptile-apps Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

cyanguwa commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cyanguwa commented Apr 30, 2026 •

edited

Loading

greptile-apps Bot commented Apr 30, 2026 •

edited

Loading