[All] Remove legacy max512 backend#2949
Conversation
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
|
/te-ci |
for more information, see https://pre-commit.ci
Greptile SummaryThis PR removes the legacy
Confidence Score: 5/5The removal is well-scoped: the arbitrary-seqlen backend is a confirmed superset of max512's capabilities, so no previously-supported workload should silently fall back to NVTE_No_Backend. The core deletion — kernel, header, dispatch, and bindings — is complete and internally consistent across C++, PyTorch, and JAX. The remaining findings are documentation-only: the FusedAttention docstring misstates the FP8 backend's sequence-length constraint, and the NVTE_FUSED_ATTN_BACKEND env-var docs still list value 1 as a meaningful F16 override when the code no longer reads it. Neither affects runtime correctness. The FusedAttention class docstring in backends.py and the NVTE_FUSED_ATTN_BACKEND entry in docs/envvars.rst both carry inaccuracies introduced by this PR that are worth a second look before merging. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[nvte_get_fused_attn_backend] --> B{dtype?}
B -->|FP8| C[Evaluate FP8 conditions]
B -->|FP16 / BF16| D[Evaluate flag_arb conditions]
C --> E{FP8 conditions met?}
E -->|Yes| F[NVTE_FP8]
E -->|No| G[NVTE_No_Backend]
D --> H{flag_arb?}
H -->|Yes| I[NVTE_F16_arbitrary_seqlen]
H -->|No| G
style F fill:#4CAF50,color:#fff
style I fill:#2196F3,color:#fff
style G fill:#f44336,color:#fff
subgraph REMOVED["Removed (this PR)"]
R1["flag_m512 evaluation - seqlen <= 512, head_dim = 64, MHA only"]
R2["NVTE_F16_max512_seqlen backend"]
R3["NVTE_FUSED_ATTN_BACKEND env-var override for F16 path"]
end
Reviews (5): Last reviewed commit: "remove sub-backend 0 from header docstri..." | Re-trigger Greptile |
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
|
/te-ci L0 |
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Description
Fused attention (cuDNN attention) currently has 3 sub-backends:
f16_max512: FP16/BF16, max_seq_len <= 512, head_dim = 64, MHA onlyf16_arbitrary: FP16/BF16, max_seq_len = any, head_dim <= 256, MHA/MQA/GQA/MLA, andfp8: FP8 delayed scaling, FP8 current scaling, MXFP8.f16_max512was implemented using a much older cuDNN interface, which will be removed in the next cudnn-frontend release. This PR removes thef16_max512sub-backend, and routes all BF16/FP16 attention calculations tof16_arbitrarysub-backend, which covers allmax512features.Type of change
Changes
fused_attn_f16_max512_seqlen.cu) and its headerNVTE_F16_max512_seqlenfrom theNVTE_Fused_Attn_Backendenum (existing valuesNVTE_F16_arbitrary_seqlen = 1andNVTE_FP8 = 2are unchanged)flag_m512computation, backend selection logic, and fwd/bwd dispatch for max512 infused_attn.cppFusedAttnBackenddict, RNG workspace sizing, and docstringsDotProductAttentionbackend selection (env var override forpost_scale_bias, sliding window filter, bias shape filter)FusedAttentionclass docstring is removed and to be replaced with a more complete and accurate version in a follow-upChecklist: