[Common] Use specialized unfused MXFP8 cast kernels by default by Oleg-Goncharov · Pull Request #2958 · NVIDIA/TransformerEngine

Oleg-Goncharov · 2026-05-05T18:04:29Z

Description

This PR enables the fast unfused MXFP8 cast kernels by default.

Previously, these kernels were gated behind an environment variable and therefore were not used unless explicitly enabled. This change makes the specialized cast-only path the default behavior.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Removed environment variable

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

greptile-apps · 2026-05-05T18:06:25Z

Greptile Summary

This PR makes the specialized unfused MXFP8 cast kernels the default by removing the ENABLE_CAST_ONLY environment variable gate and having all supported hasSpec specializations unconditionally return true. The COLWISE scaling path is explicitly excluded from the specialized route (no COLWISE-only kernel exists), and the now-dead case ScalingType::COLWISE branch with its orphaned NVTE_WARN is cleaned up.

specialized/quantize_mxfp8.cuh: Deletes is_cast_only_enabled() and replaces the four hasSpec specialization bodies (fp16/bf16 → fp8e5m2/fp8e4m3, no dbias/dact/act) with return true;.
quantize_mxfp8.cuh: Adds && (scaling_type != ScalingType::COLWISE) to the outer specialized-path guard and removes the previously unreachable COLWISE case from the inner switch.

Confidence Score: 5/5

Safe to merge — the change is a straightforward promotion of already-tested specialized kernels from opt-in to default.

The specialized kernels were previously exercisable via an env var, so the code paths themselves are not new. The COLWISE exclusion guard is logically correct (no COLWISE specialization exists), and removing the dead switch branch eliminates a misleading warning. No behavioral change occurs for COLWISE or GEMM-swizzled-scales paths, which continue to use the original kernels unchanged.

No files require special attention.

Important Files Changed

Filename	Overview
transformer_engine/common/cast/mxfp8/specialized/quantize_mxfp8.cuh	Removes the `is_cast_only_enabled()` env-var gate and replaces all `hasSpec` specialization return values with unconditional `true`, enabling the specialized cast kernels by default.
transformer_engine/common/cast/mxfp8/quantize_mxfp8.cuh	Adds `scaling_type != ScalingType::COLWISE` guard to the specialized-path condition and removes the now-unreachable dead COLWISE case (with its dangling `NVTE_WARN`) from the inner switch.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[quantize called] --> B{hasSpec AND\nnot GEMM swizzled\nnot COLWISE?}
    B -- YES --> C{scaling_type?}
    B -- NO --> G[Generic kernel path]
    C -- ROWWISE --> D[specialized cast-only kernel\nCastTraits rowwise=true colwise=false]
    C -- BIDIMENSIONAL --> E[specialized cast-only kernel\nCastTraits rowwise=true colwise=true]
    G --> H{scaling_type?}
    H -- ROWWISE --> I[Original ROWWISE kernel]
    H -- COLWISE --> J[Original COLWISE kernel]
    H -- BIDIMENSIONAL --> K[Original BIDIMENSIONAL kernel]

_{Reviews (2): Last reviewed commit: "Removed dead code" | Re-trigger Greptile}

ksivaman

LGTM

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Oleg-Goncharov · 2026-05-05T18:10:46Z

/te-ci

Use fast unfused cast mxfp8 kernels by default

83025fc

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

ksivaman previously approved these changes May 5, 2026

View reviewed changes

Removed dead code

e926c9a

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Oleg-Goncharov dismissed ksivaman’s stale review via e926c9a May 5, 2026 18:08

ksivaman approved these changes May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Common] Use specialized unfused MXFP8 cast kernels by default#2958

[Common] Use specialized unfused MXFP8 cast kernels by default#2958
Oleg-Goncharov wants to merge 2 commits intoNVIDIA:mainfrom
Oleg-Goncharov:pr_fast_default_mxfp8_kernels

Oleg-Goncharov commented May 5, 2026

Uh oh!

greptile-apps Bot commented May 5, 2026 •

edited

Loading

Uh oh!

ksivaman left a comment

Uh oh!

Oleg-Goncharov commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Oleg-Goncharov commented May 5, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

ksivaman left a comment

Choose a reason for hiding this comment

Uh oh!

Oleg-Goncharov commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented May 5, 2026 •

edited

Loading