[SM120] Emit grouped NVFP4 (ue4m3) block-scaled GEMM in cutlass_library by waynehacking8 · Pull Request #3346 · NVIDIA/cutlass

waynehacking8 · 2026-06-23T11:30:19Z

The SM120 fp4 block-scaled generator skipped grouped NVFP4 (ue4m3) kernels behind a stale
"not yet available" comment, even though to_grouped_schedule() already maps the Nvf4 Sm120
tags onto the PtrArray tags and example 79d_blackwell_geforce_nvfp4_grouped_gemm.cu runs that
exact kernel. Because the grouped tags collapse to PtrArray*, is_nvf4(kernel_schedule) is
False in the grouped branch, so it must key on the SF type. This adds the
is_grouped && ue4m3 clause and corrects the stale comment.

Emits 14 grouped NVFP4 SM120 kernels that were previously absent (verified by driving the
grouped generator path: 0 -> 14). The kernel template is the one proven by example 79d.

Fixes #3343.

Test plan / limits

Verified emission 0 -> 14 by driving GenerateSM120_TensorOp_fp4_UMMA_gemm_with_block_scaled(..., gemm_kind=GemmKind.GroupedBlockScaledUniversal3x).
Verified the kernel template via example 79d on RTX PRO 6000 (SM120, CUDA 13): both schedules Disposition: Passed, 303 / 253 TFLOPS.
I did not run a full cutlass_profiler build of all 14 emitted tile variants locally (heavy). Please run internal CI to confirm every emitted tile compiles — some SM120 tiles may hit Stages>=2 constraints; if so they can be pruned in the same clause.

The SM120 fp4 block-scaled generator skipped grouped NVFP4 (ue4m3) kernels behind a stale "not yet available" comment, although to_grouped_schedule() already maps the Nvf4 Sm120 tags onto the PtrArray tags and example 79d (79d_blackwell_geforce_nvfp4_grouped_gemm.cu) runs that exact kernel. Because the grouped tags collapse to PtrArray*, is_nvf4(kernel_schedule) is False in the grouped branch, so it must key on the SF type instead. Emits 14 grouped NVFP4 SM120 kernels that were previously absent (verified by driving the grouped generator path: 0 -> 14). The kernel template is the one proven by example 79d (303 TFLOPS on RTX PRO 6000, SM120, CUDA 13). Fixes NVIDIA#3343. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: WEI CHENG CHIU <waynehacking8@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SM120] Emit grouped NVFP4 (ue4m3) block-scaled GEMM in cutlass_library#3346

[SM120] Emit grouped NVFP4 (ue4m3) block-scaled GEMM in cutlass_library#3346
waynehacking8 wants to merge 1 commit into
NVIDIA:mainfrom
waynehacking8:sm120-grouped-nvf4-gen

waynehacking8 commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

waynehacking8 commented Jun 23, 2026

Test plan / limits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant