[SM120] Emit grouped NVFP4 (ue4m3) block-scaled GEMM in cutlass_library#3346
Open
waynehacking8 wants to merge 1 commit into
Open
[SM120] Emit grouped NVFP4 (ue4m3) block-scaled GEMM in cutlass_library#3346waynehacking8 wants to merge 1 commit into
waynehacking8 wants to merge 1 commit into
Conversation
The SM120 fp4 block-scaled generator skipped grouped NVFP4 (ue4m3) kernels behind a stale "not yet available" comment, although to_grouped_schedule() already maps the Nvf4 Sm120 tags onto the PtrArray tags and example 79d (79d_blackwell_geforce_nvfp4_grouped_gemm.cu) runs that exact kernel. Because the grouped tags collapse to PtrArray*, is_nvf4(kernel_schedule) is False in the grouped branch, so it must key on the SF type instead. Emits 14 grouped NVFP4 SM120 kernels that were previously absent (verified by driving the grouped generator path: 0 -> 14). The kernel template is the one proven by example 79d (303 TFLOPS on RTX PRO 6000, SM120, CUDA 13). Fixes NVIDIA#3343. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: WEI CHENG CHIU <waynehacking8@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The SM120 fp4 block-scaled generator skipped grouped NVFP4 (ue4m3) kernels behind a stale
"not yet available" comment, even though
to_grouped_schedule()already maps the Nvf4 Sm120tags onto the PtrArray tags and example
79d_blackwell_geforce_nvfp4_grouped_gemm.curuns thatexact kernel. Because the grouped tags collapse to
PtrArray*,is_nvf4(kernel_schedule)isFalsein the grouped branch, so it must key on the SF type. This adds theis_grouped && ue4m3clause and corrects the stale comment.Emits 14 grouped NVFP4 SM120 kernels that were previously absent (verified by driving the
grouped generator path: 0 -> 14). The kernel template is the one proven by example 79d.
Fixes #3343.
Test plan / limits
GenerateSM120_TensorOp_fp4_UMMA_gemm_with_block_scaled(..., gemm_kind=GemmKind.GroupedBlockScaledUniversal3x).Disposition: Passed, 303 / 253 TFLOPS.cutlass_profilerbuild of all 14 emitted tile variants locally (heavy). Please run internal CI to confirm every emitted tile compiles — some SM120 tiles may hit Stages>=2 constraints; if so they can be pruned in the same clause.