Open
Conversation
* create NVFP4Quantizer at TE cpp side * modify mxfp8_quantize/cast_mxfp8_2D_kernel for nvfp4 generalization * temporary hijack mxfp8 torch side to call to nvfp4 quantization, will revert
* generalize dequantize_mxfp8_kernel to dequantize_mxnv_kernel
* create nvfp4 extension interface but not fully enabled. * mxfp8 trainablility restored.
* create NVFP4BlockScaling, NVFP4BlockScalingRecipeState class
* subclassing:
- NVFP4TensorBase(MXFP8TensorBase)
- NVFP4Quantizer(MXFP8Quantizer)
- NVFP4Tensor(MXFP8Tensor)
* forward pass functional, backward raise exception due to only TN layout allowed in cublaslt nvfp4
…max normal of e4m3 range
* motivation: due to current TN-only layout for cublaslt NVFP8 matmul * this recipe uses TN NVFP4 forward, and NN/NT MXFP8 backward, * avoiding tensor relayout which can be costly to materialize for * large models. * piggyback NVFP4Quantizer for shadow MXFP8Quantizer needed for backward pass
* remove redundant quantization, step elapse improves
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR introduces a proof-of-concept implementation of NVFP4 forward + MXFP8 backward training.
The work is intentionally scoped as a local PoC and serves as a foundation for subsequent iterations.
Motivations
Implementation challenge in full NVFP4 training: The initial goal was end-to-end NVFP4 (forward + backward). However, NVFP4 matmuls in cuBLASLt currently support TN-only layouts, which would require an additional transpose kernel for the backward pass. We defer this to subsequent work. (NVIDIA’s official release v2.8 already has full NVFP4 support.)
Use-case 1: More efficient than NVFP4-QAT MXFP8 backward is substantially more efficient compared to NVFP4-QAT pipelines, which still rely on 16-/32-bit backward passes.
Use-case 2: Practicality of full NVFP4 training
While NVFP4 training has advanced significantly, it still requires several supporting techniques (1) Hadamard transforms, (2) selective higher-precision layers, and (3) switching back to higher precision for the last fraction of training, as also seen in the recent MLPerf v5.1 NVFP4 submission. Therefore, MXFP8 backward can be valuable, either for last-mile convergence or from the get-go.
Quick Summary on Implementation:
Use-case studies here.