-
Notifications
You must be signed in to change notification settings - Fork 633
[PyTorch] GroupedTensor integration
#2600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Greptile SummaryThis PR integrates the Key Changes
Implementation NotesThe implementation allocates all weight data in a single contiguous buffer, then creates individual parameter views that share the underlying storage. This improves memory locality and enables future optimizations like grouped GEMMs (#2502). Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant GroupedLinear
participant GroupedTensor
participant Quantizer
participant Storage
Note over User,Storage: Initialization Phase
User->>GroupedLinear: __init__(num_gemms, in_features, out_features)
GroupedLinear->>GroupedLinear: register_parameter(weight0...weightN)
GroupedLinear->>GroupedLinear: reset_parameters()
GroupedLinear->>GroupedLinear: make_grouped_weights()
Note over GroupedLinear,Storage: Weight Consolidation
GroupedLinear->>Quantizer: _get_weight_quantizers()
Quantizer-->>GroupedLinear: [quantizer0...quantizerN]
GroupedLinear->>GroupedTensor: make_grouped_tensor(num_tensors, shapes, quantizers)
Note over GroupedTensor,Storage: Allocate Contiguous Storage
GroupedTensor->>GroupedTensor: analyze shape patterns
GroupedTensor->>GroupedTensor: calculate logical_shape, offsets
GroupedTensor->>Storage: allocate contiguous buffers (data, scale_inv, etc)
GroupedTensor->>GroupedTensor: split_into_quantized_tensors()
GroupedTensor-->>GroupedLinear: grouped_weights with quantized_tensors
Note over GroupedLinear: Copy & Re-register Weights
loop for each weight i
GroupedLinear->>GroupedTensor: quantized_tensors[i].copy_(weights[i])
GroupedLinear->>GroupedLinear: register_parameter(weightI, quantized_tensors[i])
end
Note over User,Storage: Forward Pass
User->>GroupedLinear: forward(inp, m_splits)
GroupedLinear->>GroupedLinear: _get_weight_tensors()
GroupedLinear->>GroupedLinear: prepare quantizers
GroupedLinear->>GroupedLinear: _GroupedLinear.apply()
Note over GroupedLinear: All weights share contiguous storage
GroupedLinear->>GroupedLinear: general_grouped_gemm(weights, inputs)
GroupedLinear-->>User: output tensor
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
9 files reviewed, 2 comments
| # TODO(ksivamani): Verify correctness of copy for all recipes. | ||
| with torch.no_grad(): | ||
| for i in range(self.num_gemms): | ||
| grouped_weights.quantized_tensors[i].copy_(weights[i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: check that the copy operation works correctly for all quantization recipes (FP8, MXFP8, NVFP4, block scaling). the TODO comment on line 771 acknowledges this needs verification.
| # TODO(ksivaman): (Do we need multiple quantizers?) | ||
| # Current implementation assumes all tensors have the different quantizers. | ||
| # instances but effectively the same quantizer. | ||
| rowwise_usage = quantizers[0].rowwise_usage if not no_quantization else True | ||
| columnwise_usage = quantizers[0].columnwise_usage if not no_quantization else False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: check that all quantizers in the group are compatible. the comment acknowledges uncertainty about whether multiple quantizers are needed, but the implementation assumes they're "effectively the same" - mixed quantization schemes could cause issues.
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: Zhongbo Zhu <zhongboz@nvidia.com>
2b7ea40 to
40c619e
Compare
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
tests/pytorch/test_grouped_tensor.py
Outdated
| shape = [(512, 512) for _ in range(num_tensors)] | ||
| quantizers = make_quantizers(quantization, num_tensors, shape) | ||
|
|
||
| grouped_tensor = GroupedTensor.make_grouped_tensor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not graph safe for MOE activation inputs. Maybe we should make a make_grouped_tensor_graph_safe API?
| def __init__( | ||
| self, | ||
| num_tensors: int, | ||
| shape: List[Tuple[int, int]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot do this, shape is supposed to be a tensor on device and we should have a tiny kernel that computes the offsets
Edit:
It's okay for weights where shapes are static, but for moe activation, this needs to be changed.
|
|
||
| no_quantization = quantizers is None or len(quantizers) == 0 or quantizers[0] is None | ||
|
|
||
| # TODO(ksivaman): (Do we need multiple quantizers?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't have multiple quantizers, multiple quantizers only make sense for recipes like delayed scaling, where quantizer holds amax values. For blockwise current scaling recipes like mxfp8 and nvfp4, quantizer only holds configurations like whether we are doing RHT, then only one quantizer is enough, and it's better for reducing CPU overhead since pybind don't need to convert multiple objects.
Having separate quantizers also doesn't make sense for cuda graph, because having multiple quantizers implicitly assumes splitting the input, which then implicitly assumes that host knows about the shape of input for each expert, which will then break cuda graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. One customer has reviewed this PR and commented that having multiple quantizers would introduce CPU overheads when calling quantizer.update_usage/set_usage/... if the number of tensors is very large, for example, in FSDP.
| total_scale_elements += scale_elements | ||
| if i < num_tensors - 1: | ||
| scale_inv_offsets.append(total_scale_elements) | ||
| scale_inv = torch.empty(total_scale_elements, dtype=torch.uint8, device=device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the number of scale elements cannot be calculated by summing the per-expert split elements for activation inputs where the m_splits is now on-device, but instead we should assume that 128x128 aligned shape has been satisfied (otherwise no grouped quantize possible, fall back), and the directly get the number of bytes by M * N / 16 (nvfp4) where M is the total number of tokens for all experts, because you cannot know the per-expert M, you have to calculate the total elements this way and assume padding has been satisfied already.
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
| if defer_init: | ||
| return | ||
|
|
||
| weights = [getattr(self, f"weight{i}") for i in range(self.num_gemms)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about the case where we do fp8 param gather, each weight in this case is already a Float8Tensor?
| /*! \enum Float8BlockScaleTensorFormat | ||
| * \brief Data format for an FP8 block-scaled tensor | ||
| */ | ||
| enum class Float8BlockScaleTensorFormat { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this flag has been deprecated, plus that for grouped tensor, because there is no TP, the format is always GEMM_READY
* changes for pytoch extension; but everything seems to be broken probably unrelated to my changes Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * fix the issues Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * comment nvte API since Oleg's PR is not merged Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * test for all cases: Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * tensor attributes should be set later Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> --------- Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
for more information, see https://pre-commit.ci
|
|
||
| quantized_tensors = self.split_into_quantized_tensors() | ||
| for i in range(self.num_tensors): | ||
| self.quantizer.update_quantized(tensors[i], quantized_tensors[i], noop_flag=noop_flag) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be calling the group_quantize from the C side? Why is it still in a loop of num_tensors iterations? Thanks.
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
| GetTransformerEngineDType(tensor_offsets.scalar_type()), | ||
| getTensorShape(tensor_offsets)); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add the with_gemm_swizzled_scales support here too please?
|
|
||
| def split_into_quantized_tensors( | ||
| self, | ||
| ) -> List[Union[QuantizedTensorStorage, torch.Tensor]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this function be extended to supporting any number n of tensors that satisfies "num_tensors is divisible by n"? This would help with reshaping efforts, for example, in attention.
| from .nvfp4_tensor_storage import NVFP4TensorStorage | ||
|
|
||
|
|
||
| class GroupedTensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be GroupedTensor or GroupedTensorStorage, like with the other low-precision tensor types?
| f"GroupedTensor(num_tensors={self.num_tensors}, " | ||
| f"shape={self.shape}, " | ||
| f"logical_shape={self.logical_shape}, " | ||
| f"dtype={self.get_dtype()})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add the quantizer information here?
| - logical_shape provides the conceptual 2D interpretation | ||
| - All data is stored on device in contiguous layout | ||
|
|
||
| Note: This structure is used only for combined storage of multiple tensors with the same dtype and scaling mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be duplicate to the first sentence in the docstring.
| # Offsets for indexing into contiguous 1D layout (OPTIONAL - not needed if all_same_shape()) | ||
| # tensor_offsets[i] = element offset to start of tensor i (cumulative sum of numel for tensors 0..i-1) | ||
| # Usage: tensor_i_ptr = data.data_ptr() + tensor_offsets[i] * element_size | ||
| # If None and all_same_shape(): offset[i] = i * M * N (where M, N are common dimensions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first glance, these two tensors are confusing. Maybe we could explain the difference as one being the offset on the element level, one at the byte level?
| @staticmethod | ||
| def make_grouped_tensor_with_shapes( | ||
| num_tensors: int, | ||
| shape: List[Tuple[int, int]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be named "shapes" please?
| dtype: Optional[torch.dtype] = None, | ||
| ) -> GroupedTensor: | ||
| """ | ||
| Create a GroupedTensor for storing multiple weight tensors of the same shape. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why "weight" tensors?
| ) | ||
| result.append(tensor_data) | ||
| else: | ||
| raise RuntimeError("GroupedTensor has no data to split") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if/elseif/else code seems to be the same for both "offsets being None" or "not being None" cases, so maybe we can combine the two? Similarly, can the "data" part of the split for both "no_quantization" and "quantization" paths be combined, so there's only one "for i in range(self.num_tensors)" loop?
| ] | ||
|
|
||
|
|
||
| def make_quantizer(quantization: str, num_tensors: int, shape: List[Tuple[int, int]]) -> Quantizer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need "num_tensors" as an argument here anymore, I think, because we assume all tensors in the group use the same kind of quantizer.
| Quantize the GroupedTensor inplace. | ||
| """ | ||
|
|
||
| quantized_tensors = self.split_into_quantized_tensors() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should "quantized_tensors" here be "self.quantized_tensors"?
for more information, see https://pre-commit.ci
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Description
#2388 introduced the
GroupedTensorclass in the core library. This PR partly integrates this functionality to the PyTorch bindings.Type of change
Changes
GroupedTensorclass.GroupedTensorintoGroupedLinearsuch that the parameters are contiguous.grouped_quantizeAPI to python similar to thesplit_quantizewhich returns a quantizedGroupedTensorthat can be directly consumed by the GEMMs ([common] Add support for cuBLASLt GEMM for GroupedTensor #2502).Checklist: