Skip to content

feat(quant): support tensorwise fp8 w8a8 (--quant_type fp8w8a8-pt)#1366

Open
sufubao wants to merge 1 commit into
ModelTC:stream_fcfrom
sufubao:support_fp8_pt
Open

feat(quant): support tensorwise fp8 w8a8 (--quant_type fp8w8a8-pt)#1366
sufubao wants to merge 1 commit into
ModelTC:stream_fcfrom
sufubao:support_fp8_pt

Conversation

@sufubao

@sufubao sufubao commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

What

Adds a tensorwise (per-tensor) FP8 W8A8 quantization method, selectable via --quant_type fp8w8a8-pt (alias vllm-fp8w8a8-pt).

How

New FP8w8a8PerTensorQuantizationMethod in lightllm/common/quantization/w8a8.py, subclassing the existing FP8w8a8QuantizationMethod and overriding only what differs for per-tensor:

  • quantize: scaled_fp8_quant(..., use_per_token_if_dynamic=False) → a single scalar weight scale, broadcast across the sub-weight's output channels.
  • apply: scaled_fp8_quant(input, ..., use_per_token_if_dynamic=False) → a single per-tensor activation scale, then cutlass_scaled_mm.

The weight scale is kept in the inherited per-channel storage layout (scalar broadcast across channels). This lets fused weights (qkv / gate_up) each get their own correct per-tensor scale while the cutlass per-channel weight path consumes them unchanged; per-tensor activation combines fine with per-channel weight scales in cutlass. _create_weight is reused from the parent.

Also documents the new value in the --quant_type CLI help.

Verification

  • Both fp8w8a8-pt and vllm-fp8w8a8-pt resolve to the new class in QUANTMETHODS.
  • black --line-length=120 and flake8 (project config) pass.
  • End-to-end GEMM/accuracy not yet run (requires GPU + vllm at launch, same dependency as existing fp8w8a8).

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new quantization method, FP8w8a8PerTensorQuantizationMethod (vllm-fp8w8a8-pt), which implements tensorwise (per-tensor) dynamic FP8 W8A8 quantization. It also updates the command-line interface help text to document this new option. The review feedback suggests replacing .view(-1) with .flatten() when extracting the scalar from weight_scale to safely handle potential 0-dimensional tensors and avoid compatibility errors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread lightllm/common/quantization/w8a8.py Outdated
weight.cuda(self.device_id_), scale=None, use_per_token_if_dynamic=False
)
output.weight.copy_(qweight)
output.weight_scale.fill_(weight_scale.view(-1)[0])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using .view(-1) on a 0-dimensional (scalar) tensor in PyTorch can raise a RuntimeError (e.g., view size is not compatible with input tensor size and stride). Since weight_scale is a per-tensor scale and might be returned as a 0-dimensional tensor, it is safer to use .flatten()[0] or .reshape(-1)[0] which gracefully handles 0-dimensional tensors without triggering host-device synchronization.

Suggested change
output.weight_scale.fill_(weight_scale.view(-1)[0])
output.weight_scale.fill_(weight_scale.flatten()[0])

Add TritonFP8w8a8PerTensorQuantizationMethod (triton-fp8w8a8-pertensor/-pt)
with deferred staging for fused multi-split and per-expert weights. Support
per-tensor weight scale in the grouped MoE matmul (WEIGHT_SCALE_PER_TENSOR)
and the per-token scaled_mm kernel (B_SCALE_IS_TENSOR).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant