feat(quant): support tensorwise fp8 w8a8 (--quant_type fp8w8a8-pt) by sufubao · Pull Request #1366 · ModelTC/LightLLM

sufubao · 2026-06-22T08:15:17Z

What

Adds a tensorwise (per-tensor) FP8 W8A8 quantization method, selectable via --quant_type fp8w8a8-pt (alias vllm-fp8w8a8-pt).

How

New FP8w8a8PerTensorQuantizationMethod in lightllm/common/quantization/w8a8.py, subclassing the existing FP8w8a8QuantizationMethod and overriding only what differs for per-tensor:

quantize: scaled_fp8_quant(..., use_per_token_if_dynamic=False) → a single scalar weight scale, broadcast across the sub-weight's output channels.
apply: scaled_fp8_quant(input, ..., use_per_token_if_dynamic=False) → a single per-tensor activation scale, then cutlass_scaled_mm.

The weight scale is kept in the inherited per-channel storage layout (scalar broadcast across channels). This lets fused weights (qkv / gate_up) each get their own correct per-tensor scale while the cutlass per-channel weight path consumes them unchanged; per-tensor activation combines fine with per-channel weight scales in cutlass. _create_weight is reused from the parent.

Also documents the new value in the --quant_type CLI help.

Verification

Both fp8w8a8-pt and vllm-fp8w8a8-pt resolve to the new class in QUANTMETHODS.
black --line-length=120 and flake8 (project config) pass.
End-to-end GEMM/accuracy not yet run (requires GPU + vllm at launch, same dependency as existing fp8w8a8).

gemini-code-assist

Code Review

This pull request introduces a new quantization method, FP8w8a8PerTensorQuantizationMethod (vllm-fp8w8a8-pt), which implements tensorwise (per-tensor) dynamic FP8 W8A8 quantization. It also updates the command-line interface help text to document this new option. The review feedback suggests replacing .view(-1) with .flatten() when extracting the scalar from weight_scale to safely handle potential 0-dimensional tensors and avoid compatibility errors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-22T08:16:45Z

+            weight.cuda(self.device_id_), scale=None, use_per_token_if_dynamic=False
+        )
+        output.weight.copy_(qweight)
+        output.weight_scale.fill_(weight_scale.view(-1)[0])


Using .view(-1) on a 0-dimensional (scalar) tensor in PyTorch can raise a RuntimeError (e.g., view size is not compatible with input tensor size and stride). Since weight_scale is a per-tensor scale and might be returned as a 0-dimensional tensor, it is safer to use .flatten()[0] or .reshape(-1)[0] which gracefully handles 0-dimensional tensors without triggering host-device synchronization.

Suggested change

output.weight_scale.fill_(weight_scale.view(-1)[0])

output.weight_scale.fill_(weight_scale.flatten()[0])

Add TritonFP8w8a8PerTensorQuantizationMethod (triton-fp8w8a8-pertensor/-pt) with deferred staging for fused multi-split and per-expert weights. Support per-tensor weight scale in the grouped MoE matmul (WEIGHT_SCALE_PER_TENSOR) and the per-token scaled_mm kernel (B_SCALE_IS_TENSOR).

gemini-code-assist Bot reviewed Jun 22, 2026

View reviewed changes

sufubao force-pushed the support_fp8_pt branch from 1da1d20 to 408a840 Compare June 22, 2026 10:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(quant): support tensorwise fp8 w8a8 (--quant_type fp8w8a8-pt)#1366

feat(quant): support tensorwise fp8 w8a8 (--quant_type fp8w8a8-pt)#1366
sufubao wants to merge 1 commit into
ModelTC:stream_fcfrom
sufubao:support_fp8_pt

sufubao commented Jun 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	output.weight_scale.fill_(weight_scale.view(-1)[0])
	output.weight_scale.fill_(weight_scale.flatten()[0])

Conversation

sufubao commented Jun 22, 2026

What

How

Verification

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant