ModelTC
diff --git a/‎docs/CN/source/index.rst‎
Lines changed: 1 addition & 0 deletions b/‎docs/CN/source/index.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/CN/source/tutorial/fp8_kv_quantization.rst‎
Lines changed: 149 additions & 0 deletions b/‎docs/CN/source/tutorial/fp8_kv_quantization.rst‎
Lines changed: 149 additions & 0 deletions
diff --git a/‎docs/EN/source/index.rst‎
Lines changed: 1 addition & 0 deletions b/‎docs/EN/source/index.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/EN/source/tutorial/fp8_kv_quantization.rst‎
Lines changed: 149 additions & 0 deletions b/‎docs/EN/source/tutorial/fp8_kv_quantization.rst‎
Lines changed: 149 additions & 0 deletions
diff --git a/‎lightllm/common/basemodel/attention/create_utils.py‎
Lines changed: 4 additions & 0 deletions b/‎lightllm/common/basemodel/attention/create_utils.py‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎lightllm/common/basemodel/attention/fa3/fp8.py‎
Lines changed: 8 additions & 6 deletions b/‎lightllm/common/basemodel/attention/fa3/fp8.py‎
Lines changed: 8 additions & 6 deletions
diff --git a/‎lightllm/common/basemodel/attention/flashinfer/fp8.py‎
Lines changed: 2 additions & 2 deletions b/‎lightllm/common/basemodel/attention/flashinfer/fp8.py‎
Lines changed: 2 additions & 2 deletions
@@ -49,6 +49,7 @@ Lightllm 整合了众多的开源方案的优点，包括但不限于 FasterTran
    :caption: 部署教程
 
    DeepSeek R1 部署 <tutorial/deepseek_deployment>
+   FP8 KV 量化与校准 <tutorial/fp8_kv_quantization>
    多级缓存部署 <tutorial/multi_level_cache_deployment>
    多模态部署 <tutorial/multimodal>
    奖励模型部署 <tutorial/reward_model>
 
@@ -0,0 +1,149 @@
+.. _tutorial/fp8_kv_quantization_cn:
+
+FP8 KV 量化与校准指南
+======================
+
+本章节介绍 LightLLM 中 FP8 KV 量化的完整流程，包括：
+
+- 导出校准文件（``--export_fp8kv_calibration``）
+- 使用校准文件进行推理（``fp8kv``）
+- FA3 与 FlashInfer 后端下的量化粒度差异
+- 常见报错与排查建议
+
+功能概览
+--------
+
+LightLLM 的 FP8 KV 量化采用离线校准方案：
+
+1. 先运行导出模式，统计 KV 的最大绝对值并导出 ``kv_cache_calib.json``。
+2. 再在推理模式加载该文件，将 KV 按 scale 量化为 ``float8_e4m3fn`` 存储。
+
+后端与量化粒度
+--------------
+
+当前行为如下：
+
+- ``fa3``: 使用 ``per_head``（每个 head 独立 scale）
+- ``flashinfer``: 使用 ``per_tensor``（K/V 各一个标量 scale）
+
+因此，校准文件与后端强相关：
+
+- ``fa3`` 生成的 ``per_head`` 校准文件用于 ``fa3`` 推理。
+- ``flashinfer`` 生成的 ``per_tensor`` 校准文件用于 ``flashinfer`` 推理。
+
+不建议混用不同后端导出的校准文件。
+
+步骤一：导出校准文件
+--------------------
+
+导出模式示例（FA3）：
+
+.. code-block:: console
+
+    $ python -m lightllm.server.api_server \
+        --model_dir /path/to/model \
+        --export_fp8kv_calibration \
+        --llm_prefill_att_backend fa3 \
+        --llm_decode_att_backend fa3 \
+        --disable_cudagraph
+
+导出模式示例（FlashInfer）：
+
+.. code-block:: console
+
+    $ python -m lightllm.server.api_server \
+        --model_dir /path/to/model \
+        --export_fp8kv_calibration \
+        --llm_prefill_att_backend flashinfer \
+        --llm_decode_att_backend flashinfer \
+        --disable_cudagraph
+
+说明：
+
+- 设置 ``--export_fp8kv_calibration`` 后，会在运行过程中收集 KV 统计信息。
+- 校准完成后，会在当前工作目录输出 ``kv_cache_calib.json``。
+- 导出模式要求 ``--disable_cudagraph``，且 ``--llm_kv_type`` 保持为 ``None``。
+- 仓库 ``test/advanced_config/`` 目录中已存放常用模型的校准文件，可按需直接使用或作为参考。
+
+使用 benchmark_qps.py 进行随机数据校准
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+除了在线业务流量，也可以使用 ``test/benchmark/service/benchmark_qps.py`` 工具构造随机请求进行校准。
+
+- 默认累计约 4000 次推理后会输出一次校准结果。
+- 实践中可执行以下命令两次，以更稳定地覆盖统计范围。
+
+示例命令：
+
+.. code-block:: console
+
+    $ python test/benchmark/service/benchmark_qps.py --url http://127.0.0.1:8000/generate_stream --tokenizer_path ../Qwen3-30B-A3B --input_len 1000 --output_len 2000 --input_qps 10 --input_num 200 --range_ratio 0.9
+
+步骤二：使用校准文件启动 FP8 推理
+---------------------------------
+
+推理模式示例（FA3）：
+
+.. code-block:: console
+
+    $ python -m lightllm.server.api_server \
+        --model_dir /path/to/model \
+        --llm_kv_type fp8kv \
+        --llm_prefill_att_backend fa3 \
+        --llm_decode_att_backend fa3 \
+        --kv_quant_calibration_config_path /path/to/kv_cache_calib.json
+
+推理模式示例（FlashInfer）：
+
+.. code-block:: console
+
+    $ python -m lightllm.server.api_server \
+        --model_dir /path/to/model \
+        --llm_kv_type fp8kv \
+        --llm_prefill_att_backend flashinfer \
+        --llm_decode_att_backend flashinfer \
+        --kv_quant_calibration_config_path /path/to/kv_cache_calib.json
+
+说明：
+
+- ``fp8kv`` 模式必须提供 ``--kv_quant_calibration_config_path``。
+- 建议推理时的 attention backend 与导出校准时保持一致。
+
+校准文件格式
+------------
+
+导出的 ``kv_cache_calib.json`` 主要字段包括：
+
+- ``quant_type``: ``per_head`` 或 ``per_tensor``
+- ``num_layers``: 层数
+- ``num_head``: 总 head 数
+- ``scales_shape``: scale 张量形状
+- ``scales``: 实际 scale 数值
+- ``qmin`` / ``qmax``: FP8 范围参数
+
+加载校准文件时，会校验模型架构、层数、head 数及量化类型是否匹配。
+
+多卡说明
+--------
+
+在多卡（TP）场景下，系统会根据当前 rank 自动切分本地需要的 head 对应 scale。
+你仍然只需要提供一份全量 ``kv_cache_calib.json``。
+
+常见问题
+--------
+
+1. 启动时报错需要 ``--kv_quant_calibration_config_path``
+
+   说明你使用了 ``--llm_kv_type fp8kv`` 但未传入校准文件路径。
+
+2. 启动时报错要求 ``--disable_cudagraph``
+
+    说明你使用了 ``--export_fp8kv_calibration``，该模式必须禁用 cudagraph。
+
+3. 报错 ``quant_type not match``
+
+   通常是后端与校准文件类型不一致。例如拿 ``per_head`` 文件去跑 ``flashinfer``。
+
+4. 切换后端后效果异常
+
+   建议按目标后端重新导出校准文件，不要跨后端复用。
@@ -48,6 +48,7 @@ Documentation List
    :caption: Deployment Tutorials
 
    DeepSeek R1 Deployment <tutorial/deepseek_deployment>
+   FP8 KV Quantization and Calibration <tutorial/fp8_kv_quantization>
    Multi-Level Cache Deployment <tutorial/multi_level_cache_deployment>
    Multimodal Deployment <tutorial/multimodal>
    Reward Model Deployment <tutorial/reward_model>
 
@@ -0,0 +1,149 @@
+.. _tutorial/fp8_kv_quantization_en:
+
+FP8 KV Quantization and Calibration Guide
+=========================================
+
+This chapter describes the end-to-end FP8 KV quantization workflow in LightLLM, including:
+
+- Exporting calibration data (``--export_fp8kv_calibration``)
+- Running inference with calibration data (``fp8kv``)
+- Quantization granularity differences between FA3 and FlashInfer
+- Common errors and troubleshooting
+
+Overview
+--------
+
+LightLLM uses an offline calibration flow for FP8 KV quantization:
+
+1. Run export mode to collect KV statistics and produce ``kv_cache_calib.json``.
+2. Run inference mode with that file, and quantize KV into ``float8_e4m3fn`` storage.
+
+Backend and Quantization Granularity
+------------------------------------
+
+Current behavior:
+
+- ``fa3``: ``per_head`` scales (independent scale per head)
+- ``flashinfer``: ``per_tensor`` scales (one scalar for K and one scalar for V)
+
+Calibration files are backend-dependent:
+
+- ``per_head`` files exported with ``fa3`` should be used with ``fa3`` inference.
+- ``per_tensor`` files exported with ``flashinfer`` should be used with ``flashinfer`` inference.
+
+Avoid mixing calibration files across different backends.
+
+Step 1: Export Calibration File
+--------------------------------
+
+Export mode example (FA3):
+
+.. code-block:: console
+
+    $ python -m lightllm.server.api_server \
+        --model_dir /path/to/model \
+        --export_fp8kv_calibration \
+        --llm_prefill_att_backend fa3 \
+        --llm_decode_att_backend fa3 \
+        --disable_cudagraph
+
+Export mode example (FlashInfer):
+
+.. code-block:: console
+
+    $ python -m lightllm.server.api_server \
+        --model_dir /path/to/model \
+        --export_fp8kv_calibration \
+        --llm_prefill_att_backend flashinfer \
+        --llm_decode_att_backend flashinfer \
+        --disable_cudagraph
+
+Notes:
+
+- Setting ``--export_fp8kv_calibration`` collects KV statistics during runtime.
+- After calibration is completed, ``kv_cache_calib.json`` is written to the current working directory.
+- Export mode requires ``--disable_cudagraph``, and ``--llm_kv_type`` should remain ``None``.
+- The repository already provides calibration files for common models under ``test/advanced_config/``, which can be used directly or as references.
+
+Use benchmark_qps.py for random-data calibration
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Besides online traffic, you can use ``test/benchmark/service/benchmark_qps.py`` to generate random requests for calibration.
+
+- By default, one calibration result is exported after around 4000 inferences are accumulated.
+- In practice, you can run the following command twice to improve coverage stability.
+
+Example command:
+
+.. code-block:: console
+
+    $ python test/benchmark/service/benchmark_qps.py --url http://127.0.0.1:8000/generate_stream --tokenizer_path ../Qwen3-30B-A3B --input_len 1000 --output_len 2000 --input_qps 10 --input_num 200 --range_ratio 0.9
+
+Step 2: Start FP8 Inference with Calibration
+---------------------------------------------
+
+Inference mode example (FA3):
+
+.. code-block:: console
+
+    $ python -m lightllm.server.api_server \
+        --model_dir /path/to/model \
+        --llm_kv_type fp8kv \
+        --llm_prefill_att_backend fa3 \
+        --llm_decode_att_backend fa3 \
+        --kv_quant_calibration_config_path /path/to/kv_cache_calib.json
+
+Inference mode example (FlashInfer):
+
+.. code-block:: console
+
+    $ python -m lightllm.server.api_server \
+        --model_dir /path/to/model \
+        --llm_kv_type fp8kv \
+        --llm_prefill_att_backend flashinfer \
+        --llm_decode_att_backend flashinfer \
+        --kv_quant_calibration_config_path /path/to/kv_cache_calib.json
+
+Notes:
+
+- ``fp8kv`` requires ``--kv_quant_calibration_config_path``.
+- Keep the inference backend consistent with the backend used during calibration export.
+
+Calibration File Schema
+-----------------------
+
+Key fields in ``kv_cache_calib.json``:
+
+- ``quant_type``: ``per_head`` or ``per_tensor``
+- ``num_layers``: number of layers
+- ``num_head``: total number of heads
+- ``scales_shape``: shape of the scale tensor
+- ``scales``: actual scale values
+- ``qmin`` / ``qmax``: FP8 numeric range parameters
+
+At load time, LightLLM validates architecture, layer count, head count, and quantization type.
+
+Multi-GPU Note
+--------------
+
+In multi-GPU (TP) setups, LightLLM slices the global scales to local rank heads automatically.
+You only need to provide one full ``kv_cache_calib.json`` file.
+
+Common Issues
+-------------
+
+1. Error says ``--kv_quant_calibration_config_path`` is required
+
+   You are using ``--llm_kv_type fp8kv`` without a calibration file path.
+
+2. Error says ``--disable_cudagraph`` is required
+
+    You are using ``--export_fp8kv_calibration``; this mode requires cudagraph disabled.
+
+3. ``quant_type not match`` error
+
+   Usually caused by backend/file mismatch (for example, using a ``per_head`` file with ``flashinfer``).
+
+4. Abnormal quality after backend switch
+
+   Re-export calibration using the target backend instead of reusing files across backends.
@@ -36,6 +36,10 @@
         "fa3": Fp8Fa3AttBackend,
         "flashinfer": Fp8FlashInferAttBackend,
     },
+    "fp8kv": {
+        "fa3": Fp8Fa3AttBackend,
+        "flashinfer": Fp8FlashInferAttBackend,
+    },
 }
 
 mla_data_type_to_backend = {
 
@@ -89,19 +89,21 @@ def _fp8_prefill_att(
     ) -> torch.Tensor:
         self.backend: Fp8Fa3AttBackend = self.backend  # for typing
 
+        q_head_num = q.shape[1]
+        q_head_dim = q.shape[2]
+        k_head_num = k.shape[1]
         q, q_scale = q_per_head_fp8_quant(
-            q,
+            q.reshape(q.shape[0], k_head_num, -1),
             self.infer_state.b_seq_len,
             self.cu_seqlens_q,
-            self.mid_token_batch_ids,
+            token_batch_ids=self.mid_token_batch_ids,
         )
-        k_head_num = k.shape[1]
         k_head_dim = k.shape[2]
         cache_k = k.view(-1, 1, k_head_num, k_head_dim).view(torch.float8_e4m3fn)
         cache_v = v.view(-1, 1, k_head_num, k_head_dim).view(torch.float8_e4m3fn)
         layer_index = self.backend._find_layer_index(k=cache_k, v=cache_v, att_state=self)
         o = flash_attn_with_kvcache(
-            q=q,
+            q=q.reshape(-1, q_head_num, q_head_dim),
             k_cache=cache_k,
             v_cache=cache_v,
             page_table=self.page_table,
@@ -200,9 +202,9 @@ def _fp8_decode_att(
         layer_index = self.backend._find_layer_index(k=cache_k, v=cache_v, att_state=self)
 
         q_head_num = q.shape[1]
-        q, q_scale = scaled_fp8_quant(q.view(q.shape[0] * k_head_num, -1), use_per_token_if_dynamic=True)
+        q, q_scale = scaled_fp8_quant(q.reshape(q.shape[0] * k_head_num, -1), use_per_token_if_dynamic=True)
         o = flash_attn_with_kvcache(
-            q=q.view(-1, q_head_num, k_head_dim),
+            q=q.reshape(-1, q_head_num, k_head_dim),
             k_cache=cache_k,
             v_cache=cache_v,
             page_table=self.page_table,
 
@@ -20,7 +20,7 @@ def create_att_decode_state(self, infer_state) -> "Fp8FlashInferDecodeAttState":
 
 @dataclasses.dataclass
 class Fp8FlashInferPrefillAttState(FlashInferPrefillAttState):
-    offline_scales: torch.Tensor = None
+    offline_scales: list = None
 
     def init_state(self):
         super().init_state()
@@ -68,7 +68,7 @@ def _fp8_prefill_att(
 
 @dataclasses.dataclass
 class Fp8FlashInferDecodeAttState(FlashInferDecodeAttState):
-    offline_scales: torch.Tensor = None
+    offline_scales: list = None
 
     def init_state(self):
         super().init_state()