[Other] Refactor dynamic cache quant test by Wanglongzhi2001 · Pull Request #7092 · PaddlePaddle/FastDeploy

Wanglongzhi2001 · 2026-03-30T14:00:51Z

Motivation

Refactor dynamic cache quant test

Modifications

Refactor dynamic cache quant test

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-03-30T14:00:58Z

Thanks for your contribution!

Copilot

Pull request overview

该 PR 旨在重构 KV cache 动态量化（dynamic C8 / C16）在 FlashAttentionBackend 与 FlashMaskAttentionBackend 上的单测结构，并引入可扩展的量化配置注册表以便后续增加更多量化类型。

Changes:

用 QuantConfig + QUANT_CONFIGS 统一管理不同 cache 量化类型的测试配置与缓存布局。
将原先偏“路由断言”的 mock 测试改为：mock smoke、mock diff（C8 vs C16）、以及可选的真实 GPU forward 测试。
精简/移除与本文件主题无关的测试块（例如 softmax -inf 修复、kernel config 映射相关测试）。

Copilot · 2026-03-30T14:07:25Z

tests/layers/test_kv_cache_int8_dynamic_quant_backend.py

+def _run_forward_mocked(backend, module_path, quant_config, layer_id=0, return_tensor=None, qkv_inputs=None):
+    """Run forward_mixed with mocked external ops, return the result.
+
+    Args:
+        backend: The attention backend instance.
+        module_path: Module path for patching ops.
+        quant_config: QuantConfig to use.
+        layer_id: Layer ID for the dummy layer.
+        return_tensor: If provided, mock append_attention to return this tensor.
+        qkv_inputs: Optional (q, k, v, qkv) tuple. Generated if not provided.
+    """
+    backend.attention_metadata = DummyMetadata()
+    layer = DummyLayer(layer_id=layer_id, quant_config=quant_config)
+    caches = make_caches(quant_config, layer_id=layer_id)
+    fm = DummyForwardMeta(caches=caches, max_len_val=0)
+
+    if qkv_inputs is None:
+        q, k, v, qkv = make_qkv_inputs()
+    else:
+        q, k, v, qkv = qkv_inputs
+
+    if return_tensor is None:
+        return_tensor = paddle.zeros([BATCH_SIZE, ATTN_OUTPUT_DIM], dtype="bfloat16")
+
+    with patch(f"{module_path}.append_attention", return_value=return_tensor):
+        with patch(f"{module_path}.get_block_shape_and_split_kv_block"):
+            result = backend.forward_mixed(
+                q=q,
+                k=k,
+                v=v,
+                qkv=qkv,
+                compressed_kv=None,
+                k_pe=None,
+                layer=layer,
+                forward_meta=fm,
+            )
+    return result


当前 mock-based 测试只验证了 forward_mixed “不报错/输出一致”，但没有再检查 dynamic quant 下 cache 与 scale 的路由是否正确（例如 block_wise_fp8 是否按 4layer_id 索引、非 dynamic 是否按 2layer_id 索引并从 layer.cache_*_scale 取 scale）。这会导致原本要防回归的核心行为未被覆盖。建议在 _run_forward_mocked 或 smoke/diff 用例里恢复对 append_attention（以及 prefill 路径下的 gqa_rope_write_cache）调用参数的断言，并补回 layer_id=1 等用例。

Copilot · 2026-03-30T14:07:25Z

tests/layers/test_kv_cache_int8_dynamic_quant_backend.py

+@unittest.skipIf(not _HAS_GPU, "No GPU available")
+@unittest.skipIf(_IMPORT_ERROR is not None, f"Cannot import backends: {_IMPORT_ERROR}")
+class TestBackendForwardGPU(unittest.TestCase):
+    """GPU-based tests: real forward_mixed calls on GPU hardware."""
+
+    def _gpu_smoke_test(self, backend_class, module_path, quant_config_name):
+        """Test that forward_mixed runs on GPU without error."""
+        config = QUANT_CONFIGS[quant_config_name]
+        backend = create_backend(backend_class, module_path)
        backend.attention_metadata = DummyMetadata()

-        layer = DummyLayer(layer_id=0, cache_quant_type_str="none")
-        caches = _make_caches_normal(layer_id=0)
-        fm = DummyForwardMeta(caches=caches, max_len_val=0)
-        bs = DummyForwardMeta.BATCH_SIZE
-        mock_append_attn.return_value = paddle.zeros([bs, 7168], dtype="bfloat16")
-
-        backend.forward_mixed(
-            q=None,
-            k=None,
-            v=None,
-            qkv=paddle.zeros([bs, 7680], dtype="bfloat16"),
+        max_block_num = BATCH_SIZE
+        caches = _make_gpu_caches(config, max_block_num=max_block_num)
+        layer = DummyLayer(layer_id=0, quant_config=config)
+        fm = _make_gpu_forward_meta(caches, seq_len=1)
+        q, k, v, qkv = make_qkv_inputs()
+
+        result = backend.forward_mixed(
+            q=q,
+            k=k,
+            v=v,
+            qkv=qkv,
            compressed_kv=None,
            k_pe=None,
            layer=layer,
            forward_meta=fm,
        )


GPU 用例只判断 paddle.is_compiled_with_cuda()/device_count，但没有显式 paddle.set_device("gpu")。在默认 device 仍为 CPU 的环境下，这里创建的 qkv/caches 可能都在 CPU 上，导致 forward_mixed 触发 place mismatch 或直接走 CPU kernel 而失去“真实 GPU 覆盖”的意义。建议在该 TestCase 的 setUp/setUpClass 中设置并在 tearDown 恢复 device，确保张量和算子都在 GPU 上执行。

Copilot · 2026-03-30T14:07:25Z

tests/layers/test_kv_cache_int8_dynamic_quant_backend.py

 # ---------------------------------------------------------------------------
-# FlashMaskAttentionBackend tests
+# Part 3: GPU-based tests (require real GPU)
 # ---------------------------------------------------------------------------

-FLASH_MASK_MODULE = "fastdeploy.model_executor.layers.attention.flash_mask_attn_backend"
+_HAS_GPU = paddle.is_compiled_with_cuda() and paddle.device.cuda.device_count() > 0


本次重构删除了 TestSoftmaxInfinityHandling 与 TestAppendCacheKVC8KernelConfig 等测试，但在 tests/ 下未找到同名或等价替代用例，可能造成对 softmax -inf 修复与 append_cache_kv_c8 配置映射的覆盖回退。若这些测试仍有价值，建议迁移到更合适的测试文件而不是直接移除，或者在 PR 描述中说明删除原因与替代覆盖点。

Copilot · 2026-03-30T14:07:26Z

tests/layers/test_kv_cache_int8_dynamic_quant_backend.py

+# Backend registry for parameterized tests
+BACKENDS = [
+    ("flash_attn", FlashAttentionBackend, FLASH_ATTN_MODULE),
+    ("flash_mask", FlashMaskAttentionBackend, FLASH_MASK_MODULE),
+]
+
+


BACKENDS 变量在当前文件中未被使用（全仓库也未引用它），在开启 flake8 的情况下会触发 F841，导致 pre-commit/CI 失败。建议删除该变量，或用它来参数化 smoke/diff/GPU 三类用例（例如通过 subTest/for 循环遍历）。

Suggested change

# Backend registry for parameterized tests

BACKENDS = [

("flash_attn", FlashAttentionBackend, FLASH_ATTN_MODULE),

("flash_mask", FlashMaskAttentionBackend, FLASH_MASK_MODULE),

]

codecov-commenter · 2026-03-30T15:46:52Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@6727df8). Learn more about missing BASE report.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7092   +/-   ##
==========================================
  Coverage           ?   73.62%           
==========================================
  Files              ?      402           
  Lines              ?    56432           
  Branches           ?     8903           
==========================================
  Hits               ?    41549           
  Misses             ?    11950           
  Partials           ?     2933

Flag	Coverage Δ
GPU	`73.62% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Wanglongzhi2001 added 2 commits March 30, 2026 21:57

[Other] refractor dynamic cache quant test

0f7dedd

fix typo

c7b3184

Copilot AI review requested due to automatic review settings March 30, 2026 14:00

Wanglongzhi2001 temporarily deployed to Metax_ci March 30, 2026 14:00 — with GitHub Actions Inactive

Wanglongzhi2001 requested a review from lizhenyun01 March 30, 2026 14:01

Copilot started reviewing on behalf of Wanglongzhi2001 March 30, 2026 14:01 View session

Copilot AI reviewed Mar 30, 2026

View reviewed changes

fix test

e4af7d3

Wanglongzhi2001 temporarily deployed to Metax_ci March 31, 2026 02:51 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Other] Refactor dynamic cache quant test#7092

[Other] Refactor dynamic cache quant test#7092
Wanglongzhi2001 wants to merge 3 commits intoPaddlePaddle:developfrom
Wanglongzhi2001:refactor_dy_c8_test

Wanglongzhi2001 commented Mar 30, 2026

Uh oh!

paddle-bot bot commented Mar 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

codecov-commenter commented Mar 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Wanglongzhi2001 commented Mar 30, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Mar 30, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Mar 30, 2026 •

edited

Loading