[Other] Refactor dynamic cache quant test#7092
[Other] Refactor dynamic cache quant test#7092Wanglongzhi2001 wants to merge 3 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
该 PR 旨在重构 KV cache 动态量化(dynamic C8 / C16)在 FlashAttentionBackend 与 FlashMaskAttentionBackend 上的单测结构,并引入可扩展的量化配置注册表以便后续增加更多量化类型。
Changes:
- 用
QuantConfig+QUANT_CONFIGS统一管理不同 cache 量化类型的测试配置与缓存布局。 - 将原先偏“路由断言”的 mock 测试改为:mock smoke、mock diff(C8 vs C16)、以及可选的真实 GPU forward 测试。
- 精简/移除与本文件主题无关的测试块(例如 softmax -inf 修复、kernel config 映射相关测试)。
| def _run_forward_mocked(backend, module_path, quant_config, layer_id=0, return_tensor=None, qkv_inputs=None): | ||
| """Run forward_mixed with mocked external ops, return the result. | ||
|
|
||
| Args: | ||
| backend: The attention backend instance. | ||
| module_path: Module path for patching ops. | ||
| quant_config: QuantConfig to use. | ||
| layer_id: Layer ID for the dummy layer. | ||
| return_tensor: If provided, mock append_attention to return this tensor. | ||
| qkv_inputs: Optional (q, k, v, qkv) tuple. Generated if not provided. | ||
| """ | ||
| backend.attention_metadata = DummyMetadata() | ||
| layer = DummyLayer(layer_id=layer_id, quant_config=quant_config) | ||
| caches = make_caches(quant_config, layer_id=layer_id) | ||
| fm = DummyForwardMeta(caches=caches, max_len_val=0) | ||
|
|
||
| if qkv_inputs is None: | ||
| q, k, v, qkv = make_qkv_inputs() | ||
| else: | ||
| q, k, v, qkv = qkv_inputs | ||
|
|
||
| if return_tensor is None: | ||
| return_tensor = paddle.zeros([BATCH_SIZE, ATTN_OUTPUT_DIM], dtype="bfloat16") | ||
|
|
||
| with patch(f"{module_path}.append_attention", return_value=return_tensor): | ||
| with patch(f"{module_path}.get_block_shape_and_split_kv_block"): | ||
| result = backend.forward_mixed( | ||
| q=q, | ||
| k=k, | ||
| v=v, | ||
| qkv=qkv, | ||
| compressed_kv=None, | ||
| k_pe=None, | ||
| layer=layer, | ||
| forward_meta=fm, | ||
| ) | ||
| return result |
There was a problem hiding this comment.
当前 mock-based 测试只验证了 forward_mixed “不报错/输出一致”,但没有再检查 dynamic quant 下 cache 与 scale 的路由是否正确(例如 block_wise_fp8 是否按 4layer_id 索引、非 dynamic 是否按 2layer_id 索引并从 layer.cache_*_scale 取 scale)。这会导致原本要防回归的核心行为未被覆盖。建议在 _run_forward_mocked 或 smoke/diff 用例里恢复对 append_attention(以及 prefill 路径下的 gqa_rope_write_cache)调用参数的断言,并补回 layer_id=1 等用例。
| @unittest.skipIf(not _HAS_GPU, "No GPU available") | ||
| @unittest.skipIf(_IMPORT_ERROR is not None, f"Cannot import backends: {_IMPORT_ERROR}") | ||
| class TestBackendForwardGPU(unittest.TestCase): | ||
| """GPU-based tests: real forward_mixed calls on GPU hardware.""" | ||
|
|
||
| def _gpu_smoke_test(self, backend_class, module_path, quant_config_name): | ||
| """Test that forward_mixed runs on GPU without error.""" | ||
| config = QUANT_CONFIGS[quant_config_name] | ||
| backend = create_backend(backend_class, module_path) | ||
| backend.attention_metadata = DummyMetadata() | ||
|
|
||
| layer = DummyLayer(layer_id=0, cache_quant_type_str="none") | ||
| caches = _make_caches_normal(layer_id=0) | ||
| fm = DummyForwardMeta(caches=caches, max_len_val=0) | ||
| bs = DummyForwardMeta.BATCH_SIZE | ||
| mock_append_attn.return_value = paddle.zeros([bs, 7168], dtype="bfloat16") | ||
|
|
||
| backend.forward_mixed( | ||
| q=None, | ||
| k=None, | ||
| v=None, | ||
| qkv=paddle.zeros([bs, 7680], dtype="bfloat16"), | ||
| max_block_num = BATCH_SIZE | ||
| caches = _make_gpu_caches(config, max_block_num=max_block_num) | ||
| layer = DummyLayer(layer_id=0, quant_config=config) | ||
| fm = _make_gpu_forward_meta(caches, seq_len=1) | ||
| q, k, v, qkv = make_qkv_inputs() | ||
|
|
||
| result = backend.forward_mixed( | ||
| q=q, | ||
| k=k, | ||
| v=v, | ||
| qkv=qkv, | ||
| compressed_kv=None, | ||
| k_pe=None, | ||
| layer=layer, | ||
| forward_meta=fm, | ||
| ) |
There was a problem hiding this comment.
GPU 用例只判断 paddle.is_compiled_with_cuda()/device_count,但没有显式 paddle.set_device("gpu")。在默认 device 仍为 CPU 的环境下,这里创建的 qkv/caches 可能都在 CPU 上,导致 forward_mixed 触发 place mismatch 或直接走 CPU kernel 而失去“真实 GPU 覆盖”的意义。建议在该 TestCase 的 setUp/setUpClass 中设置并在 tearDown 恢复 device,确保张量和算子都在 GPU 上执行。
| # --------------------------------------------------------------------------- | ||
| # FlashMaskAttentionBackend tests | ||
| # Part 3: GPU-based tests (require real GPU) | ||
| # --------------------------------------------------------------------------- | ||
|
|
||
| FLASH_MASK_MODULE = "fastdeploy.model_executor.layers.attention.flash_mask_attn_backend" | ||
| _HAS_GPU = paddle.is_compiled_with_cuda() and paddle.device.cuda.device_count() > 0 |
There was a problem hiding this comment.
本次重构删除了 TestSoftmaxInfinityHandling 与 TestAppendCacheKVC8KernelConfig 等测试,但在 tests/ 下未找到同名或等价替代用例,可能造成对 softmax -inf 修复与 append_cache_kv_c8 配置映射的覆盖回退。若这些测试仍有价值,建议迁移到更合适的测试文件而不是直接移除,或者在 PR 描述中说明删除原因与替代覆盖点。
| # Backend registry for parameterized tests | ||
| BACKENDS = [ | ||
| ("flash_attn", FlashAttentionBackend, FLASH_ATTN_MODULE), | ||
| ("flash_mask", FlashMaskAttentionBackend, FLASH_MASK_MODULE), | ||
| ] | ||
|
|
||
|
|
There was a problem hiding this comment.
BACKENDS 变量在当前文件中未被使用(全仓库也未引用它),在开启 flake8 的情况下会触发 F841,导致 pre-commit/CI 失败。建议删除该变量,或用它来参数化 smoke/diff/GPU 三类用例(例如通过 subTest/for 循环遍历)。
| # Backend registry for parameterized tests | |
| BACKENDS = [ | |
| ("flash_attn", FlashAttentionBackend, FLASH_ATTN_MODULE), | |
| ("flash_mask", FlashMaskAttentionBackend, FLASH_MASK_MODULE), | |
| ] |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #7092 +/- ##
==========================================
Coverage ? 73.62%
==========================================
Files ? 402
Lines ? 56432
Branches ? 8903
==========================================
Hits ? 41549
Misses ? 11950
Partials ? 2933
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Motivation
Refactor dynamic cache quant test
Modifications
Refactor dynamic cache quant test
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.