Skip to content

[Other] Refactor dynamic cache quant test#7092

Open
Wanglongzhi2001 wants to merge 3 commits intoPaddlePaddle:developfrom
Wanglongzhi2001:refactor_dy_c8_test
Open

[Other] Refactor dynamic cache quant test#7092
Wanglongzhi2001 wants to merge 3 commits intoPaddlePaddle:developfrom
Wanglongzhi2001:refactor_dy_c8_test

Conversation

@Wanglongzhi2001
Copy link
Copy Markdown
Collaborator

Motivation

Refactor dynamic cache quant test

Modifications

Refactor dynamic cache quant test

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings March 30, 2026 14:00
@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Mar 30, 2026

Thanks for your contribution!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 旨在重构 KV cache 动态量化(dynamic C8 / C16)在 FlashAttentionBackendFlashMaskAttentionBackend 上的单测结构,并引入可扩展的量化配置注册表以便后续增加更多量化类型。

Changes:

  • QuantConfig + QUANT_CONFIGS 统一管理不同 cache 量化类型的测试配置与缓存布局。
  • 将原先偏“路由断言”的 mock 测试改为:mock smoke、mock diff(C8 vs C16)、以及可选的真实 GPU forward 测试。
  • 精简/移除与本文件主题无关的测试块(例如 softmax -inf 修复、kernel config 映射相关测试)。

Comment on lines +271 to +307
def _run_forward_mocked(backend, module_path, quant_config, layer_id=0, return_tensor=None, qkv_inputs=None):
"""Run forward_mixed with mocked external ops, return the result.

Args:
backend: The attention backend instance.
module_path: Module path for patching ops.
quant_config: QuantConfig to use.
layer_id: Layer ID for the dummy layer.
return_tensor: If provided, mock append_attention to return this tensor.
qkv_inputs: Optional (q, k, v, qkv) tuple. Generated if not provided.
"""
backend.attention_metadata = DummyMetadata()
layer = DummyLayer(layer_id=layer_id, quant_config=quant_config)
caches = make_caches(quant_config, layer_id=layer_id)
fm = DummyForwardMeta(caches=caches, max_len_val=0)

if qkv_inputs is None:
q, k, v, qkv = make_qkv_inputs()
else:
q, k, v, qkv = qkv_inputs

if return_tensor is None:
return_tensor = paddle.zeros([BATCH_SIZE, ATTN_OUTPUT_DIM], dtype="bfloat16")

with patch(f"{module_path}.append_attention", return_value=return_tensor):
with patch(f"{module_path}.get_block_shape_and_split_kv_block"):
result = backend.forward_mixed(
q=q,
k=k,
v=v,
qkv=qkv,
compressed_kv=None,
k_pe=None,
layer=layer,
forward_meta=fm,
)
return result
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当前 mock-based 测试只验证了 forward_mixed “不报错/输出一致”,但没有再检查 dynamic quant 下 cache 与 scale 的路由是否正确(例如 block_wise_fp8 是否按 4layer_id 索引、非 dynamic 是否按 2layer_id 索引并从 layer.cache_*_scale 取 scale)。这会导致原本要防回归的核心行为未被覆盖。建议在 _run_forward_mocked 或 smoke/diff 用例里恢复对 append_attention(以及 prefill 路径下的 gqa_rope_write_cache)调用参数的断言,并补回 layer_id=1 等用例。

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +430 to 456
@unittest.skipIf(not _HAS_GPU, "No GPU available")
@unittest.skipIf(_IMPORT_ERROR is not None, f"Cannot import backends: {_IMPORT_ERROR}")
class TestBackendForwardGPU(unittest.TestCase):
"""GPU-based tests: real forward_mixed calls on GPU hardware."""

def _gpu_smoke_test(self, backend_class, module_path, quant_config_name):
"""Test that forward_mixed runs on GPU without error."""
config = QUANT_CONFIGS[quant_config_name]
backend = create_backend(backend_class, module_path)
backend.attention_metadata = DummyMetadata()

layer = DummyLayer(layer_id=0, cache_quant_type_str="none")
caches = _make_caches_normal(layer_id=0)
fm = DummyForwardMeta(caches=caches, max_len_val=0)
bs = DummyForwardMeta.BATCH_SIZE
mock_append_attn.return_value = paddle.zeros([bs, 7168], dtype="bfloat16")

backend.forward_mixed(
q=None,
k=None,
v=None,
qkv=paddle.zeros([bs, 7680], dtype="bfloat16"),
max_block_num = BATCH_SIZE
caches = _make_gpu_caches(config, max_block_num=max_block_num)
layer = DummyLayer(layer_id=0, quant_config=config)
fm = _make_gpu_forward_meta(caches, seq_len=1)
q, k, v, qkv = make_qkv_inputs()

result = backend.forward_mixed(
q=q,
k=k,
v=v,
qkv=qkv,
compressed_kv=None,
k_pe=None,
layer=layer,
forward_meta=fm,
)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPU 用例只判断 paddle.is_compiled_with_cuda()/device_count,但没有显式 paddle.set_device("gpu")。在默认 device 仍为 CPU 的环境下,这里创建的 qkv/caches 可能都在 CPU 上,导致 forward_mixed 触发 place mismatch 或直接走 CPU kernel 而失去“真实 GPU 覆盖”的意义。建议在该 TestCase 的 setUp/setUpClass 中设置并在 tearDown 恢复 device,确保张量和算子都在 GPU 上执行。

Copilot uses AI. Check for mistakes.
Comment on lines 386 to +390
# ---------------------------------------------------------------------------
# FlashMaskAttentionBackend tests
# Part 3: GPU-based tests (require real GPU)
# ---------------------------------------------------------------------------

FLASH_MASK_MODULE = "fastdeploy.model_executor.layers.attention.flash_mask_attn_backend"
_HAS_GPU = paddle.is_compiled_with_cuda() and paddle.device.cuda.device_count() > 0
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

本次重构删除了 TestSoftmaxInfinityHandlingTestAppendCacheKVC8KernelConfig 等测试,但在 tests/ 下未找到同名或等价替代用例,可能造成对 softmax -inf 修复与 append_cache_kv_c8 配置映射的覆盖回退。若这些测试仍有价值,建议迁移到更合适的测试文件而不是直接移除,或者在 PR 描述中说明删除原因与替代覆盖点。

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +93 to +99
# Backend registry for parameterized tests
BACKENDS = [
("flash_attn", FlashAttentionBackend, FLASH_ATTN_MODULE),
("flash_mask", FlashMaskAttentionBackend, FLASH_MASK_MODULE),
]


Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BACKENDS 变量在当前文件中未被使用(全仓库也未引用它),在开启 flake8 的情况下会触发 F841,导致 pre-commit/CI 失败。建议删除该变量,或用它来参数化 smoke/diff/GPU 三类用例(例如通过 subTest/for 循环遍历)。

Suggested change
# Backend registry for parameterized tests
BACKENDS = [
("flash_attn", FlashAttentionBackend, FLASH_ATTN_MODULE),
("flash_mask", FlashMaskAttentionBackend, FLASH_MASK_MODULE),
]

Copilot uses AI. Check for mistakes.
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@6727df8). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7092   +/-   ##
==========================================
  Coverage           ?   73.62%           
==========================================
  Files              ?      402           
  Lines              ?    56432           
  Branches           ?     8903           
==========================================
  Hits               ?    41549           
  Misses             ?    11950           
  Partials           ?     2933           
Flag Coverage Δ
GPU 73.62% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants