Skip to content

[Cherry-Pick][KVCache][BugFix] Buffer early layer0 cache signals(#7872)#7896

Open
kevincheng2 wants to merge 1 commit into
PaddlePaddle:release/2.6from
kevincheng2:cp/cache-messager-layer0-signal-release-2.6-20260522
Open

[Cherry-Pick][KVCache][BugFix] Buffer early layer0 cache signals(#7872)#7896
kevincheng2 wants to merge 1 commit into
PaddlePaddle:release/2.6from
kevincheng2:cp/cache-messager-layer0-signal-release-2.6-20260522

Conversation

@kevincheng2
Copy link
Copy Markdown
Collaborator

Motivation

Cherry-pick #7872 to release/2.6.

Layer0 cache completion signals can arrive before the cache task is registered. Without buffering, the prefill cache send thread may miss these early layer0 signals.

Modifications

  • Add pending layer0 signal buffering in CacheMessager.
  • Recover buffered layer0 signals when cache task info is registered.
  • Drop stale pending signals when cache tasks finish.
  • Add unit tests for early layer0 signal buffering and recovery.

Usage or Command

No usage change.

Targeted test attempted:

python -m pytest tests/cache_manager/test_cache_messager.py -q
/root/paddlejob/inference-public/chengyanfu/.venv/py310/bin/python -m pytest tests/cache_manager/test_cache_messager.py -q

Both commands were stopped by the current shell environment with exit code 147 before test completion.

Accuracy Tests

Not applicable. This PR does not change model outputs.

Checklist

  • Add at least a tag in the PR title.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 22, 2026

Thanks for your contribution!

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-22 14:45:56

📋 Review 摘要

PR 概述:Cherry-pick #7872,修复 CacheMessagerV1 中 layer0 缓存完成信号早于任务注册到达时被丢失的问题
变更范围fastdeploy/cache_manager/cache_messager.pytests/cache_manager/test_cache_messager.py
影响面 Tag[KVCache]

问题

级别 文件 概述
📝 PR 规范 标题 Cherry-Pick 标题含两个官方 Tag [KVCache][BugFix],规范要求仅含一个

仅 PR 规范问题,无阻塞性 Bug。

📝 PR 规范检查

标题 [Cherry-Pick][KVCache][BugFix] Buffer early layer0 cache signals(#7872) 含两个官方 Tag,而 checklist §D1 规定 Cherry-Pick PR 标题格式为 [Cherry-Pick][Tag] 标题描述(#原PR号),仅允许一个功能 Tag。

标题建议(可直接复制):

  • [Cherry-Pick][BugFix] Buffer early layer0 cache signals(#7872)

PR 描述结构完整,所有 §D2 必填段落(## Motivation## Modifications## Usage or Command## Accuracy Tests## Checklist)均有实质内容,Checklist 勾选合理,无需修改。

总体评价

实现思路清晰:通过 pending_layer0_signals 字典缓冲早到的 layer0 信号,任务注册时原子恢复并进行 token 数量合法性校验,任务完成时清理 stale 信号;锁顺序(engine_cache_task_thread_lock → pending_layer0_signal_lock)在所有三处代码路径中保持一致,无死锁风险。新增三个单元测试覆盖缓冲恢复、无效信号丢弃、stale 清理场景,代码质量良好,仅标题含双 Tag 需按规范修改,不影响合入。

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 22, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-23 01:53:08

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

所有 required 任务已通过,建议通过;当前存在 3 个 optional 任务失败,不阻塞合并,仅供参考。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
36(0) 36 33 3 0 0 0

2 任务状态汇总

日志列说明:失败任务直接使用 CI 日志链接,运行中任务手动拼接 Job 链接。

2.1 Required任务 : 10/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
其余 10 个必选任务通过 - - - - -

2.2 可选任务 — 23/26 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 15m40s Job -
Trigger Jenkins for PR 1m1s Job -
CI_HPU 1h6m Job -
其余 23 个可选任务通过 - - -

3 失败详情(仅 required)

无 required 失败任务。

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 97.50000% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.6@b562b8d). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/cache_manager/cache_messager.py 97.50% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##             release/2.6    #7896   +/-   ##
==============================================
  Coverage               ?   72.39%           
==============================================
  Files                  ?      381           
  Lines                  ?    54252           
  Branches               ?     8480           
==============================================
  Hits                   ?    39276           
  Misses                 ?    12216           
  Partials               ?     2760           
Flag Coverage Δ
GPU 72.39% <97.50%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants