[Scheduler] Defer block recycling to accelerate LRU node freeing#7885
[Scheduler] Defer block recycling to accelerate LRU node freeing#7885liyonghua0910 wants to merge 2 commits into
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7885 +/- ##
==========================================
Coverage ? 63.58%
==========================================
Files ? 462
Lines ? 64487
Branches ? 9882
==========================================
Hits ? 41007
Misses ? 20704
Partials ? 2776
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览当前 Required 任务 9/10 通过,仍有 1 个 Required 任务失败:
2 任务状态汇总日志列说明:失败任务直接使用日志链接;运行中任务使用 Job 链接。 2.1 Required任务 : 9/10 通过
2.2 可选任务 — 29/32 通过
3 失败详情(仅 required)Approval — 需要人工审批(置信度: 高)根因摘要该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。 修复建议摘要请通过人工审批;审批通过后等待 CI 自动继续或按需 rerun。 关键日志
|
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-22 11:00:55
📋 Review 摘要
PR 概述:将 LRU 驱逐循环中的逐次 recycle_gpu_blocks 调用延迟到循环结束后批量执行,以减少 heap 操作开销,同时修复了父节点与子节点断连顺序问题。
变更范围:fastdeploy/cache_manager/prefix_cache_manager.py
影响面 Tag:[KVCache]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | prefix_cache_manager.py:1360 |
defer_recycle=True 路径未清空 node.reverved_dec_block_ids,存在双重回收风险 |
| 🔴 Bug | prefix_cache_manager.py:1360 |
删除了 node.cache_status = CacheStatus.CPU 赋值,节点驱逐后状态未更新 |
| ❓ 疑问 | prefix_cache_manager.py:1488 |
父节点已在 heap 时 continue 跳过,父节点二次入堆是否会触发真正重复释放 |
📝 PR 规范检查
PR 标题使用了 [Scheduler] 标签,但唯一变更文件 fastdeploy/cache_manager/prefix_cache_manager.py 属于 cache_manager/ 模块,按 architecture.md 影响面判断表,fastdeploy/cache_manager/ 对应 [KVCache] 标签。建议修正标题标签。
标题建议(可直接复制):
[KVCache] Defer block recycling to accelerate LRU node freeing
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
In the LRU eviction loop of `free_block_ids_async`, each iteration calls `recycle_gpu_blocks` individually, which causes frequent heap operations and slows down the overall freeing process. This PR defers block recycling to a single batch call after the loop completes.
## Modifications
- Defer `recycle_gpu_blocks` calls inside the LRU freeing loop to a single batch call after the loop, reducing the overhead of repeated heap operations.
- Add `defer_recycle` parameter to `_handle_free_gpu_node_without_cpu` to support deferred block recycling.
- Fix the LRU leaf node freeing logic: disconnect the child node from its parent first, then check whether the parent should be added to the LRU heap, avoiding duplicate freeing.
- Add warning logs to help diagnose duplicate node issues in the LRU heap.
## Usage or Command
No additional configuration required. The optimization takes effect automatically.
## Accuracy Tests
Only affects KV Cache block recycling timing, no impact on model output accuracy.
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
延迟批量回收的优化方向合理,但 PR 在删除 node.cache_status 状态更新和 node.reverved_dec_block_ids = [] 清零时未充分评估副作用,存在节点状态残留和双重 block 回收的 P0 风险,建议修复后重新提交。
| ) | ||
| return [] | ||
| else: | ||
| return blocks_to_recycle |
There was a problem hiding this comment.
🔴 Bug defer_recycle=True 路径未清空 node.reverved_dec_block_ids,存在双重回收风险
原代码在 _handle_free_gpu_node_without_cpu 中有 node.reverved_dec_block_ids = [],确保节点释放后引用被清空。PR 将此逻辑移除,当 defer_recycle=True 时,调用方收到 blocks 列表后延迟回收,但节点的 reverved_dec_block_ids 字段仍保留旧值。若该节点在批量回收完成前被其他路径再次访问(例如 free_nodes_directly 或 swap 路径),可能引发同一 block 被重复 recycle。
建议修复:在 defer_recycle=True 分支的 return 前清空:
blocks_to_recycle = list(node.reverved_dec_block_ids) + [node.block_id]
node.reverved_dec_block_ids = [] # 立即清空,防止双重回收
if not defer_recycle:
...
else:
return blocks_to_recycle| ) | ||
| return [] | ||
| else: | ||
| return blocks_to_recycle |
There was a problem hiding this comment.
🔴 Bug node.cache_status 未在 _handle_free_gpu_node_without_cpu 中更新
原代码在函数入口处设置 node.cache_status = CacheStatus.CPU(标记节点已被驱逐),PR 将该行完整删除,两条路径(defer_recycle=True/False)均未更新 cache_status。节点已从 node_map 删除但 cache_status 仍为 GPU,若外部代码(如 PD 分离传输路径 recv_data_transfer_result 中的 node.cache_status.value == CacheStatus.GPU.value 判断)通过其他引用访问该节点,会误判其仍为 GPU 状态。
建议修复:在函数开头恢复状态更新(或调整为更合适的驱逐态枚举值):
node.cache_status = CacheStatus.CPU # 或新增 EVICTED 状态| logger.warning( | ||
| f"Node {parent.node_id} is already in gpu lru leaf heap, duplicated node free may occured!" | ||
| ) | ||
| continue |
There was a problem hiding this comment.
❓ 疑问 continue 跳过时父节点已在 heap 中——但 total_gpu_free_count 未增加,是否会导致循环提前退出?
当 parent in self.gpu_lru_leaf_set 时执行 continue,本次迭代没有递增 total_gpu_free_count(当前被驱逐的叶节点计数已在上方 +1),逻辑上正确。但 warning 日志描述为 "duplicated node free may occured":如果父节点已经在 heap 中被弹出并处理过(已从 node_map 删除),再次被加入 heap 就会导致真正的重复释放。
建议确认:父节点被加入 gpu_lru_leaf_heap 前,是否应先检查其是否已被从 node_map 删除(即是否已经历过 _handle_free_gpu_node_without_cpu)?若已删除则跳过入堆。
Motivation
In the LRU eviction loop of
free_block_ids_async, each iteration callsrecycle_gpu_blocksindividually, which causes frequent heap operations and slows down the overall freeing process. This PR defers block recycling to a single batch call after the loop completes.Modifications
recycle_gpu_blockscalls inside the LRU freeing loop to a single batch call after the loop, reducing the overhead of repeated heap operations.defer_recycleparameter to_handle_free_gpu_node_without_cputo support deferred block recycling.Usage or Command
No additional configuration required. The optimization takes effect automatically.
Accuracy Tests
Only affects KV Cache block recycling timing, no impact on model output accuracy.
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.