Skip to content

[Scheduler] Defer block recycling to accelerate LRU node freeing#7885

Open
liyonghua0910 wants to merge 2 commits into
PaddlePaddle:developfrom
liyonghua0910:develop+20260521_free_blocks
Open

[Scheduler] Defer block recycling to accelerate LRU node freeing#7885
liyonghua0910 wants to merge 2 commits into
PaddlePaddle:developfrom
liyonghua0910:develop+20260521_free_blocks

Conversation

@liyonghua0910
Copy link
Copy Markdown
Collaborator

@liyonghua0910 liyonghua0910 commented May 21, 2026

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

In the LRU eviction loop of free_block_ids_async, each iteration calls recycle_gpu_blocks individually, which causes frequent heap operations and slows down the overall freeing process. This PR defers block recycling to a single batch call after the loop completes.

Modifications

  • Defer recycle_gpu_blocks calls inside the LRU freeing loop to a single batch call after the loop, reducing the overhead of repeated heap operations.
  • Add defer_recycle parameter to _handle_free_gpu_node_without_cpu to support deferred block recycling.
  • Fix the LRU leaf node freeing logic: disconnect the child node from its parent first, then check whether the parent should be added to the LRU heap, avoiding duplicate freeing.
  • Add warning logs to help diagnose duplicate node issues in the LRU heap.

Usage or Command

No additional configuration required. The optimization takes effect automatically.

Accuracy Tests

Only affects KV Cache block recycling timing, no impact on model output accuracy.

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 21, 2026

Thanks for your contribution!

@liyonghua0910 liyonghua0910 changed the title [KVCache] Defer block recycling to accelerate LRU node freeing [Scheduler] Defer block recycling to accelerate LRU node freeing May 21, 2026
PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 21, 2026

Codecov Report

❌ Patch coverage is 68.18182% with 7 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@8080a25). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/cache_manager/prefix_cache_manager.py 68.18% 4 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7885   +/-   ##
==========================================
  Coverage           ?   63.58%           
==========================================
  Files              ?      462           
  Lines              ?    64487           
  Branches           ?     9882           
==========================================
  Hits               ?    41007           
  Misses             ?    20704           
  Partials           ?     2776           
Flag Coverage Δ
GPU 72.71% <68.18%> (?)
XPU 7.11% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 21, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-23 01:53:10

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

当前 Required 任务 9/10 通过,仍有 1 个 Required 任务失败Approval。该失败为人工审批未完成,不是代码测试失败;代码相关 Required 测试(包括 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage)已通过。另有 3 个 Optional 任务失败,仅供参考。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
42(0) 42 38 4 0 0 0

2 任务状态汇总

日志列说明:失败任务直接使用日志链接;运行中任务使用 Job 链接。

2.1 Required任务 : 9/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 6s 需要 Approval:人工审批未完成 请通过人工审批 Job -
其余 9 个必选任务通过 - - - - -

2.2 可选任务 — 29/32 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 1m27s Job -
CI_HPU 1h4m Job -
Trigger Jenkins for PR 7m34s Job -
其余 29 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — 需要人工审批(置信度: 高)

根因摘要

该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。

修复建议摘要

请通过人工审批;审批通过后等待 CI 自动继续或按需 rerun。

关键日志

  • Job
  • Process completed with exit code 6.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-22 11:00:55

📋 Review 摘要

PR 概述:将 LRU 驱逐循环中的逐次 recycle_gpu_blocks 调用延迟到循环结束后批量执行,以减少 heap 操作开销,同时修复了父节点与子节点断连顺序问题。
变更范围fastdeploy/cache_manager/prefix_cache_manager.py
影响面 Tag[KVCache]

问题

级别 文件 概述
🔴 Bug prefix_cache_manager.py:1360 defer_recycle=True 路径未清空 node.reverved_dec_block_ids,存在双重回收风险
🔴 Bug prefix_cache_manager.py:1360 删除了 node.cache_status = CacheStatus.CPU 赋值,节点驱逐后状态未更新
❓ 疑问 prefix_cache_manager.py:1488 父节点已在 heap 时 continue 跳过,父节点二次入堆是否会触发真正重复释放

📝 PR 规范检查

PR 标题使用了 [Scheduler] 标签,但唯一变更文件 fastdeploy/cache_manager/prefix_cache_manager.py 属于 cache_manager/ 模块,按 architecture.md 影响面判断表,fastdeploy/cache_manager/ 对应 [KVCache] 标签。建议修正标题标签。

标题建议(可直接复制):

  • [KVCache] Defer block recycling to accelerate LRU node freeing

PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):

## Motivation
In the LRU eviction loop of `free_block_ids_async`, each iteration calls `recycle_gpu_blocks` individually, which causes frequent heap operations and slows down the overall freeing process. This PR defers block recycling to a single batch call after the loop completes.

## Modifications
- Defer `recycle_gpu_blocks` calls inside the LRU freeing loop to a single batch call after the loop, reducing the overhead of repeated heap operations.
- Add `defer_recycle` parameter to `_handle_free_gpu_node_without_cpu` to support deferred block recycling.
- Fix the LRU leaf node freeing logic: disconnect the child node from its parent first, then check whether the parent should be added to the LRU heap, avoiding duplicate freeing.
- Add warning logs to help diagnose duplicate node issues in the LRU heap.

## Usage or Command
No additional configuration required. The optimization takes effect automatically.

## Accuracy Tests
Only affects KV Cache block recycling timing, no impact on model output accuracy.

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

延迟批量回收的优化方向合理,但 PR 在删除 node.cache_status 状态更新和 node.reverved_dec_block_ids = [] 清零时未充分评估副作用,存在节点状态残留和双重 block 回收的 P0 风险,建议修复后重新提交。

)
return []
else:
return blocks_to_recycle
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug defer_recycle=True 路径未清空 node.reverved_dec_block_ids,存在双重回收风险

原代码在 _handle_free_gpu_node_without_cpu 中有 node.reverved_dec_block_ids = [],确保节点释放后引用被清空。PR 将此逻辑移除,当 defer_recycle=True 时,调用方收到 blocks 列表后延迟回收,但节点的 reverved_dec_block_ids 字段仍保留旧值。若该节点在批量回收完成前被其他路径再次访问(例如 free_nodes_directly 或 swap 路径),可能引发同一 block 被重复 recycle。

建议修复:在 defer_recycle=True 分支的 return 前清空:

blocks_to_recycle = list(node.reverved_dec_block_ids) + [node.block_id]
node.reverved_dec_block_ids = []  # 立即清空,防止双重回收
if not defer_recycle:
    ...
else:
    return blocks_to_recycle

)
return []
else:
return blocks_to_recycle
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug node.cache_status 未在 _handle_free_gpu_node_without_cpu 中更新

原代码在函数入口处设置 node.cache_status = CacheStatus.CPU(标记节点已被驱逐),PR 将该行完整删除,两条路径(defer_recycle=True/False)均未更新 cache_status。节点已从 node_map 删除但 cache_status 仍为 GPU,若外部代码(如 PD 分离传输路径 recv_data_transfer_result 中的 node.cache_status.value == CacheStatus.GPU.value 判断)通过其他引用访问该节点,会误判其仍为 GPU 状态。

建议修复:在函数开头恢复状态更新(或调整为更合适的驱逐态枚举值):

node.cache_status = CacheStatus.CPU  # 或新增 EVICTED 状态

logger.warning(
f"Node {parent.node_id} is already in gpu lru leaf heap, duplicated node free may occured!"
)
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 continue 跳过时父节点已在 heap 中——但 total_gpu_free_count 未增加,是否会导致循环提前退出?

parent in self.gpu_lru_leaf_set 时执行 continue,本次迭代没有递增 total_gpu_free_count(当前被驱逐的叶节点计数已在上方 +1),逻辑上正确。但 warning 日志描述为 "duplicated node free may occured":如果父节点已经在 heap 中被弹出并处理过(已从 node_map 删除),再次被加入 heap 就会导致真正的重复释放。

建议确认:父节点被加入 gpu_lru_leaf_heap 前,是否应先检查其是否已被从 node_map 删除(即是否已经历过 _handle_free_gpu_node_without_cpu)?若已删除则跳过入堆。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants