Skip to content

[PD Disaggregation] Write the cache of preempted req to storage and refine PD Disaggregation#7107

Merged
Jiang-Jia-Jun merged 3 commits intoPaddlePaddle:developfrom
juncaipeng:process_preempted_req
Apr 1, 2026
Merged

[PD Disaggregation] Write the cache of preempted req to storage and refine PD Disaggregation#7107
Jiang-Jia-Jun merged 3 commits intoPaddlePaddle:developfrom
juncaipeng:process_preempted_req

Conversation

@juncaipeng
Copy link
Copy Markdown
Collaborator

@juncaipeng juncaipeng commented Mar 31, 2026

Motivation

  • 优化处理抢占请求,如果开了cache池化,将抢占请求的cache写入storage
  • p实例的请求向d申请block,d实例考虑给running的请求预留block ids
  • 修复写出cache时会修改request中的prompt_token_ids的bug

Modifications

参考comments

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings March 31, 2026 07:55
@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Mar 31, 2026

Thanks for your contribution!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 主要面向 PD Disaggregation / v1 调度链路,在发生 preempt 时将请求的 KV cache(含可选 output tokens)写入 storage backend,以便后续复用/恢复,同时对调度资源判定与缓存分配策略做了小幅调整以降低卡死风险。

Changes:

  • 新增环境变量开关:控制 preempt 请求是否触发 cache 写入 storage。
  • ResourceManagerV1._trigger_preempt() 中,对 preempt 的请求按角色(P/D)写入 storage cache。
  • PrefixCacheManager.can_allocate_gpu_blocks() 增加参数以控制是否尝试主动腾挪 GPU blocks,并修复 write-to-storage 时 token_ids 的构造方式。

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
fastdeploy/envs.py 增加 preempt cache 写入 storage 的环境变量开关。
fastdeploy/engine/sched/resource_manager_v1.py preempt 路径新增写 storage 行为,并调整 prefill 阈值计算函数签名与部分资源检查逻辑。
fastdeploy/engine/common_engine.py 将 decode 资源申请失败的日志级别从 error 调整为 warning。
fastdeploy/cache_manager/prefix_cache_manager.py 扩展 GPU block 可分配判定接口;修正写 storage 时 token_ids 拼接逻辑,避免原列表被就地修改。

gpu_recv_block_ids = []
match_cpu_blocks_num = len(match_cpu_block_ids)
if self.can_allocate_gpu_blocks(num_blocks=match_cpu_blocks_num):
if self.can_allocate_gpu_blocks(num_blocks=match_cpu_blocks_num, try_free_gpu_blocks=False):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

避免可能的死锁卡住

task_id=req_id,
keys=no_match_block_keys,
token_ids=input_token_ids,
token_ids=input_token_ids if self.kvcache_storage_backend == "attention_store" else None,
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

减少跨进程传输的数据量


if self.config.cache_config.enable_output_caching:
token_ids += request.output_token_ids
input_token_ids = token_ids + request.output_token_ids
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

修复修改request中的prompt_token_ids的bug

self._free_blocks(preempted_req)
llm_logger.info(f"Preemption is triggered! Preempted request id: {preempted_req.request_id}")
else:
if envs.FD_SAVE_OUTPUT_CACHE_FOR_PREEMPTED_REQUEST:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

重调度请求的cache写出到storage

if self.available_batch() == 0:
return False
if not self.cache_manager.can_allocate_gpu_blocks(need_prealloc_prefill_blocks):
total_need_blocks = self._get_can_schedule_prefill_threshold_block(need_prealloc_prefill_blocks)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p实例的请求向d申请block,d实例考虑给running的请求预留block ids

@juncaipeng juncaipeng changed the title [PD Disaggregation] Write the cache of preempted req to storage [PD Disaggregation] Write the cache of preempted req to storage and refine PD Disaggregation Mar 31, 2026
rainyfly
rainyfly previously approved these changes Mar 31, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 31, 2026

Codecov Report

❌ Patch coverage is 57.14286% with 9 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@25d64ef). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/cache_manager/prefix_cache_manager.py 50.00% 5 Missing and 1 partial ⚠️
fastdeploy/engine/sched/resource_manager_v1.py 66.66% 0 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7107   +/-   ##
==========================================
  Coverage           ?   73.93%           
==========================================
  Files              ?      402           
  Lines              ?    56582           
  Branches           ?     8945           
==========================================
  Hits               ?    41833           
  Misses             ?    11811           
  Partials           ?     2938           
Flag Coverage Δ
GPU 73.93% <57.14%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Comment on lines +369 to 372
if envs.FD_SAVE_OUTPUT_CACHE_FOR_PREEMPTED_REQUEST:
if self.config.cache_config.kvcache_storage_backend:
self.cache_manager.write_cache_to_storage_decode(preempted_req)
self._free_blocks(preempted_req)
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里在持有 ResourceManager 的 lock 的情况下同步调用 write_cache_to_storage_decode(),该调用会等待 cache_transfer 线程回执(is_sync=True),可能阻塞调度线程并放大抢占路径的尾延迟。建议将写回逻辑移出锁(或提交到线程池异步执行),同时确保在写回完成前不要 recycle 对应的 GPU block(例如延后 _free_blocks 或在写回任务中持有必要的 block_ids 快照)。

Copilot uses AI. Check for mistakes.
@Jiang-Jia-Jun Jiang-Jia-Jun merged commit af51fc4 into PaddlePaddle:develop Apr 1, 2026
38 of 42 checks passed
mattheliu pushed a commit to mattheliu/FastDeploy that referenced this pull request Apr 1, 2026
…efine PD Disaggregation (PaddlePaddle#7107)

* Write the cache of preempted req to storage

* up

* fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants