[PD Disaggregation] Write the cache of preempted req to storage and refine PD Disaggregation#7107
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
该 PR 主要面向 PD Disaggregation / v1 调度链路,在发生 preempt 时将请求的 KV cache(含可选 output tokens)写入 storage backend,以便后续复用/恢复,同时对调度资源判定与缓存分配策略做了小幅调整以降低卡死风险。
Changes:
- 新增环境变量开关:控制 preempt 请求是否触发 cache 写入 storage。
- 在
ResourceManagerV1._trigger_preempt()中,对 preempt 的请求按角色(P/D)写入 storage cache。 PrefixCacheManager.can_allocate_gpu_blocks()增加参数以控制是否尝试主动腾挪 GPU blocks,并修复 write-to-storage 时 token_ids 的构造方式。
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| fastdeploy/envs.py | 增加 preempt cache 写入 storage 的环境变量开关。 |
| fastdeploy/engine/sched/resource_manager_v1.py | preempt 路径新增写 storage 行为,并调整 prefill 阈值计算函数签名与部分资源检查逻辑。 |
| fastdeploy/engine/common_engine.py | 将 decode 资源申请失败的日志级别从 error 调整为 warning。 |
| fastdeploy/cache_manager/prefix_cache_manager.py | 扩展 GPU block 可分配判定接口;修正写 storage 时 token_ids 拼接逻辑,避免原列表被就地修改。 |
| gpu_recv_block_ids = [] | ||
| match_cpu_blocks_num = len(match_cpu_block_ids) | ||
| if self.can_allocate_gpu_blocks(num_blocks=match_cpu_blocks_num): | ||
| if self.can_allocate_gpu_blocks(num_blocks=match_cpu_blocks_num, try_free_gpu_blocks=False): |
| task_id=req_id, | ||
| keys=no_match_block_keys, | ||
| token_ids=input_token_ids, | ||
| token_ids=input_token_ids if self.kvcache_storage_backend == "attention_store" else None, |
|
|
||
| if self.config.cache_config.enable_output_caching: | ||
| token_ids += request.output_token_ids | ||
| input_token_ids = token_ids + request.output_token_ids |
There was a problem hiding this comment.
修复修改request中的prompt_token_ids的bug
| self._free_blocks(preempted_req) | ||
| llm_logger.info(f"Preemption is triggered! Preempted request id: {preempted_req.request_id}") | ||
| else: | ||
| if envs.FD_SAVE_OUTPUT_CACHE_FOR_PREEMPTED_REQUEST: |
There was a problem hiding this comment.
重调度请求的cache写出到storage
| if self.available_batch() == 0: | ||
| return False | ||
| if not self.cache_manager.can_allocate_gpu_blocks(need_prealloc_prefill_blocks): | ||
| total_need_blocks = self._get_can_schedule_prefill_threshold_block(need_prealloc_prefill_blocks) |
There was a problem hiding this comment.
p实例的请求向d申请block,d实例考虑给running的请求预留block ids
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7107 +/- ##
==========================================
Coverage ? 73.93%
==========================================
Files ? 402
Lines ? 56582
Branches ? 8945
==========================================
Hits ? 41833
Misses ? 11811
Partials ? 2938
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| if envs.FD_SAVE_OUTPUT_CACHE_FOR_PREEMPTED_REQUEST: | ||
| if self.config.cache_config.kvcache_storage_backend: | ||
| self.cache_manager.write_cache_to_storage_decode(preempted_req) | ||
| self._free_blocks(preempted_req) |
There was a problem hiding this comment.
这里在持有 ResourceManager 的 lock 的情况下同步调用 write_cache_to_storage_decode(),该调用会等待 cache_transfer 线程回执(is_sync=True),可能阻塞调度线程并放大抢占路径的尾延迟。建议将写回逻辑移出锁(或提交到线程池异步执行),同时确保在写回完成前不要 recycle 对应的 GPU block(例如延后 _free_blocks 或在写回任务中持有必要的 block_ids 快照)。
…efine PD Disaggregation (PaddlePaddle#7107) * Write the cache of preempted req to storage * up * fix
Motivation
Modifications
参考comments
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.