[Cherry-Pick][Optimization][Speculative Decoding]opt mtp logprob (#7883)#7884
[Cherry-Pick][Optimization][Speculative Decoding]opt mtp logprob (#7883)#7884Sunny-bot1 wants to merge 2 commits into
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## release/2.6 #7884 +/- ##
==============================================
Coverage ? 72.23%
==============================================
Files ? 381
Lines ? 54233
Branches ? 8473
==============================================
Hits ? 39175
Misses ? 12295
Partials ? 2763
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-22 15:03:13
📋 Review 摘要
PR 概述:MTP + logprob(top_logprobs:0) 性能优化,通过按实际 top_logprobs 数打包传输元数据,避免填充无用 topk slot,Python 侧预批量 .tolist() 减少热路径开销,实测提升 10%
变更范围:custom_ops/gpu_ops/speculate_decoding/(3 个 C++ 文件)、fastdeploy/worker/gpu_model_runner.py、fastdeploy/output/token_processor.py
影响面 Tag:[Speculative Decoding] [OP] [DataProcessor]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| ❓ 疑问 | mtp_save_first_token_with_topk.cc:125 |
message_flag 是否保证低 8 位,OR 操作可能 bit 污染 |
| 🟡 建议 | gpu_model_runner.py |
A6:通用路径变更,是否需同步其他硬件 ModelRunner |
| 🟡 建议 | tests/ |
B2 + A3:C++ spec-decode 算子 bit-packing 协议无新增单测 |
| 📝 PR 规范 | PR body | Modifications/Usage or Command/Accuracy Tests 为空,Checklist 全未勾选 |
📝 PR 规范检查
标题包含两个官方 Tag([Optimization] + [Speculative Decoding]),按规范每个 PR 标题应仅含一个官方 Tag;## Modifications、## Usage or Command、## Accuracy Tests 三个段落缺少实际内容,Checklist 全未勾选。
标题建议(可直接复制):
[Cherry-Pick][Speculative Decoding] opt mtp logprob top_logprobs=0 perf +10% (#7883)
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
MTP + logprob(top_logprobs:0) 性能提升 10%:通过将 message_flag(低8位)和 max_num_logprobs(高16位)打包进 meta[1],C++ 侧循环仅写/读实际请求的 topk 列,避免填充无用 slot;Python 侧预批量 .tolist() 减少热路径逐次转换开销。
## Modifications
- `mtp_save_first_token_with_topk.cc`:将 `message_flag`(低8位)和 `max_num_logprobs`(高16位)打包写入 `meta[1]`;循环仅写 `max_num_logprobs` 列,索引步长从 `SPEC_LOGPROB_K+1` 改为 `max_num_logprobs`
- `speculate_save_output_with_topk.cc`:同上打包逻辑,去掉填充 `-1/0.0` 的 else 分支
- `speculate_get_output_with_topk.cc`:从 `meta[1]` 解包 `actual_topk`,内层循环改用 `actual_topk` 作上界
- `gpu_model_runner.py`:移除 speculative_decoding 下 `max_logprobs` 硬编码为 20 的限制,改用实际请求值
- `token_processor.py`:tokens/scores 张量切片到 `actual_topk` 列;热路径 `.tolist()` 改为循环前批量预转换
## Usage or Command
N/A
## Accuracy Tests
N/A(纯性能优化,不影响模型输出)
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
优化思路清晰,bit-packing 协议在发送端(两个 .cc 文件)和接收端(C++ + Python)三处对齐一致,性能收益明确。主要关注点是 message_flag 的 bit 范围假设需显式保护,以及 gpu_model_runner.py 通用路径变更需确认其他硬件 Runner 不受影响。
| // Pack message_flag (low 8 bits) and max_num_logprobs (high 16 bits) into | ||
| // meta[1]. Receiver unpacks both to avoid reading unused topk slots. | ||
| int max_num_logprobs = logprob_token_ids.shape()[1]; | ||
| msg_sed.meta[1] = message_flag | (max_num_logprobs << 8); |
There was a problem hiding this comment.
❓ 疑问 message_flag 是否保证只使用低 8 位(< 256)?
当前打包方式:msg_sed.meta[1] = message_flag | (max_num_logprobs << 8),低8位给 message_flag,高16位给 max_num_logprobs。若 message_flag 曾被赋值超过 255,两个字段的 bit 会互相污染,导致 Python 侧解包得到错误的 mtype 和 actual_topk。
建议:显式断言或掩码保护:
assert((message_flag & 0xFF) == message_flag); // 或
msg_sed.meta[1] = (message_flag & 0xFF) | (max_num_logprobs << 8);
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览Required 任务 8/10 通过;当前有 2 个 Required 失败任务阻塞合并:
2 任务状态汇总日志列说明:失败任务直接使用日志链接;可选任务失败不阻塞合并,仅供参考。 2.1 Required任务 : 8/10 通过
2.2 可选任务 — 23/27 通过
3 失败详情(仅 required)Approval — 需要人工审批(置信度: 高)该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 测试失败(置信度: 高)Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage
失败用例:
根因详情: 代码上下文核对:
关键日志: 修复建议:
修复建议摘要: 按新协议打包meta并校验logprobs基线 缓存说明:本次 Required 主测试失败命中历史分析缓存,未重复下载完整日志;本轮无 miss job 需要写回缓存。 |
Motivation
MTP + logprob(top_logprobs:0) 性能提升10%
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.