Skip to content

[Cherry-Pick] [Optimization] TopP=1.0 using _random_sample (#7892) and Triton SamplerBackend (#7639)#7910

Open
ckl117 wants to merge 7 commits into
PaddlePaddle:release/2.6from
ckl117:26_topp1
Open

[Cherry-Pick] [Optimization] TopP=1.0 using _random_sample (#7892) and Triton SamplerBackend (#7639)#7910
ckl117 wants to merge 7 commits into
PaddlePaddle:release/2.6from
ckl117:26_topp1

Conversation

@ckl117
Copy link
Copy Markdown
Collaborator

@ckl117 ckl117 commented May 25, 2026

Motivation

支持triton采样后端 #7639
优化top_p=1.0采样 #7892
新增环境变量FD_ENABLE_TOP_P_ONE_OPT=1,方便RL验证,待验证通过后删掉环境变量。

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 25, 2026

Thanks for your contribution!

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 25, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-25 19:42:44

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

当前 required 任务未全部通过,建议暂不通过。required 失败任务数 2,等待处理的 required 任务数 0(运行中 0、等待中 0)。主要阻塞项为单测覆盖率门禁失败与 Approval 待人工审批。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
37(0) 37 31 5 0 1 0

2 任务状态汇总

日志列说明:失败任务直接使用日志链接;运行中任务使用 Job 链接。

2.1 Required任务 : 8/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 1h13m PR问题:diff覆盖率仅20%,未达80% 补充采样/Triton覆盖或豁免JIT内核 Job -
Approval 19s 需要 Approval 请通过人工审批 Job -
其余 8 个必选任务通过 - - - - -

2.2 可选任务 — 23/27 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 15m48s Job -
Check PR Template 21s Job -
Trigger Jenkins for PR 19s Job -
⏸️ CI_HPU - - -
其余 23 个可选任务通过 - - -

3 失败详情(仅 required)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 覆盖率门禁(置信度: 高)

Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage

  • 状态: ❌ 失败
  • 错误类型: 覆盖率门禁
  • 置信度: 高
  • 根因摘要: diff覆盖率仅20%,未达80%
  • 分析器: ci_analyze_unittest_fastdeploy

失败用例: 无。日志显示 All tests passed,失败发生在覆盖率阈值校验步骤。

根因详情:
本次 PR 修改/新增采样相关代码(如 top_k_top_p_triton.pysampler.pytop_k_top_p_sampling.pyinput_batch.py),但 diff 覆盖率报告显示总覆盖率仅 20%,低于 CI 要求的 80%。其中 fastdeploy/model_executor/layers/sample/ops/top_k_top_p_triton.py 覆盖率仅 11.49%,大量 Triton JIT kernel 段(如 _update_min_larger_stats_topk_topp_kernel 等)被计入 diff coverage 但现有 tests/layers/test_triton_sampler.py 主要只覆盖 Python wrapper 的少量路径,导致 508 行 violation。

关键日志:

All tests passed
Coverage generation failed (exit code 9)
GPU Patch Coverage Details:
total_percent_covered: 20
total_num_violations: 508
fastdeploy/model_executor/layers/sample/ops/top_k_top_p_triton.py: 11.49%
fastdeploy/model_executor/layers/sample/ops/top_k_top_p_sampling.py: 70.0% (violations: 40,45,46)
fastdeploy/worker/input_batch.py: 66.67% (violation: 552)

修复建议:

  1. tests/layers/test_triton_sampler.py 补充覆盖 sampler.py 中 triton 分支、top_p=1.0 优化分支、top_k_top_p_sampling.py ImportError/dispatch 分支,以及 input_batch.py::reset_share_inputstop_p_list 重置逻辑。
  2. fastdeploy/model_executor/layers/sample/ops/top_k_top_p_triton.py 中 Python coverage 无法真实追踪的 Triton JIT kernel 区域(例如 92 行后的 @triton.jit helper 和 115 行后的 _topk_topp_kernel)补充可执行 GPU 覆盖;若该类 kernel 源码不适合 Python diff coverage 统计,应按项目规范添加 coverage exclude/豁免,而不是让未可追踪内核行进入 80% 门禁。

修复建议摘要: 补充采样/Triton覆盖或豁免JIT内核

关联变更: fastdeploy/model_executor/layers/sample/ops/top_k_top_p_triton.py, fastdeploy/model_executor/layers/sample/sampler.py, fastdeploy/model_executor/layers/sample/ops/top_k_top_p_sampling.py, fastdeploy/worker/input_batch.py, tests/layers/test_triton_sampler.py

链接: 查看日志

Approval — 需要人工审批(置信度: 高)

该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 25, 2026

Codecov Report

❌ Patch coverage is 18.11024% with 520 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.6@85399db). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...l_executor/layers/sample/ops/top_k_top_p_triton.py 10.95% 493 Missing and 3 partials ⚠️
fastdeploy/model_executor/layers/sample/sampler.py 69.35% 11 Missing and 8 partials ⚠️
...executor/layers/sample/ops/top_k_top_p_sampling.py 60.00% 3 Missing and 1 partial ⚠️
fastdeploy/worker/input_batch.py 66.66% 1 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##             release/2.6    #7910   +/-   ##
==============================================
  Coverage               ?   71.83%           
==============================================
  Files                  ?      383           
  Lines                  ?    55071           
  Branches               ?     8620           
==============================================
  Hits                   ?    39560           
  Misses                 ?    12732           
  Partials               ?     2779           
Flag Coverage Δ
GPU 71.83% <18.11%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-25 16:44:39

📋 Review 摘要

PR 概述:Cherry-Pick 两项采样优化——新增 Triton 采样后端和 top_p=1.0 快速路径优化
变更范围model_executor/layers/sample/worker/gpu_model_runner.pyenvs.py
影响面 Tag[OP] [Optimization]

问题

级别 文件 概述
🟡 建议 meta_data.py:46 top_p_list 使用裸 list 类型,缺乏元素类型注解
❓ 疑问 top_k_top_p_sampling.py:108 移除 topp_seed CPU→GPU 拷贝,需确认调用侧已保证在 GPU 上
🟡 建议 gpu_model_runner.py A6:通用路径有改动,请确认是否需同步其他硬件 ModelRunner
📝 PR 规范 PR 描述 ModificationsUsage or CommandAccuracy Tests 三节为空

📝 PR 规范检查

PR 标题 Cherry-Pick 格式中 [Cherry-Pick][Optimization] 之间有多余空格(规范为 [Cherry-Pick][Tag] 紧邻),且 ModificationsUsage or CommandAccuracy Tests 三节内容均为空。

标题建议(可直接复制):

  • [Cherry-Pick][Optimization] TopP=1.0 using _random_sample and Triton SamplerBackend (#7892, #7639)

PR 描述建议(可直接复制):

## Motivation
1. 支持 Triton 采样后端(`FD_SAMPLING_CLASS=triton`),基于 Qrita 算法实现高性能 Top-K + Top-P 联合采样 kernel(#7639)。
2. 优化 top_p=1.0 采样路径,当 `FD_ENABLE_TOP_P_ONE_OPT=1` 时直接走 random_sample 跳过 top-p 过滤,降低 RL 验证场景延迟(#7892)。
3. 新增环境变量 `FD_ENABLE_TOP_P_ONE_OPT`,待 RL 场景验证通过后移除。

## Modifications
- `fastdeploy/model_executor/layers/sample/ops/top_k_top_p_triton.py`:新增 Triton Top-K/Top-P 联合采样 kernel,基于 Qrita 论文 pivot-based 截断算法
- `fastdeploy/model_executor/layers/sample/ops/top_k_top_p_sampling.py`:提取 `dispatch_top_k_renorm_probs` 公共函数,简化 `topp_seed` 传参逻辑
- `fastdeploy/model_executor/layers/sample/sampler.py`:集成 triton 采样后端分支及 top_p=1.0 优化路径
- `fastdeploy/model_executor/layers/sample/meta_data.py`:新增 `top_p_list` 字段供 triton 路径使用
- `fastdeploy/worker/gpu_model_runner.py` / `input_batch.py`:透传 top_p_list 至采样层
- `fastdeploy/envs.py`:新增 `FD_ENABLE_TOP_P_ONE_OPT` 环境变量
- `tests/layers/test_triton_sampler.py`:新增 triton 采样算子单测

## Usage or Command
```bash
# 启用 Triton 采样后端
FD_SAMPLING_CLASS=triton python -m fastdeploy.entrypoints.openai.api_server ...

# 启用 top_p=1.0 优化(RL 验证场景)
FD_ENABLE_TOP_P_ONE_OPT=1 python -m fastdeploy.entrypoints.openai.api_server ...
```

## Accuracy Tests
N/A(采样输出为随机分布,无确定性精度对比;triton 后端与 base 后端的分布等价性由 `tests/layers/test_triton_sampler.py` 覆盖)

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

整体代码结构清晰,Triton kernel 实现参考了 Qrita 论文,逻辑完整;dispatch_top_k_renorm_probs 的提取也改善了代码复用。建议确认 topp_seed device 保证和多硬件同步情况后即可合入。


top_p: paddle.Tensor
top_p_list: Optional[list] = None
# only GPU used
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 top_p_list 字段使用裸 list 类型注解,与相邻字段(top_p: paddle.Tensor)风格不一致,且缺乏元素类型信息,IDE 和类型检查工具无法推导元素类型。

建议修改为:

top_p_list: Optional[List[float]] = None

(同时在文件顶部 import List 若尚未导入)

topp_seed_device = paddle.empty(shape=topp_seed.shape, dtype=topp_seed.dtype)
topp_seed_device.copy_(topp_seed, False)
_, ids = paddle.tensor.top_p_sampling(
x,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 原代码明确将 topp_seed 从 CPU 拷贝到 GPU(paddle.empty(...).copy_(topp_seed, False))再传入 paddle.tensor.top_p_sampling,此处直接将原始 topp_seed 传入。

topp_seed 在调用侧已保证在 GPU 上,此简化正确;但若仍可能在 CPU 上(如从 input_batch CPU 侧构造),则会导致运行时 device mismatch 错误。

请确认 topp_seed 的来源保证在 GPU 上,或添加注释说明。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants