[XPU] Add TP broadcast after sampling in XPU model runner.#7096
Open
Jiajun-Ji wants to merge 1 commit intoPaddlePaddle:developfrom
Open
[XPU] Add TP broadcast after sampling in XPU model runner.#7096Jiajun-Ji wants to merge 1 commit intoPaddlePaddle:developfrom
Jiajun-Ji wants to merge 1 commit intoPaddlePaddle:developfrom
Conversation
…onsistent results across ranks.
|
Thanks for your contribution! |
Collaborator
|
/skip-ci ci_iluvatar |
Contributor
There was a problem hiding this comment.
Pull request overview
该 PR 旨在解决 XPU + Tensor Parallel(TP)场景下各 rank 独立采样导致的生成结果不一致问题,通过在采样后增加 TP 组内广播,使各 TP rank 以 rank0 的采样结果为准,从而保证跨 rank 的一致性。
Changes:
- 在非投机解码路径中,采样后对
sampled_token_ids进行 TP 组内广播同步。 - 在投机解码路径中,采样后对
accept_tokens/accept_num/step_idx/stop_flags进行 TP 组内广播同步。
Comment on lines
+1588
to
+1593
| if self.parallel_config.tensor_parallel_size > 1: | ||
| paddle.distributed.broadcast( | ||
| sampler_output.sampled_token_ids, | ||
| self.parallel_config.data_parallel_rank * self.parallel_config.tensor_parallel_size, | ||
| group=self.parallel_config.tp_group, | ||
| ) |
There was a problem hiding this comment.
这里 src(root rank)的计算表达式在多处重复使用,后面 speculative 分支也同样重复。建议先用局部变量(例如 tp_src_rank = data_parallel_rank * tensor_parallel_size)保存,再传给 broadcast,避免复制粘贴带来的维护风险。
Comment on lines
+1601
to
+1606
| if self.parallel_config.tensor_parallel_size > 1: | ||
| paddle.distributed.broadcast( | ||
| self.share_inputs["accept_tokens"], | ||
| self.parallel_config.data_parallel_rank * self.parallel_config.tensor_parallel_size, | ||
| group=self.parallel_config.tp_group, | ||
| ) |
There was a problem hiding this comment.
speculative 分支里连续多次调用 broadcast,且每次都重复同一个 src 计算逻辑。建议复用同一个 tp_src_rank 变量,并考虑用一个 key 列表循环广播这些张量(accept_tokens/accept_num/step_idx/stop_flags),降低后续新增/修改字段时遗漏的概率。
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
In TP, each rank may produce different sampling results due to independent random sampling
Added a broadcast operation after sampling in the XPU model runner to synchronize the sampled tokens from rank 0.
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.