Make pp_stream wait on attn_backward_dx#10984
Open
lshpku wants to merge 1 commit intoPaddlePaddle:dsv3_devfrom
Open
Make pp_stream wait on attn_backward_dx#10984lshpku wants to merge 1 commit intoPaddlePaddle:dsv3_devfrom
lshpku wants to merge 1 commit intoPaddlePaddle:dsv3_devfrom
Conversation
|
Thanks for your contribution! |
b78af3d to
b6e9841
Compare
b6e9841 to
1b1e63a
Compare
|
This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动,被标记为stale。 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR types
Bug fixes
PR changes
Models
Description
让 pp_stream 等待 attn_backward_dx,解决开启 overlap_p2p_comm 时遇到的 loss 下降速度慢的问题
下图显示了修复前和修复后的等待关系

其实我也不知道为什么加这条等待就行,我只是通过二分法定位到是 PP(F) 的问题,然后试着加了等待,然后 loss 就正常了,估计跟跨 stream 分配显存有关,我通过单测发现 Paddle 的跨 stream 分配显存有一些不安全的情况,虽然模型里看起来没有不安全的用法,但也不好说,所以还是保守一点
对性能有一定影响,因为把 PP(F) 推后了,该 PR 还需要改进
正常情况下,单机配置(29 Decoder + 1 MTP),跑200个step,loss应该下降到7.3;在本PR之前开启 overlap_p2p_comm,loss 只能降到8.7;现在开不开都能降到7.3