vulkan: when using transfer queue for async copies, sync on event_wait to avoid race#25229
Open
0cc4m wants to merge 1 commit into
Open
vulkan: when using transfer queue for async copies, sync on event_wait to avoid race#252290cc4m wants to merge 1 commit into
0cc4m wants to merge 1 commit into
Conversation
Contributor
|
Thanks @0cc4m, this looks like the right direction! One heads-up: the new sync only fires in |
Contributor
Author
|
Whoever is using the async copy commands needs to use either ggml_backend_synchronize or events to make sure they are done by the time it wants to use them, and also that the read is done before it writes to the buffer again. That is done for moe expert-upload as well. I don't see the problem you mean. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
When
async_use_transfer_queueis set (for AMD RDNA GPUs), the async queue did not wait for events yet. On RADV this didn't cause issues, but it could be the source of the issue reported for AMD Windows devices. I can't reproduce it, so this is a guess, but I have verified it does not regress performance or cause incoherent output on Linux.This is an attempt to fix the issue reported in #25195, could be an alternative to #25196 depending on performance. @liminfei-amd please check if this resolves the race condition. It was the only issue I could find with the transfer queue use.
Requirements