Skip to content

vulkan: when using transfer queue for async copies, sync on event_wait to avoid race#25229

Open
0cc4m wants to merge 1 commit into
masterfrom
0cc4m/vulkan-event-async-transfer-queue-sync
Open

vulkan: when using transfer queue for async copies, sync on event_wait to avoid race#25229
0cc4m wants to merge 1 commit into
masterfrom
0cc4m/vulkan-event-async-transfer-queue-sync

Conversation

@0cc4m

@0cc4m 0cc4m commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Overview

When async_use_transfer_queue is set (for AMD RDNA GPUs), the async queue did not wait for events yet. On RADV this didn't cause issues, but it could be the source of the issue reported for AMD Windows devices. I can't reproduce it, so this is a guess, but I have verified it does not regress performance or cause incoherent output on Linux.

This is an attempt to fix the issue reported in #25195, could be an alternative to #25196 depending on performance. @liminfei-amd please check if this resolves the race condition. It was the only issue I could find with the transfer queue use.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, claude wrote the code, I reviewed and tested.

@0cc4m 0cc4m requested a review from a team as a code owner July 2, 2026 09:10
@github-actions github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jul 2, 2026
@liminfei-amd

Copy link
Copy Markdown
Contributor

Thanks @0cc4m, this looks like the right direction! One heads-up: the new sync only fires in event_wait, which needs pipeline parallelism (n_copies > 1), so it won't cover the single-GPU --n-cpu-moe path from #25195. Extending the same submit-level sync to the direct set_tensor_async uploads would close that gap — happy to help!

@0cc4m

0cc4m commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

Whoever is using the async copy commands needs to use either ggml_backend_synchronize or events to make sure they are done by the time it wants to use them, and also that the read is done before it writes to the buffer again. That is done for moe expert-upload as well. I don't see the problem you mean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants