Fix producer-consumer locking mechanism in Example 17 using atomic_add and memory fences#267
Fix producer-consumer locking mechanism in Example 17 using atomic_add and memory fences#267
Conversation
Co-authored-by: neoblizz <9790745+neoblizz@users.noreply.github.com>
neoblizz
left a comment
There was a problem hiding this comment.
[Iris] [7/8] Validating...
[Iris] [6/8] Max absolute difference: 346.0
[Iris] [3/8] Max absolute difference: 346.0
[Iris] [2/8] Max absolute difference: 346.0
[Iris] [5/8] Max absolute difference: 346.0
[Iris] [4/8] Max absolute difference: 346.0
[Iris] [0/8] Max absolute difference: 346.0
[Iris] [7/8] Max absolute difference: 346.0
[Iris] [1/8] Max absolute difference: 346.0
[Iris] [5/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [0/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [6/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [1/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [2/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [3/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [7/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [5/8] Final C validation failed.
[Iris] [4/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [0/8] Final C validation failed.
[Iris] [2/8] Final C validation failed.
[Iris] [1/8] Final C validation failed.
[Iris] [6/8] Final C validation failed.
[Iris] [3/8] Final C validation failed.
[Iris] [7/8] Final C validation failed.
[Iris] [4/8] Final C validation failed.
Co-authored-by: neoblizz <9790745+neoblizz@users.noreply.github.com>
|
Fixed in commit cf0601a. The issue was that I incorrectly had the producer skip incrementing on itself ( |
There was a problem hiding this comment.
Pull Request Overview
This PR fixes a producer-consumer locking mechanism bug in Example 17's one-shot all-reduce GEMM implementation. The issue was with the synchronization pattern where the atomic operations weren't properly coordinating between ranks.
Key changes:
- Replaced atomic exchange with atomic addition for producer signaling
- Updated consumer to wait for correct count from all ranks including self
- Fixed synchronization logic to match expected behavior described in the issue
neoblizz
left a comment
There was a problem hiding this comment.
@copilot Same problem, it seems like some tiles are not even being produced. Is there something wrong with the distribution code? Or maybe something else entirely?
[Iris] [3/8] Validating...
[Iris] [1/8] Max absolute difference: 346.0
[Iris] [6/8] Max absolute difference: 346.0
[Iris] [4/8] Max absolute difference: 346.0
[Iris] [2/8] Max absolute difference: 346.0
[Iris] [0/8] Max absolute difference: 346.0
[Iris] [7/8] Max absolute difference: 346.0
[Iris] [5/8] Max absolute difference: 346.0
[Iris] [3/8] Max absolute difference: 346.0
[Iris] [1/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [5/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [4/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [3/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [2/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [1/8] Final C validation failed.
[Iris] [0/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [7/8] Mismatch at index (0, 0): C=0.0, expected=192.0
[Iris] [6/8] Mismatch at index (0, 0): C=0.0, expected=192.0
Co-authored-by: neoblizz <9790745+neoblizz@users.noreply.github.com>
I've simplified the consumer wait loop in commit 1bca3e7 to avoid resetting |
Co-authored-by: neoblizz <9790745+neoblizz@users.noreply.github.com>
Fixed in commit f8d32e1. The issue was missing memory fences - |
Fixed the producer-consumer locking mechanism in Example 17 by addressing both synchronization and memory ordering issues.
Problem
The original implementation had two critical bugs:
atomic_xchgwhich overwrote signals from other ranks instead of counting themtl.debug_barrier()which only synchronizes threads but doesn't guarantee memory visibility across GPUsThis caused validation failures with
C=0.0values where data was expected, as consumers would read stale or uninitialized data fromlocal_Ceven after synchronization appeared successful.Solution
Producer Changes:
atomic_xchgwithatomic_addto properly count signals from each ranktl.atomic_fence(sem="release", scope="sys")after writinglocal_Cto ensure writes are visible before signalingtl.debug_barrier()which provided insufficient memory ordering guaranteesConsumer Changes:
tile_ready >= (world_size - 1)usingatomic_caswith acquire semanticstl.atomic_fence(sem="acquire", scope="gpu")after checking local lock to ensure local producer's writes are visibleatomic_casas atomic read operation (compare=0, value=0) to avoid unnecessary resetsImplementation Details
The fix establishes proper happens-before relationships between producer and consumer:
The atomic fences with release/acquire semantics ensure that:
local_Cwrites are visible to consumers before they proceedTesting
Fixes #266
Original prompt
Fixes #266
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.