Skip to content

Cleanup atomics and fix deadlock in DP bucket_can_pool() #1127

Closed
bratpiorka wants to merge 0 commit intooneapi-src:mainfrom
bratpiorka:rrudnick_fix_dp_atomic
Closed

Cleanup atomics and fix deadlock in DP bucket_can_pool() #1127
bratpiorka wants to merge 0 commit intooneapi-src:mainfrom
bratpiorka:rrudnick_fix_dp_atomic

Conversation

@bratpiorka
Copy link
Copy Markdown
Contributor

@bratpiorka bratpiorka commented Feb 21, 2025

Cleanup atomics and replace while(true) loop in Disjoint Pool bucket_can_pool() with a pair of atomic add/sub.

fix for #1125 and #1115

This PR is required by #1143

@bratpiorka bratpiorka force-pushed the rrudnick_fix_dp_atomic branch 5 times, most recently from c570a42 to 7bd9fd3 Compare February 24, 2025 09:29
@vinser52
Copy link
Copy Markdown

@bratpiorka we have the following issue in the backlog: #886.
Might it be related to the issue you are trying to fix?

@bratpiorka bratpiorka force-pushed the rrudnick_fix_dp_atomic branch 24 times, most recently from 6861168 to b294cc6 Compare February 25, 2025 13:36
@bratpiorka bratpiorka force-pushed the rrudnick_fix_dp_atomic branch 7 times, most recently from 5a3a327 to 1fb932d Compare February 27, 2025 20:26
@bratpiorka
Copy link
Copy Markdown
Contributor Author

@ldorau @igchor @vinser52 please re-review

@bratpiorka
Copy link
Copy Markdown
Contributor Author

clang ThreadSanitizer still detects data race between:

  1. Atomic write of size 8 at 0x7fb2f98bd6d8 by thread T15 (mutexes: write M0):
    critnib_remove() at /home/runner/work/unified-memory-framework/unified-memory-framework/src/critnib/critnib.c:450
    utils_atomic_store_release_ptr((void **)&n->child[slice_index(key, n->shift)], NULL);

and

  1. Previous read of size 8 at 0x7fb2f98bd6d8 by thread T16:
    find_predecessor at /home/runner/work/unified-memory-framework/unified-memory-framework/src/critnib/critnib.c:526
    if (n->child[nib]) {

https://github.com/ldorau/unified-memory-framework/actions/runs/13562899026/job/37909488408

this should be fixed in the newest version of this PR

#include <atomic>
#define _Atomic(X) std::atomic<X>

using std::memory_order_acq_rel;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need C++ code here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code was indirectly included in c++ tests..
moved to test/base.hpp

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bratpiorka but C++ code is still there and in the test/base.hpp only Copyright is changed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I tried moving it to base.hpp, but that broke the building of examples. I also attempted to remove the __cplusplus code entirely, but it's not that simple. So, in fact, I'm not adding new C++ code here; I'm extending the existing code. Further refactoring should be done in a separate PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I tried moving it to base.hpp, but that broke the building of examples.
Why? Our examples should not use any code either from umf internals, nor tests

I also attempted to remove the __cplusplus code entirely, but it's not that simple. So, in fact, I'm not adding new C++ code here; I'm extending the existing code. Further refactoring should be done in a separate PR.
You are adding extra using, and stil we do not know why they are needed.

@ldorau
Copy link
Copy Markdown
Contributor

ldorau commented Feb 28, 2025

This PR is required by #1143

@vinser52
Copy link
Copy Markdown

This PR is required by #998 as well.

Comment on lines 656 to 659
for (nib = 0; nib <= NIB; nib++) {
if (n->child[nib]) {
struct critnib_node *m;
utils_atomic_load_acquire_ptr((void **)&n->child[nib], (void **)&m);
if (m) {
break;
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a rationale for this change. Was old code wrong / not safe?

Copy link
Copy Markdown
Contributor

@ldorau ldorau Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ldorau sanitizer might report false positives. Are we sure it is a real issue?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was needed only by #1143. After adding a lock to #1143 it is not required any more.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which lock? What about scalability?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lock: #1143 (comment)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can anyone explain rationale why this change is needed? I need explanation why from logical point of view, of this code. That it suppress/fixes an sanitizer bug do not explain anything. From my perspective this change is like doing random changes until sanitizer is fine.

Copy link
Copy Markdown
Contributor Author

@bratpiorka bratpiorka Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed
@ldorau please add this change to #1143 as it is needed (or not?) there

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed @ldorau please add this change to #1143 as it is needed (or not?) there

OK

Comment on lines +166 to +158
ASSERT_IS_ALIGNED((uintptr_t)desired, 8);
ASSERT_IS_ALIGNED((uintptr_t)expected, 8);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that this ptrs has to by 8b aligned?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://learn.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-interlockedcompareexchange64

The variables for this function must be aligned on a 64-bit boundary; otherwise, this function will behave unpredictably on multiprocessor x86 systems and any non-x86 systems

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are dereferencing this pointers, so it do not matter what alignment, it has here, as anyway it their value will be copied to the register/stack before call.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you 100% sure that the compiler will copy these arguments and not use them directly? By the way, I really don't understand why we're discussing asserts here. In my opinion, there can never be too many asserts in the code.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are dereferencing this pointers, so it do not matter what alignment, it has here, as anyway it their value will be copied to the register/stack before call.

As Rafal mentioned it is a requirement of the corresponding Windows API.

Furthermore, we should consider underlying HW implementation. AFAIK x86 lock instruction does not require the memory location to be aligned but there is a performance penalty if the memory location is split across multiple cache lines: in that case, instead of locking the cache line for exclusive access ht memory bus should be locked. So it is better to have aligned memory.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guys, this function, takes second and third argument as a Value. So this do not matter PTR is aligned or not it do not matter at all, as function will get a copy values instead of the address.

LONG64 InterlockedCompareExchange64(
  [in, out] LONG64 volatile *Destination,
  [in]      LONG64          ExChange,
  [in]      LONG64          Comperand
);```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, sorry, I did not saw that your comment was not about destination. Of coarse, only destination should be aligned.

@bratpiorka bratpiorka marked this pull request as draft February 28, 2025 14:14
@bratpiorka bratpiorka force-pushed the rrudnick_fix_dp_atomic branch from 436aac8 to 7cba692 Compare February 28, 2025 14:16
@bratpiorka bratpiorka force-pushed the rrudnick_fix_dp_atomic branch from 7cba692 to ca4a8f1 Compare February 28, 2025 17:10
ASSERT_IS_ALIGNED((uintptr_t)ptr, 8);
ASSERT_IS_ALIGNED((uintptr_t)out, 8);
__atomic_load(ptr, out, memory_order_acquire);
//utils_annotate_acquire(ptr);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it commented?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@bratpiorka
Copy link
Copy Markdown
Contributor Author

@lplewa please re-review

@bratpiorka
Copy link
Copy Markdown
Contributor Author

continue in #1151

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants