[libcu++] Use stream's context in PSTL#9219
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughSummary by CodeRabbit
important: WalkthroughAcquire policy stream and call __pstl_ensure_current_ctx_for(__policy) at __par_impl entry across CUDA PSTL dispatchers; remove later stream re-declarations so kernels, temporary storage and async memcpys use the policy's device/stream. ChangesPSTL stream device context enforcement across CUDA algorithms
Assessment against linked issues
Possibly related PRs
Suggested reviewers
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
libcudacxx/include/cuda/std/__pstl/cuda/sort.h (1)
150-151:⚠️ Potential issue | 🟠 Major | ⚡ Quick winimportant:
__merge_sort_implwas not given the context guard, so the merge-sort path still launches on the process-current device instead of the stream's device — exactly the bug this PR fixes.__radix_sort_implgot__ctx(lines 90-91) but this path did not. Also__streamshould beconsthere (it isn't mutated;.get()/.sync()are const).- const auto __count = ::cuda::std::distance(__first, __last); - auto __stream = ::cuda::__call_or(::cuda::get_stream, ::cuda::stream_ref{::cudaStream_t{}}, __policy); + const auto __stream = ::cuda::__call_or(::cuda::get_stream, ::cuda::stream_ref{::cudaStream_t{}}, __policy); + const auto __ctx = ::cuda::std::execution::__pstl_ensure_current_ctx_for(__policy); + + const auto __count = ::cuda::std::distance(__first, __last);
🧹 Nitpick comments (1)
libcudacxx/include/cuda/std/__pstl/cuda/adjacent_difference.h (1)
78-78: 💤 Low valuesuggestion: Error string and
static_assert(Line 126) referencemerge, but this isadjacent_difference. Misleading during debugging; rename to the correct algorithm.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: cea57cf8-1a55-4f7c-92e3-7a3fc64b7b47
📒 Files selected for processing (26)
libcudacxx/include/cuda/std/__pstl/cuda/adjacent_difference.hlibcudacxx/include/cuda/std/__pstl/cuda/common.hlibcudacxx/include/cuda/std/__pstl/cuda/copy_if.hlibcudacxx/include/cuda/std/__pstl/cuda/copy_n.hlibcudacxx/include/cuda/std/__pstl/cuda/exclusive_scan.hlibcudacxx/include/cuda/std/__pstl/cuda/find_if.hlibcudacxx/include/cuda/std/__pstl/cuda/for_each_n.hlibcudacxx/include/cuda/std/__pstl/cuda/generate_n.hlibcudacxx/include/cuda/std/__pstl/cuda/inclusive_scan.hlibcudacxx/include/cuda/std/__pstl/cuda/max_element.hlibcudacxx/include/cuda/std/__pstl/cuda/merge.hlibcudacxx/include/cuda/std/__pstl/cuda/min_element.hlibcudacxx/include/cuda/std/__pstl/cuda/partition.hlibcudacxx/include/cuda/std/__pstl/cuda/partition_copy.hlibcudacxx/include/cuda/std/__pstl/cuda/reduce.hlibcudacxx/include/cuda/std/__pstl/cuda/remove_if.hlibcudacxx/include/cuda/std/__pstl/cuda/rotate.hlibcudacxx/include/cuda/std/__pstl/cuda/rotate_copy.hlibcudacxx/include/cuda/std/__pstl/cuda/shift_left.hlibcudacxx/include/cuda/std/__pstl/cuda/shift_right.hlibcudacxx/include/cuda/std/__pstl/cuda/sort.hlibcudacxx/include/cuda/std/__pstl/cuda/stable_partition.hlibcudacxx/include/cuda/std/__pstl/cuda/temporary_storage.hlibcudacxx/include/cuda/std/__pstl/cuda/transform.hlibcudacxx/include/cuda/std/__pstl/cuda/transform_reduce.hlibcudacxx/include/cuda/std/__pstl/cuda/unique.h
| _BinaryOp __binary_op) | ||
| { | ||
| auto __count = ::cuda::std::distance(__first, __last); | ||
| const auto __stream = ::cuda::__call_or(::cuda::get_stream, ::cuda::stream_ref{cudaStream_t{}}, __policy); |
There was a problem hiding this comment.
Concern: Using the null stream as a default seems like a deviation from our policy of not using the null stream.
There was a problem hiding this comment.
I don't think there is a better candidate to be honest, I think its either no default or null as default
There was a problem hiding this comment.
I agree, however, I believe we had a discussion in the past about this and decided to go with null stream. I don't remember the exact reason. @miscco do you remember?
There was a problem hiding this comment.
We had a discussion about this, but the problem is that I need to follow what CUB does, otherwise if no stream is provided CUB will internally use a different stream
| else | ||
| { | ||
| // If no stream was specified, we use the device 0. | ||
| return __ensure_current_context{device_ref{0}}; |
There was a problem hiding this comment.
I think if no stream is specified and we are using the NULL stream, we should use whatever is the current device
There was a problem hiding this comment.
I thought so, but inside the temp memory allocation helper, we always allocate memory on device 0. I don't know what's correct then.
If we always used the current device, it would be easier to implement
There was a problem hiding this comment.
I am 100% with @pciolkosz here that we should aim to use the current device.
There was a problem hiding this comment.
This is a bug I think. We should also create the memory pool on the current device if no memory resource or stream was provided by the user. @davebayer please update this in this PR as well.
| { // no stream no memory resource, assume device 0 | ||
| { | ||
| // If no stream was specified, we use the device 0. | ||
| return ::cuda::device_default_memory_pool(0); |
There was a problem hiding this comment.
Ahh, I see its pre-existing, I think this might also want to use the current device
There was a problem hiding this comment.
Actually, I'm not sure about that. @miscco what do you think?
There was a problem hiding this comment.
Use the current device here. I think using device 0 is a bug here.
There was a problem hiding this comment.
Yeah we should use the current device
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
libcudacxx/include/cuda/std/__pstl/cuda/sort.h (1)
147-163:⚠️ Potential issue | 🟠 Major | ⚡ Quick winimportant: __merge_sort_impl still launches
DeviceMergeSort::SortKeyswithout__pstl_ensure_current_ctx_for(__policy), so the merge-sort branch can still run under the wrong device context when policy stream device differs from current device. Add the same context guard at function entry in this branch (next to stream acquisition) before the CUB launch.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 87b5b756-2a28-4090-935a-e599872e30f2
📒 Files selected for processing (25)
libcudacxx/include/cuda/std/__pstl/cuda/adjacent_difference.hlibcudacxx/include/cuda/std/__pstl/cuda/copy_if.hlibcudacxx/include/cuda/std/__pstl/cuda/copy_n.hlibcudacxx/include/cuda/std/__pstl/cuda/ensure_current_context.hlibcudacxx/include/cuda/std/__pstl/cuda/exclusive_scan.hlibcudacxx/include/cuda/std/__pstl/cuda/find_if.hlibcudacxx/include/cuda/std/__pstl/cuda/for_each_n.hlibcudacxx/include/cuda/std/__pstl/cuda/generate_n.hlibcudacxx/include/cuda/std/__pstl/cuda/inclusive_scan.hlibcudacxx/include/cuda/std/__pstl/cuda/max_element.hlibcudacxx/include/cuda/std/__pstl/cuda/merge.hlibcudacxx/include/cuda/std/__pstl/cuda/min_element.hlibcudacxx/include/cuda/std/__pstl/cuda/partition.hlibcudacxx/include/cuda/std/__pstl/cuda/partition_copy.hlibcudacxx/include/cuda/std/__pstl/cuda/reduce.hlibcudacxx/include/cuda/std/__pstl/cuda/remove_if.hlibcudacxx/include/cuda/std/__pstl/cuda/rotate.hlibcudacxx/include/cuda/std/__pstl/cuda/rotate_copy.hlibcudacxx/include/cuda/std/__pstl/cuda/shift_left.hlibcudacxx/include/cuda/std/__pstl/cuda/shift_right.hlibcudacxx/include/cuda/std/__pstl/cuda/sort.hlibcudacxx/include/cuda/std/__pstl/cuda/stable_partition.hlibcudacxx/include/cuda/std/__pstl/cuda/transform.hlibcudacxx/include/cuda/std/__pstl/cuda/transform_reduce.hlibcudacxx/include/cuda/std/__pstl/cuda/unique.h
There was a problem hiding this comment.
Actionable comments posted: 1
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 140074fb-cd22-4205-8c0d-a8c100b88313
📒 Files selected for processing (2)
libcudacxx/include/cuda/std/__pstl/cuda/ensure_current_context.hlibcudacxx/include/cuda/std/__pstl/cuda/temporary_storage.h
|
Can you still run a benchmark for one of the algorithms and measure whether there is a performance impact? Otherwise the PR LGTM. |
🥳 CI Workflow Results🟩 Finished in 1h 17m: Pass: 100%/115 | Total: 1d 17h | Max: 57m 04s | Hits: 75%/446956See results here. |
|
Tested on ['baseline.json', 'modified.json'] base[0] NVIDIA RTX PRO 6000 Blackwell Workstation Edition
Summary
|
|
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin branch/3.4.x
git worktree add -d .worktree/backport-9219-to-branch/3.4.x origin/branch/3.4.x
cd .worktree/backport-9219-to-branch/3.4.x
git switch --create backport-9219-to-branch/3.4.x
git cherry-pick -x 2f7cb8b5a363da93ebef37e93ff9c4eab70edce4 |
Fixes #9212.