[Backport 3.4] Backport PSTL fixes by davebayer · Pull Request #9256 · NVIDIA/cccl

davebayer · 2026-06-04T11:49:08Z

Batch of PSTL backports to 3.4.x:

…NVIDIA#9216) (cherry picked from commit 2a82ae1)

…lgorithms (NVIDIA#9214) (cherry picked from commit 4f5bc7c)

(cherry picked from commit cfe7e26)

* [libcu++] Use stream's context in PSTL * Address review comments * Actually use the right name * Morning coffee * fixes * fix --------- Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com> (cherry picked from commit 2f7cb8b)

coderabbitai · 2026-06-04T11:59:42Z

📝 Walkthrough

Summary by CodeRabbit

Release Notes

Bug Fixes
- Fixed CUDA execution context handling in parallel algorithms to ensure correct device context is consistently established during kernel operations.
- Improved CUDA stream management and selection reliability across multiple PSTL implementations.
Tests
- Updated CUDA execution policy tests to reflect improved stream handling behavior.
Chores
- Added compiler compatibility flag for NVCC 12.0 with GCC host compilers.

Walkthrough

This PR introduces __pstl_ensure_current_ctx_for utility to enforce correct CUDA execution context in PSTL operations. It systematically updates 20+ CUDA algorithm backends to acquire streams and establish context early, eliminates duplicate stream re-fetching, replaces cudaStreamPerThread defaults with cudaStream_t{}, queries the current device instead of hardcoding device 0, and refactors exception handling.

Changes

CUDA Context and Stream Management

Layer / File(s)	Summary
New ensure_current_context utility and header `libcudacxx/include/cuda/std/__pstl/cuda/ensure_current_context.h`	Introduces `__pstl_ensure_current_ctx_for` template that returns an RAII `__ensure_current_context` object; selects stream-derived context if policy provides one via `get_stream`, otherwise queries current device via `cudaGetDevice` and wraps in device reference.
Temporary storage device and memory pool management `libcudacxx/include/cuda/std/__pstl/cuda/temporary_storage.h`	Queries current device dynamically instead of using hardcoded device 0 for default memory pool; removes `noexcept` from `__get_memory_resource_or`; changes default stream from `cudaStreamPerThread` to `cudaStream_t{}`.
PSTL algorithm implementations: stream and context setup `libcudacxx/include/cuda/std/__pstl/cuda/{adjacent_difference,copy_if,copy_n,exclusive_scan,find_if,for_each_n,generate_n,inclusive_scan,max_element,merge,min_element,partition,partition_copy,reduce,remove_if,rotate,rotate_copy,shift_left,shift_right,transform,transform_reduce,unique,stable_partition}.h`	Systematically includes `ensure_current_context.h`, moves stream acquisition early in `__par_impl`, calls `__pstl_ensure_current_ctx_for(__policy)` to establish context, and eliminates redundant later stream re-acquisition; all backends now follow consistent early-initialization pattern.
Exception handling refactoring `libcudacxx/include/cuda/std/__pstl/cuda/{sort,copy_if}.h`	sort.h replaces C++ try/catch with `_CCCL_TRY`/`_CCCL_CATCH` macros, mapping `cudaErrorMemoryAllocation` to `std::bad_alloc` and rethrowing other errors; copy_if.h uses `_CCCL_RETHROW` instead of raw throw for non-allocation CUDA errors.
Test and build config updates `libcudacxx/test/libcudacxx/cuda/execution/execution_policy/{get_stream,get_memory_resource}.pass.cpp`, `libcudacxx/test/utils/libcudacxx/test/config.py`	Execution policy tests change baseline stream from `cudaStreamPerThread` to `cudaStream_t{}`; build config adds nvcc 12.0 + gcc warning suppression via `-Xcompiler -Wno-attributes`.

Possibly related PRs

NVIDIA/cccl#9220: Exception handling macro standardization in PSTL backends.
NVIDIA/cccl#9214: Stream selection changes from cudaStreamPerThread to cudaStream_t{} across PSTL CUDA backends.
NVIDIA/cccl#9219: Use stream's context in PSTL algorithm implementations with matching ensure_current_context integration.

Suggested labels

backport branch/3.4.x

Suggested reviewers

fbusato
pciolkosz

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Infer (1.2.0)

libcudacxx/test/libcudacxx/cuda/execution/execution_policy/get_memory_resource.pass.cpp

libcudacxx/test/libcudacxx/cuda/execution/execution_policy/get_memory_resource.pass.cpp:17:10: fatal error: 'cuda/functional' file not found
17 | #include <cuda/functional>
| ^~~~~~~~~~~~~~~~~
1 error generated.
Error: the following clang command did not run successfully:
/opt/infer-linux-x86_64-v1.2.0/lib/infer/facebook-clang-plugins/clang/install/bin/clang-18
@/tmp/coderabbit-infer/ab8cc8da130a2faab6ee636c996f62d9621bc16b-592ecba06aa173d3/tmp/clang_command_.tmp.3b1ee5.txt
++Contents of '/tmp/coderabbit-infer/ab8cc8da130a2faab6ee636c996f62d9621bc16b-592ecba06aa173d3/tmp/clang_command_.tmp.3b1ee5.txt':
"-cc1" "-load"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../../facebook-clang-plugins/libtooling/build/FacebookClangPlugin.dylib"
"-add-plugin" "BiniouASTExporter" "-plugin-arg-BiniouASTExporter" "-"
"-plugin-arg-BiniouASTExporter" "PREPEND_CURRENT_DIR=1"
"-plugin-arg-BiniouASTExporter" "MAX_STRING_SIZE=65535" "-cc1" "-triple"

... [truncated 1214 characters] ...

l/include" "-internal-isystem"
"/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include"
"-internal-externc-isystem" "/usr/include/x86_64-linux-gnu"
"-internal-externc-isystem" "/include" "-internal-externc-isystem"
"/usr/include" "-Wno-ignored-optimization-argument" "-Wno-everything"
"-fdeprecated-macro" "-ferror-limit" "19" "-fgnuc-version=4.2.1"
"-fskip-odr-check-in-gmf" "-fcxx-exceptions" "-fexceptions"
"-D__GCC_HAVE_DWARF2_CFI_ASM=1" "-o"
"/tmp/coderabbit-infer/592ecba06aa173d3/file.o" "-x" "c++"
"libcudacxx/test/libcudacxx/cuda/execution/execution_policy/get_memory_resource.pass.cpp"
"-O0" "-fno-builtin" "-include"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../lib/clang_wrappers/global_defines.h"
"-Wno-everything"

libcudacxx/test/libcudacxx/cuda/execution/execution_policy/get_stream.pass.cpp

libcudacxx/test/libcudacxx/cuda/execution/execution_policy/get_stream.pass.cpp:17:10: fatal error: 'cuda/functional' file not found
17 | #include <cuda/functional>
| ^~~~~~~~~~~~~~~~~
1 error generated.
Error: the following clang command did not run successfully:
/opt/infer-linux-x86_64-v1.2.0/lib/infer/facebook-clang-plugins/clang/install/bin/clang-18
@/tmp/coderabbit-infer/ab8cc8da130a2faab6ee636c996f62d9621bc16b-dcd2ecffd9fe1481/tmp/clang_command_.tmp.f26b08.txt
++Contents of '/tmp/coderabbit-infer/ab8cc8da130a2faab6ee636c996f62d9621bc16b-dcd2ecffd9fe1481/tmp/clang_command_.tmp.f26b08.txt':
"-cc1" "-load"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../../facebook-clang-plugins/libtooling/build/FacebookClangPlugin.dylib"
"-add-plugin" "BiniouASTExporter" "-plugin-arg-BiniouASTExporter" "-"
"-plugin-arg-BiniouASTExporter" "PREPEND_CURRENT_DIR=1"
"-plugin-arg-BiniouASTExporter" "MAX_STRING_SIZE=65535" "-cc1" "-triple"
"x86_64-

... [truncated 1187 characters] ...

/usr/local/include" "-internal-isystem"
"/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include"
"-internal-externc-isystem" "/usr/include/x86_64-linux-gnu"
"-internal-externc-isystem" "/include" "-internal-externc-isystem"
"/usr/include" "-Wno-ignored-optimization-argument" "-Wno-everything"
"-fdeprecated-macro" "-ferror-limit" "19" "-fgnuc-version=4.2.1"
"-fskip-odr-check-in-gmf" "-fcxx-exceptions" "-fexceptions"
"-D__GCC_HAVE_DWARF2_CFI_ASM=1" "-o"
"/tmp/coderabbit-infer/dcd2ecffd9fe1481/file.o" "-x" "c++"
"libcudacxx/test/libcudacxx/cuda/execution/execution_policy/get_stream.pass.cpp"
"-O0" "-fno-builtin" "-include"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../lib/clang_wrappers/global_defines.h"
"-Wno-everything"

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

libcudacxx/include/cuda/std/__pstl/cuda/ensure_current_context.h (1)

39-41: ⚡ Quick win

suggestion: Fully qualify get_stream_t and get_stream from the global namespace.

Line 39 and Line 41 rely on unqualified lookup inside cuda::std::execution; switch to ::cuda::get_stream_t and ::cuda::get_stream to match project rules and avoid accidental shadowing.
As per coding guidelines, "All calls to free functions must be fully qualified starting from the global namespace, e.g., ::cuda::ceil_div."

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 94b27f12-f257-407e-a313-1fd766494a86

📥 Commits

Reviewing files that changed from the base of the PR and between 576e227 and ab8cc8d.

📒 Files selected for processing (29)

libcudacxx/include/cuda/std/__pstl/cuda/adjacent_difference.h
libcudacxx/include/cuda/std/__pstl/cuda/copy_if.h
libcudacxx/include/cuda/std/__pstl/cuda/copy_n.h
libcudacxx/include/cuda/std/__pstl/cuda/ensure_current_context.h
libcudacxx/include/cuda/std/__pstl/cuda/exclusive_scan.h
libcudacxx/include/cuda/std/__pstl/cuda/find_if.h
libcudacxx/include/cuda/std/__pstl/cuda/for_each_n.h
libcudacxx/include/cuda/std/__pstl/cuda/generate_n.h
libcudacxx/include/cuda/std/__pstl/cuda/inclusive_scan.h
libcudacxx/include/cuda/std/__pstl/cuda/max_element.h
libcudacxx/include/cuda/std/__pstl/cuda/merge.h
libcudacxx/include/cuda/std/__pstl/cuda/min_element.h
libcudacxx/include/cuda/std/__pstl/cuda/partition.h
libcudacxx/include/cuda/std/__pstl/cuda/partition_copy.h
libcudacxx/include/cuda/std/__pstl/cuda/reduce.h
libcudacxx/include/cuda/std/__pstl/cuda/remove_if.h
libcudacxx/include/cuda/std/__pstl/cuda/rotate.h
libcudacxx/include/cuda/std/__pstl/cuda/rotate_copy.h
libcudacxx/include/cuda/std/__pstl/cuda/shift_left.h
libcudacxx/include/cuda/std/__pstl/cuda/shift_right.h
libcudacxx/include/cuda/std/__pstl/cuda/sort.h
libcudacxx/include/cuda/std/__pstl/cuda/stable_partition.h
libcudacxx/include/cuda/std/__pstl/cuda/temporary_storage.h
libcudacxx/include/cuda/std/__pstl/cuda/transform.h
libcudacxx/include/cuda/std/__pstl/cuda/transform_reduce.h
libcudacxx/include/cuda/std/__pstl/cuda/unique.h
libcudacxx/test/libcudacxx/cuda/execution/execution_policy/get_memory_resource.pass.cpp
libcudacxx/test/libcudacxx/cuda/execution/execution_policy/get_stream.pass.cpp
libcudacxx/test/utils/libcudacxx/test/config.py

github-actions · 2026-06-04T13:19:40Z

🥳 CI Workflow Results

🟩 Finished in 1h 28m: Pass: 100%/113 | Total: 2d 02h | Max: 1h 04m | Hits: 75%/439700

See results here.

davebayer and others added 4 commits June 4, 2026 13:45

[libcu++] Suppress -Wattributes in lit tests with nvcc 12.0 and gcc (…

a22c7e5

…NVIDIA#9216) (cherry picked from commit 2a82ae1)

[libcu++] Replace cudaStreamPerThread with cudaStream{} in PSTL a…

4797b39

…lgorithms (NVIDIA#9214) (cherry picked from commit 4f5bc7c)

[libcu++] Fix use of exception keywords (NVIDIA#9220)

b5b5659

(cherry picked from commit cfe7e26)

davebayer requested a review from a team as a code owner June 4, 2026 11:49

davebayer requested a review from fbusato June 4, 2026 11:49

github-project-automation Bot added this to CCCL Jun 4, 2026

github-project-automation Bot moved this to Todo in CCCL Jun 4, 2026

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 4, 2026

This was referenced Jun 4, 2026

[Backport branch/3.4.x] [libcu++] Fix use of exception keywords #9228

Closed

[Backport branch/3.4.x] [libcu++] Replace cudaStreamPerThread with cudaStream{} in PSTL #9217

Closed

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

miscco approved these changes Jun 4, 2026

View reviewed changes

wmaxey merged commit 70434dd into NVIDIA:branch/3.4.x Jun 4, 2026
133 of 136 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backport 3.4] Backport PSTL fixes#9256

[Backport 3.4] Backport PSTL fixes#9256
wmaxey merged 4 commits into
NVIDIA:branch/3.4.xfrom
davebayer:backport_pstl_fixes

davebayer commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026

Summary by CodeRabbit

Release Notes

Walkthrough

Changes

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

davebayer commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026

Summary by CodeRabbit

Release Notes

Walkthrough

Changes

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 4, 2026

🥳 CI Workflow Results

🟩 Finished in 1h 28m: Pass: 100%/113 | Total: 2d 02h | Max: 1h 04m | Hits: 75%/439700

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants