Skip to content

[Backport 3.4] Backport PSTL fixes#9256

Merged
wmaxey merged 4 commits into
NVIDIA:branch/3.4.xfrom
davebayer:backport_pstl_fixes
Jun 4, 2026
Merged

[Backport 3.4] Backport PSTL fixes#9256
wmaxey merged 4 commits into
NVIDIA:branch/3.4.xfrom
davebayer:backport_pstl_fixes

Conversation

davebayer and others added 4 commits June 4, 2026 13:45
* [libcu++] Use stream's context in PSTL

* Address review comments

* Actually use the right name

* Morning coffee

* fixes

* fix

---------

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
(cherry picked from commit 2f7cb8b)
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Fixed CUDA execution context handling in parallel algorithms to ensure correct device context is consistently established during kernel operations.
    • Improved CUDA stream management and selection reliability across multiple PSTL implementations.
  • Tests

    • Updated CUDA execution policy tests to reflect improved stream handling behavior.
  • Chores

    • Added compiler compatibility flag for NVCC 12.0 with GCC host compilers.

Walkthrough

This PR introduces __pstl_ensure_current_ctx_for utility to enforce correct CUDA execution context in PSTL operations. It systematically updates 20+ CUDA algorithm backends to acquire streams and establish context early, eliminates duplicate stream re-fetching, replaces cudaStreamPerThread defaults with cudaStream_t{}, queries the current device instead of hardcoding device 0, and refactors exception handling.

Changes

CUDA Context and Stream Management

Layer / File(s) Summary
New ensure_current_context utility and header
libcudacxx/include/cuda/std/__pstl/cuda/ensure_current_context.h
Introduces __pstl_ensure_current_ctx_for template that returns an RAII __ensure_current_context object; selects stream-derived context if policy provides one via get_stream, otherwise queries current device via cudaGetDevice and wraps in device reference.
Temporary storage device and memory pool management
libcudacxx/include/cuda/std/__pstl/cuda/temporary_storage.h
Queries current device dynamically instead of using hardcoded device 0 for default memory pool; removes noexcept from __get_memory_resource_or; changes default stream from cudaStreamPerThread to cudaStream_t{}.
PSTL algorithm implementations: stream and context setup
libcudacxx/include/cuda/std/__pstl/cuda/{adjacent_difference,copy_if,copy_n,exclusive_scan,find_if,for_each_n,generate_n,inclusive_scan,max_element,merge,min_element,partition,partition_copy,reduce,remove_if,rotate,rotate_copy,shift_left,shift_right,transform,transform_reduce,unique,stable_partition}.h
Systematically includes ensure_current_context.h, moves stream acquisition early in __par_impl, calls __pstl_ensure_current_ctx_for(__policy) to establish context, and eliminates redundant later stream re-acquisition; all backends now follow consistent early-initialization pattern.
Exception handling refactoring
libcudacxx/include/cuda/std/__pstl/cuda/{sort,copy_if}.h
sort.h replaces C++ try/catch with _CCCL_TRY/_CCCL_CATCH macros, mapping cudaErrorMemoryAllocation to std::bad_alloc and rethrowing other errors; copy_if.h uses _CCCL_RETHROW instead of raw throw for non-allocation CUDA errors.
Test and build config updates
libcudacxx/test/libcudacxx/cuda/execution/execution_policy/{get_stream,get_memory_resource}.pass.cpp, libcudacxx/test/utils/libcudacxx/test/config.py
Execution policy tests change baseline stream from cudaStreamPerThread to cudaStream_t{}; build config adds nvcc 12.0 + gcc warning suppression via -Xcompiler -Wno-attributes.

Possibly related PRs

  • NVIDIA/cccl#9220: Exception handling macro standardization in PSTL backends.
  • NVIDIA/cccl#9214: Stream selection changes from cudaStreamPerThread to cudaStream_t{} across PSTL CUDA backends.
  • NVIDIA/cccl#9219: Use stream's context in PSTL algorithm implementations with matching ensure_current_context integration.

Suggested labels

backport branch/3.4.x

Suggested reviewers

  • fbusato
  • pciolkosz

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Infer (1.2.0)
libcudacxx/test/libcudacxx/cuda/execution/execution_policy/get_memory_resource.pass.cpp

libcudacxx/test/libcudacxx/cuda/execution/execution_policy/get_memory_resource.pass.cpp:17:10: fatal error: 'cuda/functional' file not found
17 | #include <cuda/functional>
| ^~~~~~~~~~~~~~~~~
1 error generated.
Error: the following clang command did not run successfully:
/opt/infer-linux-x86_64-v1.2.0/lib/infer/facebook-clang-plugins/clang/install/bin/clang-18
@/tmp/coderabbit-infer/ab8cc8da130a2faab6ee636c996f62d9621bc16b-592ecba06aa173d3/tmp/clang_command_.tmp.3b1ee5.txt
++Contents of '/tmp/coderabbit-infer/ab8cc8da130a2faab6ee636c996f62d9621bc16b-592ecba06aa173d3/tmp/clang_command_.tmp.3b1ee5.txt':
"-cc1" "-load"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../../facebook-clang-plugins/libtooling/build/FacebookClangPlugin.dylib"
"-add-plugin" "BiniouASTExporter" "-plugin-arg-BiniouASTExporter" "-"
"-plugin-arg-BiniouASTExporter" "PREPEND_CURRENT_DIR=1"
"-plugin-arg-BiniouASTExporter" "MAX_STRING_SIZE=65535" "-cc1" "-triple"

... [truncated 1214 characters] ...

l/include" "-internal-isystem"
"/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include"
"-internal-externc-isystem" "/usr/include/x86_64-linux-gnu"
"-internal-externc-isystem" "/include" "-internal-externc-isystem"
"/usr/include" "-Wno-ignored-optimization-argument" "-Wno-everything"
"-fdeprecated-macro" "-ferror-limit" "19" "-fgnuc-version=4.2.1"
"-fskip-odr-check-in-gmf" "-fcxx-exceptions" "-fexceptions"
"-D__GCC_HAVE_DWARF2_CFI_ASM=1" "-o"
"/tmp/coderabbit-infer/592ecba06aa173d3/file.o" "-x" "c++"
"libcudacxx/test/libcudacxx/cuda/execution/execution_policy/get_memory_resource.pass.cpp"
"-O0" "-fno-builtin" "-include"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../lib/clang_wrappers/global_defines.h"
"-Wno-everything"

libcudacxx/test/libcudacxx/cuda/execution/execution_policy/get_stream.pass.cpp

libcudacxx/test/libcudacxx/cuda/execution/execution_policy/get_stream.pass.cpp:17:10: fatal error: 'cuda/functional' file not found
17 | #include <cuda/functional>
| ^~~~~~~~~~~~~~~~~
1 error generated.
Error: the following clang command did not run successfully:
/opt/infer-linux-x86_64-v1.2.0/lib/infer/facebook-clang-plugins/clang/install/bin/clang-18
@/tmp/coderabbit-infer/ab8cc8da130a2faab6ee636c996f62d9621bc16b-dcd2ecffd9fe1481/tmp/clang_command_.tmp.f26b08.txt
++Contents of '/tmp/coderabbit-infer/ab8cc8da130a2faab6ee636c996f62d9621bc16b-dcd2ecffd9fe1481/tmp/clang_command_.tmp.f26b08.txt':
"-cc1" "-load"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../../facebook-clang-plugins/libtooling/build/FacebookClangPlugin.dylib"
"-add-plugin" "BiniouASTExporter" "-plugin-arg-BiniouASTExporter" "-"
"-plugin-arg-BiniouASTExporter" "PREPEND_CURRENT_DIR=1"
"-plugin-arg-BiniouASTExporter" "MAX_STRING_SIZE=65535" "-cc1" "-triple"
"x86_64-

... [truncated 1187 characters] ...

/usr/local/include" "-internal-isystem"
"/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include"
"-internal-externc-isystem" "/usr/include/x86_64-linux-gnu"
"-internal-externc-isystem" "/include" "-internal-externc-isystem"
"/usr/include" "-Wno-ignored-optimization-argument" "-Wno-everything"
"-fdeprecated-macro" "-ferror-limit" "19" "-fgnuc-version=4.2.1"
"-fskip-odr-check-in-gmf" "-fcxx-exceptions" "-fexceptions"
"-D__GCC_HAVE_DWARF2_CFI_ASM=1" "-o"
"/tmp/coderabbit-infer/dcd2ecffd9fe1481/file.o" "-x" "c++"
"libcudacxx/test/libcudacxx/cuda/execution/execution_policy/get_stream.pass.cpp"
"-O0" "-fno-builtin" "-include"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../lib/clang_wrappers/global_defines.h"
"-Wno-everything"


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
libcudacxx/include/cuda/std/__pstl/cuda/ensure_current_context.h (1)

39-41: ⚡ Quick win

suggestion: Fully qualify get_stream_t and get_stream from the global namespace.

Line 39 and Line 41 rely on unqualified lookup inside cuda::std::execution; switch to ::cuda::get_stream_t and ::cuda::get_stream to match project rules and avoid accidental shadowing.
As per coding guidelines, "All calls to free functions must be fully qualified starting from the global namespace, e.g., ::cuda::ceil_div."


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 94b27f12-f257-407e-a313-1fd766494a86

📥 Commits

Reviewing files that changed from the base of the PR and between 576e227 and ab8cc8d.

📒 Files selected for processing (29)
  • libcudacxx/include/cuda/std/__pstl/cuda/adjacent_difference.h
  • libcudacxx/include/cuda/std/__pstl/cuda/copy_if.h
  • libcudacxx/include/cuda/std/__pstl/cuda/copy_n.h
  • libcudacxx/include/cuda/std/__pstl/cuda/ensure_current_context.h
  • libcudacxx/include/cuda/std/__pstl/cuda/exclusive_scan.h
  • libcudacxx/include/cuda/std/__pstl/cuda/find_if.h
  • libcudacxx/include/cuda/std/__pstl/cuda/for_each_n.h
  • libcudacxx/include/cuda/std/__pstl/cuda/generate_n.h
  • libcudacxx/include/cuda/std/__pstl/cuda/inclusive_scan.h
  • libcudacxx/include/cuda/std/__pstl/cuda/max_element.h
  • libcudacxx/include/cuda/std/__pstl/cuda/merge.h
  • libcudacxx/include/cuda/std/__pstl/cuda/min_element.h
  • libcudacxx/include/cuda/std/__pstl/cuda/partition.h
  • libcudacxx/include/cuda/std/__pstl/cuda/partition_copy.h
  • libcudacxx/include/cuda/std/__pstl/cuda/reduce.h
  • libcudacxx/include/cuda/std/__pstl/cuda/remove_if.h
  • libcudacxx/include/cuda/std/__pstl/cuda/rotate.h
  • libcudacxx/include/cuda/std/__pstl/cuda/rotate_copy.h
  • libcudacxx/include/cuda/std/__pstl/cuda/shift_left.h
  • libcudacxx/include/cuda/std/__pstl/cuda/shift_right.h
  • libcudacxx/include/cuda/std/__pstl/cuda/sort.h
  • libcudacxx/include/cuda/std/__pstl/cuda/stable_partition.h
  • libcudacxx/include/cuda/std/__pstl/cuda/temporary_storage.h
  • libcudacxx/include/cuda/std/__pstl/cuda/transform.h
  • libcudacxx/include/cuda/std/__pstl/cuda/transform_reduce.h
  • libcudacxx/include/cuda/std/__pstl/cuda/unique.h
  • libcudacxx/test/libcudacxx/cuda/execution/execution_policy/get_memory_resource.pass.cpp
  • libcudacxx/test/libcudacxx/cuda/execution/execution_policy/get_stream.pass.cpp
  • libcudacxx/test/utils/libcudacxx/test/config.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 4, 2026

🥳 CI Workflow Results

🟩 Finished in 1h 28m: Pass: 100%/113 | Total: 2d 02h | Max: 1h 04m | Hits: 75%/439700

See results here.

@wmaxey wmaxey merged commit 70434dd into NVIDIA:branch/3.4.x Jun 4, 2026
133 of 136 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants