Skip to content

Allow public tuning of cub::DeviceScan::*ByKey#9215

Open
bernhardmgruber wants to merge 12 commits into
NVIDIA:mainfrom
bernhardmgruber:scan_by_key_tune_public
Open

Allow public tuning of cub::DeviceScan::*ByKey#9215
bernhardmgruber wants to merge 12 commits into
NVIDIA:mainfrom
bernhardmgruber:scan_by_key_tune_public

Conversation

@bernhardmgruber
Copy link
Copy Markdown
Contributor

@bernhardmgruber bernhardmgruber commented Jun 2, 2026

Fixes: #8852

New public entities

Entity Members
cub::ScanByKeyPolicy threads_per_block, items_per_thread, load_algorithm, load_modifier, store_algorithm, scan_algorithm, lookback_delay

@bernhardmgruber bernhardmgruber requested review from a team as code owners June 2, 2026 12:15
@bernhardmgruber bernhardmgruber requested a review from gonidelis June 2, 2026 12:15
@bernhardmgruber bernhardmgruber requested a review from jrhemstad June 2, 2026 12:15
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 2, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 2, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 2, 2026

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • New Features
    • Introduced public ScanPolicy, ScanAlgorithm, ScanLookbackPolicy, and ScanWarpspeedPolicy types for improved scan operation tuning and configuration.
    • Added public ScanByKeyPolicy type for scan-by-key operations.
    • Exposed public LookbackDelayPolicy and LookbackDelayAlgorithm types for delay constructor configuration.
    • Enhanced environment-based API support for policy-based tuning in scan and scan-by-key algorithms.

Walkthrough

This PR refactors internal delay-constructor and scan policy types to public APIs. Renames delay_constructor_kindLookbackDelayAlgorithm, delay_constructor_policyLookbackDelayPolicy. Introduces public ScanPolicy, ScanLookbackPolicy, ScanWarpspeedPolicy, ScanAlgorithm. Renames scan_by_key_policyScanByKeyPolicy. Updates all tuning, dispatch, and kernel code to use new types.

Changes

Policy Type Refactoring and API Introduction

Layer / File(s) Summary
Core delay/lookback policy types
cub/cub/detail/delay_constructor.cuh
delay_constructor_kind replaced by LookbackDelayAlgorithm enum; delay_constructor_policy replaced by LookbackDelayPolicy struct with kind: LookbackDelayAlgorithm field; delay_constructor_policy_from_type helper renamed to lookback_delay_policy_from_type with updated specializations.
Public scan policy types and selection
cub/cub/device/dispatch/tuning/tuning_scan.cuh
New public ScanAlgorithm enum (lookback/warpspeed), ScanLookbackPolicy struct, ScanWarpspeedPolicy struct, ScanPolicy aggregate type; updated policy selectors and warpspeed tuning functions to return/accept new types; lookback tuning paths updated to use LookbackDelayPolicy{LookbackDelayAlgorithm::..., ...} initialization.
Public scan-by-key policy
cub/cub/device/dispatch/tuning/tuning_scan_by_key.cuh
Renamed scan_by_key_policyScanByKeyPolicy; renamed delay_constructor field → lookback_delay; updated convert_policy(), policy_selector_from_hub(), policy selectors to return ScanByKeyPolicy and use lookback_delay_policy_from_type helper; added C++20 concepts constraint assertion.

Tuning Policy Updates

Layer / File(s) Summary
Batch memcpy delay policy
cub/cub/device/dispatch/tuning/tuning_batch_memcpy.cuh
Updated small_buffer_policy::buff_delay_constructor and block_delay_constructor field types from delay_constructor_policy to LookbackDelayPolicy.
Radix sort scan integration
cub/cub/device/dispatch/tuning/tuning_radix_sort.cuh
Updated radix_sort_policy::scan member type to ScanPolicy; removed legacy detail::scan type aliases; convert_policy() now constructs ScanPolicy{ScanAlgorithm::lookback, ScanLookbackPolicy{...}, ...} with lookback_delay_policy_from_type.
Reduce by key delay policy
cub/cub/device/dispatch/tuning/tuning_reduce_by_key.cuh
Updated reduce_by_key_policy::delay_constructor field type to LookbackDelayPolicy; all compute-capability-specific tuning branches updated to construct LookbackDelayPolicy{LookbackDelayAlgorithm::..., ...}.
RLE encode tuning
cub/cub/device/dispatch/tuning/tuning_rle_encode.cuh
Updated all SM100/SM90/SM86 policy selection branches to use LookbackDelayAlgorithm enum values and policy construction with delay/latency parameters.
RLE non-trivial runs
cub/cub/device/dispatch/tuning/tuning_rle_non_trivial_runs.cuh
Updated rle_non_trivial_runs_policy::delay_constructor field type to LookbackDelayPolicy; all SM10.x/SM9.x/SM8.x tuning branches updated to use LookbackDelayAlgorithm values.
Select-if tuning
cub/cub/device/dispatch/tuning/tuning_select_if.cuh
Updated select_if_policy::delay_constructor field type to LookbackDelayPolicy; all SM80/SM90/SM100 tuning branches updated to use LookbackDelayPolicy{LookbackDelayAlgorithm::..., ...} construction across all nested (offset_size, input_size) cases.
Three-way partition tuning
cub/cub/device/dispatch/tuning/tuning_three_way_partition.cuh
Updated three_way_partition_policy::delay_constructor field type to LookbackDelayPolicy; all compute-capability tuning branches (≥10.0, ≥9.0, ≥8.0) updated to use LookbackDelayAlgorithm values.
Unique by key tuning
cub/cub/device/dispatch/tuning/tuning_unique_by_key.cuh
Updated unique_by_key_policy::delay_constructor field type to LookbackDelayPolicy; all SM-specific tuning branches and convert_policy() helper updated to use lookback_delay_policy_from_type and LookbackDelayAlgorithm values.

Dispatch Layer Updates

Layer / File(s) Summary
Scan dispatch
cub/cub/device/dispatch/dispatch_scan.cuh
convert_policy() updated to return ScanPolicy with ScanAlgorithm::lookback; dispatch paths updated to use ScanWarpspeedPolicy/ScanLookbackPolicy local variables; algorithm branch condition updated to use ScanAlgorithm::warpspeed.
Scan-by-key dispatch
cub/cub/device/dispatch/dispatch_scan_by_key.cuh
DispatchScanByKey::__invoke updated to accept ScanByKeyPolicy; kernel policy constant type renamed to ScanByKeyPolicy; kernel delay-constructor configuration updated to use policy.lookback_delay.* fields; C++20 concepts constraint added when enabled.
Radix sort dispatch
cub/cub/device/dispatch/dispatch_radix_sort.cuh
PassConfig::__init_pass_config scan_policy parameter type updated from detail::radix_sort::scan_policy to generic ScanPolicy template type.
RLE/select-if/three-way partition dispatch
cub/cub/device/dispatch/dispatch_rle.cuh, dispatch_select_if.cuh, dispatch_three_way_partition.cuh
Updated policy construction to use lookback_delay_policy_from_type instead of delay_constructor_policy_from_type.
DeviceScan API
cub/cub/device/device_scan.cuh
Added tuning documentation section; updated policy selector type resolution in scan_by_key_impl to query against ScanByKeyPolicy instead of detail::scan_by_key::scan_by_key_policy.

Kernel Implementation Updates

Layer / File(s) Summary
Scan kernel
cub/cub/device/dispatch/kernels/kernel_scan.cuh
Updated kernel policies to ScanPolicy type; warpspeed selection logic updated to compare against ScanAlgorithm::warpspeed; lookback path updated to use ScanLookbackPolicy and policy.lookback_delay.* field access.
Scan warpspeed kernel
cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh
Renamed scan_warpspeed_policy to ScanWarpspeedPolicy throughout: function signatures, template static members, and cached policy constants.
Radix sort kernel
cub/cub/device/dispatch/kernels/kernel_radix_sort.cuh
Updated scan policy type to ScanPolicy; added assertion that algorithm equals ScanAlgorithm::lookback; updated AgentScanPolicy instantiation to use policy.lookback_delay.* fields.

Tests and Supporting Updates

Layer / File(s) Summary
Scan policy tests
cub/test/catch2_test_device_scan_env.cu
Updated scan_tuning functor to return public cub::ScanPolicy type; added compile-time validation test for ScanPolicy aggregate/semiregular traits, designated initialization, and equality checks (GCC-gated).
Scan env-based API test
cub/test/catch2_test_device_scan_env_api.cu
Added C++20 tuning example with ScanPolicySelector functor using cuda::execution::tune() for ExclusiveSum; included header and test execution.
Scan-by-key policy tests
cub/test/catch2_test_device_scan_by_key_env.cu
Updated scan_by_key_tuning return type to public cub::ScanByKeyPolicy; added compile-time validation test for semiregular/aggregate traits and designated initialization (GCC-gated).
Scan-by-key env-based API test
cub/test/catch2_test_device_scan_by_key_env_api.cu
Added C++20 tuning example with ScanByKeyPolicySelector functor using cuda::execution::tune() for ExclusiveSumByKey; included test execution and output validation.
Benchmark policies, NVBench helper, C API
cub/benchmarks/bench/..., nvbench_helper/look_back_helper.cuh, c/parallel/src/*
Updated benchmark policy selectors to return public ScanPolicy/ScanByKeyPolicy; updated NVBench helper to use public LookbackDelayAlgorithm/LookbackDelayPolicy; updated C API NVRTC kernel sources to import public delay policy types; updated Thrust test to use ScanAlgorithm::warpspeed.

Assessment against linked issues

Objective Addressed Explanation
Productize cub::DeviceScan::*ByKey tuning API [#8852]

Possibly related PRs

  • NVIDIA/cccl#9169: Modifies scan warpspeed policy plumbing and staging refactors that overlap with this PR's warpspeed policy type refactoring.

Suggested labels

backport branch/3.4.x

Suggested reviewers

  • fbusato
  • miscco
  • gevtushenko

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
cub/cub/device/dispatch/tuning/tuning_scan.cuh (1)

130-141: 💤 Low value

suggestion: Missing [[nodiscard]] on operator== and operator!=.

Other policy structs in this PR (ScanLookbackPolicy, ScanPolicy, ScanByKeyPolicy, LookbackDelayPolicy) have [[nodiscard]] on their comparison operators. Add it here for consistency.

-  _CCCL_HOST_DEVICE_API constexpr friend bool operator==(const ScanWarpspeedPolicy& lhs, const ScanWarpspeedPolicy& rhs)
+  [[nodiscard]] _CCCL_HOST_DEVICE_API constexpr friend bool operator==(const ScanWarpspeedPolicy& lhs, const ScanWarpspeedPolicy& rhs)
   {
     return lhs.num_reduce_and_scan_warps == rhs.num_reduce_and_scan_warps
         && lhs.look_ahead_items_per_thread == rhs.look_ahead_items_per_thread
         && lhs.items_per_thread == rhs.items_per_thread && lhs.lookahead_stages == rhs.lookahead_stages
         && lhs.block_idx_stages == rhs.block_idx_stages;
   }

-  _CCCL_HOST_DEVICE_API constexpr friend bool operator!=(const ScanWarpspeedPolicy& lhs, const ScanWarpspeedPolicy& rhs)
+  [[nodiscard]] _CCCL_HOST_DEVICE_API constexpr friend bool operator!=(const ScanWarpspeedPolicy& lhs, const ScanWarpspeedPolicy& rhs)
   {
     return !(lhs == rhs);
   }
cub/test/catch2_test_device_scan_by_key_env.cu (1)

134-143: 💤 Low value

suggestion: Consider adding [[nodiscard]] and noexcept to the operator(). As per coding guidelines, most functions with non-void return type should use [[nodiscard]], and functions that don't throw should use noexcept. While this is test code, consistency helps maintain best practices.

cub/cub/device/dispatch/kernels/kernel_radix_sort.cuh (1)

154-163: ⚡ Quick win

suggestion: The type alias ScanPolicy shadows the public cub::ScanPolicy type used at line 151, creating confusion within this function scope. For consistency with the pattern in kernel_scan.cuh:223, rename the alias to ScanPolicyT or AgentScanPolicyT to avoid shadowing.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fea37719-5a6b-4be1-918c-a245d8e327e7

📥 Commits

Reviewing files that changed from the base of the PR and between 9a8f7f6 and b969773.

📒 Files selected for processing (33)
  • c/parallel/src/radix_sort.cu
  • c/parallel/src/scan.cu
  • c/parallel/src/unique_by_key.cu
  • cub/benchmarks/bench/copy/memcpy.cu
  • cub/benchmarks/bench/scan/exclusive/by_key.cu
  • cub/benchmarks/bench/scan/policy_selector.h
  • cub/cub/detail/delay_constructor.cuh
  • cub/cub/device/device_scan.cuh
  • cub/cub/device/dispatch/dispatch_radix_sort.cuh
  • cub/cub/device/dispatch/dispatch_rle.cuh
  • cub/cub/device/dispatch/dispatch_scan.cuh
  • cub/cub/device/dispatch/dispatch_scan_by_key.cuh
  • cub/cub/device/dispatch/dispatch_select_if.cuh
  • cub/cub/device/dispatch/dispatch_three_way_partition.cuh
  • cub/cub/device/dispatch/kernels/kernel_radix_sort.cuh
  • cub/cub/device/dispatch/kernels/kernel_scan.cuh
  • cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh
  • cub/cub/device/dispatch/tuning/tuning_batch_memcpy.cuh
  • cub/cub/device/dispatch/tuning/tuning_radix_sort.cuh
  • cub/cub/device/dispatch/tuning/tuning_reduce_by_key.cuh
  • cub/cub/device/dispatch/tuning/tuning_rle_encode.cuh
  • cub/cub/device/dispatch/tuning/tuning_rle_non_trivial_runs.cuh
  • cub/cub/device/dispatch/tuning/tuning_scan.cuh
  • cub/cub/device/dispatch/tuning/tuning_scan_by_key.cuh
  • cub/cub/device/dispatch/tuning/tuning_select_if.cuh
  • cub/cub/device/dispatch/tuning/tuning_three_way_partition.cuh
  • cub/cub/device/dispatch/tuning/tuning_unique_by_key.cuh
  • cub/test/catch2_test_device_scan_by_key_env.cu
  • cub/test/catch2_test_device_scan_by_key_env_api.cu
  • cub/test/catch2_test_device_scan_env.cu
  • cub/test/catch2_test_device_scan_env_api.cu
  • nvbench_helper/nvbench_helper/look_back_helper.cuh
  • thrust/testing/scan.cu

.load_modifier = cub::LOAD_DEFAULT,
.store_algorithm = cub::BLOCK_STORE_WARP_TRANSPOSE,
.scan_algorithm = cub::BLOCK_SCAN_WARP_SCANS,
.lookback_delay = cub::LookbackDelayPolicy{LookbackDelayAlgorithm::fixed_delay, 832, 1165}},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

critical: Missing namespace qualification. LookbackDelayAlgorithm::fixed_delay should be cub::LookbackDelayAlgorithm::fixed_delay to match the fully-qualified style used elsewhere in this initializer (lines 428, 433-436).

Fix namespace qualification
-        .lookback_delay    = cub::LookbackDelayPolicy{LookbackDelayAlgorithm::fixed_delay, 832, 1165}},
+        .lookback_delay    = cub::LookbackDelayPolicy{cub::LookbackDelayAlgorithm::fixed_delay, 832, 1165}},
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
.lookback_delay = cub::LookbackDelayPolicy{LookbackDelayAlgorithm::fixed_delay, 832, 1165}},
.lookback_delay = cub::LookbackDelayPolicy{cub::LookbackDelayAlgorithm::fixed_delay, 832, 1165}},

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

😬 CI Workflow Results

🟥 Finished in 2h 17m: Pass: 36%/351 | Total: 7d 18h | Max: 2h 16m | Hits: 35%/509618

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

Productize the cub::DeviceScan::*ByKey tuning API

1 participant