Skip to content

[Optimization]: Reduce branching when possible in casting.hpp#117

Open
zacharyvincze wants to merge 15 commits intoROCm:developfrom
zacharyvincze:zv/optimization/optimize-casting-performance
Open

[Optimization]: Reduce branching when possible in casting.hpp#117
zacharyvincze wants to merge 15 commits intoROCm:developfrom
zacharyvincze:zv/optimization/optimize-casting-performance

Conversation

@zacharyvincze
Copy link
Copy Markdown
Contributor

Details

  • Removes branching where possible to the casting helper functions seen in casting.hpp. Aims to reduce divergence on GPU kernel implementations.
  • Includes fixes to some float -> integer saturation casts, especially for 32/64-bit integer cases that are not represented exactly as 32-bit floats.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the core casting helpers to reduce branching (especially for GPU code paths) and adjusts saturation behavior for some float→integer conversions, alongside adding a small test and extending supported type traits.

Changes:

  • Refactors ScalarSaturateCast / ScalarRangeCast logic in casting.hpp to use more branchless/min-max based clamping and special-case small integer widths.
  • Extends type traits support to include long/ulong vectorized types.
  • Adds a new C++ test covering basic SaturateCast behavior and a few limit/vector cases.
  • Adjusts the GPU block dimensions for the Composite operator kernel launch.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
include/core/detail/casting.hpp Refactors saturate/range cast implementations to reduce branching and adjust clamping/rounding logic.
include/core/detail/type_traits.hpp Adds long / ulong to the type-traits macro set.
tests/roccv/cpp/src/tests/core/detail/test_saturate_cast.cpp Introduces a basic unit test for SaturateCast, including a couple of vectorized casts.
src/op_composite.cpp Changes GPU kernel launch block dimensions for the composite operator.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +41 to +54
if constexpr (sizeof(T) <= 2) {
// 8/16 bit integer cases. These can be represented exactly in floating point.
#ifdef __HIP_DEVICE_COMPILE__
return static_cast<T>(rintf(fminf(fmaxf(v, minVal), maxVal)));
#else
return static_cast<T>(std::round(std::clamp(v, minVal, maxVal)));
#endif
} else {
// 32/64 bit integer cases.
#ifdef __HIP_DEVICE_COMPILE__
U rounded = rintf(v);
#else
U rounded = std::round(v);
#endif
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The device-side float->integer paths use float-specific intrinsics (rintf, fminf/fmaxf, __saturatef, __float2int_rn) even though U is any floating-point type. If U is double, this will downcast to float and can change rounding/saturation behavior. Consider either constraining these branches to U == float (static_assert / if constexpr) or adding double-correct implementations (rint, fmin/fmax, __double2int_rn, etc.).

Copilot uses AI. Check for mistakes.
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Feb 6, 2026

Codecov Report

❌ Patch coverage is 54.54545% with 15 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
include/core/detail/casting.hpp 54.55% 12 Missing and 3 partials ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #117      +/-   ##
===========================================
+ Coverage    73.51%   73.59%   +0.08%     
===========================================
  Files           77       77              
  Lines         2956     2969      +13     
  Branches       640      635       -5     
===========================================
+ Hits          2173     2185      +12     
+ Misses         338      337       -1     
- Partials       445      447       +2     
Files with missing lines Coverage Δ
include/core/detail/type_traits.hpp 87.50% <ø> (ø)
include/core/detail/casting.hpp 78.26% <54.55%> (+2.31%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@simonCatBot
Copy link
Copy Markdown

Review: [Optimization] Reduce branching in casting.hpp

Kernel optimization focusing on GPU divergence:

Changes:

  • Branch reduction in casting helper functions
  • Fixes for float->integer saturation casts (32/64-bit cases)
  • 4 files changed, +129/-33 lines

Assessment: Needs Review - Performance optimization.

Reducing branching in GPU kernels is always good for warp efficiency. The fixes for 32/64-bit integer saturation casts sound important - precision issues in type conversion can be subtle bugs.

Would benefit from:

  1. Performance benchmarks showing divergence reduction
  2. Verification that precision is maintained for edge cases
  3. Review of the saturation logic changes

Solid optimization PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci:precheckin enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants