feat: add runtime batch_bool mask overloads for load_masked/store_masked#1332
feat: add runtime batch_bool mask overloads for load_masked/store_masked#1332DiamonDinoia wants to merge 2 commits intoxtensor-stack:masterfrom
Conversation
7484c4b to
d5f21c7
Compare
|
Coild you split the head / tail part in another PR? This one is already quite dense... |
| // (AVX2 32/64-bit, AVX-512, SVE, RVV) override this with a single | ||
| // intrinsic that suppresses inactive-lane reads in hardware. | ||
| constexpr std::size_t size = batch<T, A>::size; | ||
| alignas(A::alignment()) std::array<T, size> buffer {}; |
There was a problem hiding this comment.
to make it worse, building a mask is not always a single operation depending on the target...
There was a problem hiding this comment.
Addressed in [latest push] — switched the common-arch fallback to use mask.get(i) directly instead of materialising mask.mask() once and shifting per lane. The .mask() call is now gone from the hot path on architectures that fall back to common, so the per-target cost of building the bit mask no longer matters here.
| // (AVX2 32/64-bit, AVX-512, SVE, RVV) override this with a single | ||
| // intrinsic that suppresses inactive-lane reads in hardware. | ||
| constexpr std::size_t size = batch<T, A>::size; | ||
| alignas(A::alignment()) std::array<T, size> buffer {}; |
There was a problem hiding this comment.
this array assignment forces everything to zero, while some stores are not needed, and the compiler is notable to optimize this away in the generic case
There was a problem hiding this comment.
Addressed — dropped the value-init {} on the buffer and instead write every lane unconditionally in one pass: buffer[i] = mask.get(i) ? mem[i] : T(0);. No double-write on inactive lanes anymore.
| XSIMD_INLINE std::enable_if_t<std::is_integral<T>::value && (sizeof(T) == 4 || sizeof(T) == 8), batch<T, A>> | ||
| load_masked(T const* mem, batch_bool<T, A> mask, convert<T>, Mode, requires_arch<avx2>) noexcept | ||
| { | ||
| using int_t = std::conditional_t<sizeof(T) == 4, int32_t, long long>; |
There was a problem hiding this comment.
why long long and not int64_t ? Tehre's no garantee that sizeof(long long) == 8
There was a problem hiding this comment.
long long is what the Intel intrinsic _mm256_maskload_epi64 takes (long long const*). Using int64_t* would force a reinterpret_cast at every call site. Added static_assert(sizeof(long long) == 8, ...) next to the helpers so the assumption is pinned at compile time — the dispatcher uses the pointer-width to pick the right intrinsic.
| } | ||
| else | ||
| { | ||
| _mm256_maskstore_epi64(reinterpret_cast<long long*>(mem), __m256i(mask), __m256i(src)); |
There was a problem hiding this comment.
ok, I guess that's a constraint of the Intel intrinsic, at least static_assert that sizeof(long long) ==8 and sizeof(int) == 4 if you're using this to disntinguish between the two?
There was a problem hiding this comment.
Done — added static_assert(sizeof(int) == 4, ...) and static_assert(sizeof(long long) == 8, ...) next to the detail::maskload/maskstore helpers. Refactored both the constant-mask and runtime-mask AVX2 paths to share these helpers so the size assumption lives in one place. The runtime store now goes through detail::maskstore(reinterpret_cast<int_t*>(mem), ...) like the load already did, removing the open-coded XSIMD_IF_CONSTEXPR(sizeof(T) == 4) branch.
| // constructs a 128-bit chunk predicate (svdupq_b{8,16,32,64}), which | ||
| // is replication-based and does not correctly express a per-lane | ||
| // mask on SVE wider than 128 bits — going through ``as_batch_bool`` | ||
| // gives the right predicate for every vector width. ``int32``/ |
There was a problem hiding this comment.
Do you know if the pmask approach would be faster? If so we could still if constexpr its usage when the sve size allows it.
There was a problem hiding this comment.
Good question — pmask (svdupq_b{8,16,32,64}) is replication-based: it builds a 128-bit chunk and replicates it across the SVE register, so it only expresses a per-lane mask correctly when the SVE vector length is exactly 128 bits. For 256/512/1024/2048-bit SVE it would silently produce the wrong predicate. The current path through as_batch_bool is correct for every VL and lowers to a single cmpne against the integer-domain mask. We could in principle gate pmask on __ARM_FEATURE_SVE_BITS == 128, but on my qemu build that case lowers to the same ptrue + cmpne pair, so it would just be conditional code without a measured win. Happy to add the gate if you have a benchmark that shows a delta on a 128-bit-VL system.
| * so partial loads across a page boundary are safe. \c stream_mode is not | ||
| * supported. | ||
| * | ||
| * \warning Runtime-mask loads carry a significant performance penalty on |
There was a problem hiding this comment.
I don't think we should go into details here:
- it's difficult to maintain this kind of documentation (what about newly added architectures)
- we already have the case for other operations and we don't specify it.
I think it's important to communicate that info, but until we have an automated way to do so, better not just throw documentation at it.
There was a problem hiding this comment.
Agreed — collapsed the four runtime-mask doxygen blocks down to one short paragraph each (one for load, one for store), with \overload on the unaligned variant. Removed the per-architecture rundown of which targets do/do not have native maskload, since that information rots quickly as the project picks up new arches.
| static_assert(std::is_same<Mode, aligned_mode>::value || std::is_same<Mode, unaligned_mode>::value, | ||
| "supported load mode"); | ||
| constexpr uint64_t full_mask = details::full_mask(size); | ||
| const auto bits = mask.mask(); |
There was a problem hiding this comment.
I'm unsure we want that extra call to mask which may be costly, plus the extra tests... if masking is supported, is it beneifical? If it's not, we're already slow...
There was a problem hiding this comment.
Agreed — dropped the bits == 0 / bits == full_mask early-out in both batch::load(ptr, batch_bool, mode) and batch::store(ptr, batch_bool, mode). The runtime-mask member now just forwards straight to kernel::load_masked / kernel::store_masked. Targets with native predicated instructions (AVX2/AVX-512/SVE/RVV) absorb the all-zero / all-one mask via the hardware predicate, and on the common scalar fallback the per-lane loop handles those cases for free. The extra mask.mask() call and the two compares are gone.
| static_assert(std::is_same<Mode, aligned_mode>::value || std::is_same<Mode, unaligned_mode>::value, | ||
| "supported store mode"); | ||
| constexpr uint64_t full_mask = details::full_mask(size); | ||
| const auto bits = mask.mask(); |
There was a problem hiding this comment.
Same fix — early-out and static_assert are gone from batch::store(T*, batch_bool, Mode); it now forwards directly to kernel::store_masked.
d5f21c7 to
665925b
Compare
…1332 review - common: drop zero-init buffer + mask.mask() pack; use mask.get(i) directly - batch::load/store(batch_bool, Mode): drop bits==0/full early-out, forward to kernel (native arches absorb the all-zero/all-one mask in hardware) - avx2: pin sizeof(int)==4 / sizeof(long long)==8 next to detail::maskload helpers; runtime store routes through detail::maskstore symmetrically - avx2_128: introduce detail::maskload_128 / maskstore_128; constant- and runtime-mask paths share them; fix stale convert<double>/avx_128 dispatch tags on the int64/uint64 constant-mask overloads - xsimd_api.hpp + data_transfer.rst: shorten runtime-mask docs (drop the per-architecture rundown that would rot as new arches land) - test_load_store: collapse run_*_mask_pattern / run_*_runtime_mask_pattern into one helper parameterized on a MaskKind policy; drop first_N/last_N patterns (covered by load_head/load_tail in the follow-up branch) Codegen on AVX2/AVX2-128 verified: each runtime-mask load/store reduces to a single vpmaskmov{d,q}; the early-out removal eliminates an extra vmovmskps + test + cmp + branch tail.
Adds runtime batch_bool mask overloads of xsimd::load_masked and xsimd::store_masked across AVX, AVX2, AVX-512, SSE, SVE, RVV, and NEON; generic common-path fallback collapsed to a whole-vector select. SVE compile-time masked load/store forwarded through the runtime path so the per-lane predicate is correct on SVE wider than 128 bits. Adds arch-specific runtime-mask overloads of load_masked / store_masked for the avx_128 and avx2_128 arches so they inherit the hardware predicated load/store path on x86. Squashed from: b57a766 feat: add runtime batch_bool mask overloads for load_masked/store_masked d5f21c7 feat: add runtime batch_bool mask overloads for avx_128 / avx2_128
7d7bbc3 to
d240cef
Compare
…lpers Shorten verbose comments around masked load/store paths, drop the sizeof(int)/sizeof(long long) static_asserts (intrinsic boundaries now reinterpret_cast at the call site), and collapse the four maskload_128/maskstore_128 detail overloads into two XSIMD_IF_CONSTEXPR- dispatched templates. Public surface unchanged.
Add runtime-mask overloads of xsimd::load_masked and xsimd::store_masked across AVX2, AVX-512, SSE, SVE, RVV, and NEON. The generic common-path fallback is collapsed to a whole-vector select, and the unaligned page-cross fast path is dropped since the underlying intrinsics suppress faults on masked-off lanes regardless of alignment.