Conversation
|
That |
true, I was not optimizing it for performance, actually I ran into the compile issue with an ancient compiler and looked into this file, then fixed it, and did the optimization as a favor. Well, say, this could save ~10-ish bytes in binary code :) |
gyurix
left a comment
There was a problem hiding this comment.
Perf idea is good, but current SSE2 path looks unsafe.
This code does raw *(__m128i *) loads/stores from CLzRef * without proving 16-byte alignment. Depending on compiler codegen, that risks UB or aligned-move faults. If this path stays, it should use explicit unaligned intrinsics or an enforced alignment guarantee.
Merge readiness: 4/10. Before merge: switch to safe unaligned loads/stores (or prove alignment), then add benchmark numbers and one odd-alignment correctness test.
x86[-64] doesn't have integer saturating arithmetic instructions (thus slow if not vectorized), since all x86-64 CPUs support SSE2, we can use SSE2 as a baseline implementation.
This implmentation is taken from clang's optimization result, and gcc/msvc can't optimize it this way, see here for a comparison on godbolt.
It also contains a minor fix to fix minimal gcc version to compile (without globally enabling
SSE4.1/AVX2but use thetargetGCC extension). I think the old valueGCC 4.7.1was there because AVX2 support is added in GCC 4.7, but starting from GCC 4.9, it is now possible to call x86 intrinsics from select functions in a file that are tagged with the corresponding target attribute without having to compile the entire file with the-mxxxoption..Technically GCC 4.7 and 4.8 don't have the
targetfeature in x86 intrinsic headers and don't allow including per-instruction-extension-set header directly, code like below in<?mmintrin.h>is only available since GCC 4.9.