SSE2 implementation for LzFind_SaturSub() by yumeyao · Pull Request #140 · ip7z/7zip

yumeyao · 2025-07-24T07:30:09Z

x86[-64] doesn't have integer saturating arithmetic instructions (thus slow if not vectorized), since all x86-64 CPUs support SSE2, we can use SSE2 as a baseline implementation.

This implmentation is taken from clang's optimization result, and gcc/msvc can't optimize it this way, see here for a comparison on godbolt.

It also contains a minor fix to fix minimal gcc version to compile (without globally enabling SSE4.1/AVX2 but use the target GCC extension). I think the old value GCC 4.7.1 was there because AVX2 support is added in GCC 4.7, but starting from GCC 4.9, it is now possible to call x86 intrinsics from select functions in a file that are tagged with the corresponding target attribute without having to compile the entire file with the -mxxx option..

Technically GCC 4.7 and 4.8 don't have the target feature in x86 intrinsic headers and don't allow including per-instruction-extension-set header directly, code like below in <?mmintrin.h> is only available since GCC 4.9.

#ifndef __AVX2__
#pragma GCC push_options
#pragma GCC target("avx2")
#define __DISABLE_AVX2__
#endif /* __AVX2__ */

// The contents

#ifdef __DISABLE_AVX2__
#undef __DISABLE_AVX2__
#pragma GCC pop_options
#endif /* __DISABLE_AVX2__ */

ip7z · 2025-07-25T07:17:17Z

That LzFind_SaturSub code can be important only for 4 GB dictionary in LZMA .
We can use 4 GB dictionary, if we have more than 40 GB of RAM.
If we have 40 GB of RAM, then processor porobably supports SSE41.
So SSE2 branch is not too important.

yumeyao · 2025-07-25T08:13:01Z

So SSE2 branch is not too important.

true, I was not optimizing it for performance, actually I ran into the compile issue with an ancient compiler and looked into this file, then fixed it, and did the optimization as a favor.

Well, say, this could save ~10-ish bytes in binary code :)

gyurix

Perf idea is good, but current SSE2 path looks unsafe.

This code does raw *(__m128i *) loads/stores from CLzRef * without proving 16-byte alignment. Depending on compiler codegen, that risks UB or aligned-move faults. If this path stays, it should use explicit unaligned intrinsics or an enforced alignment guarantee.

Merge readiness: 4/10. Before merge: switch to safe unaligned loads/stores (or prove alignment), then add benchmark numbers and one odd-alignment correctness test.

SSE2 implementation for LzFind_SaturSub()

e7b7080

gyurix reviewed Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SSE2 implementation for LzFind_SaturSub()#140

SSE2 implementation for LzFind_SaturSub()#140
yumeyao wants to merge 1 commit intoip7z:mainfrom
yumeyao:main

yumeyao commented Jul 24, 2025 •

edited

Loading

Uh oh!

ip7z commented Jul 25, 2025 •

edited

Loading

Uh oh!

yumeyao commented Jul 25, 2025 •

edited

Loading

Uh oh!

gyurix left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yumeyao commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ip7z commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yumeyao commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gyurix left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yumeyao commented Jul 24, 2025 •

edited

Loading

ip7z commented Jul 25, 2025 •

edited

Loading

yumeyao commented Jul 25, 2025 •

edited

Loading