Skip to content

(Performance) Optimized x86 and generic q1_0(_g128) dot#10

Open
pl752 wants to merge 2 commits intoPrismML-Eng:masterfrom
pl752:perf/q1_0_g128_no_nofma
Open

(Performance) Optimized x86 and generic q1_0(_g128) dot#10
pl752 wants to merge 2 commits intoPrismML-Eng:masterfrom
pl752:perf/q1_0_g128_no_nofma

Conversation

@pl752
Copy link
Copy Markdown

@pl752 pl752 commented Apr 3, 2026

Hello
This is yet another PR about the fix of the truncation and optimization of the cpu inference.

In this case I have:

  • Replaced a ton of bit-masking operations, removed redundant float multiplication and unrolled the hot inner loop with constant masks for accumulation with signs in arch-agnostic fallback
  • Introduced paths for filling the gap between default fallback and AVX-512 capable CPUs
  • Performed tests to make sure that optimizations don't have effect on precision/correctness
  • Performed various experinments (most yielded worse performance) including:
  • brancless variant of unroll
  • various register and superscalar pipeline pressure options (AVX 2 uses doubled accumulation flow)
  • AVX-512 VNNI
  • explicitly precomputed masks for SIMD

Note that this PR is built on top of the #3 by @jordankzf, who implemented AVX-512 workflow

Benchmarks were performed with:

  • CPU: AMD Ryzen 5 7640HS (at 65w)
  • WSL vm
  • LPDDR5 @ 6400MT JEDEC
  • Model: Bonsai-1.7B.gguf (Q1_0_g128)
  • Threads: 6
Flow pp 512 t/s tg 128 t/s Speedup Notes
Initial* 1.59 0.85 1.0x / 1.0x Slow
Scalar 9.57 7.06 6.0x / 8.3x Explicit byte-oriented unroll
SSSE3 26.13 19.51 16.5x / 22.9x 128-bit specialization
AVX 34.99 27.31 22.1x / 32.1x Mixed-width specialization
AVX2 + FMA 80.02 51.46 50.4x / 60.5x 256-bit specialization
AVX512BW 97.16 60.88 61.3x / 71.5x Leverages new SIMD extensions**
  • * extrapolated from pp 32 / tg 16: 1.659 t/s pp and 0.862 t/s tg, as I was impatient.
  • ** new SIMD instruction kinds improve performance even on AMD Zen4 implementation of AVX-512, which uses 256 bit pipeline twice instead of implementing full 512 bit one

I would appreciate your feedback

@github-actions github-actions bot added the ggml label Apr 3, 2026
@khosravipasha
Copy link
Copy Markdown
Collaborator

khosravipasha commented Apr 4, 2026

Thanks this looks great, nice write up.
I guess could you also run corretness checks, similar to what we did here: #8
see KL divergence between packed vs unpacked model (should be close to 0).

I am not too familiar with SIMD/AVX stuff, what CPUs does this support:
I know there is some different between AVX512BW, AVX2, SSSE3, AVX, is this for different CPU archetictures?

@pl752
Copy link
Copy Markdown
Author

pl752 commented Apr 4, 2026

@khosravipasha You are welcome :)
These are various generations of x86 simd instructions (mostly backward compatible, aka AVX can SSE, etc.):

  • SSE family is 128 bit simd introduced with Pentium III (fp32 only); then SSE2 followed up with (fp64 and int), then SSE3 with some additional utility instructions;
  • SSSE3 is most interesting in 128 bit case, as it provides instructions for shuffling/expansion of bit mask, sign assign ops and int dot product used in our and other implementations there (this set is available since Core 2 generation and AMD Bobcat, so it covers essentially all realistic x86 targets (except for some shenanigans);
  • AVX is 256 bit SIMD for fp32/fp64 introduced with Sandy bridge (eg core i7-2xxx) or AMD Bulldozer (eg fx-8100), lacks int ops though, still can be used for accum paired with SSSE3 instructions;
  • AVX2 adds most of the missing 256 bit instructions, plus it is paired with FMA(3) instruction introduction (except for VIA C4650), providing fused multiply-add; introduced in Haswell or AMD Zen, in our case used with most modern-ish processors completing the gap bridging between legacy and SoTA cpus;
  • AVX512 extends AVX2 ops to 512 bit SIMD, plus it has numerous extensions including AVX512BW with support for int8 and int16 ops (used in Q8_0 sign expansion to int16)
  • Scalar fallback is mostly for non-x86 cpus and comformity, there I got rid of bit mask calculations, int to float conversion with multiplication and unrolled inner loop with constant masks to minimize computations and increase pipeline pressure

As for perplexity, I have performed run for single 64 token wikitext-2-test chunk with 1.7B model

Flow PPL ln(PPL(Q)/PPL(base)) KL Divergence Δp RMS Same top p
Scalar 17.1988 ± 9.6330 -0.00000 ± -nan -0.00000 ± 0.00000 0.001 ± 0.000 % 100.000 ± 0.000 %
SSSE3 17.2739 ± 9.7094 0.00435 ± 0.00450 0.00024 ± 0.00004 0.218 ± 0.038 % 100.000 ± 0.000 %
AVX 17.2402 ± 9.6760 0.00240 ± 0.00352 0.00025 ± 0.00005 0.362 ± 0.062 % 90.323 ± 5.398 %
AVX2 17.2321 ± 9.6740 0.00193 ± 0.00398 0.00023 ± 0.00004 0.379 ± 0.067 % 96.774 ± 3.226 %
AVX-512 17.2463 ± 9.6895 0.00275 ± 0.00298 0.00023 ± 0.00005 0.279 ± 0.070 % 93.548 ± 4.485 %

I will perform more runs

@pl752
Copy link
Copy Markdown
Author

pl752 commented Apr 4, 2026

I have run 5 chunks of 512 tokens, looks better, I think, will run 100 chunks:

Flow Mean PPL(Q) Mean ln(PPL(Q)/PPL(base)) Mean KLD RMS Δp Same top p
Scalar baseline 20.943033 +/- 2.071658 0.000000 0.000000 0.000 % 100.000 %
SSSE3 21.076136 +/- 2.102022 0.006335 +/- 0.004656 0.000267 +/- 0.000009 0.386 +/- 0.017 % 99.059 +/- 0.271 %
AVX 21.081167 +/- 2.102227 0.006574 +/- 0.004686 0.000285 +/- 0.000011 0.404 +/- 0.019 % 99.451 +/- 0.207 %
AVX2 21.087163 +/- 2.103328 0.006858 +/- 0.004650 0.000282 +/- 0.000012 0.418 +/- 0.027 % 99.529 +/- 0.192 %
AVX-512BW 21.095567 +/- 2.103673 0.007257 +/- 0.004635 0.000279 +/- 0.000010 0.399 +/- 0.019 % 99.294 +/- 0.235 %

@pl752
Copy link
Copy Markdown
Author

pl752 commented Apr 4, 2026

I am somewhat in doubt now, it seems something around the effect of comparing cpu to cuda, or something inbetween fp32->fp16 and fp32->q8_0, maybe it is from using smaller model
Note: I am comparing between my implementations, I think I need to use fp16 as a baseline first

@khosravipasha
Copy link
Copy Markdown
Collaborator

khosravipasha commented Apr 4, 2026

@pl752 Awesome thanks for the explnations.
And for the KL's look pretty good, being close to 0 is good. The rest is numerical noise probably since also llama.cpp side they convert logits to fp16 to save time (I only run few chunks myself too), llama.cpp tool was designed to see how good their quantizations are, for us the weights are equivalent packed and unpacked so having KL close to zero for few chunks is good enough (this is mostly to test the kernels and not the quantization itself)

https://github.com/ggml-org/llama.cpp/tree/master/tools/perplexity

Yeah I used running the model in fp16 as the baeslines using these https://huggingface.co/collections/prism-ml/bonsai-auxiliary

@pl752
Copy link
Copy Markdown
Author

pl752 commented Apr 4, 2026

Okay, don't forget to thank the user from which I've hijacked AVX-512 implementation

@khosravipasha
Copy link
Copy Markdown
Collaborator

khosravipasha commented Apr 4, 2026

@pl752 good idea, which one was it? We can tag them here,
Right now only sending PR to llama.cpp with generic cpu to finalize the naming, formatting, etc.

After that's merged, then can all send a PR together with everyone that contributed tagged in main llama.cpp maybe.

Note that there will be some naming changes (in summary Q1_0_g128 is renamed to Q1_0, and original Q1_0 will be deleted). Should not affect running the current models.

ggml-org#21273

@pl752
Copy link
Copy Markdown
Author

pl752 commented Apr 4, 2026

Note that this PR is built on top of the #3 by @jordankzf, who implemented AVX-512 workflow

@pl752
Copy link
Copy Markdown
Author

pl752 commented Apr 5, 2026

Performed additional 5x512 run against unpacked gguf

Flow Mean PPL(Q) Mean ln(PPL(Q)/PPL(base)) Mean KLD RMS Δp Same top p
Scalar 21.082185 +/- 2.102340 0.005412 +/- 0.004643 0.000213 +/- 0.000008 0.334 +/- 0.017 % 99.451 +/- 0.207 %
SSSE3 21.076136 +/- 2.102022 0.005125 +/- 0.004661 0.000220 +/- 0.000008 0.341 +/- 0.016 % 99.137 +/- 0.259 %
AVX 21.081167 +/- 2.102227 0.005364 +/- 0.004690 0.000235 +/- 0.000010 0.362 +/- 0.017 % 99.373 +/- 0.221 %
AVX2 21.087163 +/- 2.103328 0.005649 +/- 0.004643 0.000216 +/- 0.000009 0.377 +/- 0.023 % 99.608 +/- 0.175 %
AVX-512BW 21.095567 +/- 2.103673 0.006047 +/- 0.004636 0.000222 +/- 0.000009 0.365 +/- 0.020 % 99.059 +/- 0.271 %

@pl752
Copy link
Copy Markdown
Author

pl752 commented Apr 5, 2026

UPD: I have reviewed how I was interleaving instructions when testing various register pressure options and found issues resulting in register spilling, so I just relied on the compiler doing its job properly and simply unrolled inner loop with individual accumulators for SSSE3 (as the compiler already did pretty well for other flows); I have also tried the same thing for AVX-512, but it did result in tiny performance regression. It had almost no effect on perplexity.

Effects on performance, (baseline has drifted due to using -t10 instead of -t6):

flow run baseline updated delta
SSSE3 pp512 33.38 t/s 39.18 t/s +17.36%
SSSE3 tg128 24.61 t/s 29.24 t/s +18.81%

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR focuses on improving CPU inference throughput by optimizing the q1_0 / q1_0_g128 dot-product kernels against q8_0, reducing bit-twiddling overhead in portable fallbacks and introducing additional optimized x86 SIMD execution paths.

Changes:

  • Reworked generic fallbacks to process packed sign bits in a byte-oriented way (4 × 8-value groups per 32-element sub-block), eliminating per-element bit index arithmetic.
  • Implemented x86-specialized kernels for ggml_vec_dot_q1_0_q8_0 and ggml_vec_dot_q1_0_g128_q8_0 with multiple SIMD paths (SSSE3 / AVX / AVX2 / AVX-512BW) plus scalar byte-oriented fallback.
  • Added small SSSE3 helpers to expand packed sign bits into byte masks and to reduce vector accumulators.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
ggml/src/ggml-cpu/quants.c Optimizes portable q1_0 and q1_0_g128 generic dot fallbacks by switching to explicit byte-oriented sign decoding and removing per-element bit math.
ggml/src/ggml-cpu/arch/x86/quants.c Replaces x86 dispatch to generic kernels with specialized SIMD implementations across AVX-512BW/AVX2/AVX/SSSE3, keeping a byte-oriented scalar fallback.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@zcattacz
Copy link
Copy Markdown

zcattacz commented Apr 6, 2026

I tested the AVX2 impl, slightly faster then #7 (see the est full test time) but slower than xor+sub. Maybe the reported 0.00022 KLD is arch related (Tigerlake and Broadwell are both Intel CPU). I have tried several impl on Broadwell, all hit the same KLD after the first few chunks, thus later there's little point to run the full test just to confirm the KLD.

PR10 with mul_sum_i8_pairs_float

system_info: n_threads = 2 (n_threads_batch = 2) / 4 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
kl_divergence: computing over 100 chunks, n_ctx=512, batch_size=2048, n_seq=4
kl_divergence: 150.86 seconds per pass - ETA 1 hours 2.85 minutes

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1      13.9557 ±    3.1807      -0.00019 ±    0.00239       0.00019 ±    0.00002     0.376 ±  0.047 %    99.608 ±  0.392 %
   2      20.1986 ±    3.4363       0.01428 ±    0.01146       0.00020 ±    0.00001     0.346 ±  0.030 %    99.608 ±  0.277 %
   3      20.8582 ±    2.7888       0.00944 ±    0.00766       0.00021 ±    0.00001     0.375 ±  0.025 %    99.216 ±  0.319 %
   4      21.2096 ±    2.3896       0.00684 ±    0.00577       0.00022 ±    0.00001     0.385 ±  0.026 %    99.412 ±  0.240 %
   5      21.0872 ±    2.1033       0.00566 ±    0.00464       0.00022 ±    0.00001     0.376 ±  0.023 %    99.529 ±  0.192 %
   6      21.2932 ±    1.9099       0.00549 ±    0.00390       0.00021 ±    0.00001     0.362 ±  0.020 %    99.477 ±  0.184 %
   7      21.4337 ±    1.7665       0.00508 ±    0.00335       0.00021 ±    0.00001     0.365 ±  0.020 %    99.440 ±  0.177 %
   8      23.1788 ±    1.8031       0.00527 ±    0.00297       0.00021 ±    0.00001     0.364 ±  0.018 %    99.412 ±  0.169 %
   9      24.6955 ±    1.8365       0.00752 ±    0.00337       0.00022 ±    0.00001     0.355 ±  0.017 %    99.390 ±  0.163 %
  10      25.4214 ±    1.7879       0.00672 ±    0.00303       0.00022 ±    0.00001     0.353 ±  0.015 %    99.294 ±  0.166 %
  11      26.0683 ±    1.7516       0.00617 ±    0.00276       0.00022 ±    0.00001     0.354 ±  0.014 %    99.287 ±  0.159 %
  12      26.5272 ±    1.7091       0.00582 ±    0.00254       0.00022 ±    0.00001     0.351 ±  0.013 %    99.346 ±  0.146 %
xor+sub (like PR4)

system_info: n_threads = 2 (n_threads_batch = 2) / 4 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
kl_divergence: computing over 100 chunks, n_ctx=512, batch_size=2048, n_seq=4
kl_divergence: 115.23 seconds per pass - ETA 48.00 minutes

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1      13.9528 ±    3.1791      -0.00040 ±    0.00223       0.00019 ±    0.00002     0.382 ±  0.053 %    99.608 ±  0.392 %
   2      20.1970 ±    3.4355       0.01420 ±    0.01145       0.00019 ±    0.00001     0.343 ±  0.033 %    99.608 ±  0.277 %
   3      20.8596 ±    2.7888       0.00950 ±    0.00765       0.00021 ±    0.00001     0.351 ±  0.026 %    99.346 ±  0.292 %
   4      21.2115 ±    2.3896       0.00693 ±    0.00576       0.00022 ±    0.00001     0.369 ±  0.025 %    99.510 ±  0.219 %
   5      21.0887 ±    2.1034       0.00573 ±    0.00463       0.00022 ±    0.00001     0.363 ±  0.022 %    99.608 ±  0.175 %
   6      21.2944 ±    1.9099       0.00555 ±    0.00389       0.00021 ±    0.00001     0.351 ±  0.019 %    99.542 ±  0.173 %
   7      21.4348 ±    1.7665       0.00513 ±    0.00334       0.00021 ±    0.00001     0.355 ±  0.020 %    99.496 ±  0.168 %
PR7 with _mm256_shuffle_epi8

system_info: n_threads = 2 (n_threads_batch = 2) / 4 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
kl_divergence: computing over 100 chunks, n_ctx=512, batch_size=2048, n_seq=4
kl_divergence: 186.99 seconds per pass - ETA 1 hours 17.90 minutes

chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1      13.9733 ±    3.1846       0.00107 ±    0.00236       0.00020 ±    0.00002     0.402 ±  0.048 %    99.608 ±  0.392 %
   2      20.2038 ±    3.4373       0.01454 ±    0.01146       0.00022 ±    0.00002     0.375 ±  0.029 %    99.608 ±  0.277 %
   3      20.8431 ±    2.7865       0.00871 ±    0.00766       0.00023 ±    0.00001     0.387 ±  0.026 %    98.693 ±  0.411 %
   4      21.1827 ±    2.3859       0.00558 ±    0.00577       0.00023 ±    0.00001     0.378 ±  0.022 %    99.020 ±  0.309 %
   5      21.0675 ±    2.1012       0.00473 ±    0.00465       0.00022 ±    0.00001     0.379 ±  0.019 %    99.137 ±  0.259 %
   6      21.2662 ±    1.9072       0.00422 ±    0.00390       0.00022 ±    0.00001     0.381 ±  0.018 %    99.085 ±  0.244 %
   7      21.4126 ±    1.7643       0.00409 ±    0.00335       0.00022 ±    0.00001     0.374 ±  0.016 %    98.992 ±  0.237 %

@pl752
Copy link
Copy Markdown
Author

pl752 commented Apr 6, 2026

@zcattacz Thank you for the hint, it worked at least for at least AVX2, I will revise my current kernels and post updates

@zcattacz
Copy link
Copy Markdown

zcattacz commented Apr 6, 2026

@pl752 , oh. my bad, I misread your KLD. Are they all tested on AMD. it's also around 0.00022. The xor+sub is adapted from PR4. If you are after speed, please give it a try. You can find the code I tested for AVX2 from my comment in #7. Even the shadowed variable gives it a 5%~10% boost. I also tested double accumulator impl, but it didn't give any edge. The compiler seems to be doing some magic here.
I did a similar SSSE3 test (code also in #7) for KLD, I didn't save the result, but since it's so slow on i5, not gonna to do it again. iirc, the KLD is also ~0.00022.

@pl752
Copy link
Copy Markdown
Author

pl752 commented Apr 6, 2026

@zcattacz They all tested on AMD Ryzen 5 7640HS (Zen 4)

@pl752
Copy link
Copy Markdown
Author

pl752 commented Apr 6, 2026

UPD2: Okay, I have applied the advice and it resulted in positive performance changes and no significant perplexity changes. However I removed the AVX512 branch now relying on compiler taking advantage of some of the register and instruction layout changes, as I failed to achieve meaningful performance increase past current AVX2 flow on my Zen4, moreover AVX2 is variant is slightly faster even. Somebody with Zen5 or modern Intel Xeon should take a look and experiment.

Performance changes (t=10)

flow run baseline updated delta
SSSE3 pp512 58.19 t/s 72.79 t/s +25.10%
SSSE3 tg128 37.19 t/s 45.07 t/s +21.19%
AVX pp512 59.72 t/s 75.17 t/s +25.87%
AVX tg128 38.06 t/s 46.55 t/s +22.30%
AVX2 pp512 95.60 t/s 119.14 t/s +24.64%
AVX2 tg128 56.61 t/s 69.32 t/s +22.45%
AVX512 pp512 121.25 t/s 124.16 t/s +2.40%
AVX512 tg128 67.66 t/s 70.73 t/s +4.52%

Also what has happened to SSSE3 performance, its baseline seemingly increased out of nowhere? Turns out rebooting sometimes significantly increases performance

@zcattacz
Copy link
Copy Markdown

zcattacz commented Apr 6, 2026

Hi @pl752, nice improvement. If you want to squeeze a bit more juice, pls try the code in #7 (comment) , it's simpler but the compiler makes the single accumulator impl even faster than the double accumulator (gives another 2~3tps for free on an i5).

loop + double accumulator (current)
bin/cpu/llama-bench -m models/gguf/1.7B/Bonsai-1.7B.gguf -p 512 -n 128
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3 1.7B Q1_0                | 231.13 MiB |     1.72 B | CPU        |       2 |           pp512 |         16.98 ± 0.23 |
| qwen3 1.7B Q1_0                | 231.13 MiB |     1.72 B | CPU        |       2 |           tg128 |         13.20 ± 0.60 |


no loop/if + single accumulator (from above comment)
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3 1.7B Q1_0                | 231.13 MiB |     1.72 B | CPU        |       2 |           pp512 |         18.40 ± 1.04 |
| qwen3 1.7B Q1_0                | 231.13 MiB |     1.72 B | CPU        |       2 |           tg128 |         13.79 ± 0.74 |

@pl752
Copy link
Copy Markdown
Author

pl752 commented Apr 6, 2026

That's interesting, two accumulators gave better performance before the xor+sub change, now it's other way around; baselines has drifted once more though

flow run baseline updated delta
AVX2 pp512 122.04 t/s 130.11 t/s +6.61%
AVX2 tg128 70.84 t/s 73.29 t/s +3.46%
AVX-512 pp512 129.46 t/s 136.65 t/s +5.55%
AVX-512 tg128 74.25 t/s 77.21 t/s +3.99%

Couldn't confirm KLD changes thhough, @zcattacz , check that I haven't made a mistake there

@khosravipasha
Copy link
Copy Markdown
Collaborator

Good new our first CPU PR just got merged int llama.cpp master branch now, if you are still working on this please rebase with PrismML's master (just pulled the main llama.cpp)

Changes: Q1_0_g128 naming is gone now, the original Q1_0 with group size 32 was deleted and Q1_0_g128 was renamed to Q1_0 now by default has group size 128.

https://github.com/PrismML-Eng/llama.cpp/tree/master

This one only has generic cpu (slow), and ARM NEON path, planning to gather the best x86 kernels from here and to send a PR there (and tag all the contributers).

@zcattacz
Copy link
Copy Markdown

zcattacz commented Apr 7, 2026

@pl752, yeah, I also tested double accumulator with a full unrolled version, but the speed is still a marginal net loss. Looks like FMA is not the bottleneck.

Could you try this and see if it works better for you?
The benchmark shows no overhead from the inline func but I'm not sure if it's worth it. On Broadwell _mm256_sign_epi8 leads to 1tps loss, if later architectures are more efficient, we should expect more than 1tps gain, right?

// AVX2 optimization helper
static inline __m256i avx2_apply_sign_helper(const __m256i * y, const __m256i * sm, const __m256i * ones_8)
{
#if defined(__CLWB__) || defined(__SGX__) || defined(__AVX512F__) 
    // Skylake and Zen3 onwards
    // This is a bottle neck on older arch like Haswell / Broadwell, 1TPS drop.
    return _mm256_sign_epi8(*y, _mm256_or_si256(*sm, *ones_8));
#else
    return _mm256_sub_epi8(_mm256_xor_si256(*y, *sm), *sm);
#endif
}

void ggml_vec_dot_q1_0_g128_q8_0(int n, float * GGML_RESTRICT s, size_t bs, const void * GGML_RESTRICT vx, size_t bx, const void * GGML_RESTRICT vy, size_t by, int nrc) {
    const int qk = QK1_0_g128;
    const int nb = n / qk;

    assert(n % qk == 0);
    assert(nrc == 1);
    UNUSED(nrc);
    UNUSED(bx);
    UNUSED(by);
    UNUSED(bs);

    const block_q1_0_g128 * GGML_RESTRICT x = vx;
    const block_q8_0 * GGML_RESTRICT y = vy;

#if defined(__AVX2__)
    const __m256i ones_8 = _mm256_set1_epi8(1);
    const __m256i ones_16 = _mm256_set1_epi16(1);
    const __m256i byte_shuf = _mm256_setr_epi8(0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3);
    const __m256i bit_masks = _mm256_setr_epi8(1,2,4,8,16,32,64,(char)-128,1,2,4,8,16,32,64,(char)-128,1,2,4,8,16,32,64,(char)-128,1,2,4,8,16,32,64,(char)-128);
    const __m256i zero = _mm256_setzero_si256();
    __m256 acc = _mm256_setzero_ps();



    for (int ib = 0; ib < nb; ++ib) {
        const float d0 = GGML_CPU_FP16_TO_FP32(x[ib].d);
        const uint32_t *qs32 = (const uint32_t *)x[ib].qs;
        const block_q8_0 *y_ptr = &y[ib * 4];
        // y is deliberately left shadowed for a measurable performance gain
        __m256 acc_block;
        {
            const __m256i y = _mm256_loadu_si256((const __m256i *)y_ptr[0].qs);
            const __m256i sm = _mm256_cmpeq_epi8(_mm256_and_si256(_mm256_shuffle_epi8(_mm256_set1_epi32((int)qs32[0]), byte_shuf), bit_masks), zero);
            const __m256i sy = avx2_apply_sign_helper(&y, &sm, &ones_8);
            const __m256i s32 = _mm256_madd_epi16(_mm256_maddubs_epi16(ones_8, sy), ones_16);
            acc_block = _mm256_mul_ps(_mm256_set1_ps(GGML_CPU_FP16_TO_FP32(y_ptr[0].d)), _mm256_cvtepi32_ps(s32));
        }
    #define Q1_AVX2_BLOCK(K) \
        { \
            const __m256i y = _mm256_loadu_si256((const __m256i *)y_ptr[K].qs); \
            const __m256i sm = _mm256_cmpeq_epi8(_mm256_and_si256(_mm256_shuffle_epi8(_mm256_set1_epi32((int)qs32[K]), byte_shuf), bit_masks), zero); \
            const __m256i sy = avx2_apply_sign_helper(&y, &sm, &ones_8); \
            const __m256i s32 = _mm256_madd_epi16(_mm256_maddubs_epi16(ones_8, sy), ones_16); \
            acc_block = _mm256_fmadd_ps(_mm256_set1_ps(GGML_CPU_FP16_TO_FP32(y_ptr[K].d)), _mm256_cvtepi32_ps(s32), acc_block); \
        }
        Q1_AVX2_BLOCK(1) Q1_AVX2_BLOCK(2) Q1_AVX2_BLOCK(3)
    #undef Q1_AVX2_BLOCK
        acc = _mm256_fmadd_ps(_mm256_set1_ps(d0), acc_block, acc);
    }
    {
        const __m128 h = _mm_add_ps(_mm256_extractf128_ps(acc, 0), _mm256_extractf128_ps(acc, 1));
        const __m128 q = _mm_add_ps(h, _mm_movehl_ps(h, h));
        *s = _mm_cvtss_f32(_mm_add_ss(q, _mm_movehdup_ps(q)));
    }

Since it's slow, I only run the full perplexity test in the initial tuning. Can't find the numbers, but I recall my Max KLD was way higher than what was reported in PR7. It drops down after the initial FMA is replaced with simple MUL.
I'm not quite sure why. AI said AxB+0 might introduce more rounding error. Anyway it just made a difference.

@pl752
Copy link
Copy Markdown
Author

pl752 commented Apr 7, 2026

That (sign alt) was few percent slower unfortunately in my case too

@pl752
Copy link
Copy Markdown
Author

pl752 commented Apr 7, 2026

Brought the code to uniform structure, insignificant changes in ASM for AVX, no measurable changes in perplexity or performance, will prepare for rebase

@pl752 pl752 changed the base branch from prism to master April 7, 2026 06:51
@pl752 pl752 force-pushed the perf/q1_0_g128_no_nofma branch from b793ed1 to 195593b Compare April 7, 2026 06:53
@pl752
Copy link
Copy Markdown
Author

pl752 commented Apr 7, 2026

@khosravipasha Okay, rebase is complete, awaiting review (squashed changes, lost info about one of the commit authors (jordankzf), no code from his PR is left though)

@khosravipasha
Copy link
Copy Markdown
Collaborator

@pl752 awesome thanks, now our CPU PR which has generic and NEON implementation (ggml-org#21273) is merged in main llama.cpp and Metal is being merged soon.
Should we write a polished PR draft and send it to main llama.cpp?

I can help with end to end testing and assuring correctness and writing the PR description but don't know much about AVX / AVX512. Maybe you can send the PR and tag people that contributed?

@pl752
Copy link
Copy Markdown
Author

pl752 commented Apr 7, 2026

I think yes, we can write a draft and then send it to main tree. However I am going to sleep currently, so I will help later. I have been testing perplexity all the way through to avoid breaking things, but additional tests won't hurt. Then benchmarks for the scalar and SIMD branches need to be redone and cleaned up summary added. In my current implementation SSSE3 is for most of cpus, AVX helps with fp32 accum part, but difference in performance is questionable, AVX 2 handles modern-ish cpus as the most performant way, and AVX-512 specific branch was discarded (due to me failing to obtain any improvements over AVX2 with AVX-512 flag set on my Zen4), but it is still mentioned in my latest benchmarks, as compiler still produces more optimized code for AVX2 with AVX-512 enabled due to AVX-512 providing 32 SIMD registers instead of 16 (aside from fact that their max length extends to 512 bit, which isn't used there) allowing some additional freedom during applying O3 opts.

@pl752
Copy link
Copy Markdown
Author

pl752 commented Apr 7, 2026

So I can try creating PR draft myself tommorow and tag everybody from discussion and remained not my own code if any is left, then we will look into the next steps, the code itself seems to be pretty clean.

@khosravipasha
Copy link
Copy Markdown
Collaborator

khosravipasha commented Apr 8, 2026

Thanks @pl752 of course take your time, sleep is more important :D

Closed the other CPU PRs. We can collectively send a PR to main llama.cpp when this branch is ready.

Tagging people that helped in other PRs (let me know if I missed any)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants