fix: Q1_0_g128 x86 CPU kernel — float truncation + AVX2 vectorization#7
fix: Q1_0_g128 x86 CPU kernel — float truncation + AVX2 vectorization#7wildcattrio wants to merge 1 commit intoPrismML-Eng:prismfrom
Conversation
The Q1_0_g128 x86 kernel has two bugs causing gibberish output at 0.25 tok/s on Intel CPUs: 1. Float-to-int truncation: the per-block accumulator was `int`, truncating `d1 * sumi_block` (float * int → float → int). Each Q8_0 block's scale factor was rounded to 0 or ±1, destroying the output. Fix: `float block_sum` accumulator. 2. No SIMD: the x86 path was scalar-only while ARM NEON had full vectorization. Added AVX2 using the same broadcast/shuffle/cmpeq pattern from the existing Q1_0 kernel + mul_sum_i8_pairs_float. Results on i5-1135G7 with Bonsai 8B: Before (MSVC): 0.25 tok/s, gibberish output Bug fix only: 3.7 tok/s, correct output Bug fix + AVX2: 6.9 tok/s, correct output Both the x86-specific kernel (arch/x86/quants.c) and the generic fallback (quants.c) are fixed.
|
This look great thanks, there was a few CPU kernel fixes and did not see them until I pushed my changes. For now removed the buggy x86, will merge one of the correct AVX ones. Could you run the KL divergence tests described here: #8 |
|
Running 8B on i5 box with this PR, I get consistent After swapped out I get consistent Below alternative code on i5 Broadwell gives: The 0.0002 KLD seems persistent on AVX2 across different basic implementations. |
|
the SSE path provide 0.1 tps -> 0.7~0.9 tps on N2840 ATOM. AI suggested that meaningful acceleration for 1bitnet on CPUs lack of AVX instruction could only be achieved by implementing the dot product as |
KL Divergence Results — x86 AVX2 (PR #7)Ran the KL divergence tests from PR #8 on the AVX2 kernel fix from this PR. Hardware: Intel i5-1135G7 (Tiger Lake), 32GB RAM, Windows 11. Build: System info: Test setup
x86 AVX2 Divergences
Comparison with PR #8 reference (ARM NEON / generic scalar)
The AVX2 kernel shows measurably higher divergence compared to the NEON/scalar reference. The likely cause is floating-point operation ordering: our AVX2 path pre-multiplies Note: @zcattacz's XOR+SUB approach posted above uses the two-level accumulation pattern ( Output quality is still good despite the divergence — text generation is coherent and the PPL difference is only 0.057 (24.09 vs 24.04). |
|
Hi @wildcattrio , I updated the implementation and here is the combined result. You were right xor+sub gives slightly better KLD with good tps. I also tried other impl for tps, the best were on par, but this is the simplest.
|
There was a problem hiding this comment.
Pull request overview
Fixes incorrect output and improves performance for the Q1_0_g128 × Q8_0 x86 CPU dot-product kernel by correcting float accumulation and adding an AVX2 vectorized implementation aligned with existing bit-expansion patterns in the x86 quant kernels.
Changes:
- Fix float-to-int truncation by switching per-block accumulation to
floatin the generic kernel and x86 scalar fallback. - Add an AVX2 implementation for
ggml_vec_dot_q1_0_g128_q8_0using broadcast/shuffle/bit-test expansion andmul_sum_i8_pairs_float().
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
ggml/src/ggml-cpu/quants.c |
Fixes scalar generic accumulation type to prevent truncation and incorrect results. |
ggml/src/ggml-cpu/arch/x86/quants.c |
Adds AVX2 vectorized path and fixes scalar fallback accumulation type for x86. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const __m256i qy = _mm256_loadu_si256((const __m256i *)yb->qs); | ||
|
|
||
| // Get 4 bytes of bits for this Q8_0 block | ||
| const uint32_t bits32 = *(const uint32_t *)&x[ib].qs[k * 4]; |
There was a problem hiding this comment.
bits32 is loaded via a uint32_t* cast from x[ib].qs (*(const uint32_t *)&x[ib].qs[k * 4]), which can violate strict-aliasing rules and may be unaligned. Prefer copying into a local uint32_t with memcpy (similar to bytes_from_bits_32() earlier in this file) to safely preserve the bit pattern under optimization.
| const uint32_t bits32 = *(const uint32_t *)&x[ib].qs[k * 4]; | |
| uint32_t bits32; | |
| memcpy(&bits32, &x[ib].qs[k * 4], sizeof(bits32)); |
|
Good new our first CPU PR just got merged int llama.cpp master branch now, if you are still working on this please rebase with PrismML's master (just pulled the main llama.cpp) Changes: Q1_0_g128 naming is gone now, the original Q1_0 with group size 32 was deleted and Q1_0_g128 was renamed to Q1_0 now by default has group size 128. https://github.com/PrismML-Eng/llama.cpp/tree/master This one only has generic cpu (slow), and ARM NEON path, planning to gather the best x86 kernels from here and to send a PR there (and tag all the contributers). |
|
There is a lot of CPU PRs, planning to gether all in one and then send to the main llama.cpp |
Summary
The Q1_0_g128 x86 CPU kernel produces gibberish output at 0.25 tok/s on Intel CPUs. Two bugs:
Bug 1: Float-to-int truncation (causes gibberish)
The per-block accumulator is
int, butd1 * sumi_blockproduces afloat. The implicit cast truncates every Q8_0 block's scale factor to 0 or ±1, destroying the output.Bug 2: No SIMD (causes 0.25 tok/s)
The x86 kernel is scalar-only while the ARM NEON version has full vectorization. Added AVX2 using the same
broadcast → shuffle → cmpeq → mul_sum_i8_pairs_floatpattern from the existingggml_vec_dot_q1_0_q8_0kernel.Results (i5-1135G7, 32GB, Bonsai 8B)
, with is it. and the. and the.... in.........Files changed
ggml/src/ggml-cpu/arch/x86/quants.c— AVX2 kernel + scalar fixggml/src/ggml-cpu/quants.c— generic scalar fallback fixTest plan