Skip to content

Fixes for CPU backend + instructions for targetting AMD GPUs#5

Closed
philtomson wants to merge 3 commits intoPrismML-Eng:prismfrom
philtomson:prism
Closed

Fixes for CPU backend + instructions for targetting AMD GPUs#5
philtomson wants to merge 3 commits intoPrismML-Eng:prismfrom
philtomson:prism

Conversation

@philtomson
Copy link
Copy Markdown

This PR fixes a CPU kernel bug where a float was converted to an int (0.4 -> 0) thus causing the CPU backend to not work for the Bonsai 1-bit models. Support was added for AVX2 and AVX512. Also adds info in the README.md for compiling for ROCm for AMD GPUs.

philtomson and others added 2 commits April 2, 2026 08:08
Fix float-truncation bug in ggml_vec_dot_q1_0_g128_q8_0: the Q8_0
scale factor d1 was accumulated into an int, silently truncating it
to zero for most blocks and producing garbage output on CPU.

Add AVX2 kernel using the same shuffle/bit-mask expansion as q1_0,
processing one 32-element Q8_0 sub-block per iteration (~4x speedup).

Add AVX-512 hybrid kernel using _mm256_mask_blend_epi8 (AVX-512BW+VL)
to expand weight bits in a single instruction, with _mm256_dpbusd_epi32
(AVX-512VNNI) for the dot product (~5.7x speedup on 1.7B model).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Document how to build with GGML_HIP=ON targeting gfx1151 (Radeon 8060S /
Ryzen AI MAX+), including the Docker-based workflow for systems where the
system ROCm installation is partial.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the ggml label Apr 2, 2026
@khosravipasha
Copy link
Copy Markdown
Collaborator

This look great thanks, there was a few CPU kernel fixes and did not see them until I pushed my changes. For now removed the buggy x86, will merge one of the correct AVX ones.

Could you run the KL divergence tests described here: #8

The AMD one looks great, did not realize you can build the CUDA version for AMD, how are the speeds?

@philtomson
Copy link
Copy Markdown
Author

philtomson commented Apr 3, 2026

I'm getting about 55 tok/sec with the 8B model on my Framework desktop PC (Strix Halo). That's about half of what I've seen someone report getting with a 4090 - probably about right given the slower memory bandwidth since this whole model can easily fit in a 4090.

@philtomson
Copy link
Copy Markdown
Author

philtomson commented Apr 3, 2026

As for the KL divergence numbers:

KL Divergence Results: Bonsai-1.7B on wikitext-2-raw (100 chunks, ctx=512)

Setup: F16 reference logits compared against Q1_0_g128 and Q1_0 quantizations.

Perplexity

Model PPL PPL ratio vs F16
F16 (reference) 24.203 1.000×
Q1_0_g128 24.262 1.0024×
Q1_0 24.220 1.0007×

KL Divergence vs F16

Metric Q1_0_g128 Q1_0
Mean KLD 0.000645 0.000844
Median KLD 0.000500 0.000680
Maximum KLD 0.013948 0.020355
99th percentile KLD 0.003021 0.003804
Mean Δp 0.002% -0.049%
Max Δp 7.80% 6.62%

Numbers seem higher than your NEON numbers.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses correctness and performance in the CPU quantized dot-product path needed for Bonsai 1-bit (Q1_0_g128) models, and adds user-facing build/run instructions for AMD GPUs via ROCm/HIP.

Changes:

  • Fixes CPU generic Q1_0_g128 × Q8_0 dot-product accumulation to avoid unintended float→int truncation.
  • Adds x86 SIMD implementations for Q1_0_g128 × Q8_0 using AVX2 and an AVX-512 (BW/VL/VNNI) fast path.
  • Documents ROCm/HIP build and Docker usage in the main README.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
README.md Adds ROCm/HIP requirements, build, run, and Docker instructions for AMD GPUs.
ggml/src/ggml-cpu/quants.c Fixes generic CPU dot-product logic for Q1_0_g128 by using floating-point accumulation.
ggml/src/ggml-cpu/arch/x86/quants.c Introduces AVX2 and AVX-512(VNNI) optimized dot-product implementations plus scalar fallback.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- README: reword "This branch" → "The HIP backend"; loosen ROCm version
  requirement to 6.x+ with note that gfx1151 rocBLAS kernels require 7.2;
  add link to docs/build.md; use --device /dev/dri instead of hardcoded
  card/render nodes in Docker example
- x86/quants.c: replace strict-aliasing UB (uint32_t* cast of uint8_t*
  buffer) with memcpy in both AVX-512 and AVX2 qbits32 loads

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@khosravipasha
Copy link
Copy Markdown
Collaborator

Good new our first CPU PR just got merged int llama.cpp master branch now, if you are still working on this please rebase with PrismML's master (just pulled the main llama.cpp)

Changes: Q1_0_g128 naming is gone now, the original Q1_0 with group size 32 was deleted and Q1_0_g128 was renamed to Q1_0 now by default has group size 128.

https://github.com/PrismML-Eng/llama.cpp/tree/master

@khosravipasha
Copy link
Copy Markdown
Collaborator

There is a lot of CPU PRs, planning to gether all in one and then send to the main llama.cpp
Going to close this and mention people that helped in a thread there, if you think your solution is better please comment there:
#10

The AMD part now added prebuild binaries to our llama.cpp fork https://github.com/PrismML-Eng/llama.cpp/releases/tag/prism-b8201-ba7e817
and will add it to the Bonsai-demo folder soon
PrismML-Eng/Bonsai-demo#28

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants