Fixes for CPU backend + instructions for targetting AMD GPUs by philtomson · Pull Request #5 · PrismML-Eng/llama.cpp

philtomson · 2026-04-02T16:52:32Z

This PR fixes a CPU kernel bug where a float was converted to an int (0.4 -> 0) thus causing the CPU backend to not work for the Bonsai 1-bit models. Support was added for AVX2 and AVX512. Also adds info in the README.md for compiling for ROCm for AMD GPUs.

Fix float-truncation bug in ggml_vec_dot_q1_0_g128_q8_0: the Q8_0 scale factor d1 was accumulated into an int, silently truncating it to zero for most blocks and producing garbage output on CPU. Add AVX2 kernel using the same shuffle/bit-mask expansion as q1_0, processing one 32-element Q8_0 sub-block per iteration (~4x speedup). Add AVX-512 hybrid kernel using _mm256_mask_blend_epi8 (AVX-512BW+VL) to expand weight bits in a single instruction, with _mm256_dpbusd_epi32 (AVX-512VNNI) for the dot product (~5.7x speedup on 1.7B model). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Document how to build with GGML_HIP=ON targeting gfx1151 (Radeon 8060S / Ryzen AI MAX+), including the Docker-based workflow for systems where the system ROCm installation is partial. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

khosravipasha · 2026-04-02T18:44:00Z

This look great thanks, there was a few CPU kernel fixes and did not see them until I pushed my changes. For now removed the buggy x86, will merge one of the correct AVX ones.

Could you run the KL divergence tests described here: #8

The AMD one looks great, did not realize you can build the CUDA version for AMD, how are the speeds?

philtomson · 2026-04-03T01:46:51Z

I'm getting about 55 tok/sec with the 8B model on my Framework desktop PC (Strix Halo). That's about half of what I've seen someone report getting with a 4090 - probably about right given the slower memory bandwidth since this whole model can easily fit in a 4090.

philtomson · 2026-04-03T03:46:29Z

As for the KL divergence numbers:

KL Divergence Results: Bonsai-1.7B on wikitext-2-raw (100 chunks, ctx=512)

Setup: F16 reference logits compared against Q1_0_g128 and Q1_0 quantizations.

Perplexity

Model	PPL	PPL ratio vs F16
F16 (reference)	24.203	1.000×
Q1_0_g128	24.262	1.0024×
Q1_0	24.220	1.0007×

KL Divergence vs F16

Metric	Q1_0_g128	Q1_0
Mean KLD	0.000645	0.000844
Median KLD	0.000500	0.000680
Maximum KLD	0.013948	0.020355
99th percentile KLD	0.003021	0.003804
Mean Δp	0.002%	-0.049%
Max Δp	7.80%	6.62%

Numbers seem higher than your NEON numbers.

Copilot

Pull request overview

This PR addresses correctness and performance in the CPU quantized dot-product path needed for Bonsai 1-bit (Q1_0_g128) models, and adds user-facing build/run instructions for AMD GPUs via ROCm/HIP.

Changes:

Fixes CPU generic Q1_0_g128 × Q8_0 dot-product accumulation to avoid unintended float→int truncation.
Adds x86 SIMD implementations for Q1_0_g128 × Q8_0 using AVX2 and an AVX-512 (BW/VL/VNNI) fast path.
Documents ROCm/HIP build and Docker usage in the main README.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
README.md	Adds ROCm/HIP requirements, build, run, and Docker instructions for AMD GPUs.
ggml/src/ggml-cpu/quants.c	Fixes generic CPU dot-product logic for Q1_0_g128 by using floating-point accumulation.
ggml/src/ggml-cpu/arch/x86/quants.c	Introduces AVX2 and AVX-512(VNNI) optimized dot-product implementations plus scalar fallback.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

README.md

ggml/src/ggml-cpu/arch/x86/quants.c

- README: reword "This branch" → "The HIP backend"; loosen ROCm version requirement to 6.x+ with note that gfx1151 rocBLAS kernels require 7.2; add link to docs/build.md; use --device /dev/dri instead of hardcoded card/render nodes in Docker example - x86/quants.c: replace strict-aliasing UB (uint32_t* cast of uint8_t* buffer) with memcpy in both AVX-512 and AVX2 qbits32 loads Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

khosravipasha · 2026-04-06T19:08:15Z

Good new our first CPU PR just got merged int llama.cpp master branch now, if you are still working on this please rebase with PrismML's master (just pulled the main llama.cpp)

Changes: Q1_0_g128 naming is gone now, the original Q1_0 with group size 32 was deleted and Q1_0_g128 was renamed to Q1_0 now by default has group size 128.

https://github.com/PrismML-Eng/llama.cpp/tree/master

khosravipasha · 2026-04-08T00:17:01Z

There is a lot of CPU PRs, planning to gether all in one and then send to the main llama.cpp
Going to close this and mention people that helped in a thread there, if you think your solution is better please comment there:
#10

The AMD part now added prebuild binaries to our llama.cpp fork https://github.com/PrismML-Eng/llama.cpp/releases/tag/prism-b8201-ba7e817
and will add it to the Bonsai-demo folder soon
PrismML-Eng/Bonsai-demo#28

philtomson and others added 2 commits April 2, 2026 08:08

github-actions bot added the ggml label Apr 2, 2026

khosravipasha mentioned this pull request Apr 6, 2026

Bonsai 8B running on AMD Strix Halo (gfx1151) via ROCm/HIP — 76.7 t/s PrismML-Eng/Bonsai-demo#19

Closed

khosravipasha requested a review from Copilot April 6, 2026 08:03

Copilot AI reviewed Apr 6, 2026

View reviewed changes

README.md Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

ggml/src/ggml-cpu/arch/x86/quants.c Outdated Show resolved Hide resolved

ggml/src/ggml-cpu/arch/x86/quants.c Outdated Show resolved Hide resolved

khosravipasha closed this Apr 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes for CPU backend + instructions for targetting AMD GPUs#5

Fixes for CPU backend + instructions for targetting AMD GPUs#5
philtomson wants to merge 3 commits intoPrismML-Eng:prismfrom
philtomson:prism

philtomson commented Apr 2, 2026

Uh oh!

khosravipasha commented Apr 2, 2026

Uh oh!

philtomson commented Apr 3, 2026 •

edited

Loading

Uh oh!

philtomson commented Apr 3, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

khosravipasha commented Apr 6, 2026

Uh oh!

khosravipasha commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

philtomson commented Apr 2, 2026

Uh oh!

khosravipasha commented Apr 2, 2026

Uh oh!

philtomson commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

philtomson commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

KL Divergence Results: Bonsai-1.7B on wikitext-2-raw (100 chunks, ctx=512)

Perplexity

KL Divergence vs F16

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

khosravipasha commented Apr 6, 2026

Uh oh!

khosravipasha commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

philtomson commented Apr 3, 2026 •

edited

Loading

philtomson commented Apr 3, 2026 •

edited

Loading