Fixes for CPU backend + instructions for targetting AMD GPUs#5
Fixes for CPU backend + instructions for targetting AMD GPUs#5philtomson wants to merge 3 commits intoPrismML-Eng:prismfrom
Conversation
Fix float-truncation bug in ggml_vec_dot_q1_0_g128_q8_0: the Q8_0 scale factor d1 was accumulated into an int, silently truncating it to zero for most blocks and producing garbage output on CPU. Add AVX2 kernel using the same shuffle/bit-mask expansion as q1_0, processing one 32-element Q8_0 sub-block per iteration (~4x speedup). Add AVX-512 hybrid kernel using _mm256_mask_blend_epi8 (AVX-512BW+VL) to expand weight bits in a single instruction, with _mm256_dpbusd_epi32 (AVX-512VNNI) for the dot product (~5.7x speedup on 1.7B model). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Document how to build with GGML_HIP=ON targeting gfx1151 (Radeon 8060S / Ryzen AI MAX+), including the Docker-based workflow for systems where the system ROCm installation is partial. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
This look great thanks, there was a few CPU kernel fixes and did not see them until I pushed my changes. For now removed the buggy x86, will merge one of the correct AVX ones. Could you run the KL divergence tests described here: #8 The AMD one looks great, did not realize you can build the CUDA version for AMD, how are the speeds? |
|
I'm getting about 55 tok/sec with the 8B model on my Framework desktop PC (Strix Halo). That's about half of what I've seen someone report getting with a 4090 - probably about right given the slower memory bandwidth since this whole model can easily fit in a 4090. |
|
As for the KL divergence numbers: KL Divergence Results: Bonsai-1.7B on wikitext-2-raw (100 chunks, ctx=512)Setup: F16 reference logits compared against Q1_0_g128 and Q1_0 quantizations. Perplexity
KL Divergence vs F16
Numbers seem higher than your NEON numbers. |
There was a problem hiding this comment.
Pull request overview
This PR addresses correctness and performance in the CPU quantized dot-product path needed for Bonsai 1-bit (Q1_0_g128) models, and adds user-facing build/run instructions for AMD GPUs via ROCm/HIP.
Changes:
- Fixes CPU generic Q1_0_g128 × Q8_0 dot-product accumulation to avoid unintended float→int truncation.
- Adds x86 SIMD implementations for Q1_0_g128 × Q8_0 using AVX2 and an AVX-512 (BW/VL/VNNI) fast path.
- Documents ROCm/HIP build and Docker usage in the main README.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| README.md | Adds ROCm/HIP requirements, build, run, and Docker instructions for AMD GPUs. |
| ggml/src/ggml-cpu/quants.c | Fixes generic CPU dot-product logic for Q1_0_g128 by using floating-point accumulation. |
| ggml/src/ggml-cpu/arch/x86/quants.c | Introduces AVX2 and AVX-512(VNNI) optimized dot-product implementations plus scalar fallback. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- README: reword "This branch" → "The HIP backend"; loosen ROCm version requirement to 6.x+ with note that gfx1151 rocBLAS kernels require 7.2; add link to docs/build.md; use --device /dev/dri instead of hardcoded card/render nodes in Docker example - x86/quants.c: replace strict-aliasing UB (uint32_t* cast of uint8_t* buffer) with memcpy in both AVX-512 and AVX2 qbits32 loads Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Good new our first CPU PR just got merged int llama.cpp master branch now, if you are still working on this please rebase with PrismML's master (just pulled the main llama.cpp) Changes: Q1_0_g128 naming is gone now, the original Q1_0 with group size 32 was deleted and Q1_0_g128 was renamed to Q1_0 now by default has group size 128. |
|
There is a lot of CPU PRs, planning to gether all in one and then send to the main llama.cpp The AMD part now added prebuild binaries to our llama.cpp fork https://github.com/PrismML-Eng/llama.cpp/releases/tag/prism-b8201-ba7e817 |
This PR fixes a CPU kernel bug where a float was converted to an int (0.4 -> 0) thus causing the CPU backend to not work for the Bonsai 1-bit models. Support was added for AVX2 and AVX512. Also adds info in the README.md for compiling for ROCm for AMD GPUs.