First-time setup log for building llama.cpp with TurboQuant KV cache compression on Windows 11 with an AMD RX 9070 XT (gfx1201, RDNA 4).
Status: WIP. RDNA 4 HIP support is bleeding-edge. Expect rough edges.
| Tool | Version | Install Method | Notes |
|---|---|---|---|
| Git | 2.x | Pre-installed | C:\Program Files\Git |
| Python | 3.10 | Windows Store | For turboquant_plus Python prototype |
| VS 2022 Build Tools | v143 | winget install Microsoft.VisualStudio.2022.BuildTools |
REQUIRED — VS 2019 won't work (see Gotcha #1) |
| CMake | 4.3.1 | winget install Kitware.CMake |
Installs to C:\Program Files\CMake\bin |
| Ninja | latest | pip install ninja or winget install Ninja-build.Ninja |
pip version is more reliable for PATH |
| HIP SDK for Windows | 7.1 | Manual download from AMD | ~1.6GB, installs to C:\Program Files\AMD\ROCm\7.1 |
# CMake
winget install Kitware.CMake
# Ninja (via pip — more reliable than winget for PATH)
pip install ninja
# VS 2022 Build Tools (MUST be 2022, not 2019)
winget install Microsoft.VisualStudio.2022.BuildTools --override "--quiet --wait --norestart --add Microsoft.VisualStudio.Workload.VCTools --add Microsoft.VisualStudio.Component.VC.Tools.x86.x64 --add Microsoft.VisualStudio.Component.Windows11SDK.22621"Download from AMD: https://www.amd.com/en/developer/resources/rocm-hub/hip-sdk.html
(~1.6GB). Install to default location (C:\Program Files\AMD\ROCm\7.1).
Verify:
set PATH=C:\Program Files\AMD\ROCm\7.1\bin;%PATH%
hipcc --version
hipinfocd C:\models
# Python research prototype
git clone https://github.com/TheTom/turboquant_plus.git
cd turboquant_plus && python -m venv .venv
# On Windows: source .venv/Scripts/activate (NOT bin/activate)
source .venv/Scripts/activate
pip install -e ".[dev]"
python -m pytest tests/ -v # 542+ pass, ~9 platform-specific failures OK
# llama.cpp TurboQuant fork (with HIP/ROCm support)
git clone --branch feature/turboquant-kv-cache \
https://github.com/TheTom/llama-cpp-turboquant.git llama-cpp-tqSee the gotchas below. Key changes needed:
ggml/src/ggml-hip/CMakeLists.txt— add-xhip -include __clang_hip_runtime_wrapper.hggml/src/ggml-cuda/vendors/hip.h— add<algorithm>+using std::min/maxggml/src/ggml-turbo-quant.c— addM_PIdefineggml/src/ggml-cuda/turbo-innerq.cu— adddllexporton Windowssrc/llama-kv-cache.cpp— adddllimporton Windowsggml/src/ggml-cuda/fattn-tile.cu— guard D>=576 for HIP
From x64 Native Tools Command Prompt for VS 2022:
cd C:\models\llama-cpp-tq
set PATH=C:\Program Files\AMD\ROCm\7.1\bin;C:\Program Files\CMake\bin;%PATH%
set HIP_PATH=C:\Program Files\AMD\ROCm\7.1
cmake -S . -B build -G Ninja ^
-DGPU_TARGETS=gfx1201 ^
-DGGML_HIP=ON ^
-DGGML_CUDA_FA_ALL_QUANTS=ON ^
-DCMAKE_C_COMPILER=clang ^
-DCMAKE_CXX_COMPILER=clang++ ^
-DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release:: Qwen2.5-7B Q4_K_M — use asymmetric K/V (recommended for Q4_K_M)
build\bin\llama-cli.exe ^
-m C:\models\qwen2.5-7b-instruct-q4_k_m.gguf ^
-ngl 99 -c 2048 -fa on ^
--cache-type-k q8_0 --cache-type-v turbo4 ^
-n 100 -p "Hello, I am a language model running on"Device 0: AMD Radeon RX 9070 XT, gfx1201 (0x1201), VMM: no, Wave Size: 32, VRAM: 16304 MiB
Qwen2.5-7B-Instruct Q4_K_M, -ctk q8_0 -ctv turbo4, -c 512
Prompt: 412.0 t/s
Generation: 52.3 t/s
gfx1201 detected natively — no HSA_OVERRIDE_GFX_VERSION needed!
The common/ library uses jinja templates that require a <functional> header with
C++17/20 reinterpret_cast semantics that VS 2019's v14.28 MSVC headers don't support.
HIP SDK's clang is stricter than MSVC cl.exe and rejects the VS 2019 code:
functional:1259:16: error: reinterpret_cast from 'const void *' to '...' casts away qualifiers
Fix: Install VS 2022 Build Tools with the winget command above.
MSVC's <corecrt_math.h> declares fabsf, fmaxf, expf, logf, powf as
__inline host-only functions. HIP's clang respects these attributes and refuses to
call them from __device__ code.
On Linux this works because glibc's math headers don't have host/device annotations.
On Windows, the HIP device math wrappers in __clang_hip_math.h are only included
when compiling with -xhip flag — but llama.cpp's Windows HIP build compiles .cu
files as CXX (C++), not HIP language.
Fix: Add -xhip -include __clang_hip_runtime_wrapper.h to the compile flags for
all .cu files on Windows:
# In ggml/src/ggml-hip/CMakeLists.txt
if (WIN32)
set(HIP_WIN_FLAGS "-xhip -include __clang_hip_runtime_wrapper.h")
set_source_files_properties(${GGML_SOURCES_ROCM} PROPERTIES COMPILE_FLAGS "${HIP_WIN_FLAGS}")
endif()__clang_cuda_complex_builtins.h (included via hip_runtime.h) uses bare min/max
which aren't defined in device scope on Windows.
Fix: Include <algorithm> and using std::min; using std::max; in vendors/hip.h
before <hip/hip_runtime.h>.
The turbo-quant C code uses M_PI which MSVC only defines if _USE_MATH_DEFINES
is set before <cmath>.
Fix: Add #ifndef M_PI / #define M_PI ... after includes in ggml-turbo-quant.c.
llama.cpp builds as multiple DLLs on Windows (ggml-base.dll, ggml-cpu.dll,
ggml-hip.dll, llama.dll). Global variables and functions shared across DLLs need
__declspec(dllexport) on the definition side and __declspec(dllimport) on the
consumer side, with extern "C" linkage to avoid C++ name mangling.
The TurboQuant fork has three cross-DLL symbols that need this treatment:
turbo3_cpu_wht_group_size(ggml-base → ggml-cpu)g_innerq_scale_inv_host[],turbo_innerq_needs_tensor_update(),turbo_innerq_mark_tensor_updated()(ggml-hip → llama)
Fix: Create proper API functions with GGML_API decoration, or add
platform-specific dllexport/dllimport guards.
winget install Ninja-build.Ninja adds to Windows PATH, but Git Bash doesn't
pick it up until you open a new terminal.
Fix: pip install ninja provides a Python-managed ninja that's always on PATH.
The HIP CMakeLists excludes fattn-tile instances for D=576 and D=640 (exceed 65536
byte local memory limit). But the dispatch code in fattn-tile.cu still references
them, causing linker errors.
Fix: Guard the dispatch cases with #ifdef GGML_USE_HIP → GGML_ABORT(...).
When setting HIP_PATH in cmd.exe, beware of trailing spaces:
set HIP_PATH=C:\Program Files\AMD\ROCm\7.1The hipconfig script concatenates paths using HIP_PATH and a trailing space causes
"version file not found" warnings. Harmless but confusing.
HIP SDK 7.1 (clang 21) includes gfx1201 support. No need for HSA_OVERRIDE_GFX_VERSION.
The GPU is detected correctly by hipinfo:
Name: AMD Radeon RX 9070 XT
warpSize: 32
totalGlobalMem: 15.92 GB
The TurboQuant HIP implementation reuses CUDA .cu files compiled through the HIP
backend (standard AMD approach). Key files:
ggml/src/ggml-cuda/turbo-quant.cuh— quantize/dequantize kernelsggml/src/ggml-cuda/turbo-wht.cu— Walsh-Hadamard Transformggml/src/ggml-cuda/turbo-innerq.cu— per-channel equalizationggml/src/ggml-cuda/template-instances/fattn-vec-instance-turbo*.cu— FlashAttentionggml/src/ggml-hip/CMakeLists.txt— HIP build configuration