Add KQuant quantization mode#3588
Open
asher wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Proposed changes
Add
QuantizationMode::KQuantA new quantized matmul family that reads K-quant wire-format data directly in Metal and CPU kernels, enabling inference from K-quant quantized weights without any Python-side layout transform. Supports codecs: Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q4_0, Q4_1, Q5_0, Q5_1, Q8_0.Motivation
K-quant formats are block quantization schemes with hierarchical sub-block scales, originally developed in the GGML ecosystem and now the dominant format for on-device LLM inference. They use importance-matrix calibration and mixed-precision codec assignment (different layers get different codecs based on sensitivity) to achieve significantly better quality than uniform affine quantization at comparable bits-per-weight.
At 4-5 bpw, K-quant recipes deliver 2-3x lower mean KLD vs the best available MLX-native affine recipes. There is currently no MLX-native quantization method that matches K-quant quality at K-quant byte budgets.
Design
Follows the same architectural pattern as the existing mxfp4/mxfp8/nvfp4 family (fp_quantized.h): per-codec loader structs handle wire-format decoding, while shared _impl template functions (
kq_qmm_t_impl,kq_qmm_n_impl,kq_qmv_fast_impl, etc.) implement the matmul algorithms. Each codec defines aKqBlockLoaderwithload_unsafe()/load_safe()/next()that decodes packed bytes directly in Metal with no Python-side layout transform or preprocessing step. Per codec, six kernel variants are instantiated across three dtypes (float32, float16, bfloat16):qmv_fast,qmv,qvm,qmm_t,qmm_n, andgather_qmm. C++ dispatch constructs the kernel name from the codec string (e.g.kq_q4_k_qmm_t_float16_t). Adding a new codec requires only one loader struct, one line in the codec registry, and instantiation macro calls.CPU backend includes full dequantize and matmul implementations for all codecs (for use in CI and as a reference path only).
Quality: Mean KLD vs MLX affine quantization
Mean per-token KL divergence from unquantized bf16 teacher (lower = better).
Protocol: wikitext-103-raw-v1, 512 samples x 512 tokens, scored on positions [256, 512), top-K=32768 reconstruction. Evaluation tools: https://github.com/asher/mlx-quant-tools
Qwen3.6-27B
gemma-4-E2B-it
mlx-community quants are standard
mlx_lm.convertaffine quantization.oQ quants are affine with per-tensor bit boosts via oMLX, a local MLX inference server.
Performance: MLX kquant vs llama.cpp Metal
All benchmarks on Apple M5 Max (40-core GPU, 128 GB, 14"), median of trailing samples (first run dropped), no warmup, flash-attention disabled on both sides (
-fa 0). Bit-exact same weights used by both engines.MLX kquant kernels are 1.15-1.88x faster on prefill (up to 3.47x on small compute-bound models) and 1.12-1.77x faster on decode than llama.cpp Metal, enabling inference from K-quant recipes that deliver 2-3x better quantization quality than MLX-native affine at comparable bits-per-weight.
Integration testing
This PR covers the core framework (kernel + dispatch + C++ plumbing). Downstream integration has been extensively validated against mlx-lm and mlx-vlm forks that are ready to PR once this lands:
Model families tested: Qwen3, Qwen3.5, Qwen3.5-MoE, Gemma 4 (2B, 4B), Mistral 3, Nemotron-Cascade, Nemotron-H-MoE.
Compatibility
No changes to existing affine or FP quantization paths; the new enum value and mode string are purely additive. CPU backend is fully functional (CI / Linux). CUDA throws NYI.
Test plan
Follow-up PRs (not in this PR)
Reproduction
Benchmarking and KLD evaluation tools:
https://github.com/asher/mlx-quant-tools