Add KQuant quantization mode by asher · Pull Request #3588 · ml-explore/mlx

asher · 2026-05-25T19:58:47Z

Proposed changes

Add QuantizationMode::KQuant A new quantized matmul family that reads K-quant wire-format data directly in Metal and CPU kernels, enabling inference from K-quant quantized weights without any Python-side layout transform. Supports codecs: Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q4_0, Q4_1, Q5_0, Q5_1, Q8_0.

Motivation

K-quant formats are block quantization schemes with hierarchical sub-block scales, originally developed in the GGML ecosystem and now the dominant format for on-device LLM inference. They use importance-matrix calibration and mixed-precision codec assignment (different layers get different codecs based on sensitivity) to achieve significantly better quality than uniform affine quantization at comparable bits-per-weight.

At 4-5 bpw, K-quant recipes deliver 2-3x lower mean KLD vs the best available MLX-native affine recipes. There is currently no MLX-native quantization method that matches K-quant quality at K-quant byte budgets.

Design

Follows the same architectural pattern as the existing mxfp4/mxfp8/nvfp4 family (fp_quantized.h): per-codec loader structs handle wire-format decoding, while shared _impl template functions (kq_qmm_t_impl, kq_qmm_n_impl, kq_qmv_fast_impl, etc.) implement the matmul algorithms. Each codec defines a KqBlockLoader with load_unsafe()/load_safe()/next() that decodes packed bytes directly in Metal with no Python-side layout transform or preprocessing step. Per codec, six kernel variants are instantiated across three dtypes (float32, float16, bfloat16): qmv_fast, qmv, qvm, qmm_t, qmm_n, and gather_qmm. C++ dispatch constructs the kernel name from the codec string (e.g. kq_q4_k_qmm_t_float16_t). Adding a new codec requires only one loader struct, one line in the codec registry, and instantiation macro calls.

CPU backend includes full dequantize and matmul implementations for all codecs (for use in CI and as a reference path only).

Quality: Mean KLD vs MLX affine quantization

Mean per-token KL divergence from unquantized bf16 teacher (lower = better).

Protocol: wikitext-103-raw-v1, 512 samples x 512 tokens, scored on positions [256, 512), top-K=32768 reconstruction. Evaluation tools: https://github.com/asher/mlx-quant-tools

Qwen3.6-27B

Recipe	bpw	Mean KLD (nats)	vs MLX-native
mlx-community 4-bit	4.69	0.0577	baseline
q4_k_m (imat)	4.88	0.0208	2.8x better at +0.2 bpw
mlx-community 5-bit	5.68	0.0214	baseline
q5_k_m (imat)	5.73	0.0096	2.2x better at same bpw
q6_k_xl (imat)	7.44	0.0047	--

gemma-4-E2B-it

Recipe	bpw	Mean KLD (nats)	vs MLX-native
oQ4	5.76	1.2254	baseline
q4_k_m (imat)	5.77	0.6657	1.8x better at same bpw
oQ5	6.67	0.4979	baseline
q5_k_xl (imat)	6.67	0.2316	2.2x better at same bpw
oQ6	7.58	0.2322	baseline
q6_k_xl (imat)	7.83	0.0890	2.6x better at +0.3 bpw

mlx-community quants are standard mlx_lm.convert affine quantization.
oQ quants are affine with per-tensor bit boosts via oMLX, a local MLX inference server.

Performance: MLX kquant vs llama.cpp Metal

All benchmarks on Apple M5 Max (40-core GPU, 128 GB, 14"), median of trailing samples (first run dropped), no warmup, flash-attention disabled on both sides (-fa 0). Bit-exact same weights used by both engines.

Model	Params	Prefill	Decode
gemma-4-E2B-it Q6_K_XL	2B	3.47x (17170 / 4947 t/s)	1.23x (149 / 121 t/s)
gemma-4-E4B-it Q6_K	4B	1.88x (5605 / 2976 t/s)	1.12x (86 / 77 t/s)
Nemotron-Cascade-8B Q4_K_L	8B	1.52x (2307 / 1513 t/s)	1.28x (91 / 71 t/s)
Ministral-3-14B Q4_K_L	14B	1.15x (1316 / 1140 t/s)	1.18x (56 / 47 t/s)
Qwen3.6-27B Q4_K	27B	1.18x (631 / 532 t/s)	1.29x (31 / 24 t/s)
Qwen3.6-35B-A3B Q4_K_XL	35B MoE	1.37x (2691 / 1957 t/s)	1.24x (101 / 82 t/s)
Nemotron-3-Super-120B-A12B Q5_K_S	120B MoE	1.44x (785 / 546 t/s)	1.77x (49 / 28 t/s)

MLX kquant kernels are 1.15-1.88x faster on prefill (up to 3.47x on small compute-bound models) and 1.12-1.77x faster on decode than llama.cpp Metal, enabling inference from K-quant recipes that deliver 2-3x better quantization quality than MLX-native affine at comparable bits-per-weight.

Integration testing

This PR covers the core framework (kernel + dispatch + C++ plumbing). Downstream integration has been extensively validated against mlx-lm and mlx-vlm forks that are ready to PR once this lands:

Quantization: created MLX-native kquant checkpoints using K-quant recipes with importance-matrix calibration; verified bit-exact output match against reference dequantization
KLD scoring: measured mean KLD on MLX-produced kquant checkpoints (results above); confirmed identical KLD whether loaded from GGUF or from MLX-native safetensors
LoRA fine-tuning: trained and evaluated LoRA adapters on kquant base models via mlx-lm
Long-context chat: multi-turn sessions across Qwen3, Qwen3.5, Gemma 4, Mistral 3, and Nemotron architectures (dense + MoE)
End-to-end serving: validated via oMLX (a local MLX inference server) with the full fork stack

Model families tested: Qwen3, Qwen3.5, Qwen3.5-MoE, Gemma 4 (2B, 4B), Mistral 3, Nemotron-Cascade, Nemotron-H-MoE.

Compatibility

No changes to existing affine or FP quantization paths; the new enum value and mode string are purely additive. CPU backend is fully functional (CI / Linux). CUDA throws NYI.

Test plan

python/tests/test_kquant.py — unit tests for all 5 codecs x 3 dtypes, including partial-block rejection and edge cases
Bit-exact correctness vs reference dequantization
Performance regression: all codecs faster than llama.cpp Metal on M-series
Quality regression: KLD matches expected values from reference implementation
CI (existing test suite passes)

Follow-up PRs (not in this PR)

mlx-lm: QuantizedLinear(mode="kquant") support, kquant checkpoint save/load, quantization CLI (branch)

Reproduction

Benchmarking and KLD evaluation tools:
https://github.com/asher/mlx-quant-tools

Add KQuant quantization mode with 10 codec encode/decode kernels

83703c6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add KQuant quantization mode#3588

Add KQuant quantization mode#3588
asher wants to merge 1 commit into
ml-explore:mainfrom
asher:kquant

asher commented May 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

asher commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Motivation

Design

Quality: Mean KLD vs MLX affine quantization

Performance: MLX kquant vs llama.cpp Metal

Integration testing

Compatibility

Test plan

Follow-up PRs (not in this PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

asher commented May 25, 2026 •

edited

Loading