Skip to content

Add KQuant quantization mode#3588

Open
asher wants to merge 1 commit into
ml-explore:mainfrom
asher:kquant
Open

Add KQuant quantization mode#3588
asher wants to merge 1 commit into
ml-explore:mainfrom
asher:kquant

Conversation

@asher
Copy link
Copy Markdown

@asher asher commented May 25, 2026

Proposed changes

Add QuantizationMode::KQuant A new quantized matmul family that reads K-quant wire-format data directly in Metal and CPU kernels, enabling inference from K-quant quantized weights without any Python-side layout transform. Supports codecs: Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q4_0, Q4_1, Q5_0, Q5_1, Q8_0.

Motivation

K-quant formats are block quantization schemes with hierarchical sub-block scales, originally developed in the GGML ecosystem and now the dominant format for on-device LLM inference. They use importance-matrix calibration and mixed-precision codec assignment (different layers get different codecs based on sensitivity) to achieve significantly better quality than uniform affine quantization at comparable bits-per-weight.

At 4-5 bpw, K-quant recipes deliver 2-3x lower mean KLD vs the best available MLX-native affine recipes. There is currently no MLX-native quantization method that matches K-quant quality at K-quant byte budgets.

Design

Follows the same architectural pattern as the existing mxfp4/mxfp8/nvfp4 family (fp_quantized.h): per-codec loader structs handle wire-format decoding, while shared _impl template functions (kq_qmm_t_impl, kq_qmm_n_impl, kq_qmv_fast_impl, etc.) implement the matmul algorithms. Each codec defines a KqBlockLoader with load_unsafe()/load_safe()/next() that decodes packed bytes directly in Metal with no Python-side layout transform or preprocessing step. Per codec, six kernel variants are instantiated across three dtypes (float32, float16, bfloat16): qmv_fast, qmv, qvm, qmm_t, qmm_n, and gather_qmm. C++ dispatch constructs the kernel name from the codec string (e.g. kq_q4_k_qmm_t_float16_t). Adding a new codec requires only one loader struct, one line in the codec registry, and instantiation macro calls.

CPU backend includes full dequantize and matmul implementations for all codecs (for use in CI and as a reference path only).

Quality: Mean KLD vs MLX affine quantization

Mean per-token KL divergence from unquantized bf16 teacher (lower = better).

Protocol: wikitext-103-raw-v1, 512 samples x 512 tokens, scored on positions [256, 512), top-K=32768 reconstruction. Evaluation tools: https://github.com/asher/mlx-quant-tools

Qwen3.6-27B

Recipe bpw Mean KLD (nats) vs MLX-native
mlx-community 4-bit 4.69 0.0577 baseline
q4_k_m (imat) 4.88 0.0208 2.8x better at +0.2 bpw
mlx-community 5-bit 5.68 0.0214 baseline
q5_k_m (imat) 5.73 0.0096 2.2x better at same bpw
q6_k_xl (imat) 7.44 0.0047 --

gemma-4-E2B-it

Recipe bpw Mean KLD (nats) vs MLX-native
oQ4 5.76 1.2254 baseline
q4_k_m (imat) 5.77 0.6657 1.8x better at same bpw
oQ5 6.67 0.4979 baseline
q5_k_xl (imat) 6.67 0.2316 2.2x better at same bpw
oQ6 7.58 0.2322 baseline
q6_k_xl (imat) 7.83 0.0890 2.6x better at +0.3 bpw

mlx-community quants are standard mlx_lm.convert affine quantization.
oQ quants are affine with per-tensor bit boosts via oMLX, a local MLX inference server.

Performance: MLX kquant vs llama.cpp Metal

All benchmarks on Apple M5 Max (40-core GPU, 128 GB, 14"), median of trailing samples (first run dropped), no warmup, flash-attention disabled on both sides (-fa 0). Bit-exact same weights used by both engines.

Model Params Prefill Decode
gemma-4-E2B-it Q6_K_XL 2B 3.47x (17170 / 4947 t/s) 1.23x (149 / 121 t/s)
gemma-4-E4B-it Q6_K 4B 1.88x (5605 / 2976 t/s) 1.12x (86 / 77 t/s)
Nemotron-Cascade-8B Q4_K_L 8B 1.52x (2307 / 1513 t/s) 1.28x (91 / 71 t/s)
Ministral-3-14B Q4_K_L 14B 1.15x (1316 / 1140 t/s) 1.18x (56 / 47 t/s)
Qwen3.6-27B Q4_K 27B 1.18x (631 / 532 t/s) 1.29x (31 / 24 t/s)
Qwen3.6-35B-A3B Q4_K_XL 35B MoE 1.37x (2691 / 1957 t/s) 1.24x (101 / 82 t/s)
Nemotron-3-Super-120B-A12B Q5_K_S 120B MoE 1.44x (785 / 546 t/s) 1.77x (49 / 28 t/s)

MLX kquant kernels are 1.15-1.88x faster on prefill (up to 3.47x on small compute-bound models) and 1.12-1.77x faster on decode than llama.cpp Metal, enabling inference from K-quant recipes that deliver 2-3x better quantization quality than MLX-native affine at comparable bits-per-weight.

Integration testing

This PR covers the core framework (kernel + dispatch + C++ plumbing). Downstream integration has been extensively validated against mlx-lm and mlx-vlm forks that are ready to PR once this lands:

  • Quantization: created MLX-native kquant checkpoints using K-quant recipes with importance-matrix calibration; verified bit-exact output match against reference dequantization
  • KLD scoring: measured mean KLD on MLX-produced kquant checkpoints (results above); confirmed identical KLD whether loaded from GGUF or from MLX-native safetensors
  • LoRA fine-tuning: trained and evaluated LoRA adapters on kquant base models via mlx-lm
  • Long-context chat: multi-turn sessions across Qwen3, Qwen3.5, Gemma 4, Mistral 3, and Nemotron architectures (dense + MoE)
  • End-to-end serving: validated via oMLX (a local MLX inference server) with the full fork stack

Model families tested: Qwen3, Qwen3.5, Qwen3.5-MoE, Gemma 4 (2B, 4B), Mistral 3, Nemotron-Cascade, Nemotron-H-MoE.

Compatibility

No changes to existing affine or FP quantization paths; the new enum value and mode string are purely additive. CPU backend is fully functional (CI / Linux). CUDA throws NYI.

Test plan

  • python/tests/test_kquant.py — unit tests for all 5 codecs x 3 dtypes, including partial-block rejection and edge cases
  • Bit-exact correctness vs reference dequantization
  • Performance regression: all codecs faster than llama.cpp Metal on M-series
  • Quality regression: KLD matches expected values from reference implementation
  • CI (existing test suite passes)

Follow-up PRs (not in this PR)

  • mlx-lm: QuantizedLinear(mode="kquant") support, kquant checkpoint save/load, quantization CLI (branch)

Reproduction

Benchmarking and KLD evaluation tools:
https://github.com/asher/mlx-quant-tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant