feat: fused Metal q4 inference for MLX 4-bit models by cjchanh · Pull Request #82 · evilsocket/cake

cjchanh · 2026-04-13T23:25:44Z

What

This patch enables Cake to serve MLX 4-bit quantized models that upstream main cannot load. On M5 Max, Qwen2.5-7B-Instruct-4bit loaded at 9.5 GiB and generated 10 tokens at 56.71 tok/s in a bounded API run.

How

q4_matvec_f16 and q4_matmul_tiled_f16 MSL kernels that read packed 4-bit weights and dequantize on-the-fly during matmul
QuantizedLinear layer type that stores packed U32 weights + F16 scales/biases on Metal without expansion
MetalMlxBackend VarBuilder that auto-detects MLX 4-bit format and keeps weights packed
MLP and Attention layers use polymorphic LinearWeight (Dense/Quantized)
Non-quantized tensors (embeddings, norms, lm_head) fall through to standard F16 dequantization

Tested

cargo test -p cake-core --features metal
cargo clippy -p cake-core -p cake-cli --features metal
Verified on M5 Max 128GB: 56.71 tok/s, coherent output
Verified on M1 Air 8GB: model loads at 1.5 GiB (memory-constrained throughput due to candle buffer pool growth, see Metal buffer pool grows beyond physical RAM on small unified-memory devices huggingface/candle#3464)

Files

cake-core/src/backends/metal/ops.msl — MSL kernels
cake-core/src/backends/metal/mod.rs — kernel dispatch + CustomOp
cake-core/src/backends/mod.rs — trait method
cake-core/src/utils/quantized_linear.rs — QuantizedWeight + LinearWeight
cake-core/src/utils/mlx_quant.rs — MLX detection
cake-core/src/utils/gptq.rs — MetalMlxBackend
cake-core/src/utils/mod.rs — auto-detection wiring
cake-core/src/models/common/mlp.rs — quantized MLP
cake-core/src/models/common/attention.rs — quantized attention
cake-core/tests/unit_tests/test_quantization.rs — q4 validation tests

cjchanh · 2026-04-14T04:29:21Z

Additional benchmark context for this PR.

Benchmark results

M5 Max (128 GB), single-device

model: mlx-community/Qwen2.5-7B-Instruct-4bit
throughput: 56.71 tok/s (10 tokens in a bounded API request)
loaded memory: 9.5 GiB
upstream main: cannot load this checkpoint (shape mismatch for model.embed_tokens.weight, expected [152064, 3584], got [152064, 448])

M5 Max + iPad Air M3, distributed (0.5B)

model: mlx-community/Qwen2.5-0.5B-Instruct-4bit
iPad worker discovered via zero-config discovery
iPad assigned model.layers.0
throughput: 55.44 tok/s

M1 Air + iPad Air M3, distributed (0.5B)

model: mlx-community/Qwen2.5-0.5B-Instruct-4bit
iPad worker assigned model.layers.0 via manual topology
throughput: 29.17 tok/s on a bounded request
master memory at generation: 854.7 MiB

Tested on Apple Silicon (M5 Max, M1 Air) and iPad Air M3.

This is the key distinction for the patch: it does not just optimize an existing path. It enables Cake to serve MLX 4-bit checkpoints that upstream main cannot load, and it does so at usable throughput on Apple Silicon.

cjchanh · 2026-04-30T17:58:56Z

This PR is still relevant from my side, and I am still maintaining the branch.
I realize this is a larger review than the other pending patches because it includes the fused Metal q4 path and validation coverage.
If review size is the blocker, I can split it into smaller PRs around format detection, packed-weight storage, Metal kernels, and model integration.
I can also respond to review feedback on the current branch if that is easier.

…set branch) Adds 4 portfolio-wide patterns (release/, release/evidence/autopilot_*/, __pycache__/, CRAFT_GATE_RESULT.json) to .gitignore. Mirrors the same sweep applied on cake/main this session (commit 786006d). Branch-scope to avoid cross-branch divergence.

…row resolution) Mobile workers receiving a single-file `.safetensors` model previously got the FULL file regardless of layer assignment. On 4 GiB single-file models (Qwen2.5-7B-Instruct-4bit) this exceeded iPad jetsam budgets and crashed with `early eof`. Same root cause as PR evilsocket#83 against cake/main, but applied here on q4-metal-patchset (PR evilsocket#82's source branch) since PR evilsocket#83's branch (`fix/single-file-layer-filter` at ee01115) has API drift against current upstream and isn't cleanly rebasable. Changes: * cake-core/src/utils/split.rs: - extract `reduce_for_layers(&Index, &[String])` from the worker- specific `reduce_for_worker` (more general, layer-list-driven) - introduce `ReducedModelBundle { index_json, safetensors }` for the reduced-bundle return type - add `build_reduced_single_file_bundle(model_path, layers)` that reads the safetensors header, filters tensors by layer prefixes, and emits a minimal safetensors blob + matching index.json * cake-core/src/cake/sharding/mod.rs: - replace the single-file fallback (which pushed the full model regardless of layer) with the reduced-bundle path - generalize `inline_files: HashMap<String, Vec<u8>>` so both the indexed and single-file paths can stream multiple inline blobs (index + reduced safetensors) - import `HashMap` (already had `HashSet`) Test coverage and benchmark updates pair with this in the existing q4-metal-patchset commits. Closes spec 199's cake-q4-branch NEEDS_OPERATOR_DECISION row with disposition: COMMIT (intentional q4 follow-up; preserves PR evilsocket#82 contribution path; commit stays local until operator authorizes fork push). Spec: 199-triage-dirty-trees-across-active-portfolio (cake-q4-branch row) SOP: ~/Documents/Centennial/SOPs/CDS_Stuck_Spec_Triage_SOP_v1.md §3.A v1.1 Triage report: ~/ai/evidence/spec-096-triage-20260430/TRIAGE_REPORT.md

Add fused Metal q4 path for MLX 4-bit models

50851da

cjchanh added 2 commits April 30, 2026 16:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: fused Metal q4 inference for MLX 4-bit models#82

feat: fused Metal q4 inference for MLX 4-bit models#82
cjchanh wants to merge 3 commits intoevilsocket:mainfrom
cjchanh:q4-metal-patchset

cjchanh commented Apr 13, 2026

Uh oh!

cjchanh commented Apr 14, 2026

Uh oh!

cjchanh commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

cjchanh commented Apr 13, 2026

What

How

Tested

Files

Uh oh!

cjchanh commented Apr 14, 2026

Benchmark results

Uh oh!

cjchanh commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant