Skip to content

feat: fused Metal q4 inference for MLX 4-bit models#82

Open
cjchanh wants to merge 3 commits intoevilsocket:mainfrom
cjchanh:q4-metal-patchset
Open

feat: fused Metal q4 inference for MLX 4-bit models#82
cjchanh wants to merge 3 commits intoevilsocket:mainfrom
cjchanh:q4-metal-patchset

Conversation

@cjchanh
Copy link
Copy Markdown

@cjchanh cjchanh commented Apr 13, 2026

What

This patch enables Cake to serve MLX 4-bit quantized models that upstream main cannot load. On M5 Max, Qwen2.5-7B-Instruct-4bit loaded at 9.5 GiB and generated 10 tokens at 56.71 tok/s in a bounded API run.

How

  • q4_matvec_f16 and q4_matmul_tiled_f16 MSL kernels that read packed 4-bit weights and dequantize on-the-fly during matmul
  • QuantizedLinear layer type that stores packed U32 weights + F16 scales/biases on Metal without expansion
  • MetalMlxBackend VarBuilder that auto-detects MLX 4-bit format and keeps weights packed
  • MLP and Attention layers use polymorphic LinearWeight (Dense/Quantized)
  • Non-quantized tensors (embeddings, norms, lm_head) fall through to standard F16 dequantization

Tested

Files

  • cake-core/src/backends/metal/ops.msl — MSL kernels
  • cake-core/src/backends/metal/mod.rs — kernel dispatch + CustomOp
  • cake-core/src/backends/mod.rs — trait method
  • cake-core/src/utils/quantized_linear.rs — QuantizedWeight + LinearWeight
  • cake-core/src/utils/mlx_quant.rs — MLX detection
  • cake-core/src/utils/gptq.rs — MetalMlxBackend
  • cake-core/src/utils/mod.rs — auto-detection wiring
  • cake-core/src/models/common/mlp.rs — quantized MLP
  • cake-core/src/models/common/attention.rs — quantized attention
  • cake-core/tests/unit_tests/test_quantization.rs — q4 validation tests

@cjchanh
Copy link
Copy Markdown
Author

cjchanh commented Apr 14, 2026

Additional benchmark context for this PR.

Benchmark results

M5 Max (128 GB), single-device

  • model: mlx-community/Qwen2.5-7B-Instruct-4bit
  • throughput: 56.71 tok/s (10 tokens in a bounded API request)
  • loaded memory: 9.5 GiB
  • upstream main: cannot load this checkpoint (shape mismatch for model.embed_tokens.weight, expected [152064, 3584], got [152064, 448])

M5 Max + iPad Air M3, distributed (0.5B)

  • model: mlx-community/Qwen2.5-0.5B-Instruct-4bit
  • iPad worker discovered via zero-config discovery
  • iPad assigned model.layers.0
  • throughput: 55.44 tok/s

M1 Air + iPad Air M3, distributed (0.5B)

  • model: mlx-community/Qwen2.5-0.5B-Instruct-4bit
  • iPad worker assigned model.layers.0 via manual topology
  • throughput: 29.17 tok/s on a bounded request
  • master memory at generation: 854.7 MiB

Tested on Apple Silicon (M5 Max, M1 Air) and iPad Air M3.

This is the key distinction for the patch: it does not just optimize an existing path. It enables Cake to serve MLX 4-bit checkpoints that upstream main cannot load, and it does so at usable throughput on Apple Silicon.

@cjchanh
Copy link
Copy Markdown
Author

cjchanh commented Apr 30, 2026

This PR is still relevant from my side, and I am still maintaining the branch.
I realize this is a larger review than the other pending patches because it includes the fused Metal q4 path and validation coverage.
If review size is the blocker, I can split it into smaller PRs around format detection, packed-weight storage, Metal kernels, and model integration.
I can also respond to review feedback on the current branch if that is easier.

cjchanh added 2 commits April 30, 2026 16:16
…set branch)

Adds 4 portfolio-wide patterns (release/, release/evidence/autopilot_*/,
__pycache__/, CRAFT_GATE_RESULT.json) to .gitignore. Mirrors the same
sweep applied on cake/main this session (commit 786006d). Branch-scope
to avoid cross-branch divergence.
…row resolution)

Mobile workers receiving a single-file `.safetensors` model previously
got the FULL file regardless of layer assignment. On 4 GiB single-file
models (Qwen2.5-7B-Instruct-4bit) this exceeded iPad jetsam budgets and
crashed with `early eof`. Same root cause as PR evilsocket#83 against cake/main,
but applied here on q4-metal-patchset (PR evilsocket#82's source branch) since
PR evilsocket#83's branch (`fix/single-file-layer-filter` at ee01115) has API
drift against current upstream and isn't cleanly rebasable.

Changes:
  * cake-core/src/utils/split.rs:
    - extract `reduce_for_layers(&Index, &[String])` from the worker-
      specific `reduce_for_worker` (more general, layer-list-driven)
    - introduce `ReducedModelBundle { index_json, safetensors }` for
      the reduced-bundle return type
    - add `build_reduced_single_file_bundle(model_path, layers)` that
      reads the safetensors header, filters tensors by layer prefixes,
      and emits a minimal safetensors blob + matching index.json

  * cake-core/src/cake/sharding/mod.rs:
    - replace the single-file fallback (which pushed the full model
      regardless of layer) with the reduced-bundle path
    - generalize `inline_files: HashMap<String, Vec<u8>>` so both the
      indexed and single-file paths can stream multiple inline blobs
      (index + reduced safetensors)
    - import `HashMap` (already had `HashSet`)

Test coverage and benchmark updates pair with this in the existing
q4-metal-patchset commits.

Closes spec 199's cake-q4-branch NEEDS_OPERATOR_DECISION row with
disposition: COMMIT (intentional q4 follow-up; preserves PR evilsocket#82
contribution path; commit stays local until operator authorizes
fork push).

Spec: 199-triage-dirty-trees-across-active-portfolio (cake-q4-branch row)
SOP: ~/Documents/Centennial/SOPs/CDS_Stuck_Spec_Triage_SOP_v1.md §3.A v1.1
Triage report: ~/ai/evidence/spec-096-triage-20260430/TRIAGE_REPORT.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant