feat: fused Metal q4 inference for MLX 4-bit models#82
feat: fused Metal q4 inference for MLX 4-bit models#82cjchanh wants to merge 3 commits intoevilsocket:mainfrom
Conversation
|
Additional benchmark context for this PR. Benchmark resultsM5 Max (128 GB), single-device
M5 Max + iPad Air M3, distributed (0.5B)
M1 Air + iPad Air M3, distributed (0.5B)
Tested on Apple Silicon ( This is the key distinction for the patch: it does not just optimize an existing path. It enables Cake to serve MLX 4-bit checkpoints that upstream |
|
This PR is still relevant from my side, and I am still maintaining the branch. |
…set branch) Adds 4 portfolio-wide patterns (release/, release/evidence/autopilot_*/, __pycache__/, CRAFT_GATE_RESULT.json) to .gitignore. Mirrors the same sweep applied on cake/main this session (commit 786006d). Branch-scope to avoid cross-branch divergence.
…row resolution) Mobile workers receiving a single-file `.safetensors` model previously got the FULL file regardless of layer assignment. On 4 GiB single-file models (Qwen2.5-7B-Instruct-4bit) this exceeded iPad jetsam budgets and crashed with `early eof`. Same root cause as PR evilsocket#83 against cake/main, but applied here on q4-metal-patchset (PR evilsocket#82's source branch) since PR evilsocket#83's branch (`fix/single-file-layer-filter` at ee01115) has API drift against current upstream and isn't cleanly rebasable. Changes: * cake-core/src/utils/split.rs: - extract `reduce_for_layers(&Index, &[String])` from the worker- specific `reduce_for_worker` (more general, layer-list-driven) - introduce `ReducedModelBundle { index_json, safetensors }` for the reduced-bundle return type - add `build_reduced_single_file_bundle(model_path, layers)` that reads the safetensors header, filters tensors by layer prefixes, and emits a minimal safetensors blob + matching index.json * cake-core/src/cake/sharding/mod.rs: - replace the single-file fallback (which pushed the full model regardless of layer) with the reduced-bundle path - generalize `inline_files: HashMap<String, Vec<u8>>` so both the indexed and single-file paths can stream multiple inline blobs (index + reduced safetensors) - import `HashMap` (already had `HashSet`) Test coverage and benchmark updates pair with this in the existing q4-metal-patchset commits. Closes spec 199's cake-q4-branch NEEDS_OPERATOR_DECISION row with disposition: COMMIT (intentional q4 follow-up; preserves PR evilsocket#82 contribution path; commit stays local until operator authorizes fork push). Spec: 199-triage-dirty-trees-across-active-portfolio (cake-q4-branch row) SOP: ~/Documents/Centennial/SOPs/CDS_Stuck_Spec_Triage_SOP_v1.md §3.A v1.1 Triage report: ~/ai/evidence/spec-096-triage-20260430/TRIAGE_REPORT.md
What
This patch enables Cake to serve MLX 4-bit quantized models that upstream main cannot load. On M5 Max, Qwen2.5-7B-Instruct-4bit loaded at 9.5 GiB and generated 10 tokens at 56.71 tok/s in a bounded API run.
How
q4_matvec_f16andq4_matmul_tiled_f16MSL kernels that read packed 4-bit weights and dequantize on-the-fly during matmulQuantizedLinearlayer type that stores packed U32 weights + F16 scales/biases on Metal without expansionMetalMlxBackendVarBuilder that auto-detects MLX 4-bit format and keeps weights packedLinearWeight(Dense/Quantized)Tested
cargo test -p cake-core --features metalcargo clippy -p cake-core -p cake-cli --features metalFiles
cake-core/src/backends/metal/ops.msl— MSL kernelscake-core/src/backends/metal/mod.rs— kernel dispatch + CustomOpcake-core/src/backends/mod.rs— trait methodcake-core/src/utils/quantized_linear.rs— QuantizedWeight + LinearWeightcake-core/src/utils/mlx_quant.rs— MLX detectioncake-core/src/utils/gptq.rs— MetalMlxBackendcake-core/src/utils/mod.rs— auto-detection wiringcake-core/src/models/common/mlp.rs— quantized MLPcake-core/src/models/common/attention.rs— quantized attentioncake-core/tests/unit_tests/test_quantization.rs— q4 validation tests