Skip to content

fix: filter single-file safetensors by assigned layers before push#83

Open
cjchanh wants to merge 1 commit intoevilsocket:mainfrom
cjchanh:fix/single-file-layer-filter
Open

fix: filter single-file safetensors by assigned layers before push#83
cjchanh wants to merge 1 commit intoevilsocket:mainfrom
cjchanh:fix/single-file-layer-filter

Conversation

@cjchanh
Copy link
Copy Markdown

@cjchanh cjchanh commented Apr 14, 2026

Problem

When a Cake master distributes a single-file safetensors model to a worker, it pushes the entire file regardless of how many layers the worker is assigned. For Qwen2.5-7B-Instruct-4bit (4 GiB single file), an iPad worker with a 3 GiB jetsam budget receives the full 4 GiB, exceeds memory, and crashes with early eof.

The indexed model path (model.safetensors.index.json present) already filters correctly via weight_map. The single-file fallback at sharding/mod.rs unconditionally adds model.safetensors to the push list.

Fix

For single-file models with assigned layers, the push path now:

  1. Reads only the safetensors header to enumerate tensor names
  2. Filters tensors by assigned layer prefixes (same starts_with logic as the indexed path)
  3. Calls extract_layer_tensors to build a minimal safetensors blob containing only the needed tensors
  4. Pushes the reduced blob instead of the full file

Backward compatible: if layers is empty (no specific assignment), the full file is still pushed. If no tensors match assigned layers, falls back to full push with a warning.

Results

Tested with M5 Max master + iPad Air M3 worker, Qwen2.5-7B-Instruct-4bit:

Metric Before After
Push size 4 GiB (full model) 250.1 MiB (52 tensors, 2 layers)
iPad RSS jetsam kill 1.4 GiB (under 3 GiB limit)
Result crash (early eof) coherent output at 17.21 tok/s

Test plan

  • cargo test -p cake-core --lib — 641 tests pass (638 existing + 3 new)
  • cargo test -p cake-core --test unit — 235 tests pass
  • cargo clippy — zero new warnings
  • Integration: M5 master + iPad Air M3, 2 layers of 7B-4bit, verified 250.1 MiB push, 1.4 GiB RSS, correct inference
  • Extended inference: longer generation to verify sustained correctness across distributed layers

New unit tests

  • extract_layer_tensors_single_file_filters_correctly — 4 tensors in, request 2, verify only 2 in output with correct data bytes
  • extract_layer_tensors_single_file_all_layers — request all tensors, verify all present with correct total size
  • extract_layer_tensors_single_file_missing_tensor_errors — request nonexistent tensor, verify error

When a worker is assigned a subset of layers from a single-file
safetensors model, extract only the needed tensors instead of pushing
the entire file. For Qwen2.5-7B-4bit (4 GiB), a 2-layer iPad worker
now receives 250 MiB instead of 4 GiB — staying well under the 3 GiB
iOS jetsam limit.

The indexed model path already filtered correctly via weight_map.
This extends the same extraction to the single-file fallback by:
- Reading the safetensors header to enumerate tensor names
- Filtering by assigned layer prefixes
- Calling extract_layer_tensors to build a minimal blob
- Falling back to full push when layers is empty (backward compat)

Verified: M5 master + iPad Air M3 worker, 2 layers, 250.1 MiB push,
1.4 GiB RSS, coherent output at 17.21 tok/s.
@cjchanh
Copy link
Copy Markdown
Author

cjchanh commented Apr 30, 2026

This fix is still relevant from my side. I attempted a conflict-only rebase against current main but found that recent upstream changes (PR #84's iOS TCP retry refactor and adjacent commits) introduce API drift beyond a simple merge — Strategy::assign_layers trait signature changed (7→8 params), Message::DeviceInfoRequest variant was removed, and the BUILD_HASH constant location shifted, producing 16 compile errors when ee01115 is rebased onto current main. Rather than ship a broken-build force-push, I'm leaving this PR in CONFLICTING state. Happy to either redo this as a fresh PR against current main (cherry-picking only the minimal safetensors-filter logic) or close this in favor of that — let me know which you'd prefer.

cjchanh added a commit to cjchanh/cake that referenced this pull request May 1, 2026
…row resolution)

Mobile workers receiving a single-file `.safetensors` model previously
got the FULL file regardless of layer assignment. On 4 GiB single-file
models (Qwen2.5-7B-Instruct-4bit) this exceeded iPad jetsam budgets and
crashed with `early eof`. Same root cause as PR evilsocket#83 against cake/main,
but applied here on q4-metal-patchset (PR evilsocket#82's source branch) since
PR evilsocket#83's branch (`fix/single-file-layer-filter` at ee01115) has API
drift against current upstream and isn't cleanly rebasable.

Changes:
  * cake-core/src/utils/split.rs:
    - extract `reduce_for_layers(&Index, &[String])` from the worker-
      specific `reduce_for_worker` (more general, layer-list-driven)
    - introduce `ReducedModelBundle { index_json, safetensors }` for
      the reduced-bundle return type
    - add `build_reduced_single_file_bundle(model_path, layers)` that
      reads the safetensors header, filters tensors by layer prefixes,
      and emits a minimal safetensors blob + matching index.json

  * cake-core/src/cake/sharding/mod.rs:
    - replace the single-file fallback (which pushed the full model
      regardless of layer) with the reduced-bundle path
    - generalize `inline_files: HashMap<String, Vec<u8>>` so both the
      indexed and single-file paths can stream multiple inline blobs
      (index + reduced safetensors)
    - import `HashMap` (already had `HashSet`)

Test coverage and benchmark updates pair with this in the existing
q4-metal-patchset commits.

Closes spec 199's cake-q4-branch NEEDS_OPERATOR_DECISION row with
disposition: COMMIT (intentional q4 follow-up; preserves PR evilsocket#82
contribution path; commit stays local until operator authorizes
fork push).

Spec: 199-triage-dirty-trees-across-active-portfolio (cake-q4-branch row)
SOP: ~/Documents/Centennial/SOPs/CDS_Stuck_Spec_Triage_SOP_v1.md §3.A v1.1
Triage report: ~/ai/evidence/spec-096-triage-20260430/TRIAGE_REPORT.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant