[Superseded by #317] fix: distributed CUDA span cover by hexxyan · Pull Request #316 · antirez/ds4

hexxyan · 2026-05-31T23:00:56Z

Summary

Distributed CUDA workers with --layers 21:output should only access the pages covering their assigned layer spans (~78 GiB), not the entire GGUF mmap (~150 GiB for Q4). Currently all direct/full-file fallback paths assume the full model is accessible, causing OOM on 128 GB DGX Spark workers.

Root Cause

ds4_gpu_set_model_map_spans() set g_model_registered_size to the full file size. Every fallback path (cuda_model_range_ptr, cuda_model_range_is_cached, cuda_model_direct_fallback_ptr, cuda_model_copy_chunked, Q2 tail expansion) used this to assume full-model access.

For Q2 that "works", it still registered 80.76 GiB when only 41 GiB was the slice. For Q4, the 150 GiB full file exceeds the 128 GB worker memory.

Fix

New state: span cover offset/end (not just a flag)

g_model_span_cover_offset  // page-aligned start of span cover
g_model_span_cover_end     // page-aligned end of span cover
g_model_span_only          // 1 when initialized from _spans

New helper: `cuda_model_range_in_span_cover(offset, bytes)`

Returns true iff [offset, offset+bytes) falls within [cover_offset, cover_end). When not in span-only mode, always returns true. Includes overflow guard.

All guarded paths

Function	What was guarded
`cuda_model_range_ptr`	`g_model_device_owned`/`g_model_registered` fallback, `g_model_hmm_direct` fallback, `DS4_CUDA_DIRECT_MODEL` fallback
`cuda_model_range_is_cached`	Early return when `registered`/`hmm_direct` — now also checks span cover
`cuda_model_direct_fallback_ptr`	Now takes `bytes` param and checks span cover (called from fd cache path)
`cuda_model_copy_chunked`	Early return when `registered` — now skips when `span_only`
`cuda_model_range_register_mapped`	Q2 tail expansion heuristic — now skips when `span_only`

Cleanup

Both cuda_model_set_host_map and ds4_gpu_cleanup reset all span-only state.

New log output

ds4: CUDA span cover 78.21 GiB of 150.00 GiB model (48 spans, offset 72.00 GiB)

Testing

CUDA compile verified: nvcc -arch=sm_90 compiles ds4_cuda.cu cleanly (zero warnings, zero errors, 5.3 MB ELF object) via nvidia/cuda:13.0.0-devel-ubuntu24.04 Docker image
Non-CUDA build is unaffected
Still needs: GB10/Q4 distributed e2e validation with --layers to confirm the fix resolves the OOM

Refs: #293

…ted CUDA workers When a distributed CUDA worker receives --layers 21:output, it should only access the pages covering its assigned layer spans (~78 GiB), not the entire GGUF mmap (~150 GiB for Q4). Root cause: ds4_gpu_set_model_map_spans() set g_model_registered_size to the full file size. All fallback paths (cuda_model_range_ptr, cuda_model_range_is_cached, cuda_model_direct_fallback_ptr, cuda_model_copy_chunked, Q2 tail expansion) assumed the full model was accessible, causing OOM on 128 GB DGX Spark workers where the Q4 GGUF exceeds available memory. Fix: - Add cuda_model_compute_span_cover() to compute minimal page-aligned cover of all assigned spans - Store cover as g_model_span_cover_offset/end (NOT g_model_registered_size) - Add cuda_model_range_in_span_cover() helper that checks [offset, offset+bytes) falls within the cover, with overflow guard - Guard ALL direct/full-file fallback paths with this helper: cuda_model_range_ptr, cuda_model_range_is_cached, cuda_model_direct_fallback_ptr (which also now takes a bytes parameter), cuda_model_copy_chunked, and the Q2 tail expansion heuristic - Reset span-only state in both cuda_model_set_host_map and ds4_gpu_cleanup New log output when active: ds4: CUDA span cover 78.21 GiB of 150.00 GiB model (48 spans, offset 72.00 GiB) CUDA compile verified (nvcc sm_90); GB10/Q4 e2e still needed. Refs: antirez#293

hexxyan · 2026-05-31T23:39:58Z

Superseded by a clean PR (force-push history cleanup).

hexxyan · 2026-05-31T23:41:32Z

Superseded by #317 (clean branch, same code).

hexxyan force-pushed the fix/distributed-cuda-slice-registration branch from 8c80371 to 9c5d910 Compare May 31, 2026 23:09

hexxyan changed the title ~~fix: distributed CUDA worker only registers layer span cover (refs #293)~~ fix: limit direct/full-map fallbacks to layer span cover for distributed CUDA workers (refs #293) May 31, 2026

hexxyan force-pushed the fix/distributed-cuda-slice-registration branch 2 times, most recently from 121cd07 to 06a1821 Compare May 31, 2026 23:25

hexxyan mentioned this pull request May 31, 2026

Distributed CUDA worker host-registers the whole GGUF, not just its layer slice → Q4 OOM on 128GB DGX Spark #293

Open

hexxyan force-pushed the fix/distributed-cuda-slice-registration branch from 06a1821 to 7c8d28e Compare May 31, 2026 23:37

hexxyan closed this May 31, 2026

hexxyan deleted the fix/distributed-cuda-slice-registration branch May 31, 2026 23:40

hexxyan changed the title ~~fix: limit direct/full-map fallbacks to layer span cover for distributed CUDA workers (refs #293)~~ [Superseded by #317] fix: distributed CUDA span cover May 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Superseded by #317] fix: distributed CUDA span cover#316

[Superseded by #317] fix: distributed CUDA span cover#316
hexxyan wants to merge 1 commit into
antirez:mainfrom
hexxyan:fix/distributed-cuda-slice-registration

hexxyan commented May 31, 2026 •

edited

Loading

Uh oh!

hexxyan commented May 31, 2026

Uh oh!

hexxyan commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hexxyan commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Fix

New state: span cover offset/end (not just a flag)

New helper: cuda_model_range_in_span_cover(offset, bytes)

All guarded paths

Cleanup

New log output

Testing

Uh oh!

hexxyan commented May 31, 2026

Uh oh!

hexxyan commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hexxyan commented May 31, 2026 •

edited

Loading

New helper: `cuda_model_range_in_span_cover(offset, bytes)`