Skip to content

[Superseded by #317] fix: distributed CUDA span cover#316

Closed
hexxyan wants to merge 1 commit into
antirez:mainfrom
hexxyan:fix/distributed-cuda-slice-registration
Closed

[Superseded by #317] fix: distributed CUDA span cover#316
hexxyan wants to merge 1 commit into
antirez:mainfrom
hexxyan:fix/distributed-cuda-slice-registration

Conversation

@hexxyan
Copy link
Copy Markdown
Contributor

@hexxyan hexxyan commented May 31, 2026

Summary

Distributed CUDA workers with --layers 21:output should only access the pages covering their assigned layer spans (~78 GiB), not the entire GGUF mmap (~150 GiB for Q4). Currently all direct/full-file fallback paths assume the full model is accessible, causing OOM on 128 GB DGX Spark workers.

Root Cause

ds4_gpu_set_model_map_spans() set g_model_registered_size to the full file size. Every fallback path (cuda_model_range_ptr, cuda_model_range_is_cached, cuda_model_direct_fallback_ptr, cuda_model_copy_chunked, Q2 tail expansion) used this to assume full-model access.

For Q2 that "works", it still registered 80.76 GiB when only 41 GiB was the slice. For Q4, the 150 GiB full file exceeds the 128 GB worker memory.

Fix

New state: span cover offset/end (not just a flag)

g_model_span_cover_offset  // page-aligned start of span cover
g_model_span_cover_end     // page-aligned end of span cover
g_model_span_only          // 1 when initialized from _spans

New helper: cuda_model_range_in_span_cover(offset, bytes)

Returns true iff [offset, offset+bytes) falls within [cover_offset, cover_end). When not in span-only mode, always returns true. Includes overflow guard.

All guarded paths

Function What was guarded
cuda_model_range_ptr g_model_device_owned/g_model_registered fallback, g_model_hmm_direct fallback, DS4_CUDA_DIRECT_MODEL fallback
cuda_model_range_is_cached Early return when registered/hmm_direct — now also checks span cover
cuda_model_direct_fallback_ptr Now takes bytes param and checks span cover (called from fd cache path)
cuda_model_copy_chunked Early return when registered — now skips when span_only
cuda_model_range_register_mapped Q2 tail expansion heuristic — now skips when span_only

Cleanup

Both cuda_model_set_host_map and ds4_gpu_cleanup reset all span-only state.

New log output

ds4: CUDA span cover 78.21 GiB of 150.00 GiB model (48 spans, offset 72.00 GiB)

Testing

  • CUDA compile verified: nvcc -arch=sm_90 compiles ds4_cuda.cu cleanly (zero warnings, zero errors, 5.3 MB ELF object) via nvidia/cuda:13.0.0-devel-ubuntu24.04 Docker image
  • Non-CUDA build is unaffected
  • Still needs: GB10/Q4 distributed e2e validation with --layers to confirm the fix resolves the OOM

Refs: #293

@hexxyan hexxyan force-pushed the fix/distributed-cuda-slice-registration branch from 8c80371 to 9c5d910 Compare May 31, 2026 23:09
@hexxyan hexxyan changed the title fix: distributed CUDA worker only registers layer span cover (refs #293) fix: limit direct/full-map fallbacks to layer span cover for distributed CUDA workers (refs #293) May 31, 2026
@hexxyan hexxyan force-pushed the fix/distributed-cuda-slice-registration branch 2 times, most recently from 121cd07 to 06a1821 Compare May 31, 2026 23:25
…ted CUDA workers

When a distributed CUDA worker receives --layers 21:output, it should
only access the pages covering its assigned layer spans (~78 GiB),
not the entire GGUF mmap (~150 GiB for Q4).

Root cause: ds4_gpu_set_model_map_spans() set g_model_registered_size
to the full file size.  All fallback paths (cuda_model_range_ptr,
cuda_model_range_is_cached, cuda_model_direct_fallback_ptr,
cuda_model_copy_chunked, Q2 tail expansion) assumed the full model
was accessible, causing OOM on 128 GB DGX Spark workers where the
Q4 GGUF exceeds available memory.

Fix:
- Add cuda_model_compute_span_cover() to compute minimal page-aligned
  cover of all assigned spans
- Store cover as g_model_span_cover_offset/end (NOT g_model_registered_size)
- Add cuda_model_range_in_span_cover() helper that checks [offset, offset+bytes)
  falls within the cover, with overflow guard
- Guard ALL direct/full-file fallback paths with this helper:
  cuda_model_range_ptr, cuda_model_range_is_cached,
  cuda_model_direct_fallback_ptr (which also now takes a bytes parameter),
  cuda_model_copy_chunked, and the Q2 tail expansion heuristic
- Reset span-only state in both cuda_model_set_host_map and ds4_gpu_cleanup

New log output when active:
  ds4: CUDA span cover 78.21 GiB of 150.00 GiB model (48 spans, offset 72.00 GiB)

CUDA compile verified (nvcc sm_90); GB10/Q4 e2e still needed.

Refs: antirez#293
@hexxyan hexxyan force-pushed the fix/distributed-cuda-slice-registration branch from 06a1821 to 7c8d28e Compare May 31, 2026 23:37
@hexxyan
Copy link
Copy Markdown
Contributor Author

hexxyan commented May 31, 2026

Superseded by a clean PR (force-push history cleanup).

@hexxyan hexxyan closed this May 31, 2026
@hexxyan hexxyan deleted the fix/distributed-cuda-slice-registration branch May 31, 2026 23:40
@hexxyan hexxyan changed the title fix: limit direct/full-map fallbacks to layer span cover for distributed CUDA workers (refs #293) [Superseded by #317] fix: distributed CUDA span cover May 31, 2026
@hexxyan
Copy link
Copy Markdown
Contributor Author

hexxyan commented May 31, 2026

Superseded by #317 (clean branch, same code).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant