[Superseded by #317] fix: distributed CUDA span cover#316
Closed
hexxyan wants to merge 1 commit into
Closed
Conversation
8c80371 to
9c5d910
Compare
121cd07 to
06a1821
Compare
…ted CUDA workers When a distributed CUDA worker receives --layers 21:output, it should only access the pages covering its assigned layer spans (~78 GiB), not the entire GGUF mmap (~150 GiB for Q4). Root cause: ds4_gpu_set_model_map_spans() set g_model_registered_size to the full file size. All fallback paths (cuda_model_range_ptr, cuda_model_range_is_cached, cuda_model_direct_fallback_ptr, cuda_model_copy_chunked, Q2 tail expansion) assumed the full model was accessible, causing OOM on 128 GB DGX Spark workers where the Q4 GGUF exceeds available memory. Fix: - Add cuda_model_compute_span_cover() to compute minimal page-aligned cover of all assigned spans - Store cover as g_model_span_cover_offset/end (NOT g_model_registered_size) - Add cuda_model_range_in_span_cover() helper that checks [offset, offset+bytes) falls within the cover, with overflow guard - Guard ALL direct/full-file fallback paths with this helper: cuda_model_range_ptr, cuda_model_range_is_cached, cuda_model_direct_fallback_ptr (which also now takes a bytes parameter), cuda_model_copy_chunked, and the Q2 tail expansion heuristic - Reset span-only state in both cuda_model_set_host_map and ds4_gpu_cleanup New log output when active: ds4: CUDA span cover 78.21 GiB of 150.00 GiB model (48 spans, offset 72.00 GiB) CUDA compile verified (nvcc sm_90); GB10/Q4 e2e still needed. Refs: antirez#293
06a1821 to
7c8d28e
Compare
Contributor
Author
|
Superseded by a clean PR (force-push history cleanup). |
Contributor
Author
|
Superseded by #317 (clean branch, same code). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Distributed CUDA workers with
--layers 21:outputshould only access the pages covering their assigned layer spans (~78 GiB), not the entire GGUF mmap (~150 GiB for Q4). Currently all direct/full-file fallback paths assume the full model is accessible, causing OOM on 128 GB DGX Spark workers.Root Cause
ds4_gpu_set_model_map_spans()setg_model_registered_sizeto the full file size. Every fallback path (cuda_model_range_ptr,cuda_model_range_is_cached,cuda_model_direct_fallback_ptr,cuda_model_copy_chunked, Q2 tail expansion) used this to assume full-model access.For Q2 that "works", it still registered 80.76 GiB when only 41 GiB was the slice. For Q4, the 150 GiB full file exceeds the 128 GB worker memory.
Fix
New state: span cover offset/end (not just a flag)
New helper:
cuda_model_range_in_span_cover(offset, bytes)Returns
trueiff[offset, offset+bytes)falls within[cover_offset, cover_end). When not in span-only mode, always returnstrue. Includes overflow guard.All guarded paths
cuda_model_range_ptrg_model_device_owned/g_model_registeredfallback,g_model_hmm_directfallback,DS4_CUDA_DIRECT_MODELfallbackcuda_model_range_is_cachedregistered/hmm_direct— now also checks span covercuda_model_direct_fallback_ptrbytesparam and checks span cover (called from fd cache path)cuda_model_copy_chunkedregistered— now skips whenspan_onlycuda_model_range_register_mappedspan_onlyCleanup
Both
cuda_model_set_host_mapandds4_gpu_cleanupreset all span-only state.New log output
Testing
nvcc -arch=sm_90compilesds4_cuda.cucleanly (zero warnings, zero errors, 5.3 MB ELF object) vianvidia/cuda:13.0.0-devel-ubuntu24.04Docker image--layersto confirm the fix resolves the OOMRefs: #293