Skip to content

fix: skip cudaHostRegister when DS4_CUDA_COPY_MODEL_CHUNKED is set#320

Open
kyuz0 wants to merge 1 commit into
antirez:rocmfrom
kyuz0:rocm
Open

fix: skip cudaHostRegister when DS4_CUDA_COPY_MODEL_CHUNKED is set#320
kyuz0 wants to merge 1 commit into
antirez:rocmfrom
kyuz0:rocm

Conversation

@kyuz0
Copy link
Copy Markdown

@kyuz0 kyuz0 commented Jun 1, 2026

Fixes a bug where DS4_CUDA_COPY_MODEL_CHUNKED would still cause host RAM exhaustion during model load on APUs.

Context:
Previously, even when DS4_CUDA_COPY_MODEL_CHUNKED=1 was set, ds4_gpu_set_model_map would still unconditionally call cudaHostRegister on the entire memory-mapped model. cudaHostRegister page-locks the memory, meaning the chunked copy's subsequent posix_madvise(DONTNEED) calls failed to release the RAM.

On AMD APUs like Ryzen AI Max "Strix Halo" (unified memory), this resulted in 80GB+ of system RAM being pinned, completely defeating the purpose of the chunked copy and causing an immediate OOM.

This patch simply exits ds4_gpu_set_model_map early if chunking is enabled, allowing the chunked copy loop to correctly page-in and discard system RAM as intended.

When DS4_CUDA_COPY_MODEL_CHUNKED is set, skip cudaHostRegister.
Registering a large memory map prevents posix_madvise(DONTNEED)
from freeing pages during the chunked copy, leading to catastrophic
system RAM exhaustion on APUs with unified memory.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant