ggml, server: add ggml_backend_dev_reset() for sleep mode#25271
Conversation
|
I'm testing it on my server |
|
@ngxson Still having held ROCM 6.3.3 w/ llama-server logs (important part is to know it entered sleep mode)ROCM logs after sleep was initiatedTo confirm/sanity check myself: |
|
@DEV-DUFORD unfortunately I have no experience working with hip/rocm, so I cannot help much. would appreciate if you can dip deeper to see if on CUDA, it seems to work fine for me. upon entering sleep, the |
|
@DEV-DUFORD btw, can you try with one single device to narrow down if the problem is related to multi-gpu ? |
Yeah, will do! I'm still worried this is a Also just confirmed, it's happening for me on a single card run as well as multi-gpu |
|
I can confirm that during sleep: there is no longer a CUDA context. |
|
I hate github UI |
|
I think it's like a #define cudaDeviceReset musaDeviceReset missing in vendors/musa.h I check |
|
@ngxson Doing some additional digging I got to this: It seems like, on So I don't really know if there's a path forward on HIP devices to truly clear the VRAM on sleep, hmmmmmm. 🤔 |
|
Worst case we could add an opt-in --sleep-exit: nothing dirty in llama-server, just an escape hatch for stubborn runtimes when the bug is not on our side. |
|
the main problem why I don't want exit-on-sleep is because:
so exiting process is really the last-ditch resolution. I think for now, the better steps that we can do are:
cc'ing @IMbackK if you have any thoughts on this issue |
Overview
Add
ggml_backend_dev_reset()that is mapped tocudaDeviceReset/hipDeviceResetonly CUDA & hip is handled; OTHER BACKEND ARE UNIMPLEMENTED, please discard reviewing this --> feel free to push a PR to add this feature to other backend
Tested on CUDA:
llama-server -hf unsloth/Qwen3.5-4B-MTP-GGUF:Q4_K_M --sleep-idle-seconds 5Requirements
nullptrstub for other backends