Skip to content

ggml, server: add ggml_backend_dev_reset() for sleep mode#25271

Open
ngxson wants to merge 6 commits into
masterfrom
xsn/ggml_backend_dev_reset
Open

ggml, server: add ggml_backend_dev_reset() for sleep mode#25271
ngxson wants to merge 6 commits into
masterfrom
xsn/ggml_backend_dev_reset

Conversation

@ngxson

@ngxson ngxson commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Overview

Add ggml_backend_dev_reset() that is mapped to cudaDeviceReset / hipDeviceReset

only CUDA & hip is handled; OTHER BACKEND ARE UNIMPLEMENTED, please discard reviewing this --> feel free to push a PR to add this feature to other backend

Tested on CUDA: llama-server -hf unsloth/Qwen3.5-4B-MTP-GGUF:Q4_K_M --sleep-idle-seconds 5

  • Before this change: nvtop reports using ~158MB on sleep mode
  • With this change: nvtop reports 0 memory usage on sleep mode

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: core changes are written by me, AI used to add nullptr stub for other backends

@ngxson ngxson requested review from a team, IMbackK, JohannesGaessler and ggerganov as code owners July 3, 2026 15:41
@github-actions github-actions Bot added Vulkan Issues specific to the Vulkan backend server ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Apple Metal https://en.wikipedia.org/wiki/Metal_(API) Ascend NPU issues specific to Ascend NPUs OpenCL Issues specific to the OpenCL backend IBM zDNN issues specific to IBM zDNN Accelerator Hexagon CUDA Related to the CUDA backend AMD ZenDNN Issues related to the AMD ZenDNN backend OpenVINO WebGPU labels Jul 3, 2026
@ngxson ngxson removed the WebGPU label Jul 3, 2026
@ServeurpersoCom

Copy link
Copy Markdown
Contributor

I'm testing it on my server

@github-actions github-actions Bot added Vulkan Issues specific to the Vulkan backend SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Apple Metal https://en.wikipedia.org/wiki/Metal_(API) Ascend NPU issues specific to Ascend NPUs OpenCL Issues specific to the OpenCL backend IBM zDNN issues specific to IBM zDNN Accelerator Hexagon AMD ZenDNN Issues related to the AMD ZenDNN backend OpenVINO WebGPU labels Jul 3, 2026
@DEV-DUFORD

DEV-DUFORD commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

@ngxson Still having held HIP resources after sleep is initiated, attached some logs to help w/ triage:

ROCM 6.3.3 w/ gfx900

llama-server logs (important part is to know it entered sleep mode)
nerd-dell@nerd-dell:~/llama.cpp$ cd ~/llama.cpp/ && GGML_CUDA_ALLREDUCE=none ./build/bin/llama-server --models-preset ~/gguf_store/config.ini --host 0.0.0.0 --port 8080 --models-max 1 --webui-mcp-proxy -t 4 --no-mmap
0.00.049.756 I cmn  common_param: common_params_print_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.051.314 I srv   load_models: Loaded 0 cached model presets
0.00.052.552 I srv   load_models: Loaded 12 custom model presets from /home/nerd-dell/gguf_store/config.ini
0.00.052.856 I srv    operator(): Available models (12) (*: custom preset)
0.00.052.858 I srv    operator():   * Mellum2-12B-A2.5B-Thinking-Q8_0
0.00.052.858 I srv    operator():   * Nex-N2-mini-Q5_K_L
0.00.052.859 I srv    operator():   * Qwen3.5-9B-Q4_K_M
0.00.052.859 I srv    operator():   * Qwen3.6-27B-Q6_K
0.00.052.859 I srv    operator():   * Qwen3.6-27B-Q6_K_L
0.00.052.860 I srv    operator():   * Qwen3.6-35B-A3B-Q4_K_M
0.00.052.860 I srv    operator():   * Qwen3.6-35B-A3B-Q5_K_L
0.00.052.860 I srv    operator():   * Qwen3.6-35B-A3B-Q8_0
0.00.052.861 I srv    operator():   * default
0.00.052.861 I srv    operator():   * gemma-4-12b-it-Q8_0
0.00.052.861 I srv    operator():   * gemma-4-26B-A4B-it-Q8_0
0.00.052.862 I srv    operator():   * ornith-1.0-35b-Q5_K_L
0.00.053.034 W srv  llama_server: -----------------
0.00.053.035 W srv  llama_server: CORS proxy is enabled, do not expose server to untrusted environments
0.00.053.035 W srv  llama_server: This feature is EXPERIMENTAL and may be removed or changed in future versions
0.00.053.036 W srv  llama_server: -----------------
0.00.053.043 I srv  llama_server: starting server in router mode. models will be automatically loaded on-demand
0.00.054.295 I srv  llama_server: listening on http://0.0.0.0:8080
0.00.054.297 W srv  llama_server: NOTE: router mode is experimental
0.00.054.298 W srv  llama_server:       it is not recommended to use this mode in untrusted environments
0.04.651.812 I srv  proxy_reques: proxying POST request to http://100.109.39.14:7722/mcp
0.04.663.799 I srv  proxy_reques: proxying POST request to http://100.109.39.14:7722/mcp
0.04.689.748 I srv  proxy_reques: proxying GET request to http://100.109.39.14:7722/mcp
0.04.691.141 I srv  proxy_reques: proxying POST request to http://100.109.39.14:7722/mcp
0.06.656.981 I srv          load: spawning server instance with name=ornith-1.0-35b-Q5_K_L on port 38957
0.06.657.043 I srv          load: spawning server instance with args:
0.06.657.044 I srv          load:   /home/nerd-dell/llama.cpp/build/bin/llama-server
0.06.657.045 I srv          load:   --host
0.06.657.046 I srv          load:   127.0.0.1
0.06.657.047 I srv          load:   --min-p
0.06.657.048 I srv          load:   0
0.06.657.049 I srv          load:   --no-mmap
0.06.657.050 I srv          load:   --port
0.06.657.050 I srv          load:   38957
0.06.657.051 I srv          load:   --presence-penalty
0.06.657.052 I srv          load:   1.5
0.06.657.053 I srv          load:   --sleep-idle-seconds
0.06.657.054 I srv          load:   5
0.06.657.055 I srv          load:   --temperature
0.06.657.056 I srv          load:   0.6
0.06.657.057 I srv          load:   --top-k
0.06.657.057 I srv          load:   20
0.06.657.058 I srv          load:   --top-p
0.06.657.059 I srv          load:   0.95
0.06.657.060 I srv          load:   --webui-mcp-proxy
0.06.657.061 I srv          load:   --alias
0.06.657.065 I srv          load:   ornith-1.0-35b-Q5_K_L
0.06.657.066 I srv          load:   --batch-size
0.06.657.072 I srv          load:   2048
0.06.657.076 I srv          load:   --ctx-size
0.06.657.077 I srv          load:   131072
0.06.657.077 I srv          load:   --flash-attn
0.06.657.078 I srv          load:   true
0.06.657.079 I srv          load:   --model
0.06.657.080 I srv          load:   /home/nerd-dell/gguf_store/deepreinforce-ai_Ornith-1.0-35B-Q5_K_L.gguf
0.06.657.081 I srv          load:   --parallel
0.06.657.082 I srv          load:   1
0.06.657.082 I srv          load:   --reasoning
0.06.657.083 I srv          load:   on
0.06.657.084 I srv          load:   --split-mode
0.06.657.085 I srv          load:   layer
0.06.657.085 I srv          load:   --threads
0.06.657.086 I srv          load:   4
0.06.657.087 I srv          load:   --ubatch-size
0.06.657.088 I srv          load:   512
[38957] 0.00.051.308 I cmn  common_param: common_params_print_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
[38957] 0.00.051.973 W srv  llama_server: -----------------
[38957] 0.00.052.073 W srv  llama_server: CORS proxy is enabled, do not expose server to untrusted environments
[38957] 0.00.052.126 W srv  llama_server: This feature is EXPERIMENTAL and may be removed or changed in future versions
[38957] 0.00.052.174 W srv  llama_server: -----------------
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.0}}
[38957] 0.00.053.598 I srv    load_model: loading model '/home/nerd-dell/gguf_store/deepreinforce-ai_Ornith-1.0-35B-Q5_K_L.gguf'
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.0}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.04217761009931564}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.09136982262134552}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.13931185007095337}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.1797194480895996}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.2262534499168396}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.2662183940410614}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.31179380416870117}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.35986006259918213}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.4002314507961273}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.4396010637283325}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.48717305064201355}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.5177286267280579}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.5425279140472412}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.5727134346961975}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.5959804058074951}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.6206554174423218}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.645082950592041}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.6771293878555298}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.7076850533485413}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.7310762405395508}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.7606538534164429}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.7850803732872009}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.815967321395874}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.8407665491104126}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.8651594519615173}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.8899587988853455}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.914758026599884}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.9395572543144226}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.9712387919425964}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":1.0}}
[38957] 0.23.037.140 I srv    load_model: initializing, n_slots = 1, n_ctx_slot = 131072, kv_unified = 'false'
[38957] 0.23.107.440 I srv          init: chat template supports preserving reasoning, consider enabling it via --reasoning-preserve
[38957] 0.23.107.478 I srv  llama_server: model loaded
[38957] 0.23.107.482 I srv  llama_server: listening on http://127.0.0.1:38957
[38957] cmd_child_to_router:state:{"state":"ready","payload":{"id":"ornith-1.0-35b-Q5_K_L","aliases":["ornith-1.0-35b-Q5_K_L"],"tags":[],"object":"model","created":1783094689,"owned_by":"llamacpp","meta":{"vocab_type":2,"n_vocab":248320,"n_ctx":131072,"n_ctx_train":262144,"n_embd":2048,"n_params":34660610688,"size":25320344064,"ftype":"Q5_K - Medium"}}}
[38957] 0.28.109.362 I que    start_loop: entering sleeping state
[38957] cmd_child_to_router:state:{"state":"sleeping","payload":null}
[38957] 0.28.109.597 I srv  handle_sleep: server is entering sleeping state
ROCM logs after sleep was initiated
nerd-dell@nerd-dell:~/llama.cpp$ rocm-smi --showmeminfo vram


============================ ROCm System Management Interface ============================
================================== Memory Usage (Bytes) ==================================
GPU[0]          : VRAM Total Memory (B): 8573157376
GPU[0]          : VRAM Total Used Memory (B): 376471552
GPU[1]          : VRAM Total Memory (B): 8573157376
GPU[1]          : VRAM Total Used Memory (B): 376496128
GPU[2]          : VRAM Total Memory (B): 8573157376
GPU[2]          : VRAM Total Used Memory (B): 376492032
GPU[3]          : VRAM Total Memory (B): 8573157376
GPU[3]          : VRAM Total Used Memory (B): 376483840
==========================================================================================
================================== End of ROCm SMI Log ===================================
nerd-dell@nerd-dell:~/llama.cpp$ rocm-smi --showpids


============================ ROCm System Management Interface ============================
===================================== KFD Processes ======================================
KFD process information:
PID     PROCESS NAME    GPU(s)  VRAM USED       SDMA USED       CU OCCUPANCY
1322779 llama-server    0       0               0               0           
1322886 llama-server    4       1470791680      0               0           
==========================================================================================
================================== End of ROCm SMI Log ===================================
nerd-dell@nerd-dell:~/llama.cpp$ rocm-smi --showmeminfo vram --csv | awk -F, 'NR>1 && $1{printf "%s: %.0f MB used / %.0f MB total\n",$1,$3/1048576,$2/1048576}'
card0: 359 MB used / 8176 MB total
card1: 359 MB used / 8176 MB total
card2: 359 MB used / 8176 MB total
card3: 359 MB used / 8176 MB total

To confirm/sanity check myself:

nerd-dell@nerd-dell:~/llama.cpp$ git status
On branch xsn-ggml_backend_dev_reset
nothing to commit, working tree clean

@ngxson

ngxson commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

@DEV-DUFORD unfortunately I have no experience working with hip/rocm, so I cannot help much. would appreciate if you can dip deeper to see if hipDeviceReset really works or it's buggy

on CUDA, it seems to work fine for me. upon entering sleep, the llama-server completely disappears from nvtop

@ngxson

ngxson commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

@DEV-DUFORD btw, can you try with one single device to narrow down if the problem is related to multi-gpu ?

@DEV-DUFORD

Copy link
Copy Markdown
Contributor

@DEV-DUFORD unfortunately I have no experience working with hip/rocm, so I cannot help much. would appreciate if you can dip deeper to see if hipDeviceReset really works or it's buggy

on CUDA, it seems to work fine for me. upon entering sleep, the llama-server completely disappears from nvtop

Yeah, will do! I'm still worried this is a HIP level bug, as in some isolated tests when I called hipDeviceReset for some reason the cards are refusing to clear fully :/

Also just confirmed, it's happening for me on a single card run as well as multi-gpu

@ServeurpersoCom

Copy link
Copy Markdown
Contributor

I can confirm that during sleep: there is no longer a CUDA context.

@ServeurpersoCom ServeurpersoCom left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new behavior is better, and it doesn't look longer at wakeup
(The CI still needs to be checked)

Copilot stopped work on behalf of ngxson due to an error July 3, 2026 16:54
@ngxson ngxson closed this Jul 3, 2026
@ngxson

ngxson commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

I hate github UI

@ngxson ngxson reopened this Jul 3, 2026
@ServeurpersoCom

ServeurpersoCom commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

I think it's like a #define cudaDeviceReset musaDeviceReset missing in vendors/musa.h I check

diff --git a/ggml/src/ggml-cuda/vendors/musa.h b/ggml/src/ggml-cuda/vendors/musa.h
index 6d725c7ec1..038d22f53e 100644
--- a/ggml/src/ggml-cuda/vendors/musa.h
+++ b/ggml/src/ggml-cuda/vendors/musa.h
@@ -43,6 +43,7 @@
 #define cudaDeviceEnablePeerAccess musaDeviceEnablePeerAccess
 #define cudaDeviceGetPCIBusId musaDeviceGetPCIBusId
 #define cudaDeviceProp musaDeviceProp
+#define cudaDeviceReset musaDeviceReset
 #define cudaDeviceSynchronize musaDeviceSynchronize
 #define cudaError_t musaError_t
 #define cudaErrorMemoryAllocation musaErrorMemoryAllocation

@DEV-DUFORD

Copy link
Copy Markdown
Contributor

@ngxson Doing some additional digging I got to this:

It seems like, on HIP, that the Tensile kernel libraries are being loaded on the first GEMM, and there's nothing that AMD exposes short of killing the process that will clear them from the GPUs (tried a few tricks, hipDeviceReset, hipFree, hipHostFree to no avail). We also can't use hipModuleUnload as those .hsaco libs are handled internally, sad. :(

So I don't really know if there's a path forward on HIP devices to truly clear the VRAM on sleep, hmmmmmm. 🤔

@ServeurpersoCom

ServeurpersoCom commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Worst case we could add an opt-in --sleep-exit: nothing dirty in llama-server, just an escape hatch for stubborn runtimes when the bug is not on our side.
The best move is to write a small minimal repro binary and open an issue on AMD's side https://github.com/ROCm/rocm-libraries/issues
If I had an AMD card here, one agentic loop and the repro binary would write itself: about 50 lines, hipMemGetInfo before/after hipDeviceReset, done...

@ngxson

ngxson commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

the main problem why I don't want exit-on-sleep is because:

  • it resets the server's metrics
  • it no longer mean "sleep", i.e. not the same "sleep" as what vllm has
  • it will make partially sleep, i.e. custom unload weight / kv / etc become tricky in the future

so exiting process is really the last-ditch resolution. I think for now, the better steps that we can do are:

  1. write an issue to upstream ROCm as @ServeurpersoCom suggested
  2. in the meantime, you can have a wrapper around llama-server that automatically trigger model unload when it goes to sleep - IMO this is something easy that an AI agent can write in less than 10 minutes
  3. while the issue on upstream is not fixed, we could also investigate to see if we can extract the raw pointer to Tensile or dlclose it from ggml

cc'ing @IMbackK if you have any thoughts on this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AMD ZenDNN Issues related to the AMD ZenDNN backend Apple Metal https://en.wikipedia.org/wiki/Metal_(API) Ascend NPU issues specific to Ascend NPUs CUDA Related to the CUDA backend ggml changes relating to the ggml tensor library for machine learning Hexagon IBM zDNN issues specific to IBM zDNN Accelerator OpenCL Issues specific to the OpenCL backend OpenVINO server SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Vulkan Issues specific to the Vulkan backend WebGPU

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants