ggml, server: add ggml_backend_dev_reset() for sleep mode by ngxson · Pull Request #25271 · ggml-org/llama.cpp

ngxson · 2026-07-03T15:41:45Z

Overview

Add ggml_backend_dev_reset() that is mapped to cudaDeviceReset / hipDeviceReset

only CUDA & hip is handled; OTHER BACKEND ARE UNIMPLEMENTED, please discard reviewing this --> feel free to push a PR to add this feature to other backend

Tested on CUDA: llama-server -hf unsloth/Qwen3.5-4B-MTP-GGUF:Q4_K_M --sleep-idle-seconds 5

Before this change: nvtop reports using ~158MB on sleep mode
With this change: nvtop reports 0 memory usage on sleep mode

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: core changes are written by me, AI used to add nullptr stub for other backends

ServeurpersoCom · 2026-07-03T15:45:37Z

I'm testing it on my server

DEV-DUFORD · 2026-07-03T16:12:37Z

@ngxson Still having held HIP resources after sleep is initiated, attached some logs to help w/ triage:

ROCM 6.3.3 w/ gfx900

llama-server logs (important part is to know it entered sleep mode)

nerd-dell@nerd-dell:~/llama.cpp$ cd ~/llama.cpp/ && GGML_CUDA_ALLREDUCE=none ./build/bin/llama-server --models-preset ~/gguf_store/config.ini --host 0.0.0.0 --port 8080 --models-max 1 --webui-mcp-proxy -t 4 --no-mmap
0.00.049.756 I cmn  common_param: common_params_print_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.051.314 I srv   load_models: Loaded 0 cached model presets
0.00.052.552 I srv   load_models: Loaded 12 custom model presets from /home/nerd-dell/gguf_store/config.ini
0.00.052.856 I srv    operator(): Available models (12) (*: custom preset)
0.00.052.858 I srv    operator():   * Mellum2-12B-A2.5B-Thinking-Q8_0
0.00.052.858 I srv    operator():   * Nex-N2-mini-Q5_K_L
0.00.052.859 I srv    operator():   * Qwen3.5-9B-Q4_K_M
0.00.052.859 I srv    operator():   * Qwen3.6-27B-Q6_K
0.00.052.859 I srv    operator():   * Qwen3.6-27B-Q6_K_L
0.00.052.860 I srv    operator():   * Qwen3.6-35B-A3B-Q4_K_M
0.00.052.860 I srv    operator():   * Qwen3.6-35B-A3B-Q5_K_L
0.00.052.860 I srv    operator():   * Qwen3.6-35B-A3B-Q8_0
0.00.052.861 I srv    operator():   * default
0.00.052.861 I srv    operator():   * gemma-4-12b-it-Q8_0
0.00.052.861 I srv    operator():   * gemma-4-26B-A4B-it-Q8_0
0.00.052.862 I srv    operator():   * ornith-1.0-35b-Q5_K_L
0.00.053.034 W srv  llama_server: -----------------
0.00.053.035 W srv  llama_server: CORS proxy is enabled, do not expose server to untrusted environments
0.00.053.035 W srv  llama_server: This feature is EXPERIMENTAL and may be removed or changed in future versions
0.00.053.036 W srv  llama_server: -----------------
0.00.053.043 I srv  llama_server: starting server in router mode. models will be automatically loaded on-demand
0.00.054.295 I srv  llama_server: listening on http://0.0.0.0:8080
0.00.054.297 W srv  llama_server: NOTE: router mode is experimental
0.00.054.298 W srv  llama_server:       it is not recommended to use this mode in untrusted environments
0.04.651.812 I srv  proxy_reques: proxying POST request to http://100.109.39.14:7722/mcp
0.04.663.799 I srv  proxy_reques: proxying POST request to http://100.109.39.14:7722/mcp
0.04.689.748 I srv  proxy_reques: proxying GET request to http://100.109.39.14:7722/mcp
0.04.691.141 I srv  proxy_reques: proxying POST request to http://100.109.39.14:7722/mcp
0.06.656.981 I srv          load: spawning server instance with name=ornith-1.0-35b-Q5_K_L on port 38957
0.06.657.043 I srv          load: spawning server instance with args:
0.06.657.044 I srv          load:   /home/nerd-dell/llama.cpp/build/bin/llama-server
0.06.657.045 I srv          load:   --host
0.06.657.046 I srv          load:   127.0.0.1
0.06.657.047 I srv          load:   --min-p
0.06.657.048 I srv          load:   0
0.06.657.049 I srv          load:   --no-mmap
0.06.657.050 I srv          load:   --port
0.06.657.050 I srv          load:   38957
0.06.657.051 I srv          load:   --presence-penalty
0.06.657.052 I srv          load:   1.5
0.06.657.053 I srv          load:   --sleep-idle-seconds
0.06.657.054 I srv          load:   5
0.06.657.055 I srv          load:   --temperature
0.06.657.056 I srv          load:   0.6
0.06.657.057 I srv          load:   --top-k
0.06.657.057 I srv          load:   20
0.06.657.058 I srv          load:   --top-p
0.06.657.059 I srv          load:   0.95
0.06.657.060 I srv          load:   --webui-mcp-proxy
0.06.657.061 I srv          load:   --alias
0.06.657.065 I srv          load:   ornith-1.0-35b-Q5_K_L
0.06.657.066 I srv          load:   --batch-size
0.06.657.072 I srv          load:   2048
0.06.657.076 I srv          load:   --ctx-size
0.06.657.077 I srv          load:   131072
0.06.657.077 I srv          load:   --flash-attn
0.06.657.078 I srv          load:   true
0.06.657.079 I srv          load:   --model
0.06.657.080 I srv          load:   /home/nerd-dell/gguf_store/deepreinforce-ai_Ornith-1.0-35B-Q5_K_L.gguf
0.06.657.081 I srv          load:   --parallel
0.06.657.082 I srv          load:   1
0.06.657.082 I srv          load:   --reasoning
0.06.657.083 I srv          load:   on
0.06.657.084 I srv          load:   --split-mode
0.06.657.085 I srv          load:   layer
0.06.657.085 I srv          load:   --threads
0.06.657.086 I srv          load:   4
0.06.657.087 I srv          load:   --ubatch-size
0.06.657.088 I srv          load:   512
[38957] 0.00.051.308 I cmn  common_param: common_params_print_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
[38957] 0.00.051.973 W srv  llama_server: -----------------
[38957] 0.00.052.073 W srv  llama_server: CORS proxy is enabled, do not expose server to untrusted environments
[38957] 0.00.052.126 W srv  llama_server: This feature is EXPERIMENTAL and may be removed or changed in future versions
[38957] 0.00.052.174 W srv  llama_server: -----------------
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.0}}
[38957] 0.00.053.598 I srv    load_model: loading model '/home/nerd-dell/gguf_store/deepreinforce-ai_Ornith-1.0-35B-Q5_K_L.gguf'
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.0}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.04217761009931564}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.09136982262134552}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.13931185007095337}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.1797194480895996}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.2262534499168396}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.2662183940410614}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.31179380416870117}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.35986006259918213}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.4002314507961273}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.4396010637283325}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.48717305064201355}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.5177286267280579}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.5425279140472412}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.5727134346961975}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.5959804058074951}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.6206554174423218}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.645082950592041}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.6771293878555298}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.7076850533485413}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.7310762405395508}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.7606538534164429}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.7850803732872009}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.815967321395874}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.8407665491104126}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.8651594519615173}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.8899587988853455}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.914758026599884}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.9395572543144226}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":0.9712387919425964}}
[38957] cmd_child_to_router:state:{"state":"loading","payload":{"stages":["text_model"],"current":"text_model","value":1.0}}
[38957] 0.23.037.140 I srv    load_model: initializing, n_slots = 1, n_ctx_slot = 131072, kv_unified = 'false'
[38957] 0.23.107.440 I srv          init: chat template supports preserving reasoning, consider enabling it via --reasoning-preserve
[38957] 0.23.107.478 I srv  llama_server: model loaded
[38957] 0.23.107.482 I srv  llama_server: listening on http://127.0.0.1:38957
[38957] cmd_child_to_router:state:{"state":"ready","payload":{"id":"ornith-1.0-35b-Q5_K_L","aliases":["ornith-1.0-35b-Q5_K_L"],"tags":[],"object":"model","created":1783094689,"owned_by":"llamacpp","meta":{"vocab_type":2,"n_vocab":248320,"n_ctx":131072,"n_ctx_train":262144,"n_embd":2048,"n_params":34660610688,"size":25320344064,"ftype":"Q5_K - Medium"}}}
[38957] 0.28.109.362 I que    start_loop: entering sleeping state
[38957] cmd_child_to_router:state:{"state":"sleeping","payload":null}
[38957] 0.28.109.597 I srv  handle_sleep: server is entering sleeping state

ROCM logs after sleep was initiated

nerd-dell@nerd-dell:~/llama.cpp$ rocm-smi --showmeminfo vram


============================ ROCm System Management Interface ============================
================================== Memory Usage (Bytes) ==================================
GPU[0]          : VRAM Total Memory (B): 8573157376
GPU[0]          : VRAM Total Used Memory (B): 376471552
GPU[1]          : VRAM Total Memory (B): 8573157376
GPU[1]          : VRAM Total Used Memory (B): 376496128
GPU[2]          : VRAM Total Memory (B): 8573157376
GPU[2]          : VRAM Total Used Memory (B): 376492032
GPU[3]          : VRAM Total Memory (B): 8573157376
GPU[3]          : VRAM Total Used Memory (B): 376483840
==========================================================================================
================================== End of ROCm SMI Log ===================================
nerd-dell@nerd-dell:~/llama.cpp$ rocm-smi --showpids


============================ ROCm System Management Interface ============================
===================================== KFD Processes ======================================
KFD process information:
PID     PROCESS NAME    GPU(s)  VRAM USED       SDMA USED       CU OCCUPANCY
1322779 llama-server    0       0               0               0           
1322886 llama-server    4       1470791680      0               0           
==========================================================================================
================================== End of ROCm SMI Log ===================================
nerd-dell@nerd-dell:~/llama.cpp$ rocm-smi --showmeminfo vram --csv | awk -F, 'NR>1 && $1{printf "%s: %.0f MB used / %.0f MB total\n",$1,$3/1048576,$2/1048576}'
card0: 359 MB used / 8176 MB total
card1: 359 MB used / 8176 MB total
card2: 359 MB used / 8176 MB total
card3: 359 MB used / 8176 MB total

To confirm/sanity check myself:

nerd-dell@nerd-dell:~/llama.cpp$ git status
On branch xsn-ggml_backend_dev_reset
nothing to commit, working tree clean

ngxson · 2026-07-03T16:20:03Z

@DEV-DUFORD unfortunately I have no experience working with hip/rocm, so I cannot help much. would appreciate if you can dip deeper to see if hipDeviceReset really works or it's buggy

on CUDA, it seems to work fine for me. upon entering sleep, the llama-server completely disappears from nvtop

ngxson · 2026-07-03T16:26:15Z

@DEV-DUFORD btw, can you try with one single device to narrow down if the problem is related to multi-gpu ?

DEV-DUFORD · 2026-07-03T16:35:05Z

@DEV-DUFORD unfortunately I have no experience working with hip/rocm, so I cannot help much. would appreciate if you can dip deeper to see if hipDeviceReset really works or it's buggy

on CUDA, it seems to work fine for me. upon entering sleep, the llama-server completely disappears from nvtop

Yeah, will do! I'm still worried this is a HIP level bug, as in some isolated tests when I called hipDeviceReset for some reason the cards are refusing to clear fully :/

Also just confirmed, it's happening for me on a single card run as well as multi-gpu

ServeurpersoCom · 2026-07-03T16:42:02Z

I can confirm that during sleep: there is no longer a CUDA context.

ServeurpersoCom

This new behavior is better, and it doesn't look longer at wakeup
(The CI still needs to be checked)

ngxson · 2026-07-03T16:55:27Z

I hate github UI

ServeurpersoCom · 2026-07-03T16:56:45Z

I think it's like a #define cudaDeviceReset musaDeviceReset missing in vendors/musa.h I check

diff --git a/ggml/src/ggml-cuda/vendors/musa.h b/ggml/src/ggml-cuda/vendors/musa.h
index 6d725c7ec1..038d22f53e 100644
--- a/ggml/src/ggml-cuda/vendors/musa.h
+++ b/ggml/src/ggml-cuda/vendors/musa.h
@@ -43,6 +43,7 @@
 #define cudaDeviceEnablePeerAccess musaDeviceEnablePeerAccess
 #define cudaDeviceGetPCIBusId musaDeviceGetPCIBusId
 #define cudaDeviceProp musaDeviceProp
+#define cudaDeviceReset musaDeviceReset
 #define cudaDeviceSynchronize musaDeviceSynchronize
 #define cudaError_t musaError_t
 #define cudaErrorMemoryAllocation musaErrorMemoryAllocation

DEV-DUFORD · 2026-07-03T17:58:11Z

@ngxson Doing some additional digging I got to this:

It seems like, on HIP, that the Tensile kernel libraries are being loaded on the first GEMM, and there's nothing that AMD exposes short of killing the process that will clear them from the GPUs (tried a few tricks, hipDeviceReset, hipFree, hipHostFree to no avail). We also can't use hipModuleUnload as those .hsaco libs are handled internally, sad. :(

So I don't really know if there's a path forward on HIP devices to truly clear the VRAM on sleep, hmmmmmm. 🤔

ServeurpersoCom · 2026-07-03T18:46:33Z

Worst case we could add an opt-in --sleep-exit: nothing dirty in llama-server, just an escape hatch for stubborn runtimes when the bug is not on our side.
The best move is to write a small minimal repro binary and open an issue on AMD's side https://github.com/ROCm/rocm-libraries/issues
If I had an AMD card here, one agentic loop and the repro binary would write itself: about 50 lines, hipMemGetInfo before/after hipDeviceReset, done...

ngxson · 2026-07-03T20:27:32Z

the main problem why I don't want exit-on-sleep is because:

it resets the server's metrics
it no longer mean "sleep", i.e. not the same "sleep" as what vllm has
it will make partially sleep, i.e. custom unload weight / kv / etc become tricky in the future

so exiting process is really the last-ditch resolution. I think for now, the better steps that we can do are:

write an issue to upstream ROCm as @ServeurpersoCom suggested
in the meantime, you can have a wrapper around llama-server that automatically trigger model unload when it goes to sleep - IMO this is something easy that an AI agent can write in less than 10 minutes
while the issue on upstream is not fixed, we could also investigate to see if we can extract the raw pointer to Tensile or dlclose it from ggml

cc'ing @IMbackK if you have any thoughts on this issue

ngxson added 3 commits July 3, 2026 17:25

add ggml_backend_dev_reset() for sleep mode

893b45a

add hipDeviceReset()

2acfc61

unimpl for other backend

602ceac

ngxson requested review from a team, IMbackK, JohannesGaessler and ggerganov as code owners July 3, 2026 15:41

ngxson removed the WebGPU label Jul 3, 2026

ngxson added 2 commits July 3, 2026 17:47

use CUDA_CHECK

8fb4063

nits

74dc167

ServeurpersoCom approved these changes Jul 3, 2026

View reviewed changes

Copilot started work on behalf of ngxson July 3, 2026 16:54 View session

Copilot stopped work on behalf of ngxson due to an error July 3, 2026 16:54
The session was cancelled by the user.

ngxson closed this Jul 3, 2026

ngxson reopened this Jul 3, 2026

musaDeviceReset

ff9e767

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml, server: add ggml_backend_dev_reset() for sleep mode#25271

ggml, server: add ggml_backend_dev_reset() for sleep mode#25271
ngxson wants to merge 6 commits into
masterfrom
xsn/ggml_backend_dev_reset

ngxson commented Jul 3, 2026 •

edited

Loading

Uh oh!

ServeurpersoCom commented Jul 3, 2026

Uh oh!

DEV-DUFORD commented Jul 3, 2026 •

edited

Loading

Uh oh!

ngxson commented Jul 3, 2026

Uh oh!

ngxson commented Jul 3, 2026

Uh oh!

DEV-DUFORD commented Jul 3, 2026

Uh oh!

ServeurpersoCom commented Jul 3, 2026

Uh oh!

ServeurpersoCom left a comment •

edited

Loading

Uh oh!

ngxson commented Jul 3, 2026

Uh oh!

ServeurpersoCom commented Jul 3, 2026 •

edited

Loading

Uh oh!

DEV-DUFORD commented Jul 3, 2026

Uh oh!

ServeurpersoCom commented Jul 3, 2026 •

edited

Loading

Uh oh!

ngxson commented Jul 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ngxson commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Requirements

Uh oh!

ServeurpersoCom commented Jul 3, 2026

Uh oh!

DEV-DUFORD commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Jul 3, 2026

Uh oh!

ngxson commented Jul 3, 2026

Uh oh!

DEV-DUFORD commented Jul 3, 2026

Uh oh!

ServeurpersoCom commented Jul 3, 2026

Uh oh!

ServeurpersoCom left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson commented Jul 3, 2026

Uh oh!

ServeurpersoCom commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DEV-DUFORD commented Jul 3, 2026

Uh oh!

ServeurpersoCom commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ngxson commented Jul 3, 2026 •

edited

Loading

DEV-DUFORD commented Jul 3, 2026 •

edited

Loading

ServeurpersoCom left a comment •

edited

Loading

ServeurpersoCom commented Jul 3, 2026 •

edited

Loading

ServeurpersoCom commented Jul 3, 2026 •

edited

Loading

ngxson commented Jul 3, 2026 •

edited

Loading