Hardware: RTX 5080 (16GB, compute capability 12.0), Ryzen 9800X3D, Windows.
Model: TheDrummer Artemis-31B-v1h-Q4_K_M (Gemma 4 31B finetune).
Trigger: --quantkv q8_0 causes crash in cublasSgemm_v2 during prompt processing with prompts >~3K tokens on 1.113.x. Removing --quantkv (defaulting to f16) makes the issue disappear.
Versions tested:
1.111.2 with q8_0 → works
1.113.2 with q8_0 → crashes (see below)
1.113.2 without quantkv (f16) → works
Crash:
Processing Prompt [BATCH] (4096 / 30609 tokens)CUDA error: an unsupported value or parameter was passed to the function
current device: 0, in function ggml_cuda_op_mul_mat_cublas at ggml-cuda.cu:1775
cublasSgemm_v2(...)
Release notes for 1.113 mention "Fixed q5_1 kv type not using the GPU correctly in CUDA" — suggests recent work on KV cache quantization that may have regressed q8_0 specifically on Blackwell GPUs.
Other flags: --splitmode layer, SWA, FA auto-enabled, 32K context, batch 512.
Hardware: RTX 5080 (16GB, compute capability 12.0), Ryzen 9800X3D, Windows.
Model: TheDrummer Artemis-31B-v1h-Q4_K_M (Gemma 4 31B finetune).
Trigger: --quantkv q8_0 causes crash in cublasSgemm_v2 during prompt processing with prompts >~3K tokens on 1.113.x. Removing --quantkv (defaulting to f16) makes the issue disappear.
Versions tested:
1.111.2 with q8_0 → works
1.113.2 with q8_0 → crashes (see below)
1.113.2 without quantkv (f16) → works
Crash:
Processing Prompt [BATCH] (4096 / 30609 tokens)CUDA error: an unsupported value or parameter was passed to the function
current device: 0, in function ggml_cuda_op_mul_mat_cublas at ggml-cuda.cu:1775
cublasSgemm_v2(...)
Release notes for 1.113 mention "Fixed q5_1 kv type not using the GPU correctly in CUDA" — suggests recent work on KV cache quantization that may have regressed q8_0 specifically on Blackwell GPUs.
Other flags: --splitmode layer, SWA, FA auto-enabled, 32K context, batch 512.