Possible regression in 1.113.x: cublasSgemm_v2 crash with --quantkv q8_0 on RTX 5080


Hardware: RTX 5080 (16GB, compute capability 12.0), Ryzen 9800X3D, Windows.
Model: TheDrummer Artemis-31B-v1h-Q4_K_M (Gemma 4 31B finetune).
Trigger: --quantkv q8_0 causes crash in cublasSgemm_v2 during prompt processing with prompts >~3K tokens on 1.113.x. Removing --quantkv (defaulting to f16) makes the issue disappear.
Versions tested:

1.111.2 with q8_0 → works
1.113.2 with q8_0 → crashes (see below)
1.113.2 without quantkv (f16) → works

Crash:
Processing Prompt [BATCH] (4096 / 30609 tokens)CUDA error: an unsupported value or parameter was passed to the function
  current device: 0, in function ggml_cuda_op_mul_mat_cublas at ggml-cuda.cu:1775
  cublasSgemm_v2(...)
Release notes for 1.113 mention "Fixed q5_1 kv type not using the GPU correctly in CUDA" — suggests recent work on KV cache quantization that may have regressed q8_0 specifically on Blackwell GPUs.
Other flags: --splitmode layer, SWA, FA auto-enabled, 32K context, batch 512.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible regression in 1.113.x: cublasSgemm_v2 crash with --quantkv q8_0 on RTX 5080 #2222

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Possible regression in 1.113.x: cublasSgemm_v2 crash with --quantkv q8_0 on RTX 5080 #2222

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions