Skip to content

Possible regression in 1.113.x: cublasSgemm_v2 crash with --quantkv q8_0 on RTX 5080 #2222

@diesalher

Description

@diesalher

Hardware: RTX 5080 (16GB, compute capability 12.0), Ryzen 9800X3D, Windows.
Model: TheDrummer Artemis-31B-v1h-Q4_K_M (Gemma 4 31B finetune).
Trigger: --quantkv q8_0 causes crash in cublasSgemm_v2 during prompt processing with prompts >~3K tokens on 1.113.x. Removing --quantkv (defaulting to f16) makes the issue disappear.
Versions tested:

1.111.2 with q8_0 → works
1.113.2 with q8_0 → crashes (see below)
1.113.2 without quantkv (f16) → works

Crash:
Processing Prompt [BATCH] (4096 / 30609 tokens)CUDA error: an unsupported value or parameter was passed to the function
current device: 0, in function ggml_cuda_op_mul_mat_cublas at ggml-cuda.cu:1775
cublasSgemm_v2(...)
Release notes for 1.113 mention "Fixed q5_1 kv type not using the GPU correctly in CUDA" — suggests recent work on KV cache quantization that may have regressed q8_0 specifically on Blackwell GPUs.
Other flags: --splitmode layer, SWA, FA auto-enabled, 32K context, batch 512.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions