Skip to content

Qwen3.5 slow with CUDA on 1.109.2 and Win11 #2014

@dombach

Description

@dombach

Describe the Issue
Using Qwen3.5 Models via CUDA on version 1.109.2 is significantly slower than on version 1.108.2 (Windows 11). However, performance via Vulkan remains quite fast - strangely faster than CUDA.

Additional Information:
Windows 11, Ryzen 8700G, Nvidia RTX3090
All benchmarked with translation of the same around 5000 token long text.

Mistral Small 24B behaves as expected - Vulkan is a bit slower than CUDA
1.108.2 CUDA
PP 1713 TG 40

1.109.2 CUDA
PP 1758 TG 40
1.109.2 Vulkan
PP 1593 TG 37

Qwen3.5-35B-A3B-Q4_K_M
1.108.2 CUDA
PP 2200 TG 76
1.108.2 Vulkan
PP 2541 TG 88

1.109.2 CUDA
PP 2131 TG 51 (TG is much lower)
1.109.2 Vulkan
PP 2534 TG 87

Qwen3.5-9B-Q5_K_M
1.108.2 CUDA
PP 2992 TG 67

1.109.2 CUDA
PP 3173 TG 52 (TG is much lower)
1.109.2 Vulkan
PP 2895 TG 70

I have similar results on a second computer with Win11, Ryzen 5900X, Nvidia RTX3090
Especially token generation with the Qwen 3.5 Models and CUDA on the 1.109 versions is slower.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions