Describe the Issue
Using Qwen3.5 Models via CUDA on version 1.109.2 is significantly slower than on version 1.108.2 (Windows 11). However, performance via Vulkan remains quite fast - strangely faster than CUDA.
Additional Information:
Windows 11, Ryzen 8700G, Nvidia RTX3090
All benchmarked with translation of the same around 5000 token long text.
Mistral Small 24B behaves as expected - Vulkan is a bit slower than CUDA
1.108.2 CUDA
PP 1713 TG 40
1.109.2 CUDA
PP 1758 TG 40
1.109.2 Vulkan
PP 1593 TG 37
Qwen3.5-35B-A3B-Q4_K_M
1.108.2 CUDA
PP 2200 TG 76
1.108.2 Vulkan
PP 2541 TG 88
1.109.2 CUDA
PP 2131 TG 51 (TG is much lower)
1.109.2 Vulkan
PP 2534 TG 87
Qwen3.5-9B-Q5_K_M
1.108.2 CUDA
PP 2992 TG 67
1.109.2 CUDA
PP 3173 TG 52 (TG is much lower)
1.109.2 Vulkan
PP 2895 TG 70
I have similar results on a second computer with Win11, Ryzen 5900X, Nvidia RTX3090
Especially token generation with the Qwen 3.5 Models and CUDA on the 1.109 versions is slower.