Describe the Issue
Notable performance regression in decode performance (token generation).
Observed with Vulkan, Windows, FlashAttention
Additional Information:
1.101.1:
Processing Prompt [BLAS] (1948 / 1948 tokens)
Generating (100 / 100 tokens)
[21:54:43] CtxLimit:2048/2048, Amt:100/100, Init:0.64s, Process:11.53s (169.02T/s), Generate:7.72s (12.96T/s), Total:19.24s
Benchmark Completed - v1.101.1 Results:
======
Flags: NoAVX2=False Threads=8 HighPriority=False Cuda_Args=None Tensor_Split=None BlasThreads=8 BatchSize=512 FlashAttention=True KvCache=0
Timestamp: 2026-01-20 20:54:43.101811+00:00
Backend: koboldcpp_vulkan.dll
Layers: 999
Model: openai_gpt-oss-20b-MXFP4
MaxCtx: 2048
GenAmount: 100
-----
ProcessingTime: 11.525s
ProcessingSpeed: 169.02T/s
GenerationTime: 7.719s
GenerationSpeed: 12.96T/s
TotalTime: 19.244s
Output: 1 1 1 1
-----
1.106.2:
Processing Prompt [BATCH] (1948 / 1948 tokens)
Generating (100 / 100 tokens)
[21:56:38] CtxLimit:2048/2048, Amt:100/100, Init:0.68s, Process:17.02s (114.48T/s), Generate:25.39s (3.94T/s), Total:42.41s
Benchmark Completed - v1.106.2 Results:
======
Flags: NoAVX2=False Threads=8 HighPriority=False Cuda_Args=None Tensor_Split=None BlasThreads=8 BatchSize=512 FlashAttention=True KvCache=0
Timestamp: 2026-01-20 20:56:38.645364+00:00
Backend: koboldcpp_vulkan.dll
Layers: 999
Model: openai_gpt-oss-20b-MXFP4
MaxCtx: 2048
GenAmount: 100
-----
ProcessingTime: 17.016s
ProcessingSpeed: 114.48T/s
GenerationTime: 25.394s
GenerationSpeed: 3.94T/s
TotalTime: 42.410s
Output: 1 1 1 1
-----
Last working release 1.101.1. I have observed the same regression on other models (no matter if dense/moe/small/large), and other machines (this is intel ARC, have observed it also on AMD Strix Halo). I seem to remember the issue was observable since 102 but right now the benchamrk on 102 and 103 crashes my computer.
For reference, this is 1.106.2 without FA (token generation performance is the same as flashattention enabled, but context prefill performance is better):
Processing Prompt [BATCH] (1948 / 1948 tokens)
Generating (100 / 100 tokens)
[22:01:17] CtxLimit:2048/2048, Amt:100/100, Init:0.66s, Process:7.38s (264.14T/s), Generate:27.27s (3.67T/s), Total:34.65s
Benchmark Completed - v1.106.2 Results:
======
Flags: NoAVX2=False Threads=8 HighPriority=False Cuda_Args=None Tensor_Split=None BlasThreads=8 BatchSize=512 FlashAttention=False KvCache=0
Timestamp: 2026-01-20 21:01:17.980829+00:00
Backend: koboldcpp_vulkan.dll
Layers: 999
Model: openai_gpt-oss-20b-MXFP4
MaxCtx: 2048
GenAmount: 100
-----
ProcessingTime: 7.375s
ProcessingSpeed: 264.14T/s
GenerationTime: 27.272s
GenerationSpeed: 3.67T/s
TotalTime: 34.647s
Output: 1 1 1 1
-----
Describe the Issue
Notable performance regression in decode performance (token generation).
Observed with Vulkan, Windows, FlashAttention
Additional Information:
1.101.1:
1.106.2:
Last working release 1.101.1. I have observed the same regression on other models (no matter if dense/moe/small/large), and other machines (this is intel ARC, have observed it also on AMD Strix Halo). I seem to remember the issue was observable since 102 but right now the benchamrk on 102 and 103 crashes my computer.
For reference, this is 1.106.2 without FA (token generation performance is the same as flashattention enabled, but context prefill performance is better):