Notable performance regression in token generation speed (vulkan, windows FlashAttention).

**Describe the Issue**
Notable performance regression in decode performance (token generation).

Observed with Vulkan, Windows, FlashAttention

**Additional Information:**
1.101.1:
```
Processing Prompt [BLAS] (1948 / 1948 tokens)
Generating (100 / 100 tokens)
[21:54:43] CtxLimit:2048/2048, Amt:100/100, Init:0.64s, Process:11.53s (169.02T/s), Generate:7.72s (12.96T/s), Total:19.24s
Benchmark Completed - v1.101.1 Results:
======
Flags: NoAVX2=False Threads=8 HighPriority=False Cuda_Args=None Tensor_Split=None BlasThreads=8 BatchSize=512 FlashAttention=True KvCache=0
Timestamp: 2026-01-20 20:54:43.101811+00:00
Backend: koboldcpp_vulkan.dll
Layers: 999
Model: openai_gpt-oss-20b-MXFP4
MaxCtx: 2048
GenAmount: 100
-----
ProcessingTime: 11.525s
ProcessingSpeed: 169.02T/s
GenerationTime: 7.719s
GenerationSpeed: 12.96T/s
TotalTime: 19.244s
Output:  1 1 1 1
-----
```
1.106.2:
```
Processing Prompt [BATCH] (1948 / 1948 tokens)
Generating (100 / 100 tokens)
[21:56:38] CtxLimit:2048/2048, Amt:100/100, Init:0.68s, Process:17.02s (114.48T/s), Generate:25.39s (3.94T/s), Total:42.41s
Benchmark Completed - v1.106.2 Results:
======
Flags: NoAVX2=False Threads=8 HighPriority=False Cuda_Args=None Tensor_Split=None BlasThreads=8 BatchSize=512 FlashAttention=True KvCache=0
Timestamp: 2026-01-20 20:56:38.645364+00:00
Backend: koboldcpp_vulkan.dll
Layers: 999
Model: openai_gpt-oss-20b-MXFP4
MaxCtx: 2048
GenAmount: 100
-----
ProcessingTime: 17.016s
ProcessingSpeed: 114.48T/s
GenerationTime: 25.394s
GenerationSpeed: 3.94T/s
TotalTime: 42.410s
Output:  1 1 1 1
-----
```

Last working release 1.101.1. I have observed the same regression on other models (no matter if dense/moe/small/large), and other machines (this is intel ARC, have observed it also on AMD Strix Halo). I seem to remember the issue was observable since 102 but right now the benchamrk on 102 and 103 crashes my computer.

For reference, this is 1.106.2 without FA (token generation performance is the same as flashattention enabled, but context prefill performance is _better_):
```
Processing Prompt [BATCH] (1948 / 1948 tokens)
Generating (100 / 100 tokens)
[22:01:17] CtxLimit:2048/2048, Amt:100/100, Init:0.66s, Process:7.38s (264.14T/s), Generate:27.27s (3.67T/s), Total:34.65s
Benchmark Completed - v1.106.2 Results:
======
Flags: NoAVX2=False Threads=8 HighPriority=False Cuda_Args=None Tensor_Split=None BlasThreads=8 BatchSize=512 FlashAttention=False KvCache=0
Timestamp: 2026-01-20 21:01:17.980829+00:00
Backend: koboldcpp_vulkan.dll
Layers: 999
Model: openai_gpt-oss-20b-MXFP4
MaxCtx: 2048
GenAmount: 100
-----
ProcessingTime: 7.375s
ProcessingSpeed: 264.14T/s
GenerationTime: 27.272s
GenerationSpeed: 3.67T/s
TotalTime: 34.647s
Output:  1 1 1 1
-----
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notable performance regression in token generation speed (vulkan, windows FlashAttention). #1933

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Notable performance regression in token generation speed (vulkan, windows FlashAttention). #1933

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions