Kmoren/add penalties cu backend by kmorennv · Pull Request #25262 · ggml-org/llama.cpp

kmorennv · 2026-07-03T08:13:42Z

Overview & Motivation

This PR migrates penalties sampling (repeat, frequency, and presence) from the CPU to the GPU backend.

The Problem: Previously, penalty sampling was CPU-only, which forced all subsequent samplers in the chain to execute on the CPU as well.
The Solution: Moving this to the backend allows for continuous GPU-bound sampling. This is highly recommended for modern models like Qwen 3.5/3.6. They directly advice to use penalties sampling https://huggingface.co/Qwen/Qwen3.5-35B-A3B

Performance Impact

Moving penalty sampling to the backend yields a noticeable boost in token generation speed:

OS / GPU	Max Speedup	Key Models Benefiting
Linux (RTX 4000 SFF)	+7.59%	`gpt-oss-20b`, `Qwen3.6-35B`
Windows 11 (RTX 6000 Pro)	+19.40%	`gpt-oss-20b`, `Qwen3.6-35B`

Details performance

OS-CTK-Driver	GPU	Model	Without BS tok/sec	With BS tok/sec	Speedup
Linux-13.2-595.58.03	RTX 4000 SFF	gpt-oss-20b-mxfp4	81.01	87.16	+7.59%
Linux-13.2-595.58.03	RTX 4000 SFF	Qwen3.6-27B UD-Q4_K_XL	13.62	13.84	+1.63%
Linux-13.2-595.58.03	RTX 4000 SFF	Qwen3.6-35B-A3B UD-Q4_K_M	57.87	61.22	+5.80%
Win11-13.2-596.36	RTX 6000 Pro	gpt-oss-20b-mxfp4	305.73	365.06	+19.40%
Win11-13.2-596.36	RTX 6000 Pro	Qwen3.6-27B Q4_K_M	71.84	75.34	+4.86%
Win11-13.2-596.36	RTX 6000 Pro	Qwen3.6-35B-A3B Q4_K_M	237.70	275.66	+15.97%

CTK - CudaToolKit
BS- backend-sampling

Command used to run benchmark:

./build/bin/llama-server -m /gguf/gpt-oss-20b-mxfp4.gguf --temp 1.0 --top-k 0 --top-p 1.0 --min-p 0.0 -dio --port 8033 -np 1 -b 4096 -ub 4096 --repeat-penalty 1.1 --presence-penalty 0 -bs

Core Implementation Steps

1. Backend Integration & Fallback

llama_sampler_penalties now inherits from llama_sampler_backend.
Includes a capability check during initialization. If the selected backend lacks required ggml operations, it falls back to the CPU to preserve compatibility.

2. State Management & Clone Fix

Token history (prev ring buffer and token_count) is still maintained on the CPU via the normal accept() call, but the actual logit transformation is offloaded to the GPU.
Bug Fix: Fixed state cloning by ensuring both the ring buffer and token_count are copied. Previously, an empty cloned token count delayed penalty applications.

3. Sparse Graph Optimization

Instead of modifying the entire vocabulary tensor, the backend graph uses a sparse approach. It only gathers, transforms, and scatters logits for the tokens present in the recent history window, leaving the rest untouched.

4. Support After Top-K / Top-P

Initially, the backend assumed full-vocabulary logits. The implementation now detects if a prior sampler reduced the vocabulary (like Top-K).
It creates a comparison matrix to align recent history IDs with the reduced candidate IDs, applying penalties directly to the truncated logits array.

5. Testing

Added comprehensive backend-vs-CPU verification tests covering combined penalties, repeated prompt tokens, and a larger penalty window (80 tokens).

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, paired with codex.

- Set default value for penalty_last_n based on model context if not specified. - Ensure penalty_last_n and n_prev are non-negative. - Update llama_sampler_penalties structure to inherit from llama_sampler_backend and add backend input handling for penalties. - Implement backend initialization and application logic for penalties, including frequency and presence adjustments.

- Introduced `accept_prompt` and `unique_prompt_tokens` functions to handle prompt acceptance and token uniqueness. - Implemented `compare_penalties_logits` to compare logits from backend and CPU samplers with penalties. - Added `test_backend_penalties_sampling` to validate backend penalties with various configurations. - Enhanced the test suite for better coverage of penalty handling in sampling.

… logits as -Inf and no longer generate NaN.

ggml-gh-bot · 2026-07-03T08:17:51Z

Hi @kmorennv, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 2 open PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

kmorennv · 2026-07-03T08:36:47Z

I think that’s a bot error, since I only have a single open PR.

pwilkin · 2026-07-03T08:46:05Z

@kmorennv Nope :)

#21673

pwilkin · 2026-07-03T08:48:55Z

(just for clarity's sake, this is just an informational message and it's not hard-enforced anywhere and since that's just a draft that doesn't really matter, but technically it's not an error)

kmorennv · 2026-07-03T08:49:32Z

@pwilkin thanks , I filtered for this but only for open --> clear , it was not shown ...

kmorennv added 7 commits June 16, 2026 17:47

sampling: add support for top-k penalties in backend sampling

af9cf9c

sampling: add fix to ensure stable numerical results. Preserve masked…

3ffeaea

… logits as -Inf and no longer generate NaN.

sampling: enhance penalty comparison tests with masking penalties logic

888f264

add comments on padding

1834d9f

sampling: add comments on modifications

38ad11c

kmorennv requested review from a team and ggerganov as code owners July 3, 2026 08:13

github-actions Bot added the testing Everything test related label Jul 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Kmoren/add penalties cu backend#25262

Kmoren/add penalties cu backend#25262
kmorennv wants to merge 7 commits into
ggml-org:masterfrom
kmorennv:kmoren/add_penalties_cu_backend

kmorennv commented Jul 3, 2026

Uh oh!

ggml-gh-bot Bot commented Jul 3, 2026

Uh oh!

kmorennv commented Jul 3, 2026

Uh oh!

pwilkin commented Jul 3, 2026

Uh oh!

pwilkin commented Jul 3, 2026

Uh oh!

kmorennv commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

kmorennv commented Jul 3, 2026

Overview & Motivation

Performance Impact

Core Implementation Steps

1. Backend Integration & Fallback

2. State Management & Clone Fix

3. Sparse Graph Optimization

4. Support After Top-K / Top-P

5. Testing

Requirements

Uh oh!

ggml-gh-bot Bot commented Jul 3, 2026

Uh oh!

kmorennv commented Jul 3, 2026

Uh oh!

pwilkin commented Jul 3, 2026

Uh oh!

pwilkin commented Jul 3, 2026

Uh oh!

kmorennv commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants