Skip to content

Kmoren/add penalties cu backend#25262

Open
kmorennv wants to merge 7 commits into
ggml-org:masterfrom
kmorennv:kmoren/add_penalties_cu_backend
Open

Kmoren/add penalties cu backend#25262
kmorennv wants to merge 7 commits into
ggml-org:masterfrom
kmorennv:kmoren/add_penalties_cu_backend

Conversation

@kmorennv

@kmorennv kmorennv commented Jul 3, 2026

Copy link
Copy Markdown

Overview & Motivation

This PR migrates penalties sampling (repeat, frequency, and presence) from the CPU to the GPU backend.

  • The Problem: Previously, penalty sampling was CPU-only, which forced all subsequent samplers in the chain to execute on the CPU as well.
  • The Solution: Moving this to the backend allows for continuous GPU-bound sampling. This is highly recommended for modern models like Qwen 3.5/3.6. They directly advice to use penalties sampling https://huggingface.co/Qwen/Qwen3.5-35B-A3B

Performance Impact

Moving penalty sampling to the backend yields a noticeable boost in token generation speed:

OS / GPU Max Speedup Key Models Benefiting
Linux (RTX 4000 SFF) +7.59% gpt-oss-20b, Qwen3.6-35B
Windows 11 (RTX 6000 Pro) +19.40% gpt-oss-20b, Qwen3.6-35B

Details performance

OS-CTK-Driver GPU Model Without BS tok/sec With BS tok/sec Speedup
Linux-13.2-595.58.03 RTX 4000 SFF gpt-oss-20b-mxfp4 81.01 87.16 +7.59%
Linux-13.2-595.58.03 RTX 4000 SFF Qwen3.6-27B UD-Q4_K_XL 13.62 13.84 +1.63%
Linux-13.2-595.58.03 RTX 4000 SFF Qwen3.6-35B-A3B UD-Q4_K_M 57.87 61.22 +5.80%
Win11-13.2-596.36 RTX 6000 Pro gpt-oss-20b-mxfp4 305.73 365.06 +19.40%
Win11-13.2-596.36 RTX 6000 Pro Qwen3.6-27B Q4_K_M 71.84 75.34 +4.86%
Win11-13.2-596.36 RTX 6000 Pro Qwen3.6-35B-A3B Q4_K_M 237.70 275.66 +15.97%

CTK - CudaToolKit
BS- backend-sampling

Command used to run benchmark:

./build/bin/llama-server -m /gguf/gpt-oss-20b-mxfp4.gguf --temp 1.0 --top-k 0 --top-p 1.0 --min-p 0.0 -dio --port 8033 -np 1 -b 4096 -ub 4096 --repeat-penalty 1.1 --presence-penalty 0 -bs

Core Implementation Steps

1. Backend Integration & Fallback

  • llama_sampler_penalties now inherits from llama_sampler_backend.
  • Includes a capability check during initialization. If the selected backend lacks required ggml operations, it falls back to the CPU to preserve compatibility.

2. State Management & Clone Fix

  • Token history (prev ring buffer and token_count) is still maintained on the CPU via the normal accept() call, but the actual logit transformation is offloaded to the GPU.
  • Bug Fix: Fixed state cloning by ensuring both the ring buffer and token_count are copied. Previously, an empty cloned token count delayed penalty applications.

3. Sparse Graph Optimization

  • Instead of modifying the entire vocabulary tensor, the backend graph uses a sparse approach. It only gathers, transforms, and scatters logits for the tokens present in the recent history window, leaving the rest untouched.

4. Support After Top-K / Top-P

  • Initially, the backend assumed full-vocabulary logits. The implementation now detects if a prior sampler reduced the vocabulary (like Top-K).
  • It creates a comparison matrix to align recent history IDs with the reduced candidate IDs, applying penalties directly to the truncated logits array.

5. Testing

  • Added comprehensive backend-vs-CPU verification tests covering combined penalties, repeated prompt tokens, and a larger penalty window (80 tokens).

Requirements

kmorennv added 7 commits June 16, 2026 17:47
- Set default value for penalty_last_n based on model context if not specified.
- Ensure penalty_last_n and n_prev are non-negative.
- Update llama_sampler_penalties structure to inherit from llama_sampler_backend and add backend input handling for penalties.
- Implement backend initialization and application logic for penalties, including frequency and presence adjustments.
- Introduced `accept_prompt` and `unique_prompt_tokens` functions to handle prompt acceptance and token uniqueness.
- Implemented `compare_penalties_logits` to compare logits from backend and CPU samplers with penalties.
- Added `test_backend_penalties_sampling` to validate backend penalties with various configurations.
- Enhanced the test suite for better coverage of penalty handling in sampling.
@kmorennv kmorennv requested review from a team and ggerganov as code owners July 3, 2026 08:13
@github-actions github-actions Bot added the testing Everything test related label Jul 3, 2026
@ggml-gh-bot

ggml-gh-bot Bot commented Jul 3, 2026

Copy link
Copy Markdown

Hi @kmorennv, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 2 open PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@kmorennv

kmorennv commented Jul 3, 2026

Copy link
Copy Markdown
Author

I think that’s a bot error, since I only have a single open PR.

@pwilkin

pwilkin commented Jul 3, 2026

Copy link
Copy Markdown
Member

@kmorennv Nope :)

#21673

@pwilkin

pwilkin commented Jul 3, 2026

Copy link
Copy Markdown
Member

(just for clarity's sake, this is just an informational message and it's not hard-enforced anywhere and since that's just a draft that doesn't really matter, but technically it's not an error)

@kmorennv

kmorennv commented Jul 3, 2026

Copy link
Copy Markdown
Author

@pwilkin thanks , I filtered for this but only for open --> clear , it was not shown ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants