Kmoren/add penalties cu backend#25262
Open
kmorennv wants to merge 7 commits into
Open
Conversation
- Set default value for penalty_last_n based on model context if not specified. - Ensure penalty_last_n and n_prev are non-negative. - Update llama_sampler_penalties structure to inherit from llama_sampler_backend and add backend input handling for penalties. - Implement backend initialization and application logic for penalties, including frequency and presence adjustments.
- Introduced `accept_prompt` and `unique_prompt_tokens` functions to handle prompt acceptance and token uniqueness. - Implemented `compare_penalties_logits` to compare logits from backend and CPU samplers with penalties. - Added `test_backend_penalties_sampling` to validate backend penalties with various configurations. - Enhanced the test suite for better coverage of penalty handling in sampling.
… logits as -Inf and no longer generate NaN.
|
Hi @kmorennv, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
Author
|
I think that’s a bot error, since I only have a single open PR. |
Member
Member
|
(just for clarity's sake, this is just an informational message and it's not hard-enforced anywhere and since that's just a draft that doesn't really matter, but technically it's not an error) |
Author
|
@pwilkin thanks , I filtered for this but only for open --> clear , it was not shown ... |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview & Motivation
This PR migrates penalties sampling (repeat, frequency, and presence) from the CPU to the GPU backend.
Performance Impact
Moving penalty sampling to the backend yields a noticeable boost in token generation speed:
gpt-oss-20b,Qwen3.6-35Bgpt-oss-20b,Qwen3.6-35BDetails performance
CTK - CudaToolKit
BS- backend-sampling
Command used to run benchmark:
./build/bin/llama-server -m /gguf/gpt-oss-20b-mxfp4.gguf --temp 1.0 --top-k 0 --top-p 1.0 --min-p 0.0 -dio --port 8033 -np 1 -b 4096 -ub 4096 --repeat-penalty 1.1 --presence-penalty 0 -bs
Core Implementation Steps
1. Backend Integration & Fallback
llama_sampler_penaltiesnow inherits fromllama_sampler_backend.ggmloperations, it falls back to the CPU to preserve compatibility.2. State Management & Clone Fix
prevring buffer andtoken_count) is still maintained on the CPU via the normalaccept()call, but the actual logit transformation is offloaded to the GPU.token_countare copied. Previously, an empty cloned token count delayed penalty applications.3. Sparse Graph Optimization
4. Support After Top-K / Top-P
5. Testing
Requirements