Skip to content

feat(skip-softmax): Add skip-softmax support for KV-blocked attention#1009

Draft
vbaddi wants to merge 1 commit into
quic:mainfrom
vbaddi:feat/enable-skip-softmax
Draft

feat(skip-softmax): Add skip-softmax support for KV-blocked attention#1009
vbaddi wants to merge 1 commit into
quic:mainfrom
vbaddi:feat/enable-skip-softmax

Conversation

@vbaddi
Copy link
Copy Markdown
Contributor

@vbaddi vbaddi commented May 25, 2026

Summary

Adds BLASS-style skip-softmax support on top of KV-blocked attention.

Changes

  • Added skip_softmax knobs to attention blocking config.
  • Wired skip-softmax through KV/QKV/HQKV/BHQKV and KV-MLA blocked paths.
  • Added online-softmax block skip predicate using: lambda_eff = skip_softmax_scale / ctx_len
  • Added tests for threshold selection, skip mask behavior, and accumulator preservation.
  • Added example script supporting:
    • plain non-blocked baseline
    • KV-blocked baseline
    • KV-blocked + skip-softmax

Example

KV-Blocking baseline:

python examples/text_generation/skip_softmax_kv_blocking_inference.py \
    --model-name meta-llama/Llama-3.2-1B \
    --prompt "Can you generate a detailed case study for Mumbai city?" \
    --prefill-seq-len 1 \
    --ctx-len 65536 \
    --generation-len 64000 \
    --num-kv-blocks 8 \
    --no-skip-softmax \
    --num-cores 16 \
    --aic-hw-version ai100 \
    --device-group '[0,1,2,3]'

KV-Blocking + skip-softmax:

python examples/text_generation/skip_softmax_kv_blocking_inference.py \
--model-name meta-llama/Llama-3.2-1B \
--prompt "Can you generate a detailed case study for Mumbai city?" \
--prefill-seq-len 1 \
--ctx-len 65536 \
--generation-len 64000 \
--num-kv-blocks 8 \
--skip-softmax \
--skip-softmax-decode-scale 16.0 \
--skip-softmax-prefill-scale 1.0 \
--skip-softmax-min-keep-blocks 1 \
--num-cores 16 \
--aic-hw-version ai100 \
--device-group '[0,1,2,3]'

Validation

python -m pytest -q tests/test_skip_softmax_kv_blocking.py
Result:
7 passed

cc: @anujgupt-github @kdulla

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
@vbaddi vbaddi added the enhancement New feature or request label May 25, 2026
@vbaddi vbaddi marked this pull request as draft May 25, 2026 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants