Fix Gemma4 KeyError sliding_attention issue#1839
Conversation
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Pull request overview
Fixes Gemma4 block-wise quantization failures on Transformers >= 5.6 by ensuring all Gemma4 decoder layers share the same shared_kv_states dict during replay, preventing KeyError: 'sliding_attention' when KV state is produced by one layer and consumed by another.
Changes:
- Introduces
_shared_kv_states_global_refand attaches a single shared dict to every Gemma4 layer during_attach_gemma4_rotary_emb(). - Updates
_prepare_gemma4_replay_inputs()to fall back to that shared dict (instead of allocating a new{}) whenshared_kv_statesis not provided. - Adds a small helper (
_get_gemma4_shared_kv_states_global) to retrieve the shared dict from a layer.
| head_dim = getattr(attn, "head_dim", None) | ||
|
|
||
| if attn is not None and hasattr(attn, "store_full_length_kv") and shared_kv_states is None: | ||
| shared_kv_states = default_shared_kv_states if default_shared_kv_states is not None else {} | ||
| shared_kv_states = ( | ||
| default_shared_kv_states |
Local test pass:1 gemma-4-E4B-it quantizeCUDA_VISIBLE_DEVICES=6 auto-round --model_name /mnt/disk2/lvl/gemma-4-E4B-it --bits 4 --iters 0 --tasks lambada_openai
evaluation running time=122s 2 gemma-4-31B-it quantizeCUDA_VISIBLE_DEVICES=4,6 auto-round /mnt/disk3/lvl/gemma-4-31B-it --device_map "auto" --data_type "int" --group_size 128 --batch_size 1 --nsamples 2 --seqlen 512 --iters 1 --to_quant_block_names 'model.language_model.layers' --output_dir gemma-4-31B-it-INT8-AutoRound --scheme W8A16 --dataset NeelNanda/pile-10k --format "auto_round:auto_gptq" 2026-05-21 14:21:25 INFO main.py L652: start to quantize /mnt/disk3/lvl/gemma-4-31B-it 3 gemma-4-31B-it RAM checkCUDA_VISIBLE_DEVICES=4,6 auto-round /mnt/disk3/lvl/gemma-4-31B-it --device_map "auto" --data_type "int" --group_size 128 --batch_size 2 --nsamples 512 --seqlen 2048 --iters 2000 --to_quant_block_names 'model.language_model.layers' --output_dir gemma-4-31B-it-INT8-AutoRound --scheme W8A16 --dataset NeelNanda/pile-10k --format "auto_round:auto_gptq" 2026-05-21 14:40:21 INFO device.py L1840: 'peak_ram': 72.32GB, 'peak_vram': {'0': 36.9GB, '1': 6.6GB} |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Description
For transformers >= 5.6, _attach_gemma4_rotary_emb() enabled per-layer special replay but did not create a shared
shared_kv_states dict. Each layer got its own empty {}, so the KV cache written by the first sliding_attention layer was invisible to full_attention layers that tried to read it, so caused KeyError.Fix
_attach_gemma4_rotary_emb() now creates one shared dict and attaches it as _shared_kv_states_global_ref to every layer.
_prepare_gemma4_replay_inputs() falls back to that shared dict instead of a new empty {}.
Type of Change
Bug fix
Related Issues
#1837
Checklist Before Submitting
/azp run Unit-Test-CUDA-AutoRound.