Skip to content

Fix Gemma4 KeyError sliding_attention issue#1839

Open
lvliang-intel wants to merge 3 commits into
mainfrom
lvl/fix_sliding_attention_key_error
Open

Fix Gemma4 KeyError sliding_attention issue#1839
lvliang-intel wants to merge 3 commits into
mainfrom
lvl/fix_sliding_attention_key_error

Conversation

@lvliang-intel
Copy link
Copy Markdown
Contributor

Description

For transformers >= 5.6, _attach_gemma4_rotary_emb() enabled per-layer special replay but did not create a shared shared_kv_states dict. Each layer got its own empty {}, so the KV cache written by the first sliding_attention layer was invisible to full_attention layers that tried to read it, so caused KeyError.

Fix

_attach_gemma4_rotary_emb() now creates one shared dict and attaches it as _shared_kv_states_global_ref to every layer.
_prepare_gemma4_replay_inputs() falls back to that shared dict instead of a new empty {}.

Type of Change

Bug fix

Related Issues

#1837

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.
  • The CUDA CI has passed. You can trigger it by commenting /azp run Unit-Test-CUDA-AutoRound.

Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Copilot AI review requested due to automatic review settings May 21, 2026 06:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes Gemma4 block-wise quantization failures on Transformers >= 5.6 by ensuring all Gemma4 decoder layers share the same shared_kv_states dict during replay, preventing KeyError: 'sliding_attention' when KV state is produced by one layer and consumed by another.

Changes:

  • Introduces _shared_kv_states_global_ref and attaches a single shared dict to every Gemma4 layer during _attach_gemma4_rotary_emb().
  • Updates _prepare_gemma4_replay_inputs() to fall back to that shared dict (instead of allocating a new {}) when shared_kv_states is not provided.
  • Adds a small helper (_get_gemma4_shared_kv_states_global) to retrieve the shared dict from a layer.

Comment thread auto_round/special_model_handler.py Outdated
Comment thread auto_round/special_model_handler.py Outdated
Comment on lines +160 to +164
head_dim = getattr(attn, "head_dim", None)

if attn is not None and hasattr(attn, "store_full_length_kv") and shared_kv_states is None:
shared_kv_states = default_shared_kv_states if default_shared_kv_states is not None else {}
shared_kv_states = (
default_shared_kv_states
@lvliang-intel
Copy link
Copy Markdown
Contributor Author

Local test pass:

1 gemma-4-E4B-it quantize

CUDA_VISIBLE_DEVICES=6 auto-round --model_name /mnt/disk2/lvl/gemma-4-E4B-it --bits 4 --iters 0 --tasks lambada_openai
2026-05-21 14:48:55 INFO main.py L652: start to quantize /mnt/disk2/lvl/gemma-4-E4B-it
2026-05-21 14:48:55 INFO config.py L45: enable_opt_rtn is turned on, set --disable_opt_rtn for higher speed at the cost of accuracy.
2026-05-21 14:48:55 WARNING logging.py L340: Using MLLM mode for multimodal model (new architecture).
Loading weights: 100%|█████████████████████████████████████████| 2076/2076 [00:00<00:00, 10924.96it/s]
2026-05-21 14:49:30 WARNING logging.py L340: some layers are skipped quantization (shape not divisible by 32):
[transformers] loss_type=None was set in the config but it is unrecognized. Using the default loss: ForCausalLMLoss.
2026-05-21 14:49:30 WARNING special_model_handler.py L359: Applying a monkey patch to Gemma4 to reduce RAM usage. This patch has only been validated with limited Transformers versions. Proceed with caution.
2026-05-21 14:49:30 INFO base.py L655: 'enable_torch_compile' is set to False by default. Enabling it can reduce tuning cost by 20%, but it might throw an exception.
2026-05-21 14:49:32 INFO data_driven.py L1080: start to compute imatrix
2026-05-21 14:49:32 INFO calib_dataset.py L977: Preprocessing calibration dataset in a subprocess to avoid memory leaks...
2026-05-21 14:50:15 INFO mllm.py L83: Using MLLM template: gemma4
2026-05-21 14:50:15 INFO calib_dataset.py L977: Preprocessing calibration dataset in a subprocess to avoid memory leaks...
2026-05-21 14:50:50 WARNING logging.py L340: Please note that 'shared_kv_states' key is not currently used in quantization fine-tuning.
Quantizing model.language_model.layers.0: 0%| | 0/42 [00:00<?, ?it/s]2026-05-21 14:51:23 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.05GB
Quantizing model.language_model.layers.1: 2%|▌ | 1/42 [00:05<03:45, 5.51s/it]2026-05-21 14:51:27 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.2: 5%|█▏ | 2/42 [00:10<03:18, 4.96s/it]2026-05-21 14:51:32 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.3: 7%|█▋ | 3/42 [00:14<03:07, 4.80s/it]2026-05-21 14:51:37 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.4: 10%|██▎ | 4/42 [00:19<03:04, 4.86s/it]2026-05-21 14:51:41 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.5: 12%|██▊ | 5/42 [00:24<02:55, 4.74s/it]2026-05-21 14:51:46 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.6: 14%|███▍ | 6/42 [00:28<02:49, 4.71s/it]2026-05-21 14:51:51 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.7: 17%|████ | 7/42 [00:33<02:45, 4.72s/it]2026-05-21 14:51:56 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.8: 19%|████▌ | 8/42 [00:38<02:41, 4.74s/it]2026-05-21 14:52:01 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.9: 21%|█████▏ | 9/42 [00:43<02:38, 4.80s/it]2026-05-21 14:52:05 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.10: 24%|█████▏ | 10/42 [00:48<02:33, 4.80s/it]2026-05-21 14:52:10 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.11: 26%|█████▊ | 11/42 [00:52<02:26, 4.71s/it]2026-05-21 14:52:14 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.12: 29%|██████▎ | 12/42 [00:57<02:20, 4.70s/it]2026-05-21 14:52:19 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.13: 31%|██████▊ | 13/42 [01:02<02:17, 4.73s/it]2026-05-21 14:52:24 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.14: 33%|███████▎ | 14/42 [01:06<02:12, 4.73s/it]2026-05-21 14:52:29 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.15: 36%|███████▊ | 15/42 [01:11<02:07, 4.72s/it]2026-05-21 14:52:33 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.16: 38%|████████▍ | 16/42 [01:16<02:02, 4.71s/it]2026-05-21 14:52:38 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.17: 40%|████████▉ | 17/42 [01:20<01:58, 4.72s/it]2026-05-21 14:52:43 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.18: 43%|█████████▍ | 18/42 [01:25<01:53, 4.72s/it]2026-05-21 14:52:48 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.19: 45%|█████████▉ | 19/42 [01:30<01:48, 4.74s/it]2026-05-21 14:52:52 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.20: 48%|██████████▍ | 20/42 [01:35<01:43, 4.73s/it]2026-05-21 14:52:57 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.21: 50%|███████████ | 21/42 [01:39<01:40, 4.77s/it]2026-05-21 14:53:03 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.22: 52%|███████████▌ | 22/42 [01:45<01:40, 5.02s/it]2026-05-21 14:53:09 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.23: 55%|████████████ | 23/42 [01:52<01:43, 5.46s/it]2026-05-21 14:53:14 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.24: 57%|████████████▌ | 24/42 [01:56<01:34, 5.25s/it]2026-05-21 14:53:19 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.25: 60%|█████████████ | 25/42 [02:01<01:26, 5.09s/it]2026-05-21 14:53:24 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.26: 62%|█████████████▌ | 26/42 [02:06<01:19, 4.98s/it]2026-05-21 14:53:28 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.27: 64%|██████████████▏ | 27/42 [02:10<01:13, 4.90s/it]2026-05-21 14:53:33 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.28: 67%|██████████████▋ | 28/42 [02:15<01:08, 4.87s/it]2026-05-21 14:53:38 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.29: 69%|███████████████▏ | 29/42 [02:20<01:02, 4.82s/it]2026-05-21 14:53:42 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.30: 71%|███████████████▋ | 30/42 [02:25<00:57, 4.79s/it]2026-05-21 14:53:47 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.31: 74%|████████████████▏ | 31/42 [02:29<00:52, 4.77s/it]2026-05-21 14:53:52 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.32: 76%|████████████████▊ | 32/42 [02:34<00:47, 4.76s/it]2026-05-21 14:53:57 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.33: 79%|█████████████████▎ | 33/42 [02:39<00:42, 4.74s/it]2026-05-21 14:54:01 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.34: 81%|█████████████████▊ | 34/42 [02:44<00:37, 4.74s/it]2026-05-21 14:54:06 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.35: 83%|██████████████████▎ | 35/42 [02:48<00:33, 4.73s/it]2026-05-21 14:54:11 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.36: 86%|██████████████████▊ | 36/42 [02:53<00:28, 4.74s/it]2026-05-21 14:54:15 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.37: 88%|███████████████████▍ | 37/42 [02:58<00:23, 4.71s/it]2026-05-21 14:54:20 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.38: 90%|███████████████████▉ | 38/42 [03:02<00:18, 4.69s/it]2026-05-21 14:54:25 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.39: 93%|████████████████████▍ | 39/42 [03:07<00:14, 4.68s/it]2026-05-21 14:54:29 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.40: 95%|████████████████████▉ | 40/42 [03:12<00:09, 4.66s/it]2026-05-21 14:54:34 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.41: 98%|█████████████████████▍| 41/42 [03:16<00:04, 4.65s/it]2026-05-21 14:54:39 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
Quantizing model.language_model.layers.41: 100%|██████████████████████| 42/42 [03:21<00:00, 4.80s/it]
2026-05-21 14:54:56 INFO shard_writer.py L324: model has been saved to ./tmp_autoround/gemma-4-E4B-it-w4g128/
2026-05-21 14:54:57 INFO device.py L1840: 'peak_ram': 13.61GB, 'peak_vram': 7.95GB
2026-05-21 14:54:57 INFO evaluation.py L457: Using lm-eval version 0.4.11.dev0
2026-05-21 14:54:57 WARNING evaluation.py L379: hf-multimodal models does not support auto currently, reset eval_bs to 16
Detected kernel version 5.4.292, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/mnt/disk1/lvl/conda_envs/artest/lib/python3.11/site-packages/transformers/quantizers/auto.py:262: UserWarning: You passed quantization_config or equivalent parameters to from_pretrained but the model you're loading already has a quantization_config attribute. The quantization_config from the model will be used.However, loading attributes (e.g. ['backend']) will be overwritten with the one you passed to from_pretrained. The rest will be ignored.
warnings.warn(warning_msg)
2026-05-21 14:55:01 WARNING backend.py L1176: Better backend is found, please install all the following requirements to enable it.
2026-05-21 14:55:01 WARNING backend.py L1176: pip install -v "gptqmodel>=2.0" --no-build-isolation
Loading weights: 100%|██████████████████████████████████████████| 2760/2760 [00:01<00:00, 1442.33it/s]
100%|████████████████████████████████████████████████████████████| 5153/5153 [00:10<00:00, 489.65it/s]
Running loglikelihood requests: 100%|█████████████████████████████| 5153/5153 [00:51<00:00, 99.38it/s]
bootstrapping for stddev: perplexity
100%|███████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 65.97it/s]

Tasks Version Filter n-shot Metric Value Stderr
lambada_openai 1 none 0 acc 0.1582 ± 0.0051
none 0 perplexity 28298.3349 ± 2929.2261

evaluation running time=122s

2 gemma-4-31B-it quantize

CUDA_VISIBLE_DEVICES=4,6 auto-round /mnt/disk3/lvl/gemma-4-31B-it --device_map "auto" --data_type "int" --group_size 128 --batch_size 1 --nsamples 2 --seqlen 512 --iters 1 --to_quant_block_names 'model.language_model.layers' --output_dir gemma-4-31B-it-INT8-AutoRound --scheme W8A16 --dataset NeelNanda/pile-10k --format "auto_round:auto_gptq"

2026-05-21 14:21:25 INFO main.py L652: start to quantize /mnt/disk3/lvl/gemma-4-31B-it
2026-05-21 14:21:25 WARNING logging.py L340: Using MLLM mode for multimodal model (new architecture).
Loading weights: 100%|█████████████████████████████████████████| 1188/1188 [00:00<00:00, 10923.77it/s]
2026-05-21 14:23:42 WARNING logging.py L340: some layers are skipped quantization (shape not divisible by 32): model.vision_tower.encoder.layers.[0-26].mlp.down_proj.linear, model.vision_tower.encoder.layers.[0-26].mlp.gate_proj.linear, model.vision_tower.encoder.layers.[0-26].mlp.up_proj.linear
[transformers] loss_type=None was set in the config but it is unrecognized. Using the default loss: ForCausalLMLoss.
2026-05-21 14:23:42 WARNING special_model_handler.py L359: Applying a monkey patch to Gemma4 to reduce RAM usage. This patch has only been validated with limited Transformers versions. Proceed with caution.
2026-05-21 14:23:43 INFO base.py L655: 'enable_torch_compile' is set to False by default. Enabling it can reduce tuning cost by 20%, but it might throw an exception.
2026-05-21 14:23:43 INFO data_driven.py L662: start to cache block inputs
2026-05-21 14:23:43 INFO mllm.py L83: Using MLLM template: gemma4
2026-05-21 14:23:43 INFO calib_dataset.py L977: Preprocessing calibration dataset in a subprocess to avoid memory leaks...
2026-05-21 14:24:26 WARNING logging.py L340: Please note that 'shared_kv_states' key is not currently used in quantization fine-tuning.
2026-05-21 14:24:27 INFO data_driven.py L685: caching done
Quantizing model.language_model.layers.0: 0%| | 0/60 [00:00<?, ?it/s]/mnt/disk1/lvl/conda_envs/artest/lib/python3.11/site-packages/torch/autograd/graph.py:869: UserWarning: Flash Attention defaults to a non-deterministic algorithm. To explicitly enable determinism call torch.use_deterministic_algorithms(True, warn_only=False). (Triggered internally at /pytorch/aten/src/ATen/native/transformers/cuda/attention_backward.cu:124.)
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
quantized 7/7 layers in the block, loss iter 0: 0.000012 -> iter 0: 0.000012
2026-05-21 14:24:31 INFO device.py L1840: 'peak_ram': 58.9GB, 'peak_vram': {'0': 5.66GB, '1': 5.47GB}
Quantizing model.language_model.layers.1: 2%|▍ | 1/60 [00:04<04:32, 4.62s/it]quantized 7/7 layers in the block, loss iter 0: 0.000037 -> iter 0: 0.000037
2026-05-21 14:24:33 INFO device.py L1840: 'peak_ram': 59.4GB, 'peak_vram': {'0': 5.68GB, '1': 5.47GB}
Quantizing model.language_model.layers.2: 3%|▊ | 2/60 [00:06<03:05, 3.19s/it]quantized 7/7 layers in the block, loss iter 0: 0.000040 -> iter 0: 0.000040
2026-05-21 14:24:36 INFO device.py L1840: 'peak_ram': 59.89GB, 'peak_vram': {'0': 5.68GB, '1': 5.47GB}
Quantizing model.language_model.layers.3: 5%|█▏ | 3/60 [00:10<03:05, 3.25s/it]quantized 7/7 layers in the block, loss iter 0: 0.000050 -> iter 0: 0.000050
2026-05-21 14:24:38 INFO device.py L1840: 'peak_ram': 60.33GB, 'peak_vram': {'0': 5.68GB, '1': 5.47GB}
Quantizing model.language_model.layers.4: 7%|█▌ | 4/60 [00:13<03:09, 3.39s/it]quantized 7/7 layers in the block, loss iter 0: 0.000055 -> iter 0: 0.000055
2026-05-21 14:24:42 INFO device.py L1840: 'peak_ram': 60.82GB, 'peak_vram': {'0': 5.68GB, '1': 5.47GB}
Quantizing model.language_model.layers.5: 8%|██ | 5/60 [00:15<02:38, 2.88s/it]quantized 6/6 layers in the block, loss iter 0: 0.000077 -> iter 0: 0.000077
2026-05-21 14:24:46 INFO device.py L1840: 'peak_ram': 61.44GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.6: 10%|██▍ | 6/60 [00:19<02:52, 3.19s/it]quantized 7/7 layers in the block, loss iter 0: 0.000089 -> iter 0: 0.000089
2026-05-21 14:24:48 INFO device.py L1840: 'peak_ram': 61.87GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.7: 12%|██▊ | 7/60 [00:23<03:00, 3.40s/it]quantized 7/7 layers in the block, loss iter 0: 0.000115 -> iter 0: 0.000115
2026-05-21 14:24:51 INFO device.py L1840: 'peak_ram': 62.35GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.8: 13%|███▏ | 8/60 [00:25<02:37, 3.03s/it]quantized 7/7 layers in the block, loss iter 0: 0.000139 -> iter 0: 0.000139
2026-05-21 14:24:55 INFO device.py L1840: 'peak_ram': 62.79GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.9: 15%|███▌ | 9/60 [00:33<03:51, 4.54s/it]quantized 7/7 layers in the block, loss iter 0: 0.000169 -> iter 0: 0.000169
2026-05-21 14:25:02 INFO device.py L1840: 'peak_ram': 62.79GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.10: 17%|███▋ | 10/60 [00:37<03:32, 4.25s/it]quantized 7/7 layers in the block, loss iter 0: 0.000189 -> iter 0: 0.000189
2026-05-21 14:25:05 INFO device.py L1840: 'peak_ram': 62.79GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.11: 18%|████ | 11/60 [00:39<02:55, 3.59s/it]quantized 6/6 layers in the block, loss iter 0: 0.000217 -> iter 0: 0.000217
2026-05-21 14:25:09 INFO device.py L1840: 'peak_ram': 62.79GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.12: 20%|████▍ | 12/60 [00:42<02:55, 3.66s/it]quantized 7/7 layers in the block, loss iter 0: 0.000235 -> iter 0: 0.000235
2026-05-21 14:25:11 INFO device.py L1840: 'peak_ram': 62.79GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.13: 22%|████▊ | 13/60 [00:46<02:52, 3.68s/it]quantized 7/7 layers in the block, loss iter 0: 0.000250 -> iter 0: 0.000250
2026-05-21 14:25:15 INFO device.py L1840: 'peak_ram': 62.79GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.14: 23%|█████▏ | 14/60 [00:48<02:28, 3.24s/it]quantized 7/7 layers in the block, loss iter 0: 0.000256 -> iter 0: 0.000256
2026-05-21 14:25:19 INFO device.py L1840: 'peak_ram': 62.79GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.15: 25%|█████▌ | 15/60 [00:52<02:34, 3.43s/it]quantized 7/7 layers in the block, loss iter 0: 0.000285 -> iter 0: 0.000285
2026-05-21 14:25:21 INFO device.py L1840: 'peak_ram': 62.79GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.16: 27%|█████▊ | 16/60 [00:56<02:32, 3.46s/it]quantized 7/7 layers in the block, loss iter 0: 0.000298 -> iter 0: 0.000298
2026-05-21 14:25:24 INFO device.py L1840: 'peak_ram': 62.79GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.17: 28%|██████▏ | 17/60 [00:58<02:12, 3.07s/it]quantized 6/6 layers in the block, loss iter 0: 0.000363 -> iter 0: 0.000363
2026-05-21 14:25:28 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.18: 30%|██████▌ | 18/60 [01:06<03:10, 4.53s/it]quantized 7/7 layers in the block, loss iter 0: 0.000358 -> iter 0: 0.000358
2026-05-21 14:25:34 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.19: 32%|██████▉ | 19/60 [01:08<02:35, 3.79s/it]quantized 7/7 layers in the block, loss iter 0: 0.000404 -> iter 0: 0.000404
2026-05-21 14:25:38 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.20: 33%|███████▎ | 20/60 [01:12<02:29, 3.73s/it]quantized 7/7 layers in the block, loss iter 0: 0.000462 -> iter 0: 0.000462
2026-05-21 14:25:40 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.21: 35%|███████▋ | 21/60 [01:15<02:25, 3.74s/it]quantized 7/7 layers in the block, loss iter 0: 0.000564 -> iter 0: 0.000564
2026-05-21 14:25:44 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.22: 37%|████████ | 22/60 [01:18<02:04, 3.28s/it]quantized 7/7 layers in the block, loss iter 0: 0.000683 -> iter 0: 0.000683
2026-05-21 14:25:47 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.23: 38%|████████▍ | 23/60 [01:21<02:03, 3.34s/it]quantized 6/6 layers in the block, loss iter 0: 0.001340 -> iter 0: 0.001340
2026-05-21 14:25:51 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.24: 40%|████████▊ | 24/60 [01:25<02:05, 3.47s/it]quantized 7/7 layers in the block, loss iter 0: 0.001552 -> iter 0: 0.001552
2026-05-21 14:25:53 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.25: 42%|█████████▏ | 25/60 [01:27<01:48, 3.09s/it]quantized 7/7 layers in the block, loss iter 0: 0.002417 -> iter 0: 0.002417
2026-05-21 14:25:57 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.26: 43%|█████████▌ | 26/60 [01:35<02:31, 4.47s/it]quantized 7/7 layers in the block, loss iter 0: 0.006087 -> iter 0: 0.006087
2026-05-21 14:26:03 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.27: 45%|█████████▉ | 27/60 [01:39<02:22, 4.31s/it]quantized 7/7 layers in the block, loss iter 0: 0.002219 -> iter 0: 0.002219
2026-05-21 14:26:07 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.28: 47%|██████████▎ | 28/60 [01:41<01:57, 3.68s/it]quantized 7/7 layers in the block, loss iter 0: 0.018925 -> iter 0: 0.018925
2026-05-21 14:26:11 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.29: 48%|██████████▋ | 29/60 [01:44<01:48, 3.49s/it]quantized 6/6 layers in the block, loss iter 0: 0.025021 -> iter 0: 0.025021
2026-05-21 14:26:12 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.30: 50%|███████████ | 30/60 [01:45<01:26, 2.88s/it]quantized 7/7 layers in the block, loss iter 0: 0.003304 -> iter 0: 0.003304
2026-05-21 14:26:15 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.31: 52%|███████████▎ | 31/60 [01:49<01:28, 3.04s/it]quantized 7/7 layers in the block, loss iter 0: 0.019993 -> iter 0: 0.019993
2026-05-21 14:26:17 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.32: 53%|███████████▋ | 32/60 [01:51<01:16, 2.75s/it]quantized 7/7 layers in the block, loss iter 0: 0.004367 -> iter 0: 0.004367
2026-05-21 14:26:21 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.33: 55%|████████████ | 33/60 [01:54<01:20, 2.98s/it]quantized 7/7 layers in the block, loss iter 0: 0.004880 -> iter 0: 0.004880
2026-05-21 14:26:23 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.34: 57%|████████████▍ | 34/60 [01:58<01:23, 3.19s/it]quantized 7/7 layers in the block, loss iter 0: 0.005355 -> iter 0: 0.005355
2026-05-21 14:26:27 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.35: 58%|████████████▊ | 35/60 [02:05<01:48, 4.34s/it]quantized 6/6 layers in the block, loss iter 0: 0.042357 -> iter 0: 0.042357
2026-05-21 14:26:35 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.36: 60%|█████████████▏ | 36/60 [02:08<01:37, 4.05s/it]quantized 7/7 layers in the block, loss iter 0: 0.010940 -> iter 0: 0.010940
2026-05-21 14:26:37 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.37: 62%|█████████████▌ | 37/60 [02:12<01:31, 3.99s/it]quantized 7/7 layers in the block, loss iter 0: 0.058233 -> iter 0: 0.058233
2026-05-21 14:26:41 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.38: 63%|█████████████▉ | 38/60 [02:14<01:11, 3.26s/it]quantized 7/7 layers in the block, loss iter 0: 0.013178 -> iter 0: 0.013178
2026-05-21 14:26:44 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.39: 65%|██████████████▎ | 39/60 [02:18<01:12, 3.46s/it]quantized 7/7 layers in the block, loss iter 0: 0.049228 -> iter 0: 0.049228
2026-05-21 14:26:46 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.40: 67%|██████████████▋ | 40/60 [02:21<01:08, 3.44s/it]quantized 7/7 layers in the block, loss iter 0: 0.015693 -> iter 0: 0.015693
2026-05-21 14:26:50 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.41: 68%|███████████████ | 41/60 [02:23<00:58, 3.06s/it]quantized 6/6 layers in the block, loss iter 0: 0.020955 -> iter 0: 0.020955
2026-05-21 14:26:54 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.42: 70%|███████████████▍ | 42/60 [02:27<00:59, 3.29s/it]quantized 7/7 layers in the block, loss iter 0: 0.065412 -> iter 0: 0.065412
2026-05-21 14:26:56 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.43: 72%|███████████████▊ | 43/60 [02:35<01:17, 4.58s/it]quantized 7/7 layers in the block, loss iter 0: 0.053742 -> iter 0: 0.053742
2026-05-21 14:27:03 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.44: 73%|████████████████▏ | 44/60 [02:37<01:01, 3.86s/it]quantized 7/7 layers in the block, loss iter 0: 0.042936 -> iter 0: 0.042936
2026-05-21 14:27:05 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.45: 75%|████████████████▌ | 45/60 [02:41<00:58, 3.87s/it]quantized 7/7 layers in the block, loss iter 0: 0.038064 -> iter 0: 0.038064
2026-05-21 14:27:09 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.46: 77%|████████████████▊ | 46/60 [02:43<00:47, 3.37s/it]quantized 7/7 layers in the block, loss iter 0: 0.013084 -> iter 0: 0.013084
2026-05-21 14:27:13 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.47: 78%|█████████████████▏ | 47/60 [02:46<00:44, 3.42s/it]quantized 6/6 layers in the block, loss iter 0: 0.051475 -> iter 0: 0.051475
2026-05-21 14:27:17 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.48: 80%|█████████████████▌ | 48/60 [02:51<00:43, 3.60s/it]quantized 7/7 layers in the block, loss iter 0: 0.051462 -> iter 0: 0.051462
2026-05-21 14:27:19 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.49: 82%|█████████████████▉ | 49/60 [02:52<00:32, 3.00s/it]quantized 7/7 layers in the block, loss iter 0: 0.042956 -> iter 0: 0.042956
2026-05-21 14:27:22 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.50: 83%|██████████████████▎ | 50/60 [02:55<00:31, 3.10s/it]quantized 7/7 layers in the block, loss iter 0: 0.013168 -> iter 0: 0.013168
2026-05-21 14:27:24 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.51: 85%|██████████████████▋ | 51/60 [02:58<00:25, 2.83s/it]quantized 7/7 layers in the block, loss iter 0: 0.013259 -> iter 0: 0.013259
2026-05-21 14:27:27 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.52: 87%|███████████████████ | 52/60 [03:06<00:34, 4.36s/it]quantized 7/7 layers in the block, loss iter 0: 0.011236 -> iter 0: 0.011236
2026-05-21 14:27:36 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.53: 88%|███████████████████▍ | 53/60 [03:10<00:29, 4.24s/it]quantized 6/6 layers in the block, loss iter 0: 0.018349 -> iter 0: 0.018349
2026-05-21 14:27:38 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.54: 90%|███████████████████▊ | 54/60 [03:12<00:21, 3.62s/it]quantized 7/7 layers in the block, loss iter 0: 0.038695 -> iter 0: 0.038695
2026-05-21 14:27:42 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.55: 92%|████████████████████▏ | 55/60 [03:14<00:16, 3.35s/it]quantized 7/7 layers in the block, loss iter 0: 0.023505 -> iter 0: 0.023505
2026-05-21 14:27:43 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.56: 93%|████████████████████▌ | 56/60 [03:18<00:13, 3.48s/it]quantized 7/7 layers in the block, loss iter 0: 0.023436 -> iter 0: 0.023436
2026-05-21 14:27:47 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.57: 95%|████████████████████▉ | 57/60 [03:20<00:08, 2.85s/it]quantized 7/7 layers in the block, loss iter 0: 0.042688 -> iter 0: 0.042688
2026-05-21 14:27:48 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.58: 97%|█████████████████████▎| 58/60 [03:23<00:06, 3.01s/it]quantized 7/7 layers in the block, loss iter 0: 0.023143 -> iter 0: 0.023143
2026-05-21 14:27:52 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing model.language_model.layers.59: 98%|█████████████████████▋| 59/60 [03:25<00:02, 2.77s/it]quantized 6/6 layers in the block, loss iter 0: 0.000289 -> iter 0: 0.000289
2026-05-21 14:27:56 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
Quantizing done: 100%|████████████████████████████████████████████████| 60/60 [03:34<00:00, 3.57s/it]
2026-05-21 14:28:01 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}
2026-05-21 14:28:06 INFO shard_writer.py L324: model has been saved to gemma-4-31B-it-INT8-AutoRound/gemma-4-31B-it-w8g128/
2026-05-21 14:28:07 INFO data_driven.py L750: quantization tuning time 219.62608313560486
2026-05-21 14:28:07 INFO data_driven.py L769: Summary: quantized 410/602 in the model, unquantized layers: lm_head, model.embed_vision.embedding_projection, model.vision_tower.encoder.layers.[0-26].mlp.down_proj.linear, model.vision_tower.encoder.layers.[0-26].mlp.gate_proj.linear, model.vision_tower.encoder.layers.[0-26].mlp.up_proj.linear, model.vision_tower.encoder.layers.[0-26].self_attn.k_proj.linear, model.vision_tower.encoder.layers.[0-26].self_attn.o_proj.linear, model.vision_tower.encoder.layers.[0-26].self_attn.q_proj.linear, model.vision_tower.encoder.layers.[0-26].self_attn.v_proj.linear, model.vision_tower.patch_embedder.input_proj
2026-05-21 14:28:07 INFO device.py L1840: 'peak_ram': 63.24GB, 'peak_vram': {'0': 6.52GB, '1': 5.65GB}

3 gemma-4-31B-it RAM check

CUDA_VISIBLE_DEVICES=4,6 auto-round /mnt/disk3/lvl/gemma-4-31B-it --device_map "auto" --data_type "int" --group_size 128 --batch_size 2 --nsamples 512 --seqlen 2048 --iters 2000 --to_quant_block_names 'model.language_model.layers' --output_dir gemma-4-31B-it-INT8-AutoRound --scheme W8A16 --dataset NeelNanda/pile-10k --format "auto_round:auto_gptq"
2026-05-21 14:29:49 INFO main.py L652: start to quantize /mnt/disk3/lvl/gemma-4-31B-it
2026-05-21 14:29:49 WARNING logging.py L340: Using MLLM mode for multimodal model (new architecture).
Loading weights: 100%|█████████████████████████████████████████| 1188/1188 [00:00<00:00, 10537.45it/s]
2026-05-21 14:32:33 WARNING logging.py L340: some layers are skipped quantization (shape not divisible by 32): model.vision_tower.encoder.layers.[0-26].mlp.down_proj.linear, model.vision_tower.encoder.layers.[0-26].mlp.gate_proj.linear, model.vision_tower.encoder.layers.[0-26].mlp.up_proj.linear
[transformers] loss_type=None was set in the config but it is unrecognized. Using the default loss: ForCausalLMLoss.
2026-05-21 14:32:33 WARNING special_model_handler.py L359: Applying a monkey patch to Gemma4 to reduce RAM usage. This patch has only been validated with limited Transformers versions. Proceed with caution.
2026-05-21 14:32:33 INFO base.py L655: 'enable_torch_compile' is set to False by default. Enabling it can reduce tuning cost by 20%, but it might throw an exception.
2026-05-21 14:32:33 INFO data_driven.py L662: start to cache block inputs
2026-05-21 14:32:33 INFO mllm.py L83: Using MLLM template: gemma4
2026-05-21 14:32:33 INFO calib_dataset.py L977: Preprocessing calibration dataset in a subprocess to avoid memory leaks...
2026-05-21 14:33:17 WARNING logging.py L340: Please note that 'shared_kv_states' key is not currently used in quantization fine-tuning.
2026-05-21 14:33:36 INFO data_driven.py L685: caching done
Quantizing model.language_model.layers.0: 0%| | 0/60 [00:04<?, ?it/s]/mnt/disk1/lvl/conda_envs/artest/lib/python3.11/site-packages/torch/autograd/graph.py:869: UserWarning: Memory Efficient attention defaults to a non-deterministic algorithm. To explicitly enable determinism call torch.use_deterministic_algorithms(True, warn_only=False). (Triggered internally at /pytorch/aten/src/ATen/native/transformers/cuda/attention_backward.cu:900.)
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
quantized 7/7 layers in the block, loss iter 0: 0.000011 -> iter 1948: 0.000008

2026-05-21 14:40:21 INFO device.py L1840: 'peak_ram': 72.32GB, 'peak_vram': {'0': 36.9GB, '1': 6.6GB}
Quantizing model.language_model.layers.1: 2%|▎ | 1/60 [06:46<6:39:20, 406.11s/it]
quantized 7/7 layers in the block, loss iter 0: 0.000028 -> iter 648: 0.000024
2026-05-21 14:47:01 INFO device.py L1840: 'peak_ram': 72.32GB, 'peak_vram': {'0': 36.9GB, '1': 6.6GB}
Quantizing model.language_model.layers.2: 3%|▋ | 2/60 [13:27<6:30:08, 403.59s/it]

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@lvliang-intel lvliang-intel requested review from n1ck-guo and xin3he May 21, 2026 07:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants