Reduce VRAM usage of quantizing VLM models by lvliang-intel · Pull Request #1777 · intel/auto-round

lvliang-intel · 2026-05-04T12:08:03Z

Description

Fix issue #1744, reduce VRAM usage of quantizing VLM models.

CUDA_VISIBLE_DEVICES=0 python -m auto_round /home/lvl/models/Qwen3-VL-8B-Instruct --bits 4 --group_size 128 --dataset "pile-10k" --nsamples 4 --seqlen 512 --iters 2 --output_dir ./tmp_vlm_quant

Metric	Before Optimization	After Optimization	Delta
Peak RAM	18.28GB	18.52GB	+0.24 GB
Peak VRAM	16.56 GB	5.18 GB	-11.38 GB

Type of Change

Bug fix

Related Issues

#1744

Checklist Before Submitting

My code has been tested locally.
Documentation has been updated as needed.
New or updated tests are included where applicable.
The CUDA CI has passed. You can trigger it by commenting /azp run Unit-Test-CUDA-AutoRound.

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

for more information, see https://pre-commit.ci

Copilot

Pull request overview

This PR targets issue #1744 by enabling a lower-VRAM calibration/caching path when quantizing VLM/MLLM models (notably in the new-architecture compressor), and adds a CUDA test intended to validate reduced peak VRAM for a text-only MLLM calibration flow.

Changes:

Allow compressors_new CPU caching path for MLLM models by removing the explicit MLLM exclusion in try_cache_inter_data_gpucpu.
Add a CUDA test that runs MLLM quantization with a text dataset and asserts peak VRAM stays low.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`auto_round/compressors_new/calib.py`	Removes the MLLM-specific guard so CPU caching can apply to MLLM in the new compressor path.
`test/test_cuda/models/test_mllm.py`	Adds a low-VRAM regression test for MLLM quantization with a list-of-strings dataset using `memory_monitor`.

lvliang-intel · 2026-05-04T15:06:40Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-05-04T15:06:51Z

Azure Pipelines successfully started running 1 pipeline(s).

lvliang-intel · 2026-05-05T14:41:38Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-05-05T14:41:51Z

Azure Pipelines successfully started running 1 pipeline(s).

lvliang-intel · 2026-05-13T06:33:37Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-05-13T06:33:46Z

Azure Pipelines successfully started running 1 pipeline(s).

…ix_vlm_large_vram Signed-off-by: lvliang-intel <liang1.lv@intel.com>

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

lvliang-intel · 2026-05-19T01:57:56Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-05-19T01:58:06Z

Azure Pipelines successfully started running 1 pipeline(s).

Reduce VRAM usage of quantizing VLM models

01c27b2

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

Copilot AI review requested due to automatic review settings May 4, 2026 12:08

Copilot started reviewing on behalf of lvliang-intel May 4, 2026 12:08 View session

[pre-commit.ci] auto fixes from pre-commit.com hooks

cc5079a

for more information, see https://pre-commit.ci

Copilot AI reviewed May 4, 2026

View reviewed changes

Comment thread test/test_cuda/models/test_mllm.py

Comment thread auto_round/compressors_new/calib.py Outdated

Comment thread test/test_cuda/models/test_mllm.py

wenhuach21 mentioned this pull request May 8, 2026

[Bug]: TypeError: Gemma4TextDecoderLayer.forward() got multiple values for argument 'hidden_states' #1783

Closed

wenhuach21 and others added 5 commits May 8, 2026 17:38

Merge branch 'main' into lvl/fix_vlm_large_vram

02c57a8

Merge branch 'main' into lvl/fix_vlm_large_vram

7d412d3

Merge branch 'main' into lvl/fix_vlm_large_vram

3e666b9

Merge branch 'main' into lvl/fix_vlm_large_vram

8ee0f18

Merge branch 'main' into lvl/fix_vlm_large_vram

f1b5fa8

chensuyue added this to the 0.13.0 milestone May 14, 2026

lvliang-intel added 3 commits May 14, 2026 21:24

Merge branch 'main' of https://github.com/intel/auto-round into lvl/f…

92a8762

…ix_vlm_large_vram Signed-off-by: lvliang-intel <liang1.lv@intel.com>

refactor for new arch

ae936a7

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

Merge branch 'main' into lvl/fix_vlm_large_vram

425e558

n1ck-guo approved these changes May 20, 2026

View reviewed changes

lvliang-intel merged commit 64a5817 into main May 20, 2026
46 checks passed

lvliang-intel deleted the lvl/fix_vlm_large_vram branch May 20, 2026 07:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce VRAM usage of quantizing VLM models#1777

Reduce VRAM usage of quantizing VLM models#1777
lvliang-intel merged 10 commits into
mainfrom
lvl/fix_vlm_large_vram

lvliang-intel commented May 4, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lvliang-intel commented May 4, 2026

Uh oh!

azure-pipelines Bot commented May 4, 2026

Uh oh!

lvliang-intel commented May 5, 2026

Uh oh!

azure-pipelines Bot commented May 5, 2026

Uh oh!

lvliang-intel commented May 13, 2026

Uh oh!

azure-pipelines Bot commented May 13, 2026

Uh oh!

lvliang-intel commented May 19, 2026

Uh oh!

azure-pipelines Bot commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

lvliang-intel commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Related Issues

Checklist Before Submitting

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lvliang-intel commented May 4, 2026

Uh oh!

azure-pipelines Bot commented May 4, 2026

Uh oh!

lvliang-intel commented May 5, 2026

Uh oh!

azure-pipelines Bot commented May 5, 2026

Uh oh!

lvliang-intel commented May 13, 2026

Uh oh!

azure-pipelines Bot commented May 13, 2026

Uh oh!

lvliang-intel commented May 19, 2026

Uh oh!

azure-pipelines Bot commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lvliang-intel commented May 4, 2026 •

edited

Loading