Skip to content

Reduce VRAM usage of quantizing VLM models#1777

Merged
lvliang-intel merged 10 commits into
mainfrom
lvl/fix_vlm_large_vram
May 20, 2026
Merged

Reduce VRAM usage of quantizing VLM models#1777
lvliang-intel merged 10 commits into
mainfrom
lvl/fix_vlm_large_vram

Conversation

@lvliang-intel
Copy link
Copy Markdown
Contributor

@lvliang-intel lvliang-intel commented May 4, 2026

Description

Fix issue #1744, reduce VRAM usage of quantizing VLM models.

CUDA_VISIBLE_DEVICES=0 python -m auto_round /home/lvl/models/Qwen3-VL-8B-Instruct --bits 4 --group_size 128 --dataset "pile-10k" --nsamples 4 --seqlen 512 --iters 2 --output_dir ./tmp_vlm_quant

Metric Before Optimization After Optimization Delta
Peak RAM 18.28GB 18.52GB +0.24 GB
Peak VRAM 16.56 GB 5.18 GB -11.38 GB

Type of Change

Bug fix

Related Issues

#1744

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.
  • The CUDA CI has passed. You can trigger it by commenting /azp run Unit-Test-CUDA-AutoRound.

Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Copilot AI review requested due to automatic review settings May 4, 2026 12:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets issue #1744 by enabling a lower-VRAM calibration/caching path when quantizing VLM/MLLM models (notably in the new-architecture compressor), and adds a CUDA test intended to validate reduced peak VRAM for a text-only MLLM calibration flow.

Changes:

  • Allow compressors_new CPU caching path for MLLM models by removing the explicit MLLM exclusion in try_cache_inter_data_gpucpu.
  • Add a CUDA test that runs MLLM quantization with a text dataset and asserts peak VRAM stays low.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
auto_round/compressors_new/calib.py Removes the MLLM-specific guard so CPU caching can apply to MLLM in the new compressor path.
test/test_cuda/models/test_mllm.py Adds a low-VRAM regression test for MLLM quantization with a list-of-strings dataset using memory_monitor.

Comment thread test/test_cuda/models/test_mllm.py
Comment thread auto_round/compressors_new/calib.py Outdated
Comment thread test/test_cuda/models/test_mllm.py
@lvliang-intel
Copy link
Copy Markdown
Contributor Author

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@lvliang-intel
Copy link
Copy Markdown
Contributor Author

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@lvliang-intel
Copy link
Copy Markdown
Contributor Author

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@chensuyue chensuyue added this to the 0.13.0 milestone May 14, 2026
…ix_vlm_large_vram

Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
@lvliang-intel
Copy link
Copy Markdown
Contributor Author

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@lvliang-intel lvliang-intel merged commit 64a5817 into main May 20, 2026
46 checks passed
@lvliang-intel lvliang-intel deleted the lvl/fix_vlm_large_vram branch May 20, 2026 07:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants