Integrate vLLM loading support into AutoRound new architecture by changwangss · Pull Request #1804 · intel/auto-round

changwangss · 2026-05-12T14:41:52Z

Description

the previous PR #1775 which implement the vllm init model feature with old arch.

This PR adds end-to-end vLLM feature support in the new AutoRound architecture by introducing vLLM-aware routing and loading paths, wiring --use_vllm_loading through CLI and context initialization, adding VllmMixin for vLLM-specific calibration behavior, and extending model utilities with vLLM detection/loading helpers while keeping non-vLLM flows unchanged.
usage
enable_opt_rtn

CUDA_VISIBLE_DEVICES=1 VLLM_ENABLE_V1_MULTIPROCESSING=0 auto-round --model /dataset/Qwen3-30B-A3B-Instruct-2507/ --iters 0 --enable_opt_rtn --format llm_compressor --use_vllm_loading --output_dir vllm_model_1 --scheme NVFP4  --nsamples 4

disable_opt_rtn

CUDA_VISIBLE_DEVICES=1 VLLM_ENABLE_V1_MULTIPROCESSING=0 auto-round --model /dataset/Qwen3-30B-A3B-Instruct-2507/ --iters 0 --disable_opt_rtn  --use_vllm_loading --format llm_compressor --output_dir vllm_qmodel_2 --scheme NVFP4  --nsamples 4

test
generate script

from vllm import LLM, SamplingParams

if __name__ == "__main__":
    sampling_params = SamplingParams(
        temperature=0.7,
        max_tokens=32,
    )

    llm = LLM(
        model="vllm_qmodel_2/Qwen3-30B-A3B-Instruct-2507-nvfp-w4g16/",
        tensor_parallel_size=1,
        enforce_eager=True,
        #cpu_offload_gb=100,
        max_model_len=8192,
        gpu_memory_utilization=0.9,
        trust_remote_code=True,
    )

    prompts = ["Hello, my name is"]

    outputs = llm.generate(prompts, sampling_params)
    for idx, output in enumerate(outputs):
        generated_text = output.outputs[0].text
        print(prompts[idx], generated_text)

use vllm loading.

Hello, my name is  Kofi. I am preparing for a job interview, and I've been given a list of questions to answer in Englis
h. Please help me answer these questions naturally

use transformers loading.

Hello, my name is  Kofi. I am preparing for a job interview, and I've been given a list of possible questions to prepare for. One of them is: "Tell

Type of Change

Bug fix

Related Issues

Fixes or relates to #

Checklist Before Submitting

My code has been tested locally.
Documentation has been updated as needed.
New or updated tests are included where applicable.
The CUDA CI has passed. You can trigger it by commenting /azp run Unit-Test-CUDA-AutoRound.

…e with CLI/context/model routing and VllmMixin Signed-off-by: changwangss <chang1.wang@intel.com>

for more information, see https://pre-commit.ci

Copilot

Pull request overview

This PR integrates an experimental vLLM-based loading/calibration path into AutoRound’s new architecture by adding vLLM model detection/loading utilities, routing to a vLLM-specific compressor mixin, and exposing --use_vllm_loading through the CLI and context initialization.

Changes:

Added vLLM install check + vLLM model detection/loading helpers in model utilities.
Wired use_vllm_loading through new-arch compressor/context initialization and added vLLM-aware routing.
Introduced VllmMixin to perform vLLM-specific calibration behavior.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
auto_round/utils/model.py	Adds vLLM availability check, vLLM loader helper, and vLLM model detection.
auto_round/context/model.py	Routes model initialization through the vLLM loading path and stores the vLLM engine handle.
auto_round/compressors_new/vllm_mixin.py	New mixin to implement vLLM-specific calibration behavior and constraints.
auto_round/compressors_new/entry.py	Adds vLLM model type routing and mixin selection in the new architecture.
auto_round/compressors_new/base.py	Threads `use_vllm_loading` into `ModelContext` creation.
auto_round/main.py	Exposes `--use_vllm_loading` and forwards it into AutoRound construction.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

…for object inputs, and add vLLM utils unit tests Signed-off-by: changwangss <chang1.wang@intel.com>

changwangss · 2026-05-13T01:23:43Z

@copilot I’ve addressed the feedback and updated the code. Could you please review it again?

Copilot · 2026-05-13T01:25:53Z

@copilot I’ve addressed the feedback and updated the code. Could you please review it again?

I re-reviewed the updates and confirmed the previously raised items are addressed in commit 30258de. I did not find any additional changes needed in this follow-up pass.

changwangss · 2026-05-13T01:39:19Z

@copilot I’ve addressed the feedback and updated the code. Could you please review it again?

I re-reviewed the updates and confirmed the previously raised items are addressed in commit 30258de. I did not find any additional changes needed in this follow-up pass.

All the previous issues have been resolved. Could you please click "Resolve conversation" on this page to indicate your approval?

Copilot · 2026-05-13T01:40:30Z

@copilot I’ve addressed the feedback and updated the code. Could you please review it again?

I re-reviewed the updates and confirmed the previously raised items are addressed in commit `3...

I can’t directly click GitHub’s “Resolve conversation” control from this environment. The fixes are already in 30258de; please resolve the thread in the PR UI.

wenhuach21 · 2026-05-13T01:50:08Z

+    if model_or_path is None:
+        return False
+    if isinstance(model_or_path, torch.nn.Module):
+        return True


this one is incorrect I guess, hf model is also a subclass of torch.nn.module

Fixed in c6d4034: is_vllm_model() no longer treats generic torch.nn.Module as vLLM.

wenhuach21 · 2026-05-13T01:50:43Z

+    if isinstance(model_or_path, torch.nn.Module):
+        return True
+    if "vllm" not in str(type(model_or_path)).lower():
+        return False


I don't quite understand this condition

Updated in c6d4034: the condition now uses a clearer type/module hint plus llm_engine presence check before nested vLLM-path probing.

wenhuach21 · 2026-05-13T01:53:49Z


        mixin = DiffusionMixin
+    elif model_type == "vllm":
+        from auto_round.compressors_new.vllm_mixin import VllmMixin


vllm path only supports rtn mod/opt-rtn mode, we'd better add a checker

Added in c6d4034: vLLM loading now validates RTN/Opt-RTN mode and raises when iters>0.

wenhuach21 · 2026-05-13T01:55:30Z

Thanks for this feature, it’s really helpful. I noticed that you already tested Qwen3-MoE, but it would be even better to test Qwen3.6-MoE as well, since it is a VLM and also includes MTP.

wenhuach21 · 2026-05-13T01:56:55Z

besides, if VLLM_ENABLE_V1_MULTIPROCESSING is required, do we need to set it directly in the code?

wenhuach21 · 2026-05-13T02:01:11Z

            "--enable_torch_compile", action="store_true", help="Enable PyTorch compilation for faster execution. "
        )
+        basic.add_argument(
+            "--use_vllm_loading",


We generally use the enable_xxx naming convention for boolean variables. Please sync with other reviewers to confirm whether this name is appropriate and there is a better name

Updated in c6d4034: CLI now supports --enable_vllm_loading (with --use_vllm_loading kept as alias) to align boolean naming conventions.

wenhuach21 · 2026-05-13T02:03:44Z

+        else:
+            self.dataloader = self.dataset
+
+        sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=1)


Although I don’t think these parameters have any effect during tuning, I would still recommend using t=1.0.

Adjusted in c6d4034: vLLM calibration sampling now uses temperature=1.0.

xin3he · 2026-05-13T02:14:55Z

+    assert model_utils.is_vllm_model(fake_vllm_obj) is False
+
+
+def test_vllm_load_model_raises_when_vllm_not_installed(monkeypatch):


It would be best to add a model test within the test_cuda folder to ensure that this feature is robust and reliable.

Added in c6d4034: a CUDA integration test (test_vllm_loading_requires_rtn_mode) under test/test_cuda/integrations/test_vllm.py to guard this vLLM path.

yiliu30 · 2026-05-13T10:54:46Z

+                total_cnt += step
+                pbar.update(step)
+                if total_cnt >= nsamples:
+                    break


Should we clean up the vLLM instance and release the resources manually?

Addressed in c6d4034: after vLLM calibration we now clear the retained vLLM handle (model_context.llm = None) and trigger GC to release resources earlier.

yiliu30 · 2026-05-13T10:55:57Z

+    from vllm import LLM
+
+    if isinstance(pretrained_model_name_or_path, str):
+        os.environ.setdefault("VLLM_ENABLE_V1_MULTIPROCESSING", "0")


Please export it before importing vLLM.

Done in c6d4034: VLLM_ENABLE_V1_MULTIPROCESSING is now exported via os.environ.setdefault(...) before importing vLLM.

changwangss · 2026-05-14T03:06:00Z

@copilot resolve the merge conflicts in this pull request

Co-authored-by: changwangss <24910407+changwangss@users.noreply.github.com>

Copilot · 2026-05-14T03:11:27Z

besides, if VLLM_ENABLE_V1_MULTIPROCESSING is required, do we need to set it directly in the code?

Implemented in c6d4034: the code now sets VLLM_ENABLE_V1_MULTIPROCESSING with setdefault before vLLM import, so users don’t need to set it manually in normal flow.

@copilot resolve the merge conflicts in this pull request

Merge conflicts are resolved in c6d4034 (merged origin/main into this branch and fixed the vLLM-related conflicts).

feat(vllm): integrate use_vllm_loading into new AutoRound architectur…

ebd1e40

…e with CLI/context/model routing and VllmMixin Signed-off-by: changwangss <chang1.wang@intel.com>

Copilot AI review requested due to automatic review settings May 12, 2026 14:41

Copilot started reviewing on behalf of changwangss May 12, 2026 14:42 View session

[pre-commit.ci] auto fixes from pre-commit.com hooks

2e9f3e3

for more information, see https://pre-commit.ci

Copilot AI reviewed May 12, 2026

View reviewed changes

changwangss and others added 2 commits May 12, 2026 22:58

Potential fix for pull request finding

ba1c835

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

b4a86eb

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

changwangss mentioned this pull request May 13, 2026

explore how a vLLM model can be used as input for quantization. #1119

Open

honor detected model type for use_vllm_loading, harden vLLM fallback …

30258de

…for object inputs, and add vLLM utils unit tests Signed-off-by: changwangss <chang1.wang@intel.com>

Copilot started work on behalf of changwangss May 13, 2026 01:23 View session

Copilot finished work on behalf of changwangss May 13, 2026 01:26

Copilot started work on behalf of changwangss May 13, 2026 01:39 View session

Copilot finished work on behalf of changwangss May 13, 2026 01:41

changwangss requested review from n1ck-guo and wenhuach21 May 13, 2026 01:41

wenhuach21 reviewed May 13, 2026

View reviewed changes

wenhuach21 requested a review from xin3he May 13, 2026 01:55

wenhuach21 approved these changes May 13, 2026

View reviewed changes

wenhuach21 requested a review from yiliu30 May 13, 2026 01:56

wenhuach21 reviewed May 13, 2026

View reviewed changes

xin3he reviewed May 13, 2026

View reviewed changes

chensuyue added this to the 0.13.0 milestone May 13, 2026

yiliu30 reviewed May 13, 2026

View reviewed changes

yiliu30 approved these changes May 13, 2026

View reviewed changes

Copilot started work on behalf of changwangss May 14, 2026 03:06 View session

Merge origin/main and address vLLM review follow-ups

c6d4034

Co-authored-by: changwangss <24910407+changwangss@users.noreply.github.com>

Copilot finished work on behalf of changwangss May 14, 2026 03:12

changwangss closed this May 20, 2026

		assert model_utils.is_vllm_model(fake_vllm_obj) is False


		def test_vllm_load_model_raises_when_vllm_not_installed(monkeypatch):

Conversation

changwangss commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Related Issues

Checklist Before Submitting

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

changwangss commented May 13, 2026

Uh oh!

Copilot AI commented May 13, 2026

Uh oh!

changwangss commented May 13, 2026

Uh oh!

Copilot AI commented May 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 14, 2026

Choose a reason for hiding this comment

Uh oh!

wenhuach21 commented May 13, 2026

Uh oh!

wenhuach21 commented May 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 14, 2026

Choose a reason for hiding this comment

Uh oh!

changwangss commented May 14, 2026

Uh oh!

Copilot AI commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

changwangss commented May 12, 2026 •

edited

Loading

Copilot AI commented May 14, 2026 •

edited

Loading