Skip to content

Integrate vLLM loading support into AutoRound new architecture#1804

Closed
changwangss wants to merge 6 commits into
mainfrom
wangchang/vllm
Closed

Integrate vLLM loading support into AutoRound new architecture#1804
changwangss wants to merge 6 commits into
mainfrom
wangchang/vllm

Conversation

@changwangss
Copy link
Copy Markdown

@changwangss changwangss commented May 12, 2026

Description

task #1119

the previous PR #1775 which implement the vllm init model feature with old arch.

This PR adds end-to-end vLLM feature support in the new AutoRound architecture by introducing vLLM-aware routing and loading paths, wiring --use_vllm_loading through CLI and context initialization, adding VllmMixin for vLLM-specific calibration behavior, and extending model utilities with vLLM detection/loading helpers while keeping non-vLLM flows unchanged.
usage
enable_opt_rtn

CUDA_VISIBLE_DEVICES=1 VLLM_ENABLE_V1_MULTIPROCESSING=0 auto-round --model /dataset/Qwen3-30B-A3B-Instruct-2507/ --iters 0 --enable_opt_rtn --format llm_compressor --use_vllm_loading --output_dir vllm_model_1 --scheme NVFP4  --nsamples 4

disable_opt_rtn

CUDA_VISIBLE_DEVICES=1 VLLM_ENABLE_V1_MULTIPROCESSING=0 auto-round --model /dataset/Qwen3-30B-A3B-Instruct-2507/ --iters 0 --disable_opt_rtn  --use_vllm_loading --format llm_compressor --output_dir vllm_qmodel_2 --scheme NVFP4  --nsamples 4

test
generate script

from vllm import LLM, SamplingParams

if __name__ == "__main__":
    sampling_params = SamplingParams(
        temperature=0.7,
        max_tokens=32,
    )

    llm = LLM(
        model="vllm_qmodel_2/Qwen3-30B-A3B-Instruct-2507-nvfp-w4g16/",
        tensor_parallel_size=1,
        enforce_eager=True,
        #cpu_offload_gb=100,
        max_model_len=8192,
        gpu_memory_utilization=0.9,
        trust_remote_code=True,
    )

    prompts = ["Hello, my name is"]

    outputs = llm.generate(prompts, sampling_params)
    for idx, output in enumerate(outputs):
        generated_text = output.outputs[0].text
        print(prompts[idx], generated_text)

use vllm loading.

Hello, my name is  Kofi. I am preparing for a job interview, and I've been given a list of questions to answer in Englis
h. Please help me answer these questions naturally

use transformers loading.

Hello, my name is  Kofi. I am preparing for a job interview, and I've been given a list of possible questions to prepare for. One of them is: "Tell

Type of Change

Bug fix

Related Issues

Fixes or relates to #

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.
  • The CUDA CI has passed. You can trigger it by commenting /azp run Unit-Test-CUDA-AutoRound.

…e with CLI/context/model routing and VllmMixin

Signed-off-by: changwangss <chang1.wang@intel.com>
Copilot AI review requested due to automatic review settings May 12, 2026 14:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR integrates an experimental vLLM-based loading/calibration path into AutoRound’s new architecture by adding vLLM model detection/loading utilities, routing to a vLLM-specific compressor mixin, and exposing --use_vllm_loading through the CLI and context initialization.

Changes:

  • Added vLLM install check + vLLM model detection/loading helpers in model utilities.
  • Wired use_vllm_loading through new-arch compressor/context initialization and added vLLM-aware routing.
  • Introduced VllmMixin to perform vLLM-specific calibration behavior.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
auto_round/utils/model.py Adds vLLM availability check, vLLM loader helper, and vLLM model detection.
auto_round/context/model.py Routes model initialization through the vLLM loading path and stores the vLLM engine handle.
auto_round/compressors_new/vllm_mixin.py New mixin to implement vLLM-specific calibration behavior and constraints.
auto_round/compressors_new/entry.py Adds vLLM model type routing and mixin selection in the new architecture.
auto_round/compressors_new/base.py Threads use_vllm_loading into ModelContext creation.
auto_round/main.py Exposes --use_vllm_loading and forwards it into AutoRound construction.

Comment thread auto_round/utils/model.py Outdated
Comment thread auto_round/context/model.py
Comment thread auto_round/compressors_new/entry.py Outdated
Comment thread auto_round/compressors_new/vllm_mixin.py Outdated
Comment thread auto_round/compressors_new/vllm_mixin.py Outdated
Comment thread auto_round/utils/model.py
changwangss and others added 2 commits May 12, 2026 22:58
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…for object inputs, and add vLLM utils unit tests

Signed-off-by: changwangss <chang1.wang@intel.com>
@changwangss
Copy link
Copy Markdown
Author

@copilot I’ve addressed the feedback and updated the code. Could you please review it again?

Copy link
Copy Markdown
Contributor

Copilot AI commented May 13, 2026

@copilot I’ve addressed the feedback and updated the code. Could you please review it again?

I re-reviewed the updates and confirmed the previously raised items are addressed in commit 30258de. I did not find any additional changes needed in this follow-up pass.

@changwangss
Copy link
Copy Markdown
Author

@copilot I’ve addressed the feedback and updated the code. Could you please review it again?

I re-reviewed the updates and confirmed the previously raised items are addressed in commit 30258de. I did not find any additional changes needed in this follow-up pass.

All the previous issues have been resolved. Could you please click "Resolve conversation" on this page to indicate your approval?

Copy link
Copy Markdown
Contributor

Copilot AI commented May 13, 2026

@copilot I’ve addressed the feedback and updated the code. Could you please review it again?

I re-reviewed the updates and confirmed the previously raised items are addressed in commit `3...

I can’t directly click GitHub’s “Resolve conversation” control from this environment. The fixes are already in 30258de; please resolve the thread in the PR UI.

Comment thread auto_round/utils/model.py Outdated
if model_or_path is None:
return False
if isinstance(model_or_path, torch.nn.Module):
return True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is incorrect I guess, hf model is also a subclass of torch.nn.module

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c6d4034: is_vllm_model() no longer treats generic torch.nn.Module as vLLM.

Comment thread auto_round/utils/model.py
if isinstance(model_or_path, torch.nn.Module):
return True
if "vllm" not in str(type(model_or_path)).lower():
return False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand this condition

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in c6d4034: the condition now uses a clearer type/module hint plus llm_engine presence check before nested vLLM-path probing.

Comment thread auto_round/compressors_new/entry.py Outdated

mixin = DiffusionMixin
elif model_type == "vllm":
from auto_round.compressors_new.vllm_mixin import VllmMixin
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vllm path only supports rtn mod/opt-rtn mode, we'd better add a checker

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in c6d4034: vLLM loading now validates RTN/Opt-RTN mode and raises when iters>0.

@wenhuach21
Copy link
Copy Markdown
Contributor

Thanks for this feature, it’s really helpful. I noticed that you already tested Qwen3-MoE, but it would be even better to test Qwen3.6-MoE as well, since it is a VLM and also includes MTP.

@wenhuach21 wenhuach21 requested a review from xin3he May 13, 2026 01:55
@wenhuach21 wenhuach21 requested a review from yiliu30 May 13, 2026 01:56
@wenhuach21
Copy link
Copy Markdown
Contributor

besides, if VLLM_ENABLE_V1_MULTIPROCESSING is required, do we need to set it directly in the code?

Comment thread auto_round/__main__.py
"--enable_torch_compile", action="store_true", help="Enable PyTorch compilation for faster execution. "
)
basic.add_argument(
"--use_vllm_loading",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally use the enable_xxx naming convention for boolean variables. Please sync with other reviewers to confirm whether this name is appropriate and there is a better name

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in c6d4034: CLI now supports --enable_vllm_loading (with --use_vllm_loading kept as alias) to align boolean naming conventions.

else:
self.dataloader = self.dataset

sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I don’t think these parameters have any effect during tuning, I would still recommend using t=1.0.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted in c6d4034: vLLM calibration sampling now uses temperature=1.0.

assert model_utils.is_vllm_model(fake_vllm_obj) is False


def test_vllm_load_model_raises_when_vllm_not_installed(monkeypatch):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be best to add a model test within the test_cuda folder to ensure that this feature is robust and reliable.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in c6d4034: a CUDA integration test (test_vllm_loading_requires_rtn_mode) under test/test_cuda/integrations/test_vllm.py to guard this vLLM path.

@chensuyue chensuyue added this to the 0.13.0 milestone May 13, 2026
total_cnt += step
pbar.update(step)
if total_cnt >= nsamples:
break
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we clean up the vLLM instance and release the resources manually?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in c6d4034: after vLLM calibration we now clear the retained vLLM handle (model_context.llm = None) and trigger GC to release resources earlier.

Comment thread auto_round/utils/model.py Outdated
from vllm import LLM

if isinstance(pretrained_model_name_or_path, str):
os.environ.setdefault("VLLM_ENABLE_V1_MULTIPROCESSING", "0")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please export it before importing vLLM.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in c6d4034: VLLM_ENABLE_V1_MULTIPROCESSING is now exported via os.environ.setdefault(...) before importing vLLM.

@changwangss
Copy link
Copy Markdown
Author

@copilot resolve the merge conflicts in this pull request

Co-authored-by: changwangss <24910407+changwangss@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI commented May 14, 2026

besides, if VLLM_ENABLE_V1_MULTIPROCESSING is required, do we need to set it directly in the code?

Implemented in c6d4034: the code now sets VLLM_ENABLE_V1_MULTIPROCESSING with setdefault before vLLM import, so users don’t need to set it manually in normal flow.

@copilot resolve the merge conflicts in this pull request

Merge conflicts are resolved in c6d4034 (merged origin/main into this branch and fixed the vLLM-related conflicts).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants