Integrate vLLM loading support into AutoRound new architecture#1804
Integrate vLLM loading support into AutoRound new architecture#1804changwangss wants to merge 6 commits into
Conversation
…e with CLI/context/model routing and VllmMixin Signed-off-by: changwangss <chang1.wang@intel.com>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Pull request overview
This PR integrates an experimental vLLM-based loading/calibration path into AutoRound’s new architecture by adding vLLM model detection/loading utilities, routing to a vLLM-specific compressor mixin, and exposing --use_vllm_loading through the CLI and context initialization.
Changes:
- Added vLLM install check + vLLM model detection/loading helpers in model utilities.
- Wired
use_vllm_loadingthrough new-arch compressor/context initialization and added vLLM-aware routing. - Introduced
VllmMixinto perform vLLM-specific calibration behavior.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| auto_round/utils/model.py | Adds vLLM availability check, vLLM loader helper, and vLLM model detection. |
| auto_round/context/model.py | Routes model initialization through the vLLM loading path and stores the vLLM engine handle. |
| auto_round/compressors_new/vllm_mixin.py | New mixin to implement vLLM-specific calibration behavior and constraints. |
| auto_round/compressors_new/entry.py | Adds vLLM model type routing and mixin selection in the new architecture. |
| auto_round/compressors_new/base.py | Threads use_vllm_loading into ModelContext creation. |
| auto_round/main.py | Exposes --use_vllm_loading and forwards it into AutoRound construction. |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…for object inputs, and add vLLM utils unit tests Signed-off-by: changwangss <chang1.wang@intel.com>
|
@copilot I’ve addressed the feedback and updated the code. Could you please review it again? |
I re-reviewed the updates and confirmed the previously raised items are addressed in commit |
All the previous issues have been resolved. Could you please click "Resolve conversation" on this page to indicate your approval? |
I can’t directly click GitHub’s “Resolve conversation” control from this environment. The fixes are already in |
| if model_or_path is None: | ||
| return False | ||
| if isinstance(model_or_path, torch.nn.Module): | ||
| return True |
There was a problem hiding this comment.
this one is incorrect I guess, hf model is also a subclass of torch.nn.module
There was a problem hiding this comment.
Fixed in c6d4034: is_vllm_model() no longer treats generic torch.nn.Module as vLLM.
| if isinstance(model_or_path, torch.nn.Module): | ||
| return True | ||
| if "vllm" not in str(type(model_or_path)).lower(): | ||
| return False |
There was a problem hiding this comment.
I don't quite understand this condition
There was a problem hiding this comment.
Updated in c6d4034: the condition now uses a clearer type/module hint plus llm_engine presence check before nested vLLM-path probing.
|
|
||
| mixin = DiffusionMixin | ||
| elif model_type == "vllm": | ||
| from auto_round.compressors_new.vllm_mixin import VllmMixin |
There was a problem hiding this comment.
vllm path only supports rtn mod/opt-rtn mode, we'd better add a checker
There was a problem hiding this comment.
Added in c6d4034: vLLM loading now validates RTN/Opt-RTN mode and raises when iters>0.
|
Thanks for this feature, it’s really helpful. I noticed that you already tested Qwen3-MoE, but it would be even better to test Qwen3.6-MoE as well, since it is a VLM and also includes MTP. |
|
besides, if VLLM_ENABLE_V1_MULTIPROCESSING is required, do we need to set it directly in the code? |
| "--enable_torch_compile", action="store_true", help="Enable PyTorch compilation for faster execution. " | ||
| ) | ||
| basic.add_argument( | ||
| "--use_vllm_loading", |
There was a problem hiding this comment.
We generally use the enable_xxx naming convention for boolean variables. Please sync with other reviewers to confirm whether this name is appropriate and there is a better name
There was a problem hiding this comment.
Updated in c6d4034: CLI now supports --enable_vllm_loading (with --use_vllm_loading kept as alias) to align boolean naming conventions.
| else: | ||
| self.dataloader = self.dataset | ||
|
|
||
| sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=1) |
There was a problem hiding this comment.
Although I don’t think these parameters have any effect during tuning, I would still recommend using t=1.0.
There was a problem hiding this comment.
Adjusted in c6d4034: vLLM calibration sampling now uses temperature=1.0.
| assert model_utils.is_vllm_model(fake_vllm_obj) is False | ||
|
|
||
|
|
||
| def test_vllm_load_model_raises_when_vllm_not_installed(monkeypatch): |
There was a problem hiding this comment.
It would be best to add a model test within the test_cuda folder to ensure that this feature is robust and reliable.
There was a problem hiding this comment.
Added in c6d4034: a CUDA integration test (test_vllm_loading_requires_rtn_mode) under test/test_cuda/integrations/test_vllm.py to guard this vLLM path.
| total_cnt += step | ||
| pbar.update(step) | ||
| if total_cnt >= nsamples: | ||
| break |
There was a problem hiding this comment.
Should we clean up the vLLM instance and release the resources manually?
There was a problem hiding this comment.
Addressed in c6d4034: after vLLM calibration we now clear the retained vLLM handle (model_context.llm = None) and trigger GC to release resources earlier.
| from vllm import LLM | ||
|
|
||
| if isinstance(pretrained_model_name_or_path, str): | ||
| os.environ.setdefault("VLLM_ENABLE_V1_MULTIPROCESSING", "0") |
There was a problem hiding this comment.
Please export it before importing vLLM.
There was a problem hiding this comment.
Done in c6d4034: VLLM_ENABLE_V1_MULTIPROCESSING is now exported via os.environ.setdefault(...) before importing vLLM.
|
@copilot resolve the merge conflicts in this pull request |
Co-authored-by: changwangss <24910407+changwangss@users.noreply.github.com>
Implemented in
Merge conflicts are resolved in |
Description
task #1119
the previous PR #1775 which implement the vllm init model feature with old arch.
This PR adds end-to-end vLLM feature support in the new AutoRound architecture by introducing vLLM-aware routing and loading paths, wiring
--use_vllm_loadingthrough CLI and context initialization, addingVllmMixinfor vLLM-specific calibration behavior, and extending model utilities with vLLM detection/loading helpers while keeping non-vLLM flows unchanged.usage
enable_opt_rtn
disable_opt_rtn
test
generate script
use vllm loading.
use transformers loading.
Type of Change
Bug fix
Related Issues
Fixes or relates to #
Checklist Before Submitting
/azp run Unit-Test-CUDA-AutoRound.