Awq algorithm#1749

Open

WeiweiZhang1 wants to merge 141 commits into

mainfrom

Contributor

WeiweiZhang1 commented Apr 28, 2026 •

edited

Loading

Description

Add AWQ algorithm support for the new architecture

Type of Change

New feature

Related Issues

Fixes or relates to #1469

Checklist Before Submitting

My code has been tested locally.
Documentation has been updated as needed.
New or updated tests are included where applicable.
The CUDA CI has passed. You can trigger it by commenting /azp run Unit-Test-CUDA-AutoRound.

Benchmark Results (W8A8_INT8, group_size=128, sym=True, fake)

Still WIP for more results

Llama-3.1-8B-Instruct

Method	Arc_c	Arc_e	Boolq	HellaSwag	Lambada	MMLU	OpenBookQA	PIQA	TruthfulQA	WinoGrande	AVG	Peak RAM / VRAM	Time
BF16	0.5333	0.8215	0.8550	0.5976	0.7213	0.6838	0.3560	0.8020	0.3819	0.7411	0.6494	—	—
AR_AWQ	0.5478	0.8258	0.8517	0.5960	0.7223	0.6805	0.3580	0.7949	0.3782	0.7372	0.6492	24.68 / 18.80 GB	927s
AR_RTN	0.5350	0.8203	0.8505	0.5963	0.7223	0.6748	0.3580	0.7965	0.3819	0.7332	0.6469	27.13 / 1.41 GB	200s
AR	0.5290	0.8190	0.8517	0.5959	0.7192	0.6785	0.3600	0.8020	0.3758	0.7340	0.6465	22.83 / 21.29 GB	1413s
LLMC_AWQ	0.5358	0.8203	0.8526	0.5957	0.7169	0.6778	0.3560	0.7949	0.3794	0.7348	0.6464	4.41 / 30.99 GB*	782s
LLMC_SmoothQuant	0.5341	0.8203	0.8492	0.5966	0.7190	0.6754	0.3560	0.7943	0.3745	0.7324	0.6452	WIP	1035s

Qwen3-8B

Method	Arc_c	Arc_e	Boolq	HellaSwag	Lambada	MMLU	OpenBookQA	PIQA	TruthfulQA	WinoGrande	AVG	Peak RAM / VRAM	Time
BF16	0.5529	0.8363	0.8670	0.5715	0.6420	0.7288	0.3160	0.7655	0.3647	0.6788	0.6324	—	—
AR_AWQ	0.5512	0.8279	0.8642	0.5716	0.6583	0.7251	0.3200	0.7693	0.3623	0.6819	0.6332	22.60 / 16.46 GB	—
AR_RTN	0.5555	0.8224	0.8654	0.5689	0.6682	0.7279	0.3180	0.7699	0.3574	0.6803	0.6334	27.00 / 1.25 GB	—
AR	0.5392	0.8224	0.8664	0.5654	0.6270	0.7303	0.3020	0.7606	0.3574	0.6796	0.6250	15.07 / 19.51 GB	—
LLMC_AWQ	0.5469	0.8291	0.8700	0.5651	0.6375	0.7296	0.3160	0.7644	0.3660	0.6898	0.6314	—	—
LLMC_SmoothQuant	0.5529	0.8270	0.8654	0.5702	0.6716	0.7242	0.3160	0.7606	0.3427	0.6890	0.6320	—	—

Notes:

AR_AWQ / LLMC_AWQ: nsamples=128, calibration data=pile (AR) / ultrachat_200k (LLMC)

LLMC_SmoothQuant: nsamples=512, ultrachat_200k

n1ck-guo and others added 30 commits

March 13, 2026 10:06


          init

7698b93

Signed-off-by: n1ck-guo <heng.guo@intel.com>


          Merge branch 'main' of https://github.com/intel/auto-round into hengg…

75b4141

…uo/new_ar_arch


          update

ca17097

Signed-off-by: n1ck-guo <heng.guo@intel.com>


          Merge branch 'main' of https://github.com/intel/auto-round into hengg…

a092e37

…uo/new_ar_arch


          Merge branch 'main' of https://github.com/intel/auto-round into hengg…

cec4ce4

…uo/new_ar_arch


          update

e265b8f

Signed-off-by: n1ck-guo <heng.guo@intel.com>


          merge main

868a82d

Signed-off-by: n1ck-guo <heng.guo@intel.com>


          [pre-commit.ci] auto fixes from pre-commit.com hooks

9dc930c

for more information, see https://pre-commit.ci


          add switch

70a2d02

Signed-off-by: n1ck-guo <heng.guo@intel.com>


          code scan

5998d44

Signed-off-by: n1ck-guo <heng.guo@intel.com>


          Merge branch 'hengguo/new_ar_arch' of https://github.com/intel/auto-r…

…ound into hengguo/new_ar_arch

fix

394dcdd

Signed-off-by: n1ck-guo <heng.guo@intel.com>


          Merge branch 'main' of https://github.com/intel/auto-round into hengg…

7024cad

…uo/new_ar_arch

fix

36daba0

Signed-off-by: n1ck-guo <heng.guo@intel.com>

fix

6feed99

Signed-off-by: n1ck-guo <heng.guo@intel.com>


          fix qweight

7bd3e62

Signed-off-by: n1ck-guo <heng.guo@intel.com>


          fix ut and refactor code

9b14918

Signed-off-by: n1ck-guo <heng.guo@intel.com>


          fix ut

2ab9b51

Signed-off-by: n1ck-guo <heng.guo@intel.com>

fix

dd5aec7

Signed-off-by: n1ck-guo <heng.guo@intel.com>


          Merge branch 'main' of https://github.com/intel/auto-round into hengg…

d65f1eb

…uo/new_ar_arch


          fix merge

bde95c6

Signed-off-by: n1ck-guo <heng.guo@intel.com>

fix

7b4e479

Signed-off-by: n1ck-guo <heng.guo@intel.com>


          update

9b4cab7

Signed-off-by: n1ck-guo <heng.guo@intel.com>


          merge main

b602e00

Signed-off-by: n1ck-guo <heng.guo@intel.com>


          sync merge change

a1fe717

Signed-off-by: n1ck-guo <heng.guo@intel.com>

fix

b58d55a

Signed-off-by: n1ck-guo <heng.guo@intel.com>


          fix ut

6a7ac60

Signed-off-by: n1ck-guo <heng.guo@intel.com>


          Merge branch 'main' of https://github.com/intel/auto-round into hengg…

64d4a57

…uo/new_ar_arch


          decoupling quantization and refactor hadamard

b753bab

Signed-off-by: n1ck-guo <heng.guo@intel.com>


          support multi rotation

b32bc68

Signed-off-by: n1ck-guo <heng.guo@intel.com>

pre-commit-ci Bot and others added 4 commits

May 10, 2026 14:20


          [pre-commit.ci] auto fixes from pre-commit.com hooks

ede757a

for more information, see https://pre-commit.ci


          fix spell typo check

1e282bf

Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com>


          Merge branch 'awq_algorithm' of https://github.com/intel/auto-round i…

e16f881

…nto awq_algorithm


          [pre-commit.ci] auto fixes from pre-commit.com hooks

3f61f32

for more information, see https://pre-commit.ci

WeiweiZhang1 removed the WIP label

WeiweiZhang1 added 2 commits

May 11, 2026 13:15


          fix CI

bca564f

Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com>


          Merge branch 'awq_algorithm' of https://github.com/intel/auto-round i…

6d80bac

…nto awq_algorithm

WeiweiZhang1 requested a review from Copilot

May 11, 2026 05:16


          [pre-commit.ci] auto fixes from pre-commit.com hooks

84c1b39

for more information, see https://pre-commit.ci

Contributor Author

WeiweiZhang1 commented May 11, 2026

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines Bot commented May 11, 2026

Azure Pipelines successfully started running 1 pipeline(s).

WeiweiZhang1 requested review from n1ck-guo, wenhuach21 and yiliu30

May 11, 2026 05:18

Contributor Author

WeiweiZhang1 commented May 11, 2026

@copilot resolve the merge conflicts in this pull request

Copilot started work on behalf of WeiweiZhang1

May 11, 2026 05:21

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull request overview

Copilot reviewed 15 out of 16 changed files in this pull request and generated 9 comments.

Comments suppressed due to low confidence (1)

auto_round/compressors_new/base.py:1

format.save_quantized(...) is being called with a new algorithm= kwarg. Unless every format implementation’s save_quantized signature (and any downstream wrappers) has been updated to accept this parameter, this will raise a TypeError at runtime for unrelated formats. Consider either (a) updating the base format interface to accept algorithm: str | None = None everywhere, or (b) only passing algorithm when the target format explicitly supports it (e.g., by feature-detecting via inspect.signature, or by routing through kwargs with a known-safe key for that formatter).

# Copyright (c) 2026 Intel Corporation

auto_round/__main__.py

auto_round/compressors_new/entry.py

auto_round/algorithms/quantization/awq/quantizer.py

auto_round/algorithms/quantization/awq/quantizer.py

auto_round/compressors_new/awq.py

test/test_cuda/algorithms/test_awq.py Outdated

test/test_cuda/algorithms/test_awq.py Outdated

test/test_cuda/algorithms/test_awq.py

test/test_cpu/algorithms/test_awq.py Outdated


          Merge origin/main and resolve docs TOC conflicts

3d8c72b

Co-authored-by: WeiweiZhang1 <109071285+WeiweiZhang1@users.noreply.github.com>

Contributor

Copilot AI commented May 11, 2026 •

edited

Loading

why is AR so slow now. without torch compile, it should be <1200s， with torch compile, it should be <800s

You’re right. The AR time shown there is slower than expected; that number was from a non-optimized run and doesn’t reflect the expected <1200s (no compile) / <800s (with torch.compile) target. I’ll rerun and refresh the benchmark numbers accordingly.

what's the scheme? for W4A16, the mmlu should be less than 0.67, it's W8A8?

It’s W8A8 (fake quant), not W4A16. The benchmark section header is W8A8_INT8, group_size=128, sym=True, fake; I’ll make this more prominent to avoid confusion.

@copilot resolve the merge conflicts in this pull request

Resolved by merging origin/main into this branch and fixing the conflicts in docs/step_by_step.md and docs/step_by_step_CN.md (kept both AWQ and Model-Free TOC entries). Commit: 3d8c72b.

Copilot started reviewing on behalf of WeiweiZhang1

May 11, 2026 05:26

Copilot finished work on behalf of WeiweiZhang1

May 11, 2026 05:27

Contributor Author

WeiweiZhang1 commented May 11, 2026

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines Bot commented May 11, 2026

Azure Pipelines successfully started running 1 pipeline(s).

WeiweiZhang1 and others added 3 commits

May 11, 2026 16:35


          bugfix, refine CI

5b29869

Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com>


          [pre-commit.ci] auto fixes from pre-commit.com hooks

16b19ca

for more information, see https://pre-commit.ci


          Merge branch 'main' into awq_algorithm

4b20a99

wenhuach21 reviewed

View reviewed changes

auto_round/algorithms/quantization/awq/mappings.py

+                  # Cohere / Command-R
+                  "CohereForCausalLM": _cohere_mappings,
+                  "Cohere2ForCausalLM": _cohere_mappings,
+                  "Cohere2VisionForConditionalGeneration": _cohere_mappings,

Contributor

wenhuach21 May 12, 2026

is this mapping same with llmc or gptqmodel. As we may have no time to update this file, so it would be better we leverage a same file used in these repo

wenhuach21 reviewed

View reviewed changes

auto_round/algorithms/quantization/awq/quantizer.py

		from auto_round.logger import logger


		class AWQQuantizer(RTNQuantizer):

Contributor

wenhuach21 May 12, 2026

why inherits RTNQuantizers here,

auto_round/compressors_new/entry.py

		@@ -225,6 +227,10 @@ def __new__(
		if isinstance(quant_config, SignRoundConfig):
		return _get_compressor_class(model_type, CalibCompressor)(alg_configs, local_args, kwargs)

Contributor

wenhuach21 May 12, 2026

@n1ck-guo It would be better to use a registry or something similar. An algorithm developer should only need to care about the code inside their own alg folder.

auto_round/compressors_new/entry.py

+                              act_sym=act_sym,
+                              act_data_type=act_data_type,
+                              act_dynamic=act_dynamic,
+                              duo_scaling=duo_scaling,

Contributor

wenhuach21 May 12, 2026

same issue

auto_round/__main__.py

+                          type=str.lower,
+                          choices=["auto_round", "rtn", "awq"],
+                          help="Quantization algorithm to use. "
+                          "auto_round: SignSGD-based optimization (default when iters > 0). "

Contributor

wenhuach21 May 12, 2026

how to set multlple algorithm

auto_round/algorithms/quantization/awq/quantizer.py

+                          return
+                      # Resolve mappings
+                      self._resolved_mappings = resolve_mappings(model, self._user_mappings)

Contributor

wenhuach21 May 12, 2026

@n1ck-guo algorithm should control the act hook

auto_round/algorithms/quantization/awq/quantizer.py

+                      # subsequent mappings.)
+                      seen_parents = set()
+                      for mapping in block_mappings:
+                          pid = id(mapping.parent)

Contributor

wenhuach21 May 12, 2026

where is the entry to quantize the layer

wenhuach21 reviewed

View reviewed changes

auto_round/algorithms/quantization/awq/quantizer.py

+                      self.n_grid = config.n_grid
+                      # Populated during calibration
+                      self._user_mappings = config.mappings

Contributor

wenhuach21 May 12, 2026

please add two more args, 1 eanble_minmax_tuning(this name could not change) 2 apply_smooth(feel free to give a better neam) to control the two alg parts in awq.

chensuyue added this to the 0.13.0 milestone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet