From ba70b4ac510e9aaf9500e8cdb14111b4eee99d83 Mon Sep 17 00:00:00 2001 From: Aamir Nazir Date: Mon, 11 May 2026 17:11:05 +0400 Subject: [PATCH 1/6] Update openvino_quantizer.rst --- unstable_source/openvino_quantizer.rst | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/unstable_source/openvino_quantizer.rst b/unstable_source/openvino_quantizer.rst index f8609d2a70d..59b89a0f030 100644 --- a/unstable_source/openvino_quantizer.rst +++ b/unstable_source/openvino_quantizer.rst @@ -118,7 +118,8 @@ After we capture the FX Module to be quantized, we will import the OpenVINOQuant .. code-block:: python - from nncf.experimental.torch.fx import OpenVINOQuantizer + from executorch.backends.openvino.quantizer import OpenVINOQuantizer + from executorch.backends.openvino.quantizer import QuantizationMode quantizer = OpenVINOQuantizer() @@ -126,21 +127,20 @@ After we capture the FX Module to be quantized, we will import the OpenVINOQuant Below is the list of essential parameters and their description: -* ``preset`` - defines quantization scheme for the model. Two types of presets are available: +* ``mode`` - defines quantization scheme for the model. Multiple modes are supported: - * ``PERFORMANCE`` (default) - defines symmetric quantization of weights and activations + * ``INT8_SYM`` (default) - defines symmetric quantization of weights and activations. This is the best for performance - * ``MIXED`` - weights are quantized with symmetric quantization and the activations are quantized with asymmetric quantization. This preset is recommended for models with non-ReLU and asymmetric activation functions, e.g. ELU, PReLU, GELU, etc. + * ``INT8_MIXED`` - weights are quantized with symmetric quantization and the activations are quantized with asymmetric quantization. This preset is recommended for models with non-ReLU and asymmetric activation functions, e.g. ELU, PReLU, GELU, etc. - .. code-block:: python - - OpenVINOQuantizer(preset=nncf.QuantizationPreset.MIXED) + * ``INT8_TRANSFORMER`` - special quantization scheme to preserve accuracy after quantization of Transformer models (BERT, Llama, etc.). None is default, i.e. no specific scheme is defined. -* ``model_type`` - used to specify quantization scheme required for specific type of the model. Transformer is the only supported special quantization scheme to preserve accuracy after quantization of Transformer models (BERT, Llama, etc.). None is default, i.e. no specific scheme is defined. + * ``INT8WO_SYM``, ``INT8WO_ASYM``, ``INT4WO_SYM``, ``INT4WO_ASYM`` - these are weights-only quantization schemes. They apply vanilla min-max quantization to model weights to INT8/INT4 with Symmetric and Asymmetric schemes. .. code-block:: python - OpenVINOQuantizer(model_type=nncf.ModelType.Transformer) + OpenVINOQuantizer(mode=QuantizationMode.INT8_SYM) + * ``ignored_scope`` - this parameter can be used to exclude some layers from the quantization process to preserve the model accuracy. For example, when you want to exclude the last layer of the model from quantization. Below are some examples of how to use this parameter: From d89f3e889f904d728351b9ef3f112d04d7bb92ca Mon Sep 17 00:00:00 2001 From: Aamir Nazir Date: Mon, 11 May 2026 17:17:44 +0400 Subject: [PATCH 2/6] Update openvino_quantizer.rst --- unstable_source/openvino_quantizer.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/unstable_source/openvino_quantizer.rst b/unstable_source/openvino_quantizer.rst index 59b89a0f030..52ba3110477 100644 --- a/unstable_source/openvino_quantizer.rst +++ b/unstable_source/openvino_quantizer.rst @@ -15,7 +15,7 @@ Introduction This is an experimental feature, the quantization API is subject to change. -This tutorial demonstrates how to use ``OpenVINOQuantizer`` from `Neural Network Compression Framework (NNCF) `_ in PyTorch 2 Export Quantization flow to generate a quantized model customized for the `OpenVINO torch.compile backend `_ and explains how to lower the quantized model into the `OpenVINO `_ representation. +This tutorial demonstrates how to use ``OpenVINOQuantizer`` from `Executorch `_ in PyTorch 2 Export Quantization flow to generate a quantized model customized for the `OpenVINO torch.compile backend `_ and explains how to lower the quantized model into the `OpenVINO `_ representation. ``OpenVINOQuantizer`` unlocks the full potential of low-precision OpenVINO kernels due to the placement of quantizers designed specifically for the OpenVINO. The PyTorch 2 export quantization flow uses ``torch.export`` to capture the model into a graph and performs quantization transformations on top of the ATen graph. @@ -135,7 +135,7 @@ Below is the list of essential parameters and their description: * ``INT8_TRANSFORMER`` - special quantization scheme to preserve accuracy after quantization of Transformer models (BERT, Llama, etc.). None is default, i.e. no specific scheme is defined. - * ``INT8WO_SYM``, ``INT8WO_ASYM``, ``INT4WO_SYM``, ``INT4WO_ASYM`` - these are weights-only quantization schemes. They apply vanilla min-max quantization to model weights to INT8/INT4 with Symmetric and Asymmetric schemes. + * ``INT8WO_SYM``, ``INT8WO_ASYM``, ``INT4WO_SYM``, ``INT4WO_ASYM`` - these are weights-only quantization schemes. They apply simple min-max quantization to model weights to INT8/INT4 with Symmetric and Asymmetric schemes. .. code-block:: python From d652ee07f6bb7995a239524ffa37078863ae4c43 Mon Sep 17 00:00:00 2001 From: Aamir Nazir Date: Mon, 11 May 2026 17:19:14 +0400 Subject: [PATCH 3/6] Update openvino_quantizer.rst --- unstable_source/openvino_quantizer.rst | 6 ------ 1 file changed, 6 deletions(-) diff --git a/unstable_source/openvino_quantizer.rst b/unstable_source/openvino_quantizer.rst index 52ba3110477..7108d2e6231 100644 --- a/unstable_source/openvino_quantizer.rst +++ b/unstable_source/openvino_quantizer.rst @@ -165,12 +165,6 @@ Below is the list of essential parameters and their description: OpenVINOQuantizer(ignored_scope=nncf.IgnoredScope(subgraphs=[subgraph])) -* ``target_device`` - defines the target device, the specificity of which will be taken into account during optimization. The following values are supported: ``ANY`` (default), ``CPU``, ``CPU_SPR``, ``GPU``, and ``NPU``. - - .. code-block:: python - - OpenVINOQuantizer(target_device=nncf.TargetDevice.CPU) - For further details on `OpenVINOQuantizer` please see the `documentation `_. After we import the backend-specific Quantizer, we will prepare the model for post-training quantization. From 695bd7de734d895e7c981c44be2348366df4fe37 Mon Sep 17 00:00:00 2001 From: Aamir Nazir Date: Mon, 11 May 2026 17:21:03 +0400 Subject: [PATCH 4/6] update ovquantizer location in executorch --- unstable_source/openvino_quantizer.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/unstable_source/openvino_quantizer.rst b/unstable_source/openvino_quantizer.rst index 7108d2e6231..ce3afebb33d 100644 --- a/unstable_source/openvino_quantizer.rst +++ b/unstable_source/openvino_quantizer.rst @@ -15,7 +15,7 @@ Introduction This is an experimental feature, the quantization API is subject to change. -This tutorial demonstrates how to use ``OpenVINOQuantizer`` from `Executorch `_ in PyTorch 2 Export Quantization flow to generate a quantized model customized for the `OpenVINO torch.compile backend `_ and explains how to lower the quantized model into the `OpenVINO `_ representation. +This tutorial demonstrates how to use ``OpenVINOQuantizer`` from `Executorch `_ in PyTorch 2 Export Quantization flow to generate a quantized model customized for the `OpenVINO torch.compile backend `_ and explains how to lower the quantized model into the `OpenVINO `_ representation. ``OpenVINOQuantizer`` unlocks the full potential of low-precision OpenVINO kernels due to the placement of quantizers designed specifically for the OpenVINO. The PyTorch 2 export quantization flow uses ``torch.export`` to capture the model into a graph and performs quantization transformations on top of the ATen graph. From 6397dec631def3388cf13629078e5a0c23e598d9 Mon Sep 17 00:00:00 2001 From: Aamir Nazir Date: Tue, 12 May 2026 16:27:15 +0400 Subject: [PATCH 5/6] Update openvino_quantizer.rst --- unstable_source/openvino_quantizer.rst | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/unstable_source/openvino_quantizer.rst b/unstable_source/openvino_quantizer.rst index ce3afebb33d..9af50819754 100644 --- a/unstable_source/openvino_quantizer.rst +++ b/unstable_source/openvino_quantizer.rst @@ -211,9 +211,8 @@ This should significantly speed up inference time in comparison with the eager m 4. Optional: Improve quantized model metrics ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -NNCF implements advanced quantization algorithms like `SmoothQuant `_ and `BiasCorrection `_, which help -to improve the quantized model metrics while minimizing the output discrepancies between the original and compressed models. -These advanced NNCF algorithms can be accessed via the NNCF `quantize_pt2e` API: +NNCF implements advanced quantization algorithms like `SmoothQuant `_ and `BiasCorrection `_ for static activation and weights quantization. For weights-only quantization, there are `AWQ https://arxiv.org/abs/2306.00978`_ and `Scale Estimation https://github.com/openvinotoolkit/nncf/blob/develop/src/nncf/quantization/algorithms/weight_compression/scale_estimation.py`_ algorithms. These techniques help in improving the quantized model metrics while minimizing the output discrepancies between the original and compressed models. +These advanced NNCF algorithms can be accessed via the NNCF `quantize_pt2e` API for static activation and weights or `compress_pt2e` for weights-only quantization: .. code-block:: python @@ -234,7 +233,7 @@ These advanced NNCF algorithms can be accessed via the NNCF `quantize_pt2e` API: For further details, please see the `documentation `_ -and a complete `example on Resnet18 quantization `_. +and `for some examples with llama and stable_diffusion checkout `_. For `YoloV26 example with this API ` Conclusion ------------ From 96a8dee151729463cd582be7f04e705b5ffd2432 Mon Sep 17 00:00:00 2001 From: Aamir Nazir Date: Mon, 18 May 2026 01:28:53 +0400 Subject: [PATCH 6/6] Update openvino_quantizer.rst --- unstable_source/openvino_quantizer.rst | 86 +++++++++++++++++++++----- 1 file changed, 72 insertions(+), 14 deletions(-) diff --git a/unstable_source/openvino_quantizer.rst b/unstable_source/openvino_quantizer.rst index 9af50819754..fe9940ecd2a 100644 --- a/unstable_source/openvino_quantizer.rst +++ b/unstable_source/openvino_quantizer.rst @@ -36,27 +36,27 @@ The high-level architecture of this flow could look like this: float_model(Python) Example Input \ / \ / - —-------------------------------------------------------- + --------------------------------------------------------- | export | - —-------------------------------------------------------- + --------------------------------------------------------- | FX Graph in ATen | | OpenVINOQuantizer | / - —-------------------------------------------------------- + --------------------------------------------------------- | prepare_pt2e | | | | | Calibrate | | | | convert_pt2e | - —-------------------------------------------------------- + --------------------------------------------------------- | Quantized Model | - —-------------------------------------------------------- + --------------------------------------------------------- | Lower into Inductor | - —-------------------------------------------------------- + --------------------------------------------------------- | OpenVINO model @@ -164,10 +164,15 @@ Below is the list of essential parameters and their description: subgraph = nncf.Subgraph(inputs=['layer_1', 'layer_2'], outputs=['layer_3']) OpenVINOQuantizer(ignored_scope=nncf.IgnoredScope(subgraphs=[subgraph])) +* ``target_device`` - defines the target device, the specificity of which will be taken into account during optimization. The following values are supported: ``ANY`` (default), ``CPU``, ``CPU_SPR``, ``GPU``, and ``NPU``. + + .. code-block:: python + + OpenVINOQuantizer(target_device=nncf.TargetDevice.CPU) For further details on `OpenVINOQuantizer` please see the `documentation `_. -After we import the backend-specific Quantizer, we will prepare the model for post-training quantization. +After we import the backend-specific Quantizer, we will prepare the model for post-training quantization/weights-only quantization. ``prepare_pt2e`` folds BatchNorm operators into preceding Conv2d operators, and inserts observers in appropriate places in the model. .. code-block:: python @@ -209,10 +214,17 @@ The optimized model is using low-level kernels designed specifically for Intel C This should significantly speed up inference time in comparison with the eager model. 4. Optional: Improve quantized model metrics -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -NNCF implements advanced quantization algorithms like `SmoothQuant `_ and `BiasCorrection `_ for static activation and weights quantization. For weights-only quantization, there are `AWQ https://arxiv.org/abs/2306.00978`_ and `Scale Estimation https://github.com/openvinotoolkit/nncf/blob/develop/src/nncf/quantization/algorithms/weight_compression/scale_estimation.py`_ algorithms. These techniques help in improving the quantized model metrics while minimizing the output discrepancies between the original and compressed models. -These advanced NNCF algorithms can be accessed via the NNCF `quantize_pt2e` API for static activation and weights or `compress_pt2e` for weights-only quantization: +NNCF implements advanced quantization algorithms that help improve the metrics of a compressed model while minimizing the output discrepancies between the original and compressed models. These are accessed via the NNCF ``quantize_pt2e`` API for static activation and weights quantization, or ``compress_pt2e`` for weights-only quantization. + +Post Training Quantization +"""""""""""""""""""""""""" + +``quantize_pt2e`` can be applied on top of any ``torchao`` Quantizer to improve the accuracy of the quantized model. Key algorithms: + +- `SmoothQuant `_ - Reduces activation quantization error by inserting smoothing scales before weighted layers, migrating quantization difficulty from hard-to-quantize activations onto the weights. +- `BiasCorrection `_ - Compares quantized and original layer outputs layer-by-layer and adjusts convolution biases to align them, compensating for the error introduced by quantization. .. code-block:: python @@ -220,20 +232,66 @@ These advanced NNCF algorithms can be accessed via the NNCF `quantize_pt2e` API calibration_loader = torch.utils.data.DataLoader(...) - def transform_fn(data_item): images, _ = data_item return images - calibration_dataset = nncf.Dataset(calibration_loader, transform_fn) quantized_model = quantize_pt2e( exported_model, quantizer, calibration_dataset, smooth_quant=True, fast_bias_correction=False ) +Weights Only Quantization +""""""""""""""""""""""""" + +``compress_pt2e`` applies weight compression to a ``torch.fx.GraphModule``, targeting LLM deployment. The following activation-aware algorithms use a small calibration subset to capture activation statistics: + +- `AWQ `_ - Activation-aware Weight Quantization that finds per-channel scales to minimize quantization error based on activation distributions. +- `Scale Estimation `_ - Estimates scales to minimize the layer-wise output error for INT4 weight layers, iteratively refining the scales on a calibration subset. + +.. code-block:: python + + from nncf.experimental.torch.fx import compress_pt2e + + calibration_loader = torch.utils.data.DataLoader(...) + + def transform_fn(data_item): + images, _ = data_item + return images + + calibration_dataset = nncf.Dataset(calibration_loader, transform_fn) + compressed_model = compress_pt2e( + exported_model, quantizer, calibration_dataset, awq=True, scale_estimation=True + ) + +Data-free algorithms +~~~~~~~~~~~~~~~~~~~~ + +When no calibration data is available, ``compress_pt2e`` can perform weight compression relying solely on the pretrained weights. Data-Free Compression uses only the weight tensor statistics, with no activations observed at any point. It can be combined with the AWQ and Mixed Precision algorithms when richer behavior is needed without giving up the no-dataset workflow. + +.. code-block:: python + + from nncf.experimental.torch.fx import compress_pt2e + + compressed_model = compress_pt2e(exported_model, quantizer, awq=True, ratio=0.8) + +Mixed Precision algorithms +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Mixed Precision assigns different bit-widths (e.g. INT4 vs INT8) to individual layers based on their sensitivity, keeping more sensitive layers at higher precision while aggressively compressing the rest. NNCF supports several sensitivity-ranking criteria: + +- **Weight Quantization Error** - Data-free metric that measures the per-layer error introduced by quantizing the weights themselves, requiring no calibration data. +- **Hessian** - Activation-aware metric that uses second-order information about the loss to estimate how much the model output changes when a layer's weights are perturbed by quantization. +- **Mean Variance** and **Max Variance** - Activation-aware metrics that rank layers by the mean or maximum variance of their input activations, on the intuition that layers with more spread-out activations are harder to quantize. +- **Mean Magnitude** - Activation-aware metric that ranks layers by the average magnitude of their input activations. + +.. code-block:: python + from nncf import SensitivityMetric + compressed_model = compress_pt2e( + exported_model, quantizer, calibration_dataset, awq=True, scale_estimation=True, ratio=0.8, sensitivity_metric=SensitivityMetric.MAX_ACTIVATION_VARIANCE + ) -For further details, please see the `documentation `_ -and `for some examples with llama and stable_diffusion checkout `_. For `YoloV26 example with this API ` +Checkout some `resnet `_, `llama `_, `stable diffusion `_ and `Yolo26 `_ examples with this API. Conclusion ------------