From d2d18a69d0a595c567cc8fca8f84d8ed8787ccad Mon Sep 17 00:00:00 2001
From: Abukhoyer Shaik <abukhoye@qti.qualcomm.com>
Date: Mon, 25 May 2026 04:49:39 +0000
Subject: [PATCH 1/2] release/v1.21.6 doc update

Signed-off-by: Abukhoyer Shaik <abukhoye@qti.qualcomm.com>
---
 docs/source/release_docs.md | 60 +++++++++++++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)

diff --git a/docs/source/release_docs.md b/docs/source/release_docs.md
index 880c3a4e4c..afc48a95a1 100644
--- a/docs/source/release_docs.md
+++ b/docs/source/release_docs.md
@@ -1,3 +1,63 @@
+# Efficient Transformer Library - 1.21.6 Release Notes
+
+Welcome to the official release of **Efficient Transformer Library v1.21.6**! This targeted release builds on the v1.21 line with multi-resolution Vision Language Model workflows, Qwen3-VL stability fixes, on-device sampling enablement, and compatibility updates for newer model and framework APIs.
+
+> ✅ The exact release content is available on the [`release/v1.21.6`](https://github.com/quic/efficient-transformers/tree/release/v1.21.6) branch. The package version for this branch is `1.21.6.0`.
+
+---
+
+## Branch Summary
+
+- **Release branch**: [`release/v1.21.6`](https://github.com/quic/efficient-transformers/tree/release/v1.21.6)
+- **Release head**: `25e7c53` (`Updated release version to 1.21.6.0`)
+- **Mainline comparison**: Reviewed against `upstream/main`; the release branch contains 11 release commits from merge base `d02f717`.
+
+---
+
+## Key Features & Enhancements
+
+- **Multi-specialization vision compilation for Qwen VLMs**
+  - Qwen2.5-VL, Qwen3-VL Dense, and Qwen3-VL-MoE can compile multiple vision resolution and frame configurations in one pass.
+  - `height`, `width`, and `num_frames` can be supplied as lists when building specializations.
+  - Runtime generation can select the matching specialization through the multi-frame generation path.
+  - New example scripts are available for [Qwen2.5-VL](https://github.com/quic/efficient-transformers/tree/release/v1.21.6/examples/image_text_to_text/models/qwen2_5_vl), [Qwen3-VL Dense](https://github.com/quic/efficient-transformers/tree/release/v1.21.6/examples/image_text_to_text/models/qwen3vl), and [Qwen3-VL-MoE](https://github.com/quic/efficient-transformers/tree/release/v1.21.6/examples/image_text_to_text/models/qwen3_vl_moe).
+
+- **Qwen3-VL Dense on-device sampling**
+  - Registers Qwen3-VL Dense with the sampler transform path.
+  - Handles Qwen3-VL Dense deepstack feature inputs and outputs for on-device sampling.
+  - Adds sampler coverage to validate the new transform behavior.
+
+- **Large embedding export robustness**
+  - Adds `SplitTensorsTransform` to `QEFFAutoModel` ONNX transforms so large initializers are emitted as `*.onnx.data` sidecar files.
+  - Prevents ONNX ModelProto parser failures when exports exceed the 2 GB protobuf limit.
+  - Adds regression coverage for large embedding and reranker model export flows.
+
+- **Qwen VLM runtime stability**
+  - Fixes RoPE handling for Qwen3-VL-MoE disaggregated mode.
+  - Fixes Qwen3-VL Dense continuous batching with multi-image, multi-prompt inputs by preserving the complete hidden-state tensor during broadcast.
+  - Handles multi-resolution `vision_embeds` edge cases for Qwen2.5-VL, Qwen3-VL Dense, and Qwen3-VL-MoE.
+  - Moves Qwen2.5-VL examples into a dedicated `qwen2_5_vl` example directory.
+
+- **Gemma3 configuration compatibility**
+  - Updates Gemma3 cache handling for the newer `_sliding_window_pattern` config field.
+  - Preserves sliding-window behavior for Gemma3 models using updated Transformers configs.
+
+- **Llama4 compatibility with Transformers `4.57.3`**
+  - Adds `**kwargs` support to `QEffLlama4VisionModel.forward()`.
+  - Accepts `vision_feature_layer` and `vision_feature_select_strategy` forwarded by newer Transformers Llama4 APIs.
+  - Fixes ONNX export failures for Llama4 vision models while remaining backward compatible.
+
+---
+
+## Validation & Quality Updates
+
+- Added tests for Qwen3-VL Dense on-device sampling transformations.
+- Added regression tests that verify large ONNX initializers are split into external data files.
+- Updated image-text model configs and Qwen3-VL examples for continuous batching and multi-specialization workflows.
+- Reverted a temporary Qwen VLM multi-image test/config change before landing the stable Qwen3-VL Dense continuous batching fix.
+
+---
+
 # Efficient Transformer Library - 1.21.0 Release Notes
 
 Welcome to the official release of **Efficient Transformer Library v1.21.0**! This release introduces advanced attention mechanisms, expanded model support, optimized serving capabilities, and significant improvements to fine-tuning and deployment workflows.

From 2bae9804087fd0ebecc588ddf9cc5a5348aa6421 Mon Sep 17 00:00:00 2001
From: Abukhoyer Shaik <abukhoye@qti.qualcomm.com>
Date: Tue, 26 May 2026 01:38:07 +0000
Subject: [PATCH 2/2] comments are incorporated

Signed-off-by: Abukhoyer Shaik <abukhoye@qti.qualcomm.com>
---
 docs/source/release_docs.md | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/docs/source/release_docs.md b/docs/source/release_docs.md
index afc48a95a1..9cd7332eeb 100644
--- a/docs/source/release_docs.md
+++ b/docs/source/release_docs.md
@@ -1,6 +1,6 @@
 # Efficient Transformer Library - 1.21.6 Release Notes
 
-Welcome to the official release of **Efficient Transformer Library v1.21.6**! This targeted release builds on the v1.21 line with multi-resolution Vision Language Model workflows, Qwen3-VL stability fixes, on-device sampling enablement, and compatibility updates for newer model and framework APIs.
+Welcome to the official release of **Efficient Transformer Library v1.21.6**! This targeted release builds on the v1.21 line with multi-resolution Vision Language Model workflows, Qwen3-VL stability fixes, on-device sampling enablement, online serving support for Gemma4 through vLLM, and compatibility updates for newer model and framework APIs.
 
 > ✅ The exact release content is available on the [`release/v1.21.6`](https://github.com/quic/efficient-transformers/tree/release/v1.21.6) branch. The package version for this branch is `1.21.6.0`.
 
@@ -17,10 +17,10 @@ Welcome to the official release of **Efficient Transformer Library v1.21.6**! Th
 ## Key Features & Enhancements
 
 - **Multi-specialization vision compilation for Qwen VLMs**
-  - Qwen2.5-VL, Qwen3-VL Dense, and Qwen3-VL-MoE can compile multiple vision resolution and frame configurations in one pass.
+  - Qwen2.5-VL, Qwen3-VL Dense can compile multiple vision resolution and frame configurations in one pass.
   - `height`, `width`, and `num_frames` can be supplied as lists when building specializations.
   - Runtime generation can select the matching specialization through the multi-frame generation path.
-  - New example scripts are available for [Qwen2.5-VL](https://github.com/quic/efficient-transformers/tree/release/v1.21.6/examples/image_text_to_text/models/qwen2_5_vl), [Qwen3-VL Dense](https://github.com/quic/efficient-transformers/tree/release/v1.21.6/examples/image_text_to_text/models/qwen3vl), and [Qwen3-VL-MoE](https://github.com/quic/efficient-transformers/tree/release/v1.21.6/examples/image_text_to_text/models/qwen3_vl_moe).
+  - New example scripts are available for [Qwen2.5-VL](https://github.com/quic/efficient-transformers/tree/release/v1.21.6/examples/image_text_to_text/models/qwen2_5_vl), [Qwen3-VL Dense](https://github.com/quic/efficient-transformers/tree/release/v1.21.6/examples/image_text_to_text/models/qwen3vl).
 
 - **Qwen3-VL Dense on-device sampling**
   - Registers Qwen3-VL Dense with the sampler transform path.
@@ -33,7 +33,6 @@ Welcome to the official release of **Efficient Transformer Library v1.21.6**! Th
   - Adds regression coverage for large embedding and reranker model export flows.
 
 - **Qwen VLM runtime stability**
-  - Fixes RoPE handling for Qwen3-VL-MoE disaggregated mode.
   - Fixes Qwen3-VL Dense continuous batching with multi-image, multi-prompt inputs by preserving the complete hidden-state tensor during broadcast.
   - Handles multi-resolution `vision_embeds` edge cases for Qwen2.5-VL, Qwen3-VL Dense, and Qwen3-VL-MoE.
   - Moves Qwen2.5-VL examples into a dedicated `qwen2_5_vl` example directory.
@@ -41,12 +40,16 @@ Welcome to the official release of **Efficient Transformer Library v1.21.6**! Th
 - **Gemma3 configuration compatibility**
   - Updates Gemma3 cache handling for the newer `_sliding_window_pattern` config field.
   - Preserves sliding-window behavior for Gemma3 models using updated Transformers configs.
+  - Added online serving support for Gemma3 through vLLM
 
 - **Llama4 compatibility with Transformers `4.57.3`**
   - Adds `**kwargs` support to `QEffLlama4VisionModel.forward()`.
   - Accepts `vision_feature_layer` and `vision_feature_select_strategy` forwarded by newer Transformers Llama4 APIs.
   - Fixes ONNX export failures for Llama4 vision models while remaining backward compatible.
 
+- **GPT-OSS batch size flexibility**
+  - Added GPT OSS 120B with BS>1 and GPT OSS 20B BS>2 support is enabled
+
 ---
 
 ## Validation & Quality Updates