Added fp16/bf16 based export and compile support for VLMs#819

Merged

quic-hemagnih merged 54 commits intoquic:mainfrom

asmigosw:custom_dtype

Apr 14, 2026

Contributor

asmigosw commented Mar 2, 2026

Added fp16/bf16 based export and compile support for VLMs

asmigosw requested review from ochougul, quic-amitraj, quic-hemagnih and quic-rishinr as code owners

March 2, 2026 08:10

quic-rishinr marked this pull request as draft

March 2, 2026 09:55

vbaddi requested changes

View reviewed changes

QEfficient/base/modeling_qeff.py Outdated

QEfficient/transformers/models/internvl/modeling_internvl.py

QEfficient/transformers/models/llama4/modeling_llama4.py

QEfficient/transformers/models/llama4/modeling_llama4.py

QEfficient/transformers/models/modeling_auto.py Outdated

                               retained_state=True,
                               specializations=specializations["lang"],
-                              convert_to_fp16=True,
+                              convert_to_fp16=(DTYPE_TO_STRING_MAP[needed_dtype] == "float16"),

Contributor

vbaddi Mar 4, 2026

nit: why is this condition? required for AI200? @quic-rishinr

Contributor Author

asmigosw Mar 9, 2026 •

edited

Loading

This condition is required in case user wants bf16 support which will come in AI200, I have updated the code to convert_to_fp16 = True when passed dtype is either fp16 or fp32.

quic-rishinr requested changes

View reviewed changes

tests/transformers/models/test_causal_lm_models.py Outdated

tests/transformers/models/test_causal_lm_models.py Outdated

QEfficient/utils/generate_inputs.py Outdated

QEfficient/utils/generate_inputs.py Outdated

QEfficient/transformers/models/modeling_auto.py Outdated

QEfficient/transformers/models/llama_swiftkv/modeling_llama_swiftkv.py Outdated

QEfficient/transformers/models/internvl/modeling_internvl.py

QEfficient/transformers/models/llama4/modeling_llama4.py

QEfficient/transformers/models/molmo/modeling_molmo.py Outdated

QEfficient/transformers/models/modeling_auto.py Outdated

vbaddi requested changes

View reviewed changes

QEfficient/base/modeling_qeff.py

                           self.model, transformed = transform.apply(self.model)
                           any_transformed = any_transformed or transformed
+                      self._normalize_torch_dtype()

Contributor

vbaddi Mar 11, 2026

nit: does this take care of embedding and ASR models too?

tests/transformers/models/image_text_to_text/test_image_text_to_text_models.py

+                      "allenai/Molmo-7B-D-0924",
+                      "meta-llama/Llama-3.2-11B-Vision-Instruct",
+                  ]:
+                      pytest.skip("Test skipped for this model due to some issues.")

Contributor

vbaddi Mar 11, 2026

nit: with our dummy configs, can we run all sample lm models w/this test quickly?

quic-rishinr marked this pull request as ready for review

March 13, 2026 08:34

quic-dhirajku force-pushed the custom_dtype branch from d84668e to c658e0f Compare

March 13, 2026 09:07

quic-rishinr requested changes

View reviewed changes

QEfficient/transformers/models/modeling_auto.py Outdated

                           torch.nn.functional.pad(inputs["input_values"], (0, self.seq_len - input_ids_len), "constant", 0)
                       )
+                      needed_dtype = self.model.config.torch_dtype
+                      input_values = input_values.astype(CUSTOM_IO_DTYPE_MAP[needed_dtype])

Contributor

quic-rishinr Mar 13, 2026

Since inputs are in numpy format we should be using Torch_to_numpy_map right?

QEfficient/transformers/models/gemma3/modeling_gemma3.py

QEfficient/transformers/models/mixtral_moe/modeling_mixtral.py Outdated

                       router_logits = self.gate(hidden_states)
-                      routing_weights = F.softmax(router_logits, dim=1, dtype=torch.float)
+                      routing_weights = F.softmax(router_logits, dim=1, dtype=self.gate.weight.dtype)

Contributor

quic-rishinr Mar 13, 2026

Can we update the softmax to be in original percision?

QEfficient/transformers/models/gpt2/modeling_gpt2.py

QEfficient/transformers/models/grok_1/modeling_grok1.py Outdated

QEfficient/transformers/models/gemma3/modeling_gemma3.py Outdated

QEfficient/transformers/models/olmo2/modeling_olmo2.py

QEfficient/transformers/models/modeling_auto.py Outdated

QEfficient/base/modeling_qeff.py

QEfficient/transformers/models/modeling_auto.py Outdated

asmigosw force-pushed the custom_dtype branch from a0db0d6 to 9cd7729 Compare

March 13, 2026 12:09

quic-rishinr force-pushed the custom_dtype branch from 1e15c1a to 9df3b31 Compare

March 17, 2026 17:31

asmigosw and others added 17 commits

March 18, 2026 05:49


          Added fp16/bf16 based export and compile for InternVL Model

f76e043

Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>


          Ruff format

edef200

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>


          Added bf16/fp16/fp32 support for mistral3

853d999

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>


          Added changes for Llama4

ebda0e8

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>


          Ruff check

28d6499

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>


          Added custom dtype support for Molmo

aa659cb

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>


          Added custom dtype support for llava_next

848577e

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>


          Ruff format

8da9eac

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>


          Added custom_dtype support for Qwen2_5_vl

41addfa

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>


          Added custom_dtype support for mllama

c274445

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>


          Ruff format

fd9a5a7

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>


          Added custom_dtype support for Gemma3

2ad6706

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>


          BF16 changes to be used

0326bd0

Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>


          Added modifications and changes to enable fp16/bf16 based compilation…

5c15fa0

… and inference.

Almost all LLMs can now be compiled and infered in fp16, test_causal_lm_models script has the following notion regarding how the tests happened :
# means the model wasn't tested due to the size, not sure if it'll run through or have an accuracy mismatch.
## means the ouputs match for fp16 and things worked fine.
### means, outputs come but don't match properly with HF tokens.
#### means they're quantized model and additional effort is needed to enable these.
These commits cover almost all LLMs currently supported.

Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>


          Added custom dtype support for llava

64fa655

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>


          Ruff format

4f81aa8

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>


          Updated logits to dtype float32

97b515a

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

asmigosw and others added 2 commits

March 18, 2026 05:49


          removing comments

2915e49

Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>


          Added check to not pass Custom_IO yaml when model weight and pkv are …

8680d1a

…both in bfloat16.

Added a patch incloud infer to map bfloat16 or 11 key type to np.float16 for AI200 inference.

Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

asmigosw force-pushed the custom_dtype branch from 9df3b31 to 8680d1a Compare

March 18, 2026 05:51

quic-dhirajku and others added 17 commits

March 18, 2026 08:16


          Added additional check to default bf16 model dtype and pkv cache dtyp…

a9494aa

…e to False for _compile when appropriate params are missing.

Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>


          Undo unit test for HL API Tests.

fc45acd

Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>


          Merge branch 'main' into custom_dtype

cb88f6c

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>


          Added logger warning for bf16

Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>


          Merge branch 'main' into custom_dtype

8284da6


          Merge branch 'main' into custom_dtype

f68881c

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>


          Skipping sampler tests

a534872

Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>


          Merge branch 'main' into custom_dtype

69cdd88


          Merge branch 'main' into custom_dtype

677fb64

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>


          Added aic-hw-version args

52feedc

Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>


          Updated parsed args

f22fd7d

Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>


          Merge branch 'main' into custom_dtype

6429f8e

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>


          Merge branch 'main' into custom_dtype

44809ea


          Merge branch 'main' into custom_dtype

7f10015

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>


          Updated Gemma3 for TF v4.57

7f53e42

Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>


          Ruff format

d73766b

Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>


          Removed Comments

f92dd12

Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

quic-rishinr requested changes

View reviewed changes

QEfficient/transformers/cache_utils.py Outdated

QEfficient/transformers/models/gemma3/modeling_gemma3.py Outdated

examples/text_generation/basic_inference.py Outdated

vbaddi reviewed

View reviewed changes

QEfficient/transformers/models/modeling_auto.py

+              CUSTOM_IO_DTYPE_MAP = {
+                  torch.float16: "float16",
+                  torch.bfloat16: "bfloat16",
+                  torch.float32: "float16",  # Since compiler doesn't support fp32

Contributor

vbaddi Apr 9, 2026

nit: did not understand this? torch.float32: "float16", # Since compiler doesn't support fp32
is it to avoid compiling for fp32?

Contributor Author

asmigosw Apr 9, 2026

Yes, if in model's config, torch_dtype=torch.float32 is there, it should compile in float16 as compiler doesn't support fp32 precision. Earlier we were hardcoding it to fp16 everywhere so it wouldn't created a problem, but now since bf16 support is also there hence to make it generic.

QEfficient/transformers/models/modeling_auto.py Outdated

asmigosw added 2 commits

April 9, 2026 06:41


          Comments Addressed

da0cb15

Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>


          ruff check fix

838f0e7

Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

quic-rishinr approved these changes

View reviewed changes

vbaddi approved these changes

View reviewed changes

asmigosw added 2 commits

April 13, 2026 11:34


          Merge branch 'main' into custom_dtype

b2532ba


          Merge branch 'main' into custom_dtype

d2d4127

quic-hemagnih approved these changes

View reviewed changes

quic-hemagnih merged commit 3992d3c into quic:main

5 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

quic-dhirajku quic-dhirajku left review comments

vbaddi vbaddi approved these changes

quic-rishinr quic-rishinr approved these changes

quic-hemagnih quic-hemagnih approved these changes

ochougul Awaiting requested review from ochougul

quic-amitraj Awaiting requested review from quic-amitraj

Labels

None yet