Skip to content

Added fp16/bf16 based export and compile support for VLMs#819

Merged
quic-hemagnih merged 54 commits intoquic:mainfrom
asmigosw:custom_dtype
Apr 14, 2026
Merged

Added fp16/bf16 based export and compile support for VLMs#819
quic-hemagnih merged 54 commits intoquic:mainfrom
asmigosw:custom_dtype

Conversation

@asmigosw
Copy link
Copy Markdown
Contributor

@asmigosw asmigosw commented Mar 2, 2026

Added fp16/bf16 based export and compile support for VLMs

Comment thread QEfficient/base/modeling_qeff.py Outdated
Comment thread QEfficient/transformers/models/internvl/modeling_internvl.py
Comment thread QEfficient/transformers/models/llama4/modeling_llama4.py
Comment thread QEfficient/transformers/models/llama4/modeling_llama4.py
retained_state=True,
specializations=specializations["lang"],
convert_to_fp16=True,
convert_to_fp16=(DTYPE_TO_STRING_MAP[needed_dtype] == "float16"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why is this condition? required for AI200? @quic-rishinr

Copy link
Copy Markdown
Contributor Author

@asmigosw asmigosw Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is required in case user wants bf16 support which will come in AI200, I have updated the code to convert_to_fp16 = True when passed dtype is either fp16 or fp32.

Comment thread tests/transformers/models/test_causal_lm_models.py Outdated
Comment thread tests/transformers/models/test_causal_lm_models.py Outdated
Comment thread QEfficient/utils/generate_inputs.py Outdated
Comment thread QEfficient/utils/generate_inputs.py Outdated
Comment thread QEfficient/transformers/models/modeling_auto.py Outdated
Comment thread QEfficient/transformers/models/llama_swiftkv/modeling_llama_swiftkv.py Outdated
Comment thread QEfficient/transformers/models/internvl/modeling_internvl.py
Comment thread QEfficient/transformers/models/llama4/modeling_llama4.py
Comment thread QEfficient/transformers/models/molmo/modeling_molmo.py Outdated
Comment thread QEfficient/transformers/models/modeling_auto.py Outdated
self.model, transformed = transform.apply(self.model)
any_transformed = any_transformed or transformed

self._normalize_torch_dtype()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: does this take care of embedding and ASR models too?

"allenai/Molmo-7B-D-0924",
"meta-llama/Llama-3.2-11B-Vision-Instruct",
]:
pytest.skip("Test skipped for this model due to some issues.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: with our dummy configs, can we run all sample lm models w/this test quickly?

@quic-rishinr quic-rishinr marked this pull request as ready for review March 13, 2026 08:34
torch.nn.functional.pad(inputs["input_values"], (0, self.seq_len - input_ids_len), "constant", 0)
)
needed_dtype = self.model.config.torch_dtype
input_values = input_values.astype(CUSTOM_IO_DTYPE_MAP[needed_dtype])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since inputs are in numpy format we should be using Torch_to_numpy_map right?

Comment thread QEfficient/transformers/models/gemma3/modeling_gemma3.py
router_logits = self.gate(hidden_states)

routing_weights = F.softmax(router_logits, dim=1, dtype=torch.float)
routing_weights = F.softmax(router_logits, dim=1, dtype=self.gate.weight.dtype)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update the softmax to be in original percision?

Comment thread QEfficient/transformers/models/gpt2/modeling_gpt2.py
Comment thread QEfficient/transformers/models/grok_1/modeling_grok1.py Outdated
Comment thread QEfficient/transformers/models/gemma3/modeling_gemma3.py Outdated
Comment thread QEfficient/transformers/models/olmo2/modeling_olmo2.py
Comment thread QEfficient/transformers/models/modeling_auto.py Outdated
Comment thread QEfficient/base/modeling_qeff.py
Comment thread QEfficient/transformers/models/modeling_auto.py Outdated
asmigosw and others added 17 commits March 18, 2026 05:49
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
… and inference.

Almost all LLMs can now be compiled and infered in fp16, test_causal_lm_models script has the following notion regarding how the tests happened :
# means the model wasn't tested due to the size, not sure if it'll run through or have an accuracy mismatch.
## means the ouputs match for fp16 and things worked fine.
### means, outputs come but don't match properly with HF tokens.
#### means they're quantized model and additional effort is needed to enable these.
These commits cover almost all LLMs currently supported.

Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
asmigosw and others added 2 commits March 18, 2026 05:49
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
…both in bfloat16.

Added a patch incloud infer to map bfloat16 or 11 key type to np.float16 for AI200 inference.

Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
quic-dhirajku and others added 17 commits March 18, 2026 08:16
…e to False for _compile when appropriate params are missing.

Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Comment thread QEfficient/transformers/cache_utils.py Outdated
Comment thread QEfficient/transformers/models/gemma3/modeling_gemma3.py Outdated
Comment thread examples/text_generation/basic_inference.py Outdated
CUSTOM_IO_DTYPE_MAP = {
torch.float16: "float16",
torch.bfloat16: "bfloat16",
torch.float32: "float16", # Since compiler doesn't support fp32
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: did not understand this? torch.float32: "float16", # Since compiler doesn't support fp32
is it to avoid compiling for fp32?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if in model's config, torch_dtype=torch.float32 is there, it should compile in float16 as compiler doesn't support fp32 precision. Earlier we were hardcoding it to fp16 everywhere so it wouldn't created a problem, but now since bf16 support is also there hence to make it generic.

Comment thread QEfficient/transformers/models/modeling_auto.py Outdated
asmigosw added 2 commits April 9, 2026 06:41
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
@quic-hemagnih quic-hemagnih merged commit 3992d3c into quic:main Apr 14, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants