Qualcomm Quantization and Lowering for LLM fails

### 🐛 Describe the bug

I want to lower and quantize LLMs for the QNN backend. I followed the following guide to lower and quantize the SmolLM2 model:
https://docs.pytorch.org/executorch/main/llm/build-run-llama3-qualcomm-ai-engine-direct-backend.html
and
https://github.com/pytorch/executorch/blob/main/examples/qualcomm/oss_scripts/llama/README.md#smollm2

I'm running the following command:
`python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} --decoder_model smollm2_135m --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1 --compile_only`

It fails with the following error after quite some time:
```
return cls.__new__(cls, *args)
[INFO] [Qnn ExecuTorch]: Initialize Qnn backend parameters for Qnn executorch backend type 2
[INFO] [Qnn ExecuTorch]: Caching: Caching is in SAVE MODE.
[INFO] [Qnn ExecuTorch]: Running level=3 optimization.
[INFO] [Qnn ExecuTorch]: Running level=3 optimization.
[INFO 2026-01-15 18:01:48,832 qnn_preprocess.py:172] Processing Method(0): (1/73)
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend
[INFO 2026-01-15 18:01:48,847 wrappers.py:99] HybridTextDecoder::compile completed in 498.08749079704285s
[INFO 2026-01-15 18:01:48,847 wrappers.py:99] MultiModalManager::compile completed in 498.0877661705017s
Traceback (most recent call last):
  File "/home/felix/miniconda3/envs/executorch/lib/python3.10/site-packages/torch/fx/passes/infra/pass_manager.py", line 277, in __call__
    res = fn(module)
  File "/home/felix/miniconda3/envs/executorch/lib/python3.10/site-packages/torch/fx/passes/infra/pass_base.py", line 47, in __call__
    res = self.call(graph_module)
  File "/home/felix/workspace/executorch/backends/qualcomm/_passes/insert_io_qdq.py", line 151, in call
    self._insert(graph_module)
  File "/home/felix/workspace/executorch/backends/qualcomm/_passes/insert_io_qdq.py", line 147, in _insert
    self.q_dq_map[n.meta[QCOM_QUANT_ATTRS][QCOM_ENCODING]],
KeyError: <EdgeOpOverload: quantized_decomposed.dequantize_per_tensor.default>: schema = quantized_decomposed::dequantize_per_tensor(Tensor input, float scale, int zero_point, int quant_min, int quant_max, ScalarType dtype, *, ScalarType? out_dtype=None) -> Tensor

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/llama.py", line 637, in main
    export_llama(args)
  File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/llama.py", line 599, in export_llama
    compile(
  File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/llama.py", line 143, in compile
    multi_modal_mgr.compile(compile_specs=compile_specs, pte_filenames=pte_filenames)
  File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/wrappers.py", line 106, in wrapper
    func(cls, *args, **kwargs)
  File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/wrappers.py", line 1064, in compile
    self.process(compile_request)
  File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/wrappers.py", line 1046, in process
    Processor.process(self, request)
  File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/wrappers.py", line 124, in process
    return self._next_handler.process(request)
  File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/wrappers.py", line 150, in process
    super().process(request)
  File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/wrappers.py", line 124, in process
    return self._next_handler.process(request)
  File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/wrappers.py", line 150, in process
    super().process(request)
  File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/wrappers.py", line 124, in process
    return self._next_handler.process(request)
  File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/wrappers.py", line 149, in process
    getattr(self, request.method_name)(request)
  File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/wrappers.py", line 106, in wrapper
    func(cls, *args, **kwargs)
  File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/wrappers.py", line 890, in compile
    edge_prog_mgr = to_edge_transform_and_lower_to_qnn(
  File "/home/felix/workspace/executorch/backends/qualcomm/utils/utils.py", line 442, in to_edge_transform_and_lower_to_qnn
    return to_edge_transform_and_lower(
  File "/home/felix/workspace/executorch/exir/program/_program.py", line 115, in wrapper
    return func(*args, **kwargs)
  File "/home/felix/workspace/executorch/exir/program/_program.py", line 1379, in to_edge_transform_and_lower
    edge_manager = edge_manager.to_backend(method_to_partitioner)
  File "/home/felix/workspace/executorch/exir/program/_program.py", line 115, in wrapper
    return func(*args, **kwargs)
  File "/home/felix/workspace/executorch/exir/program/_program.py", line 1681, in to_backend
    new_edge_programs = to_backend(method_to_programs_and_partitioners)
  File "/home/felix/miniconda3/envs/executorch/lib/python3.10/functools.py", line 889, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/home/felix/workspace/executorch/exir/backend/backend_api.py", line 762, in _
    lower_all_submodules_to_backend(
  File "/home/felix/workspace/executorch/exir/backend/backend_api.py", line 591, in lower_all_submodules_to_backend
    backend_name_to_subclass[backend_id].preprocess_multimethod(
  File "/home/felix/workspace/executorch/backends/qualcomm/qnn_preprocess.py", line 173, in preprocess_multimethod
    py_op_wrappers = QnnBackend._build_op_wrappers(
  File "/home/felix/workspace/executorch/backends/qualcomm/qnn_preprocess.py", line 54, in _build_op_wrappers
    graph_module = QnnPassManager().transform_for_preprocess_pipeline(
  File "/home/felix/workspace/executorch/backends/qualcomm/_passes/qnn_pass_manager.py", line 262, in transform_for_preprocess_pipeline
    self._transform(exported_program.graph_module)
  File "/home/felix/workspace/executorch/backends/qualcomm/_passes/qnn_pass_manager.py", line 134, in _transform
    return self(graph_module).graph_module
  File "/home/felix/miniconda3/envs/executorch/lib/python3.10/site-packages/torch/fx/passes/infra/pass_manager.py", line 303, in __call__
    raise Exception(msg) from e  # noqa: TRY002
Exception: An error occurred when running the 'InsertIOQDQ' pass after the following passes: ['FoldQDQ', 'ConvertMhaToSha', 'InsertRequantize']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/llama.py", line 648, in <module>
    main()
  File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/llama.py", line 643, in main
    raise Exception(e)
Exception: An error occurred when running the 'InsertIOQDQ' pass after the following passes: ['FoldQDQ', 'ConvertMhaToSha', 'InsertRequantize']
```

This looks to me like a bug. The same error happens with other models like qwen, and gemma as well.

#qcom_aisw

### Versions

PyTorch version: 2.11.0.dev20251222+cpu
Is debug build: False
CUDA used to build PyTorch: Could not collect
ROCM used to build PyTorch: N/A

OS: Arch Linux (x86_64)
GCC version: (GCC) 15.2.1 20260103
Clang version: 21.1.6
CMake version: version 3.31.10
Libc version: glibc-2.42

Python version: 3.10.19 (main, Oct 21 2025, 16:43:05) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.42
Is CUDA available: False
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3080
Nvidia driver version: 576.52
cuDNN version: Could not collect
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Caching allocator config: N/A

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   48 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          64
On-line CPU(s) list:             0-63
Vendor ID:                       AuthenticAMD
Model name:                      AMD Ryzen Threadripper PRO 5975WX 32-Cores
CPU family:                      25
Model:                           8
Thread(s) per core:              2
Core(s) per socket:              32
Socket(s):                       1
Stepping:                        2
BogoMIPS:                        7186.50
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload umip vaes vpclmulqdq rdpid fsrm
Virtualization:                  AMD-V
Hypervisor vendor:               Microsoft
Virtualization type:             full
L1d cache:                       1 MiB (32 instances)
L1i cache:                       1 MiB (32 instances)
L2 cache:                        16 MiB (32 instances)
L3 cache:                        32 MiB (1 instance)
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full AMD retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] executorch==1.1.0a0+cccf977
[pip3] numpy==2.2.6
[pip3] pytorch_tokenizers==1.0.1
[pip3] torch==2.11.0.dev20251222+cpu
[pip3] torchao==0.16.0+git08e5e203f
[pip3] torchaudio==2.10.0.dev20251222+cpu
[pip3] torchdata==0.11.0
[pip3] torchsr==1.0.4
[pip3] torchtune==0.6.1
[pip3] torchvision==0.25.0.dev20251222+cpu
[conda] executorch                 1.1.0a0+cccf977         pypi_0           pypi
[conda] numpy                      2.2.6                   pypi_0           pypi
[conda] pytorch-tokenizers         1.0.1                   pypi_0           pypi
[conda] torch                      2.11.0.dev20251222+cpu  pypi_0           pypi
[conda] torchao                    0.16.0+git08e5e203f     pypi_0           pypi
[conda] torchaudio                 2.10.0.dev20251222+cpu  pypi_0           pypi
[conda] torchdata                  0.11.0                  pypi_0           pypi
[conda] torchsr                    1.0.4                   pypi_0           pypi
[conda] torchtune                  0.6.1                   pypi_0           pypi
[conda] torchvision                0.25.0.dev20251222+cpu  pypi_0           pypi

cc @cccclai @winskuo-quic @shewu-quic @haowhsu-quic @DannyYuyang-quic @cbilgin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qualcomm Quantization and Lowering for LLM fails #16690

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Qualcomm Quantization and Lowering for LLM fails #16690

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions