-
Notifications
You must be signed in to change notification settings - Fork 803
Description
🐛 Describe the bug
I want to lower and quantize LLMs for the QNN backend. I followed the following guide to lower and quantize the SmolLM2 model:
https://docs.pytorch.org/executorch/main/llm/build-run-llama3-qualcomm-ai-engine-direct-backend.html
and
https://github.com/pytorch/executorch/blob/main/examples/qualcomm/oss_scripts/llama/README.md#smollm2
I'm running the following command:
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} --decoder_model smollm2_135m --model_mode hybrid --prefill_ar_len 128 --max_seq_len 1024 --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1 --compile_only
It fails with the following error after quite some time:
return cls.__new__(cls, *args)
[INFO] [Qnn ExecuTorch]: Initialize Qnn backend parameters for Qnn executorch backend type 2
[INFO] [Qnn ExecuTorch]: Caching: Caching is in SAVE MODE.
[INFO] [Qnn ExecuTorch]: Running level=3 optimization.
[INFO] [Qnn ExecuTorch]: Running level=3 optimization.
[INFO 2026-01-15 18:01:48,832 qnn_preprocess.py:172] Processing Method(0): (1/73)
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend
[INFO 2026-01-15 18:01:48,847 wrappers.py:99] HybridTextDecoder::compile completed in 498.08749079704285s
[INFO 2026-01-15 18:01:48,847 wrappers.py:99] MultiModalManager::compile completed in 498.0877661705017s
Traceback (most recent call last):
File "/home/felix/miniconda3/envs/executorch/lib/python3.10/site-packages/torch/fx/passes/infra/pass_manager.py", line 277, in __call__
res = fn(module)
File "/home/felix/miniconda3/envs/executorch/lib/python3.10/site-packages/torch/fx/passes/infra/pass_base.py", line 47, in __call__
res = self.call(graph_module)
File "/home/felix/workspace/executorch/backends/qualcomm/_passes/insert_io_qdq.py", line 151, in call
self._insert(graph_module)
File "/home/felix/workspace/executorch/backends/qualcomm/_passes/insert_io_qdq.py", line 147, in _insert
self.q_dq_map[n.meta[QCOM_QUANT_ATTRS][QCOM_ENCODING]],
KeyError: <EdgeOpOverload: quantized_decomposed.dequantize_per_tensor.default>: schema = quantized_decomposed::dequantize_per_tensor(Tensor input, float scale, int zero_point, int quant_min, int quant_max, ScalarType dtype, *, ScalarType? out_dtype=None) -> Tensor
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/llama.py", line 637, in main
export_llama(args)
File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/llama.py", line 599, in export_llama
compile(
File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/llama.py", line 143, in compile
multi_modal_mgr.compile(compile_specs=compile_specs, pte_filenames=pte_filenames)
File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/wrappers.py", line 106, in wrapper
func(cls, *args, **kwargs)
File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/wrappers.py", line 1064, in compile
self.process(compile_request)
File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/wrappers.py", line 1046, in process
Processor.process(self, request)
File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/wrappers.py", line 124, in process
return self._next_handler.process(request)
File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/wrappers.py", line 150, in process
super().process(request)
File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/wrappers.py", line 124, in process
return self._next_handler.process(request)
File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/wrappers.py", line 150, in process
super().process(request)
File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/wrappers.py", line 124, in process
return self._next_handler.process(request)
File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/wrappers.py", line 149, in process
getattr(self, request.method_name)(request)
File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/wrappers.py", line 106, in wrapper
func(cls, *args, **kwargs)
File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/wrappers.py", line 890, in compile
edge_prog_mgr = to_edge_transform_and_lower_to_qnn(
File "/home/felix/workspace/executorch/backends/qualcomm/utils/utils.py", line 442, in to_edge_transform_and_lower_to_qnn
return to_edge_transform_and_lower(
File "/home/felix/workspace/executorch/exir/program/_program.py", line 115, in wrapper
return func(*args, **kwargs)
File "/home/felix/workspace/executorch/exir/program/_program.py", line 1379, in to_edge_transform_and_lower
edge_manager = edge_manager.to_backend(method_to_partitioner)
File "/home/felix/workspace/executorch/exir/program/_program.py", line 115, in wrapper
return func(*args, **kwargs)
File "/home/felix/workspace/executorch/exir/program/_program.py", line 1681, in to_backend
new_edge_programs = to_backend(method_to_programs_and_partitioners)
File "/home/felix/miniconda3/envs/executorch/lib/python3.10/functools.py", line 889, in wrapper
return dispatch(args[0].__class__)(*args, **kw)
File "/home/felix/workspace/executorch/exir/backend/backend_api.py", line 762, in _
lower_all_submodules_to_backend(
File "/home/felix/workspace/executorch/exir/backend/backend_api.py", line 591, in lower_all_submodules_to_backend
backend_name_to_subclass[backend_id].preprocess_multimethod(
File "/home/felix/workspace/executorch/backends/qualcomm/qnn_preprocess.py", line 173, in preprocess_multimethod
py_op_wrappers = QnnBackend._build_op_wrappers(
File "/home/felix/workspace/executorch/backends/qualcomm/qnn_preprocess.py", line 54, in _build_op_wrappers
graph_module = QnnPassManager().transform_for_preprocess_pipeline(
File "/home/felix/workspace/executorch/backends/qualcomm/_passes/qnn_pass_manager.py", line 262, in transform_for_preprocess_pipeline
self._transform(exported_program.graph_module)
File "/home/felix/workspace/executorch/backends/qualcomm/_passes/qnn_pass_manager.py", line 134, in _transform
return self(graph_module).graph_module
File "/home/felix/miniconda3/envs/executorch/lib/python3.10/site-packages/torch/fx/passes/infra/pass_manager.py", line 303, in __call__
raise Exception(msg) from e # noqa: TRY002
Exception: An error occurred when running the 'InsertIOQDQ' pass after the following passes: ['FoldQDQ', 'ConvertMhaToSha', 'InsertRequantize']
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/llama.py", line 648, in <module>
main()
File "/home/felix/workspace/executorch/examples/qualcomm/oss_scripts/llama/llama.py", line 643, in main
raise Exception(e)
Exception: An error occurred when running the 'InsertIOQDQ' pass after the following passes: ['FoldQDQ', 'ConvertMhaToSha', 'InsertRequantize']
This looks to me like a bug. The same error happens with other models like qwen, and gemma as well.
#qcom_aisw
Versions
PyTorch version: 2.11.0.dev20251222+cpu
Is debug build: False
CUDA used to build PyTorch: Could not collect
ROCM used to build PyTorch: N/A
OS: Arch Linux (x86_64)
GCC version: (GCC) 15.2.1 20260103
Clang version: 21.1.6
CMake version: version 3.31.10
Libc version: glibc-2.42
Python version: 3.10.19 (main, Oct 21 2025, 16:43:05) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.42
Is CUDA available: False
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3080
Nvidia driver version: 576.52
cuDNN version: Could not collect
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Caching allocator config: N/A
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Vendor ID: AuthenticAMD
Model name: AMD Ryzen Threadripper PRO 5975WX 32-Cores
CPU family: 25
Model: 8
Thread(s) per core: 2
Core(s) per socket: 32
Socket(s): 1
Stepping: 2
BogoMIPS: 7186.50
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload umip vaes vpclmulqdq rdpid fsrm
Virtualization: AMD-V
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 1 MiB (32 instances)
L1i cache: 1 MiB (32 instances)
L2 cache: 16 MiB (32 instances)
L3 cache: 32 MiB (1 instance)
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Full AMD retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] executorch==1.1.0a0+cccf977
[pip3] numpy==2.2.6
[pip3] pytorch_tokenizers==1.0.1
[pip3] torch==2.11.0.dev20251222+cpu
[pip3] torchao==0.16.0+git08e5e203f
[pip3] torchaudio==2.10.0.dev20251222+cpu
[pip3] torchdata==0.11.0
[pip3] torchsr==1.0.4
[pip3] torchtune==0.6.1
[pip3] torchvision==0.25.0.dev20251222+cpu
[conda] executorch 1.1.0a0+cccf977 pypi_0 pypi
[conda] numpy 2.2.6 pypi_0 pypi
[conda] pytorch-tokenizers 1.0.1 pypi_0 pypi
[conda] torch 2.11.0.dev20251222+cpu pypi_0 pypi
[conda] torchao 0.16.0+git08e5e203f pypi_0 pypi
[conda] torchaudio 2.10.0.dev20251222+cpu pypi_0 pypi
[conda] torchdata 0.11.0 pypi_0 pypi
[conda] torchsr 1.0.4 pypi_0 pypi
[conda] torchtune 0.6.1 pypi_0 pypi
[conda] torchvision 0.25.0.dev20251222+cpu pypi_0 pypi
cc @cccclai @winskuo-quic @shewu-quic @haowhsu-quic @DannyYuyang-quic @cbilgin