Llama 3.2 1B export appears to be broken on Executorch Vulkan backend

### 🐛 Describe the bug

I am trying to export Llama 3.2 1B to run on an android device using the Vulkan backend, and was following this documentation: https://docs.pytorch.org/executorch/main/backends/vulkan/tutorials/etvk-llama-tutorial.html. I was unable to install the executorch package from pip for some reason, so I built it from source following this: https://docs.pytorch.org/executorch/main/using-executorch-building-from-source.html. After that, for exporting Llama I run the command:
```
python -m examples.models.llama.export_llama \
   -c $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/consolidated.00.pth \
   -p $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/params.json \
   -d fp32 --${BACKEND} \
   -qmode ${QUANT} -G ${GROUP_SIZE} \
   --max_seq_length ${CONTEXT_LENGTH} \
   --max_context_length ${CONTEXT_LENGTH} \
   -kv --use_sdpa_with_kv_cache \
   --metadata '{"append_eos_to_prompt": 0, "get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
   --model "llama3_2" \
   --output_name $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_${BACKEND}_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte
```
and it fails with a dynamic shape error in the slice_copy operation. The main segment of the error is this:

```
While executing %aten_slice_copy_tensor : [num_users=2] = call_function[target=executorch.exir.dialects.edge._ops.aten.slice_copy.Tensor](args = (%b_rope_freqs_cos, 0, %_local_scalar_dense, %add), kwargs = {})
Original traceback:
  File "/users/Lakshman/executorch/.executorch-qnn/lib/python3.12/site-packages/torch/_dynamo/functional_export.py", line 216, in forward
    res = self._export_root(*args, **kwargs)
  File "/users/Lakshman/executorch/.executorch-qnn/lib/python3.12/site-packages/executorch/examples/models/llama/llama_transformer.py", line 196, in forward
    freqs_cos, freqs_sin = self.rope.get_freqs(
  File "/users/Lakshman/executorch/.executorch-qnn/lib/python3.12/site-packages/executorch/examples/models/llama/rope.py", line 300, in get_freqs
    freqs_cos = self.freqs_cos.narrow(0, input_pos_item, seq_len)

Use tlparse to see full graph. (https://github.com/pytorch/tlparse?tab=readme-ov-file#tlparse-parse-structured-pt2-logs)
```

Since the error seems to be with dynamic shapes I ran the command again with the `--disable_dynamic_shapes` flag to bypass the failing section of the code, but then the build fails with a KeyError in `get_op_features` in the op_registry.py file:

```
File "/users/Lakshman/executorch/.executorch-qnn/lib/python3.12/site-packages/executorch/backends/vulkan/op_registry.py", line 877, in get_op_features
    return vulkan_supported_ops[target.name()]
           ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
KeyError: 'aten::index.Tensor'
```

Which I think means the index tensor operation isn't a supported Vulkan Op but is used somewhere? I tried disabling quantization and running the build, but I run into the same error. I patched that by returning a basic default OpFeatures object if the operation is not in `vulkan_supported_ops`. Following this I encountered an SDPA error, which I think expects KV cache update nodes, which the Vulkan partitioner seems to explicitly skip (all the `llama.update_cache.default` nodes were skipped). Is this expected? 
I finally tried the export without the SPDA or KV cache flags, and then it seems to finally pass and export.

### Versions


I want to know if I am missing some other dependencies/steps that may not be explicitly mentioned in the tutorial that fixes all these issues much more directly, since the current version I have working definitely does not seem to be the intended way to export the model. For building from source, the default android ndk `glslc` also does not work as noted here: https://github.com/pytorch/executorch/issues/14507 and so I run the build with the manually installed version of the vulkan SDK which I don't think is mentioned as a requirement in the building from source docs.
I am using the Vulkan SDK version 1.4.335 and this is the output of collect env. I am using venv, not conda.
```
Collecting environment information...
PyTorch version: 2.11.0.dev20251222+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.3 LTS (x86_64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version: Could not collect
CMake version: version 3.31.10
Libc version: glibc-2.39

Python version: 3.12.3 (main, Jan  8 2026, 11:30:50) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-6.8.0-71-generic-x86_64-with-glibc2.39
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Caching allocator config: N/A

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        43 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               128
On-line CPU(s) list:                  0-127
Vendor ID:                            AuthenticAMD
Model name:                           AMD EPYC 7543 32-Core Processor
CPU family:                           25
Model:                                1
Thread(s) per core:                   2
Core(s) per socket:                   32
Socket(s):                            2
Stepping:                             1
BogoMIPS:                             5590.00
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca sev sev_es debug_swap
Virtualization:                       AMD-V
L1d cache:                            2 MiB (64 instances)
L1i cache:                            2 MiB (64 instances)
L2 cache:                             32 MiB (64 instances)
L3 cache:                             512 MiB (16 instances)
NUMA node(s):                         8
NUMA node0 CPU(s):                    0-7,64-71
NUMA node1 CPU(s):                    8-15,72-79
NUMA node2 CPU(s):                    16-23,80-87
NUMA node3 CPU(s):                    24-31,88-95
NUMA node4 CPU(s):                    32-39,96-103
NUMA node5 CPU(s):                    40-47,104-111
NUMA node6 CPU(s):                    48-55,112-119
NUMA node7 CPU(s):                    56-63,120-127
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Mitigation; Safe RET
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] executorch==1.1.0a0+082b62b
[pip3] numpy==2.4.1
[pip3] pytorch_tokenizers==1.0.1
[pip3] torch==2.11.0.dev20251222+cpu
[pip3] torchao==0.16.0+git08e5e203f
[pip3] torchaudio==2.10.0.dev20251222+cpu
[pip3] torchdata==0.11.0
[pip3] torchsr==1.0.4
[pip3] torchtune==0.6.1
[pip3] torchvision==0.25.0.dev20251222+cpu
[conda] Could not collect
``` 

cc @SS-JIA @manuelcandales @digantdesai @cbilgin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Llama 3.2 1B export appears to be broken on Executorch Vulkan backend #16647

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Llama 3.2 1B export appears to be broken on Executorch Vulkan backend #16647

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions