Add recurrent gated delta rule custom op for Qwen3.5 attention by Phineas1500 · Pull Request #18088 · pytorch/executorch

Phineas1500 · 2026-03-11T04:36:49Z

Summary

This PR adds a fused llama::recurrent_gated_delta_rule custom op and wires Qwen3.5 GatedDeltaNet attention to use it instead of the Python per-token recurrence loop when the op is available.

It also tightens local custom-op loading so we no longer implicitly scan repo-local cmake-out* directories, and adds coverage for recurrent-state correctness, chunked prefill behavior, and export graph selection.

What changed

added llama::recurrent_gated_delta_rule runtime and AOT registrations
updated Qwen3.5 GatedDeltaNet attention to use the fused op with Python fallback preserved
tightened custom_ops_aot_lib discovery:
- default to package-local discovery
- allow explicit override via EXECUTORCH_CUSTOM_OPS_AOT_LIB
- removed implicit repo-local cmake-out* scanning
added tests for:
- recurrent op parity vs reference
- .out variant behavior
- chunked-state parity vs full-sequence execution
- custom-op vs fallback attention parity
- tiny Qwen3.5 export selecting llama.recurrent_gated_delta_rule

Validation

Linux CPU-only (aarch64)

Built custom_ops_aot_lib successfully and loaded it via EXECUTORCH_CUSTOM_OPS_AOT_LIB.

Passed:

pytest extension/llm/custom_ops/test_update_cache.py::RecurrentGatedDeltaRuleTest -q
- 3 passed
pytest examples/models/llama/tests/test_qwen3_5_attention.py -q
- 7 passed
pytest examples/models/llama/tests/test_export_llama_lib.py::ExportLlamaLibTest::test_tiny_qwen35_export_uses_recurrent_gated_delta_rule -q
- 1 passed

Real-model CPU validation

On a real Qwen3.5-0.8B CPU run, fused recurrence matched the fallback path on next-token selection with very small logit drift, and improved eager prefill latency on the tested prompt.

Observed on local CPU validation:

same next token from fused path vs fallback
max logit diff on the order of 1e-5
eager prefill speedup about 1.6x on the tested prompt

Windows note

A local Windows-only FFHT/MSVC workaround was used during development to keep the local build usable, but that workaround is intentionally not included in this PR.

Non-goals / separate issues

I did not treat the local program.fbs serialization issue as part of this change.

This branch does not modify exir/_serialize/* or schema/program.fbs, and serialization-focused checks passed on both this branch and clean main once the local environment was set up correctly.

A separate end-to-end tiny Qwen3.5 .pte export probe hit:

RuntimeError: Missing out variants: {'aten::alias'}

That appears to be a separate pre-existing export issue outside this change set.

pytorch-bot · 2026-03-11T04:36:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18088

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Awaiting Approval, 29 New Failures, 1 Cancelled Job, 3 Pending

As of commit e5540ad with merge base 8c0a60b ():

AWAITING APPROVAL - The following workflow needs approval before CI can run:

Claude Code (gh)

NEW FAILURES - The following jobs have failed:

Copilot code review / Cleanup artifacts (gh)
Process completed with exit code 1.
Lint / link-check / lint-urls (gh)
Process completed with exit code 1.
pull / test-llama-runner-linux (bf16, custom, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
RuntimeError: Command docker exec -t c0368ad4db0bf7c30c56ee7fdbda5b582a631e6d28d994b0c04707bb49a45101 /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, xnnpack+custom+qe, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
RuntimeError: Command docker exec -t fbfb553484d00cd8e2cf645b0af39d0944fa5855e6f61ee4f5e4f6bc238ab1e1 /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, xnnpack+custom+qe, linux.arm64.2xlarge, executorch-ubuntu-22.04-gc... / linux-job (gh)
RuntimeError: Command docker exec -t d18d451b38d23bd54da9b7ecb856b92d44d9fc6a15cfd014c03f74a9ef974b7b /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, xnnpack+custom+quantize_kv, linux.2xlarge, executorch-ubuntu-22.04... / linux-job (gh)
RuntimeError: Command docker exec -t 483c48d5a88018f794eb85293977ae710d4f5dd9fff45030ece84e95a0fa652f /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, xnnpack+custom+quantize_kv, linux.arm64.2xlarge, executorch-ubuntu... / linux-job (gh)
RuntimeError: Command docker exec -t 4a161c23d5090f00de89dcca05b10e5f3c5a0c1421c017b6da3800c00bccf2d3 /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, xnnpack+quantize_kv, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
RuntimeError: Command docker exec -t e9956f67eb8982bc23a52ec5e724269f8cfc4d191e689e15bb801fb0ef7af8a9 /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, xnnpack+quantize_kv, linux.arm64.2xlarge, executorch-ubuntu-22.04-... / linux-job (gh)
RuntimeError: Command docker exec -t e581da065dce9d1b2f623be8789b80f962d0e5aca7f14912c1f0fb4ee20e4778 /exec failed with exit code 1
pull / test-lora-linux / linux-job (gh)
RuntimeError: Command docker exec -t 990f18e3ded61138de9bacd735fd8e255cd4de7f7fee989c2ca9815754f61541 /exec failed with exit code 1
pull / test-lora-multimethod-linux / linux-job (gh)
RuntimeError: Command docker exec -t 5322aa2dccd270ae1eaad31f79381b8f2fb49e2aeb6eeb4777e73afd626ab943 /exec failed with exit code 1
pull / test-models-linux (add_mul, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t 0396a3824392cf3464f662fdbafd59f705c57cb7e856b2fc9950c91f00742712 /exec failed with exit code 1
pull / test-models-linux (add, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t 9cb74bdb74ae9554ca99829b739cd1fa25f0111dd09c9146a4328563bae9e0a3 /exec failed with exit code 1
pull / test-models-linux (emformer_transcribe, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t ea492d3bb1387d18aaf157f33fb89c227c9ce19ee70efc4ce59ad734973feaf4 /exec failed with exit code 1
pull / test-models-linux (ic3, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t 637a969fddf4e0f0fdb3d57e34357c10829f97fe285d94af14be5613fa5d3c32 /exec failed with exit code 1
pull / test-models-linux (ic4, xnnpack-quantization-delegation, linux.4xlarge.memory) / linux-job (gh)
RuntimeError: Command docker exec -t d1c7cfe7242fd23168e86822b6a23f43b2fe0437b144e69f4e6d97efe77112c6 /exec failed with exit code 1
pull / test-models-linux (linear, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t b8b9e29885e0ce2d9c71bf1a10fadd97891eab32b7f0cda4acfcbd0fc13ab5e5 /exec failed with exit code 1
pull / test-models-linux (mobilebert, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t 58862ba131e4e87468bf216d8db1ef10e8a57c209c9b29a97afe8f98a9d0191b /exec failed with exit code 1
pull / test-models-linux (mv2, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t 2ea79a5ce8fea92455a2221baed26c5023519820cf113f1c7e6107379a906556 /exec failed with exit code 1
pull / test-models-linux (resnet18, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t 53705dacbe81cf3e33f443e8e6ee09c3223e061f55f445baae1e4b460c38a64f /exec failed with exit code 1
pull / test-models-linux (resnet50, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t d54c20f69edb7dd379fea406d777302ab78ad69035195b9ceeb53db2ec82efe5 /exec failed with exit code 1
pull / test-models-linux-basic (mv3, xnnpack-quantization-delegation, buck2, linux.2xlarge, executorch-u... / linux-job (gh)
RuntimeError: Command docker exec -t a6192779ba0d839ff0da816209e0d15305949c2f737cb6ffff6d82c1de5d3065 /exec failed with exit code 1
pull / test-models-linux-basic (mv3, xnnpack-quantization-delegation, cmake, linux.2xlarge, executorch-u... / linux-job (gh)
RuntimeError: Command docker exec -t 3afb85d48043acdffb0578446f36deca01e375d268f1108125b137990dacb0cb /exec failed with exit code 1
pull / test-models-linux-basic (mv3, xnnpack-quantization-delegation, cmake, linux.arm64.2xlarge, execut... / linux-job (gh)
RuntimeError: Command docker exec -t 02ca72c4d11e70cccf7014ef7329b316f1eaa7e5a646b03dbe42d644fe3b6f96 /exec failed with exit code 1
pull / test-models-linux-basic (vit, xnnpack-quantization-delegation, cmake, linux.2xlarge, executorch-u... / linux-job (gh)
RuntimeError: Command docker exec -t 8fd2b4ef4da16886558dd0bfab5c43117576779d2dca807e2324dbd1418f2fbf /exec failed with exit code 1
pull / test-models-linux-basic (vit, xnnpack-quantization-delegation, cmake, linux.arm64.2xlarge, execut... / linux-job (gh)
RuntimeError: Command docker exec -t b7b6c66eb02374a48dbbe19a47e9bdd7a6bc2995fd732583e716aa106863490b /exec failed with exit code 1
pull / test-phi-3-mini-runner-linux / linux-job (gh)
RuntimeError: Command docker exec -t 03876f436ca170b130718f5b8fc2a8010703864dd81821a5f6a3dddaa3b6d9cf /exec failed with exit code 1
pull / test-voxtral-realtime-xnnpack-linux / linux-job (gh)
RuntimeError: Command docker exec -t 3e858c556b1608b2974dff9887a6521009c3035d030b9bd0424cc6425d8590d6 /exec failed with exit code 1
Test CUDA Builds / test-models-cuda (mv3) / linux-job (gh)
RuntimeError: Command docker exec -t 7001b70b4e886fd09650d1dc60ae435c5c8e5a0590a8215aea62761f6f499c9b /exec failed with exit code 1

CANCELLED JOB - The following job was cancelled. Please retry:

periodic / gather-models (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-03-11T04:37:32Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot

Pull request overview

Adds a fused llama::recurrent_gated_delta_rule custom op and integrates it into Qwen3.5 GatedDeltaNet attention to avoid the Python per-token recurrence loop when the op is available, along with tighter custom-op library discovery/loading and new test coverage.

Changes:

Implemented and registered llama::recurrent_gated_delta_rule (runtime kernel + ATen/AOT registrations) and updated attention to use it with a fallback path.
Refined custom_ops_aot_lib discovery/loading (package-local by default, optional EXECUTORCH_CUSTOM_OPS_AOT_LIB override).
Added tests for recurrent-state correctness/parity, chunked prefill behavior, and export graph op selection.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
extension/llm/custom_ops/test_update_cache.py	Adds unit tests for recurrent gated delta rule correctness, `.out` behavior, and chunking parity.
extension/llm/custom_ops/op_tile_crop_aot.cpp	Replaces `WRAP_TO_ATEN` usage with explicit ET↔ATen conversion helpers for `.out`.
extension/llm/custom_ops/op_sdpa_aot.cpp	Adds ATen bindings for recurrent op; refactors multiple `.out` wrappers to explicit conversions.
extension/llm/custom_ops/op_sdpa.h	Declares the new `recurrent_gated_delta_rule_out` kernel signature.
extension/llm/custom_ops/op_sdpa.cpp	Implements recurrent kernel logic and registers the ExecuTorch kernel.
extension/llm/custom_ops/op_fast_hadamard_transform_aten.cpp	Refactors `.out` binding to explicit ET↔ATen conversion helpers.
extension/llm/custom_ops/custom_ops.py	Tightens custom op library discovery/loading; adds meta impl for recurrent op.
extension/llm/custom_ops/CMakeLists.txt	Adds MSVC `/Zc:__cplusplus` compile option.
examples/models/llama/tests/test_qwen3_5_attention.py	Adds chunked prefill parity + fused-op vs fallback parity tests.
examples/models/llama/tests/test_export_llama_lib.py	Adds tiny Qwen3.5 export test asserting recurrent op selection in graph.
examples/models/llama/attention.py	Adds lazy lookup/loading for fused recurrent op and uses it when available.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-11T04:43:12Z

extension/llm/custom_ops/custom_ops.py

+def _get_custom_ops_library_override() -> Path | None:
+    override = os.environ.get("EXECUTORCH_CUSTOM_OPS_AOT_LIB")
+    if override is None:
+        return None
+
+    lib_path = Path(override).expanduser().resolve()
+    assert lib_path.is_file(), (
+        "EXECUTORCH_CUSTOM_OPS_AOT_LIB must point to an existing "
+        f"custom_ops_aot_lib, but got {lib_path}"
+    )
+    return lib_path
+
+
+def _find_custom_ops_library() -> Path:
+    override = _get_custom_ops_library_override()
+    if override is not None:
+        return override

    package_path = Path(__file__).parent.resolve()
-    logging.info(f"Looking for libcustom_ops_aot_lib.so in {package_path}")
+    candidates = []
+    patterns = (
+        "**/custom_ops_aot_lib.dll",
+        "**/custom_ops_aot_lib.so",
+        "**/custom_ops_aot_lib.dylib",
+    )
+
+    for pattern in patterns:
+        candidates.extend(package_path.glob(pattern))
+
+    libs = sorted({path.resolve() for path in candidates if path.is_file()})
+    assert libs, f"Could not find custom_ops_aot_lib under {package_path}"
+    return max(libs, key=lambda path: path.stat().st_mtime)


Avoid using assert for runtime validation of EXECUTORCH_CUSTOM_OPS_AOT_LIB / library discovery. Assertions can be stripped with python -O, turning these into silent misconfigurations; raise a ValueError/FileNotFoundError with the same message instead.

Copilot · 2026-03-11T04:43:12Z

examples/models/llama/attention.py

+    try:
+        from executorch.extension.llm.custom_ops import custom_ops  # noqa: F401
+    except Exception:
+        return None


_get_recurrent_gated_delta_rule_op() swallows all exceptions when importing executorch.extension.llm.custom_ops.custom_ops. Catching broad Exception can hide real load/link errors and make debugging difficult; consider narrowing to ImportError/OSError (or logging the exception at debug level) so unexpected failures surface.

Copilot · 2026-03-11T04:43:12Z

extension/llm/custom_ops/op_sdpa.cpp

+      std::vector<float> kv_mem(v_head_dim);
+      std::vector<float> delta(v_head_dim);
+


recurrent_gated_delta_rule_out allocates std::vector buffers (kv_mem, delta) inside the per-(batch, head) loop. For long sequences / many heads this adds repeated heap allocations and can dominate runtime; allocate these buffers once per call (or reuse a scratch buffer) and resize as needed, or use stack/arena allocation when sizes are small.

Copilot · 2026-03-11T04:43:12Z

extension/llm/custom_ops/op_sdpa_aot.cpp

+namespace {
+template <typename EType, typename AType>
+auto to_et_arg(AType&& value) {
+  return executorch::extension::internal::type_convert<AType, EType>(
+      std::forward<AType>(value));
+}
+
+at::Tensor& copy_et_result_to_out(Tensor& et_result, at::Tensor& out) {
+  auto converted_result =
+      executorch::extension::internal::type_convert<Tensor&, at::Tensor>(
+          et_result)
+          .call();
+  at::native::resize_output(out, converted_result.sizes());
+  out.copy_(converted_result);
+  return out;
+}


The to_et_arg / copy_et_result_to_out helpers are duplicated here and in other custom-op AOT wrappers (tile_crop / sdpa / fast_hadamard_transform). Consider factoring them into a shared utility header to reduce copy-paste and keep conversion semantics consistent across ops.

Copilot · 2026-03-11T04:43:13Z

extension/llm/custom_ops/custom_ops.py

+    if os.name == "nt":
+        os.add_dll_directory(str(lib_path.parent))
+        torch_lib_dir = Path(torch.__file__).resolve().parent / "lib"
+        if torch_lib_dir.is_dir():
+            os.add_dll_directory(str(torch_lib_dir))


On Windows, os.add_dll_directory() returns a handle that must be kept alive; otherwise the directory is removed immediately (CPython refcounting), which can cause torch.ops.load_library() to fail to resolve dependent DLLs. Store the returned handles (e.g., in a module-level list) at least through the load (and ideally for process lifetime).

Phineas1500 · 2026-03-11T04:44:41Z

@lucylq this is a PR for optimizing the recurrence in Qwen 3.5, which we discussed here: #17801 (comment)

I'm next going to make a PR for quantization. I'll let you know once that's up.

JacobSzwejbka · 2026-03-11T17:10:38Z

extension/llm/custom_ops/op_tile_crop_aot.cpp

  m.impl("tile_crop", torch::executor::native::tile_crop_aten);
  m.impl(
      "tile_crop.out",
-      WRAP_TO_ATEN(torch::executor::native::tile_crop_out_no_context, 2));


what problem did you run into with WRAP_TO_ATEN?

JacobSzwejbka · 2026-03-11T17:14:50Z

Can you tell me a bit more about the serialization issue you ran into as well as the MSVC one?

JacobSzwejbka · 2026-03-11T17:16:27Z

examples/models/llama/attention.py

+            )
+            return core_attn_out.transpose(1, 2).contiguous().to(initial_dtype)
+
        core_attn_out = torch.zeros(


can you put this logic in some function called like "naive_gated_delta_rule_op" and then just have the if statement switch between them to tidy this function up a bit.

JacobSzwejbka · 2026-03-11T17:19:23Z

extension/llm/custom_ops/CMakeLists.txt


 set(_common_compile_options
    $<$<CXX_COMPILER_ID:MSVC>:/wd4996>
+    $<$<CXX_COMPILER_ID:MSVC>:/Zc:__cplusplus>


What codepath are you doing down that isnt triggering properly without this? Typically the c10 pattern is to just have explicit msvc conditions and not rely on the c++ version on windows iirc. I could be wrong on that though.

Phineas1500 added 3 commits March 10, 2026 18:09

Add recurrent gated delta rule custom op

ae788ce

Tighten custom op loading and add recurrent op tests

36a19d4

Drop local MSVC FFHT workaround from branch

e5540ad

Phineas1500 requested a review from larryliu0820 as a code owner March 11, 2026 04:36

Copilot AI review requested due to automatic review settings March 11, 2026 04:36

Phineas1500 requested review from kirklandsign, lucylq and mergennachin as code owners March 11, 2026 04:36

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 11, 2026

Copilot started reviewing on behalf of Phineas1500 March 11, 2026 04:37 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

mergennachin requested review from JacobSzwejbka and digantdesai March 11, 2026 16:57

JacobSzwejbka reviewed Mar 11, 2026

View reviewed changes

		std::vector<float> kv_mem(v_head_dim);
		std::vector<float> delta(v_head_dim);

Conversation

Phineas1500 commented Mar 11, 2026

Summary

What changed

Validation

Linux CPU-only (aarch64)

Real-model CPU validation

Windows note

Non-goals / separate issues

Uh oh!

pytorch-bot bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18088

❌ 1 Awaiting Approval, 29 New Failures, 1 Cancelled Job, 3 Pending

Uh oh!

github-actions bot commented Mar 11, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Phineas1500 commented Mar 11, 2026

Uh oh!

JacobSzwejbka Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

JacobSzwejbka commented Mar 11, 2026

Uh oh!

JacobSzwejbka Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

JacobSzwejbka Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pytorch-bot bot commented Mar 11, 2026 •

edited

Loading

This PR needs a `release notes:` label