Skip to content

feat: generate dispatch entrypoints for bindings#604

Merged
voltjia merged 3 commits into
masterfrom
feat/generated-dispatch
May 13, 2026
Merged

feat: generate dispatch entrypoints for bindings#604
voltjia merged 3 commits into
masterfrom
feat/generated-dispatch

Conversation

@voltjia
Copy link
Copy Markdown
Collaborator

@voltjia voltjia commented May 12, 2026

Summary

  • Generate generated/bindings/generated_dispatch.h and generated/bindings/generated_dispatch.cc from scripts/generate_wrappers.py.
  • Split generated Python bindings into per-operator translation units while routing Make, Call, instance invocation, cache clearing, and active implementation queries through the generated dispatch entrypoints.
  • Compile every generated binding .cc via CMake instead of compiling only the monolithic ops.cc.

Motivation

This change prepares the binding generator for larger generated operator sets without relying on each binding translation unit to see every backend implementation header.

The existing monolithic binding source works because all Operator<Key, Device, Index> specializations are visible in one translation unit. A naive split binding layout changes ActiveImplementations visibility and can produce different runtime dispatch behavior. The generated dispatch source keeps implementation visibility centralized while allowing the pybind registration code to compile in smaller per-operator translation units.

Related: #593

Type of Change

  • feat — new feature / new operator / new platform
  • N/A fix — bug fix
  • N/A perf — performance improvement (no behavioral change)
  • N/A refactor — code restructuring without behavior change
  • N/A test — adding or fixing tests only
  • N/A docs — documentation only
  • build / ci — build system or CI configuration
  • N/A chore — tooling, formatting, or other non-code changes
  • N/A Breaking change (requires a ! in the Conventional Commits prefix or a BREAKING CHANGE: footer)

Platforms Affected

  • CPU (WITH_CPU)
  • NVIDIA (WITH_NVIDIA)
  • Iluvatar (WITH_ILUVATAR)
  • MetaX (WITH_METAX)
  • Cambricon (WITH_CAMBRICON)
  • Moore (WITH_MOORE)
  • Ascend (WITH_ASCEND)
  • N/A PyTorch C++ bindings (WITH_TORCH)
  • Build system / CMake / CI
  • Python bindings / user-facing API

Test Results on Supported Platforms

Latest full-platform rerun after commit a00c8e51 (refactor: rely on overloads in generated dispatch): generated-dispatch-a00c8e51-20260513-150730. Each platform used pytest tests/ from the generated-dispatch branch on GPU/NPU 6 unless noted otherwise.

Platform Built pytest Result Notes / Hardware
NVIDIA Yes 4151 passed, 1375 skipped in 318.69s GPU 6
Iluvatar Yes 3651 passed, 375 skipped in 263.04s GPU 6
MetaX Yes 5795 passed, 1447 skipped in 365.97s GPU 6
Cambricon Yes 12 failed, 3061 passed, 3857 skipped in 911.23s GPU 6; known existing tests/test_add.py MLU int16 random_ failures per prior platform runs
Moore Yes 300 failed, 5459 passed, 1483 skipped in 525.00s GPU 6; known existing tests/test_gemm.py failures per prior platform runs
Ascend Yes 24 failed, 3804 passed, 138 skipped in 464.88s; job reported exit code 137 after pytest output NPU 6 requested; failures are tests/test_linear.py OOMs, with container-visible npu:0 reporting only tens of MiB free

Notes on result differences from the previous table:

  • The earlier table was from generated-dispatch-full-20260513; this table supersedes it for the current head commit a00c8e51.
  • NVIDIA and Iluvatar counts match the historical baseline-sized direct pytest tests/ coverage for this branch. MetaX/Moore/Cambricon match the prior known result classes.
  • Cambricon and Moore failures are pre-existing platform test failures, not new dispatch-generation failures.
  • Ascend differs from the prior 16 failed table: this rerun used NPU 6 and failed 24 tests/test_linear.py cases due OOM. Logs show PyTorch reporting NPU 0 inside the container with only ~24-30 MiB free while allocating 34-66 MiB tensors, so this appears to be device-memory state/mapping rather than a generated-dispatch compile or symbol issue.
Full `pytest` output summary
Full-platform rerun generated-dispatch-a00c8e51-20260513-150730:
NVIDIA: 4151 passed, 1375 skipped in 318.69s
Iluvatar: 3651 passed, 375 skipped in 263.04s
MetaX: 5795 passed, 1447 skipped in 365.97s
Cambricon: 12 failed, 3061 passed, 3857 skipped in 911.23s
Moore: 300 failed, 5459 passed, 1483 skipped in 525.00s
Ascend: 24 failed, 3804 passed, 138 skipped in 464.88s

Benchmark / Performance Impact

This PR is expected to reduce pybind binding compilation pressure by moving operator registration into per-op translation units while keeping implementation dispatch centralized in generated_dispatch.cc.

Timing from full-platform rerun generated-dispatch-a00c8e51-20260513-150730:

Platform Test command time
NVIDIA 318.69s
Iluvatar 263.04s
MetaX 365.97s
Cambricon 911.23s
Moore 525.00s
Ascend 464.88s

Notes for Reviewers

The important invariant is that all code paths which depend on ActiveImplementations visibility must remain in generated_dispatch.cc, because that file includes the backend implementation headers. Per-op binding files intentionally include only base/operator binding headers plus lightweight device marker headers, then call the generated dispatch entrypoints.

This PR does not expose generated_dispatch as a public C API. It is an internal generated layer that can be reused by a future C API implementation.


Checklist

Title, Branch, and Commits

  • PR title follows Conventional Commits (e.g. feat(nvidia): …, fix(cuda/gemm): …).
  • Branch name follows <type>/xxx-yyyy-zzzz where <type> matches the PR title's Conventional Commits type and words are joined with hyphens (see CONTRIBUTING.md §Branches).
  • Each commit message follows Conventional Commits.
  • Small PR is a single squashable commit; or, for a large PR, every commit is meaningful, well-formed, and independently reviewable (see CONTRIBUTING.md §Pull Requests).
  • No stray merge commits from master — the branch is rebased cleanly on top of the current master.
  • No fixup! / squash! / wip commits remain.

Scope and Design

  • Changes are minimal — nothing unrelated to the stated motivation was added (CONTRIBUTING.md §Code/General).
  • No dead code, commented-out blocks, debug prints, printf/std::cout/print(...) left behind, or TODO without an owner and issue link.
  • No unrelated formatting churn that would obscure the diff.
  • Public API changes (if any) are intentional, documented, and reflected in affected callers/tests.

General Code Hygiene (applies to all languages)

  • The code is self-explanatory; comments were added only where the why is non-obvious (CONTRIBUTING.md §Code/General).
  • Every modified or added file ends with a single trailing newline (CONTRIBUTING.md §Code/General).
  • No trailing whitespace, tab/space mixing, or stray BOMs.
  • Identifiers in comments and error messages are wrapped in backticks (e.g. the `seqlens_k` tensor) (CONTRIBUTING.md §Code/General).
  • All comments and error messages are in English (CONTRIBUTING.md §Code/General).
  • Comments and error messages are complete sentences — capitalized first letter, terminal punctuation — unless the language/framework convention says otherwise (CONTRIBUTING.md §Code/General; §Python).

C++ Specific (if C++ files changed)

  • Code follows the Google C++ Style Guide strictly.
  • clang-format (version 21, per .github/workflows/clang-format.yml) has been run against all modified .h, .cc, .cuh, and .mlu files; the diff is clean.
  • N/A clang-tidy concerns (per .clang-tidy) have been reviewed — no new warnings beyond the existing baseline.
  • N/A Operator parameter order is unchanged by this PR.
  • N/A No new exception/error path is added.
  • N/A Kernel files are not changed by this PR.
  • N/A New operators are not added by this PR.
  • No raw new/delete; RAII / smart pointers / existing allocators are used.

Python Specific (if Python files changed)

  • Code is PEP 8 compliant; ruff check passes cleanly on CI (see .github/workflows/ruff.yml).
  • ruff format --check passes cleanly — if not, run ruff format and commit the result.
  • Comments are complete English sentences, starting with a capital letter and ending with punctuation; Markdown backticks are used for code references (CONTRIBUTING.md §Python).
  • Framework-specific conventions (e.g. lowercase pytest.skip messages without terminal period) are honored where applicable (CONTRIBUTING.md §Python).
  • No blank line between the function signature and the body when there is no docstring or comment (CONTRIBUTING.md §Python).
  • A blank line is present before and after if, for, and similar control-flow statements (CONTRIBUTING.md §Python).
  • A blank line appears before each return, except when it directly follows a control-flow statement (CONTRIBUTING.md §Python).
  • N/A No new docstrings or type hints were introduced.

Testing

  • pytest was run locally on every supported platform that this PR can affect, and the results are recorded in the "Test Results" table above (CONTRIBUTING.md §Pull Requests).
  • Every supported platform was tested; failing platforms are recorded in the table above.
  • N/A New operator tests were not added because this PR changes generated binding structure, not operator behavior.
  • Existing parameterized tests are used for smoke coverage.
  • N/A This is a feature PR, not a bug-fix PR.

Build, CI, and Tooling

  • The project builds cleanly from a fresh directory with pip install .[dev] on at least one affected platform.
  • compile_commands.json still regenerates (CMake option CMAKE_EXPORT_COMPILE_COMMANDS=ON in pyproject.toml — required by the code-lint skill and clang-tidy -p).
  • N/A No new backend/device auto-detection was added.
  • Only one CUDA-like GPU backend is selectable at a time — the existing mutual-exclusion check in CMakeLists.txt is not broken.
  • Both CI workflows (clang-format.yml, ruff.yml) are green on GitHub Actions.
  • No new runtime dependency was added without updating pyproject.toml's [project.optional-dependencies] (or justified in the PR description).

Documentation

  • N/A README.md, CONTRIBUTING.md, or inline docs updated when behavior, build flags, or developer workflow changed.
  • N/A New operators, new dispatch helpers, or new public utilities are documented (docstring, header comment, or an addition to CONTRIBUTING.md §Some Code Explanations).
  • N/A No user-visible breaking change is introduced.

Security and Safety

  • No secrets, access tokens, internal URLs, customer data, or personal hardware identifiers have been committed.
  • Third-party code is license-compatible and attributed.
  • No unsafe pointer arithmetic, uninitialized reads, or missing bounds checks were introduced.

@voltjia voltjia requested review from Ziminli and bitzyz May 13, 2026 05:08
@voltjia
Copy link
Copy Markdown
Collaborator Author

voltjia commented May 13, 2026

@bitzyz 初审,@Ziminli 终审。

@voltjia voltjia marked this pull request as ready for review May 13, 2026 05:09
@voltjia voltjia requested a review from a team May 13, 2026 05:09
bitzyz
bitzyz previously approved these changes May 13, 2026
@voltjia voltjia merged commit cef8806 into master May 13, 2026
4 checks passed
@voltjia voltjia deleted the feat/generated-dispatch branch May 13, 2026 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants