feat: generate dispatch entrypoints for bindings by voltjia · Pull Request #604 · InfiniTensor/InfiniOps

voltjia · 2026-05-12T14:26:35Z

Summary

Generate generated/bindings/generated_dispatch.h and generated/bindings/generated_dispatch.cc from scripts/generate_wrappers.py.
Split generated Python bindings into per-operator translation units while routing Make, Call, instance invocation, cache clearing, and active implementation queries through the generated dispatch entrypoints.
Compile every generated binding .cc via CMake instead of compiling only the monolithic ops.cc.

Motivation

This change prepares the binding generator for larger generated operator sets without relying on each binding translation unit to see every backend implementation header.

The existing monolithic binding source works because all Operator<Key, Device, Index> specializations are visible in one translation unit. A naive split binding layout changes ActiveImplementations visibility and can produce different runtime dispatch behavior. The generated dispatch source keeps implementation visibility centralized while allowing the pybind registration code to compile in smaller per-operator translation units.

Related: #593

Type of Change

feat — new feature / new operator / new platform
N/A fix — bug fix
N/A perf — performance improvement (no behavioral change)
N/A refactor — code restructuring without behavior change
N/A test — adding or fixing tests only
N/A docs — documentation only
build / ci — build system or CI configuration
N/A chore — tooling, formatting, or other non-code changes
N/A Breaking change (requires a ! in the Conventional Commits prefix or a BREAKING CHANGE: footer)

Platforms Affected

Test Results on Supported Platforms

Latest full-platform rerun after commit a00c8e51 (refactor: rely on overloads in generated dispatch): generated-dispatch-a00c8e51-20260513-150730. Each platform used pytest tests/ from the generated-dispatch branch on GPU/NPU 6 unless noted otherwise.

Platform	Built	`pytest` Result	Notes / Hardware
NVIDIA	Yes	`4151 passed, 1375 skipped in 318.69s`	GPU 6
Iluvatar	Yes	`3651 passed, 375 skipped in 263.04s`	GPU 6
MetaX	Yes	`5795 passed, 1447 skipped in 365.97s`	GPU 6
Cambricon	Yes	`12 failed, 3061 passed, 3857 skipped in 911.23s`	GPU 6; known existing `tests/test_add.py` MLU `int16 random_` failures per prior platform runs
Moore	Yes	`300 failed, 5459 passed, 1483 skipped in 525.00s`	GPU 6; known existing `tests/test_gemm.py` failures per prior platform runs
Ascend	Yes	`24 failed, 3804 passed, 138 skipped in 464.88s`; job reported exit code 137 after pytest output	NPU 6 requested; failures are `tests/test_linear.py` OOMs, with container-visible `npu:0` reporting only tens of MiB free

Notes on result differences from the previous table:

The earlier table was from generated-dispatch-full-20260513; this table supersedes it for the current head commit a00c8e51.
NVIDIA and Iluvatar counts match the historical baseline-sized direct pytest tests/ coverage for this branch. MetaX/Moore/Cambricon match the prior known result classes.
Cambricon and Moore failures are pre-existing platform test failures, not new dispatch-generation failures.
Ascend differs from the prior 16 failed table: this rerun used NPU 6 and failed 24 tests/test_linear.py cases due OOM. Logs show PyTorch reporting NPU 0 inside the container with only ~24-30 MiB free while allocating 34-66 MiB tensors, so this appears to be device-memory state/mapping rather than a generated-dispatch compile or symbol issue.

Full `pytest` output summary

Full-platform rerun generated-dispatch-a00c8e51-20260513-150730:
NVIDIA: 4151 passed, 1375 skipped in 318.69s
Iluvatar: 3651 passed, 375 skipped in 263.04s
MetaX: 5795 passed, 1447 skipped in 365.97s
Cambricon: 12 failed, 3061 passed, 3857 skipped in 911.23s
Moore: 300 failed, 5459 passed, 1483 skipped in 525.00s
Ascend: 24 failed, 3804 passed, 138 skipped in 464.88s

Benchmark / Performance Impact

This PR is expected to reduce pybind binding compilation pressure by moving operator registration into per-op translation units while keeping implementation dispatch centralized in generated_dispatch.cc.

Timing from full-platform rerun generated-dispatch-a00c8e51-20260513-150730:

Platform	Test command time
NVIDIA	318.69s
Iluvatar	263.04s
MetaX	365.97s
Cambricon	911.23s
Moore	525.00s
Ascend	464.88s

Notes for Reviewers

The important invariant is that all code paths which depend on ActiveImplementations visibility must remain in generated_dispatch.cc, because that file includes the backend implementation headers. Per-op binding files intentionally include only base/operator binding headers plus lightweight device marker headers, then call the generated dispatch entrypoints.

This PR does not expose generated_dispatch as a public C API. It is an internal generated layer that can be reused by a future C API implementation.

Checklist

Title, Branch, and Commits

PR title follows Conventional Commits (e.g. feat(nvidia): …, fix(cuda/gemm): …).
Branch name follows <type>/xxx-yyyy-zzzz where <type> matches the PR title's Conventional Commits type and words are joined with hyphens (see CONTRIBUTING.md §Branches).
Each commit message follows Conventional Commits.
Small PR is a single squashable commit; or, for a large PR, every commit is meaningful, well-formed, and independently reviewable (see CONTRIBUTING.md §Pull Requests).
No stray merge commits from master — the branch is rebased cleanly on top of the current master.
No fixup! / squash! / wip commits remain.

Scope and Design

Changes are minimal — nothing unrelated to the stated motivation was added (CONTRIBUTING.md §Code/General).
No dead code, commented-out blocks, debug prints, printf/std::cout/print(...) left behind, or TODO without an owner and issue link.
No unrelated formatting churn that would obscure the diff.
Public API changes (if any) are intentional, documented, and reflected in affected callers/tests.

General Code Hygiene (applies to all languages)

The code is self-explanatory; comments were added only where the why is non-obvious (CONTRIBUTING.md §Code/General).
Every modified or added file ends with a single trailing newline (CONTRIBUTING.md §Code/General).
No trailing whitespace, tab/space mixing, or stray BOMs.
Identifiers in comments and error messages are wrapped in backticks (e.g. the `seqlens_k` tensor) (CONTRIBUTING.md §Code/General).
All comments and error messages are in English (CONTRIBUTING.md §Code/General).
Comments and error messages are complete sentences — capitalized first letter, terminal punctuation — unless the language/framework convention says otherwise (CONTRIBUTING.md §Code/General; §Python).

C++ Specific (if C++ files changed)

Code follows the Google C++ Style Guide strictly.
clang-format (version 21, per .github/workflows/clang-format.yml) has been run against all modified .h, .cc, .cuh, and .mlu files; the diff is clean.
N/A clang-tidy concerns (per .clang-tidy) have been reviewed — no new warnings beyond the existing baseline.
N/A Operator parameter order is unchanged by this PR.
N/A No new exception/error path is added.
N/A Kernel files are not changed by this PR.
N/A New operators are not added by this PR.
No raw new/delete; RAII / smart pointers / existing allocators are used.

Python Specific (if Python files changed)

Code is PEP 8 compliant; ruff check passes cleanly on CI (see .github/workflows/ruff.yml).
ruff format --check passes cleanly — if not, run ruff format and commit the result.
Comments are complete English sentences, starting with a capital letter and ending with punctuation; Markdown backticks are used for code references (CONTRIBUTING.md §Python).
Framework-specific conventions (e.g. lowercase pytest.skip messages without terminal period) are honored where applicable (CONTRIBUTING.md §Python).
No blank line between the function signature and the body when there is no docstring or comment (CONTRIBUTING.md §Python).
A blank line is present before and after if, for, and similar control-flow statements (CONTRIBUTING.md §Python).
A blank line appears before each return, except when it directly follows a control-flow statement (CONTRIBUTING.md §Python).
N/A No new docstrings or type hints were introduced.

Testing

pytest was run locally on every supported platform that this PR can affect, and the results are recorded in the "Test Results" table above (CONTRIBUTING.md §Pull Requests).
Every supported platform was tested; failing platforms are recorded in the table above.
N/A New operator tests were not added because this PR changes generated binding structure, not operator behavior.
Existing parameterized tests are used for smoke coverage.
N/A This is a feature PR, not a bug-fix PR.

Build, CI, and Tooling

The project builds cleanly from a fresh directory with pip install .[dev] on at least one affected platform.
compile_commands.json still regenerates (CMake option CMAKE_EXPORT_COMPILE_COMMANDS=ON in pyproject.toml — required by the code-lint skill and clang-tidy -p).
N/A No new backend/device auto-detection was added.
Only one CUDA-like GPU backend is selectable at a time — the existing mutual-exclusion check in CMakeLists.txt is not broken.
Both CI workflows (clang-format.yml, ruff.yml) are green on GitHub Actions.
No new runtime dependency was added without updating pyproject.toml's [project.optional-dependencies] (or justified in the PR description).

Documentation

N/A README.md, CONTRIBUTING.md, or inline docs updated when behavior, build flags, or developer workflow changed.
N/A New operators, new dispatch helpers, or new public utilities are documented (docstring, header comment, or an addition to CONTRIBUTING.md §Some Code Explanations).
N/A No user-visible breaking change is introduced.

Security and Safety

No secrets, access tokens, internal URLs, customer data, or personal hardware identifiers have been committed.
Third-party code is license-compatible and attributed.
No unsafe pointer arithmetic, uninitialized reads, or missing bounds checks were introduced.

voltjia · 2026-05-13T05:09:07Z

请 @bitzyz 初审，@Ziminli 终审。

voltjia added 2 commits May 12, 2026 21:43

feat: generate dispatch entrypoints for bindings

5958a1e

refactor: clarify generated dispatch names

9e749f3

voltjia requested review from Ziminli and bitzyz May 13, 2026 05:08

voltjia marked this pull request as ready for review May 13, 2026 05:09

voltjia requested a review from a team May 13, 2026 05:09

bitzyz previously approved these changes May 13, 2026

View reviewed changes

refactor: rely on overloads in generated dispatch

a00c8e5

voltjia dismissed bitzyz’s stale review via a00c8e5 May 13, 2026 07:01

voltjia requested a review from bitzyz May 13, 2026 08:36

bitzyz approved these changes May 13, 2026

View reviewed changes

Ziminli approved these changes May 13, 2026

View reviewed changes

voltjia merged commit cef8806 into master May 13, 2026
4 checks passed

voltjia deleted the feat/generated-dispatch branch May 13, 2026 09:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: generate dispatch entrypoints for bindings#604

feat: generate dispatch entrypoints for bindings#604
voltjia merged 3 commits into
masterfrom
feat/generated-dispatch

voltjia commented May 12, 2026 •

edited

Loading

Uh oh!

voltjia commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

voltjia commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Type of Change

Platforms Affected

Test Results on Supported Platforms

Benchmark / Performance Impact

Notes for Reviewers

Checklist

Title, Branch, and Commits

Scope and Design

General Code Hygiene (applies to all languages)

C++ Specific (if C++ files changed)

Python Specific (if Python files changed)

Testing

Build, CI, and Tooling

Documentation

Security and Safety

Uh oh!

voltjia commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

voltjia commented May 12, 2026 •

edited

Loading