feat: add `upsample_nearest1d` base by voltjia · Pull Request #515 · InfiniTensor/InfiniOps

voltjia · 2026-05-05T17:51:20Z

Summary

Add the hand-written InfiniOps base class for upsample_nearest1d in src/base/upsample_nearest1d.h.
Let the torch code generator reuse src/base/upsample_nearest1d.h instead of emitting generated/base/upsample_nearest1d.h.
Apply the base class member-spacing convention required by scripts/check_conventions.py.

Motivation

This PR is part of the feat/torch-codegen base-header migration. The generated UpsampleNearest1d base declaration is moved into src/base so code generation can reuse a reviewed hand-written header.

N/A: no linked issue.

Type of Change

feat - new feature / new operator / new platform
fix - bug fix
perf - performance improvement (no behavioral change)
refactor - code restructuring without behavior change
test - adding or fixing tests only
docs - documentation only
build / ci - build system or CI configuration
chore - tooling, formatting, or other non-code changes
Breaking change (requires a ! in the Conventional Commits prefix or a BREAKING CHANGE: footer)

Platforms Affected

Test Results on Supported Platforms

Platform	Built	`pytest` Result	Notes / Hardware
NVIDIA	N/A	Not run	Not required for this non-`master` `feat/torch-codegen` base-header PR; no runtime implementation is added.
Iluvatar	N/A	Not run	Not required for this non-`master` `feat/torch-codegen` base-header PR; no runtime implementation is added.
MetaX	N/A	Not run	Not required for this non-`master` `feat/torch-codegen` base-header PR; no runtime implementation is added.
Cambricon	N/A	Not run	Not required for this non-`master` `feat/torch-codegen` base-header PR; no runtime implementation is added.
Moore	N/A	Not run	Not required for this non-`master` `feat/torch-codegen` base-header PR; no runtime implementation is added.
Ascend	N/A	Not run	Not required for this non-`master` `feat/torch-codegen` base-header PR; no runtime implementation is added.

Full `pytest` output (optional)

N/A: pytest was intentionally not run because this PR targets `feat/torch-codegen`, not `master`, and only adds a reusable base header declaration.

Benchmark / Performance Impact

N/A. This PR only adds a base operator declaration for torch codegen reuse and does not add a runtime implementation.

Notes for Reviewers

This PR targets feat/torch-codegen, not master.
The branch diff against feat/torch-codegen contains only src/base/upsample_nearest1d.h.
Original branch validation reported clang-format 21 passing on src/base/upsample_nearest1d.h; the follow-up formatting commit applies the class member spacing required by scripts/check_conventions.py.

Checklist

Title, Branch, and Commits

PR title follows Conventional Commits (e.g. feat(nvidia): …, fix(cuda/gemm): …).
N/A: this automated batch uses existing codex/add-upsample_nearest1d-base PR branches targeting feat/torch-codegen; branch renaming is intentionally out of scope.
Each commit message follows Conventional Commits.
N/A: this batch intentionally keeps the base-header addition and convention-formatting follow-up as two meaningful, squashable commits.
N/A: this PR is based on feat/torch-codegen, not master; no master rebase is required for this integration target.
No fixup! / squash! / wip commits remain.

Scope and Design

Changes are minimal - nothing unrelated to the stated motivation was added (CONTRIBUTING.md §Code/General).
No dead code, commented-out blocks, debug prints, printf/std::cout/print(...) left behind, or TODO without an owner and issue link.
No unrelated formatting churn that would obscure the diff.
Public API changes are intentional and limited to the UpsampleNearest1d base operator declaration used by torch codegen.

General Code Hygiene (applies to all languages)

The code is self-explanatory; comments were added only where the why is non-obvious (CONTRIBUTING.md §Code/General).
Every modified or added file ends with a single trailing newline (CONTRIBUTING.md §Code/General).
No trailing whitespace, tab/space mixing, or stray BOMs.
Identifiers in comments and error messages are wrapped in backticks (e.g. the `seqlens_k` tensor) (CONTRIBUTING.md §Code/General).
All comments and error messages are in English (CONTRIBUTING.md §Code/General).
Comments and error messages are complete sentences - capitalized first letter, terminal punctuation - unless the language/framework convention says otherwise (CONTRIBUTING.md §Code/General; §Python).

C++ Specific (if C++ files changed)

Code follows the Google C++ Style Guide strictly.
clang-format (version 21, per .github/workflows/clang-format.yml) has been run against all modified .h, .cc, .cuh, and .mlu files; the diff is clean.
N/A: clang-tidy was not run because this PR only adds a base declaration header for feat/torch-codegen; no runtime implementation is added.
Operator parameter order is inputs first, outputs last; attributes are between inputs and outputs; naming follows PyTorch → ONNX → CUDA API precedence (CONTRIBUTING.md §C++).
N/A: this base declaration does not add C++ error paths or exceptions.
N/A: this base declaration does not add error or warning messages.
N/A: this base declaration does not add kernel files.
N/A: this base declaration does not add kernel launchers.
Constructor initializer list order matches member declaration order (CONTRIBUTING.md §C++).
Exactly one blank line between classes, between classes and functions, and between functions (CONTRIBUTING.md §C++).
Exactly one blank line between members (functions and variables) within a class (CONTRIBUTING.md §C++).
Exactly one blank line before and after the contents of a namespace (CONTRIBUTING.md §C++).
N/A: this PR adds only src/base/upsample_nearest1d.h for torch codegen reuse; platform implementations are out of scope.
No raw new/delete; RAII / smart pointers / existing allocators are used.

Python Specific (if Python files changed)

N/A: no Python files changed.

Testing

N/A: platform pytest was intentionally not run because this PR targets feat/torch-codegen, not master, and only adds a reusable base header declaration.
N/A: the table above records the reason platform testing was skipped.
N/A: no runtime functionality was added, so no new tests/ coverage is required.
N/A: no new pytest parameterization was added.
N/A: no Payload-returning test was added.
N/A: no dtype / device parameterization was added.
N/A: no flaky test was added.
N/A: this is not a runtime bug fix.

Build, CI, and Tooling

N/A: full platform builds were not run because this PR targets feat/torch-codegen, not master, and only adds a reusable base header declaration.
N/A: compile_commands.json behavior was not changed.
N/A: no new backend or device was added.
N/A: CUDA-like backend mutual exclusion was not changed.
Existing CI formatting expectations are preserved; original validation reported clang-format 21 passing on src/base/upsample_nearest1d.h.
N/A: no new runtime dependency was added.

Documentation

N/A: README.md, CONTRIBUTING.md, and developer workflow are unchanged.
N/A: UpsampleNearest1d is an internal base declaration for torch codegen reuse; no user-facing documentation is required.
N/A: no user-visible breaking change.

Security and Safety

No secrets, access tokens, internal URLs, customer data, or personal hardware identifiers have been committed.
N/A: no third-party code was added.
N/A: no unsafe pointer arithmetic, uninitialized reads, or missing bounds checks were introduced.

wooway777

torch没有nearest1d

The torch op codegen script imports `yaml` to parse `scripts/torch_ops.yaml` and PyTorch's `native_functions.yaml`. Since CMake invokes the script at configure time, PyYAML must be available in the build environment.

Frees the `infini::ops::Sigmoid` name for the auto-generated PyTorch operator class emitted by the upcoming `scripts/generate_torch_ops.py`.

Adds two pieces used by the upcoming pybind bindings for auto-generated torch ops: - `detail::ListContains` and an early-out in `Operator::active_implementation_indices` so querying impls for a device the op does not support returns an empty vector instead of crashing in `DispatchFunc`. - `TryDeviceTypeFromString` returning `std::optional<Device::Type>`, so generated bindings can resolve a device name without aborting on unrecognized inputs.

For each entry in `scripts/torch_ops.yaml`, the script finds the matching `.out` variant in PyTorch's `native_functions.yaml` (fetched from GitHub on first invocation, cached under `generated/.cache/`), parses its schema, and emits an InfiniOps base class plus a PyTorch backend specialization at slot 8 that wraps `at::<op>_out`. Key strategies: - Overload-aware lookup: prefers `<name>.out` then any `<name>.<overload>_out`, picking the variant with the most tensor inputs (so `pow.Tensor_Tensor_out` wins over `pow.Tensor_Scalar_out`). - Hidden-parameter pattern: optional types (`Scalar?`, `int[]?`, `ScalarType?`, `Generator?`, …), `bool` defaults, numeric `int`/`float` defaults, `int[N]=[]` defaults, and ATen enum symbols (`Mean`, `Sum`) are filtered from the user-facing API and substituted at the ATen call site. Unlocks reductions, scans, comparisons, losses, and multi-scalar activations from a single mechanism. - Slot 8: reserved for PyTorch backends; native and vendor implementations use 0–7. Also avoids a partial-specialization-after- instantiation conflict with `Operator<Op>` at index 0. - Hand-written-base coexistence: if `src/base/<op>.h` exists, the generator skips emitting `generated/base/<op>.h` so the hand-written one wins. Ops whose pre-existing hand-written base has a different parameter shape (`add`, `linear`, `matmul`, `mul`) are kept out of the YAML; including them would cause the generated torch override to mismatch the hand-written base. - Per-op metadata (`generated/torch_ops_metadata.json`): records the full parameter list per op for the test harness, so adding a new op to the allowlist requires no code changes.

When `WITH_TORCH=ON`, run `scripts/generate_torch_ops.py` at configure time and add the generated tree to the torch source glob and include path. Vendor compilers (`mxcc`/`mcc`) get the same include via the system-`g++` torch recompile loop. When Python bindings are enabled, also install `generated/torch_ops_metadata.json` so the torch-op test can discover the generated catalog at runtime.

Three changes that let `generate_wrappers.py` see the codegen output: - `_find_base_header` resolves an op's base in `src/base/` first, then `generated/base/` — mirroring the C++ include-path order so a hand-written base wins. `_OperatorExtractor`, `_find_optional_tensor_params`, and `_find_vector_tensor_params` use it; clang's parser also picks up `-I generated` so the include in a generated torch source resolves through the parser too. - `_get_all_ops` now scans both base directories and both impl roots (`src/` and `generated/`), so generated PyTorch backends are bound alongside hand-written ones. `_to_include_path` strips either `src/` or `generated/` when emitting legacy-C `#include` directives. - Active-impl device lookup goes through the new `TryDeviceTypeFromString<Self>(device)` helper, returning an empty vector for an unknown name instead of aborting. Also wipes the bindings/src/include output trees at start so files for ops removed from the active set do not linger and get globbed by the next build, and pulls `_get_system_include_flags` out as a module-level `lru_cache` (the `subprocess` probes were the slow path).

Tensor parameters bind to `py::object`, which accepts any Python value and only rejects inside `TensorFromPybind11Handle` at runtime. When a class has both scalar and Tensor overloads of `__call__` or its constructor (e.g. `pow.Tensor_Tensor_out` vs `pow.Tensor_Scalar_out`), pybind's overload resolver tries them in registration order, so the `Tensor` signature swallows scalar calls if it sits first and the call aborts inside the conversion. `_overload_order_key` sorts by (object-like-arg count ascending, total arg count descending), so the most-specific signature is registered first and pybind walks toward more permissive ones only on a real type-mismatch. While here, rename the `__call__` lambda's first parameter from `self` to `op` so it does not collide with ATen ops that take a parameter literally named `self`.

A single parametrized `test_op` reads `generated/torch_ops_metadata.json` (installed alongside the bindings, with a fallback to the source-tree copy), synthesises inputs by parameter type, calls the InfiniOps wrapper at slot 8, and compares each output tensor against `torch.<op>` or its `torch.special` / `torch.nn.functional` counterpart. Adding an op to `scripts/torch_ops.yaml` extends coverage with no test changes. Skip-lists narrow the harness around known harness limitations: vendor kernels that lack a given (op, dtype, device) combination, random ops whose RNG state diverges from a fresh torch reference, low-precision reductions where the functional and `_out` paths diverge, ops that fire CUDA device-side asserts on random inputs, and ops whose inputs or outputs use dtypes outside the InfiniOps `DataType` enum. `tests/conftest.py` now compares non-floating outputs with `torch.equal` (since `torch.allclose` rejects `bool`) and passes `equal_nan=True` for floats so symmetric NaNs (common for special functions fed out-of-domain inputs) do not fail the test.

Reviewers consistently flagged class names like `xlogy_outtensor`, `triangular_solve_x`, `*_grad_input`, `*_forward_output`, `*_n_scalar`, `*_dim_values`, `*_values_stable` etc. as bad public-API naming — the suffix is just an ATen schema artifact and carries no semantic info. Use only the canonical `aten_name` for the InfiniOps class; multiple ATen overloads of the same base op (e.g. `scatter.src`, `scatter.value`, `scatter.reduce`) become overloaded `operator()` methods on a single `Scatter` class, with tensor metadata members shared across overloads. Overloads that collapse to identical visible C++ signatures after hidden defaults are still deduped by `_dedupe_visible_overloads`. The test harness's parametrize-id falls back to `overload_name` so pytest does not collide ids between overloads.

Reviewers flagged on multiple PRs that scalar parameters such as `n` on `special_chebyshev_polynomial_v` were declared in the constructor but never stored on the class — leaving the backend with no way to read them outside of `operator()`. Add a `<type> <name>_;` member for every visible non-tensor parameter, initialized from the matching constructor argument. Same-named scalars across overloads must agree on type; if a later overload disagrees, that overload's value is left default-constructed rather than emitting a conflicting member. Tensor metadata members (`<name>_shape_`, `_strides_`, `_type_`) keep their existing union-across-overloads behaviour.

Reviewers consistently flagged on multiple PRs that semantically critical default-valued parameters were being hidden by the codegen: - `bool upper`, `bool transpose`, `bool unitriangular` on `triangular_solve` (PR #580) - `int diagonal` on `triu` (PR #509) - `int n` on the `special_chebyshev_polynomial_*` family - `str ord` on `linalg_matrix_norm` (PR #280) - `int[N]` dims with `[]` defaults on reductions These were hidden because they have a default in ATen's schema, but defaults do not equal "optional to expose". Stop hiding non-optional default-valued params; they are now visible in the generated `operator()` signatures and forwarded to ATen. Optional ATen types (`Tensor?`, `Scalar?`, `int?`, …) remain hidden for now — exposing them properly requires threading `std::optional` through to ATen, which is a larger refactor and tracked separately.

…tion libclang silently reports the type of `std::vector<int64_t>` parameters as `int` on systems where the STL headers are not fully indexable (observed under the NVIDIA build's libclang). The fallback type then leaks into the generated binding as `const int padding` instead of `const std::vector<int64_t> padding`, and the binding's call to the base operator fails to compile with a long instantiation trace at `Operator::operator()` for any op with `int[N]` schema parameters (im2col, col2im, reflection_pad*, replication_pad*, fft_*, upsample_*, nuclear_norm, …). Adopt the same regex-scan workaround already used for `std::optional<Tensor>` and `std::vector<Tensor>` parameters: scan the base header text for `std::vector<int64_t> <name>` declarations and emit the binding parameter with that exact type, bypassing libclang's inferred spelling.

The wrapper generator picked up `generated/base/<op>.h` headers unconditionally whenever the directory existed. When a CI container inherits a `generated/` tree via rsync but configures with `WITH_TORCH=OFF` (so the codegen never re-runs and the matching torch sources never compile), the generated bindings reference base headers that are not on the include path of any compiled target — `ops.cc` then fails with "fatal error: base/<op>.h: No such file or directory". Skip the `generated/base/` scan unless `--with-torch` is in effect, mirroring the existing gate on `generated/torch/`.

ATen names the first tensor parameter `self` to mirror the method-style invocation `tensor.abs()`. InfiniOps' hand-written bases (`Add`, `Gemm`, …) use `input` for the primary tensor input, matching `CONTRIBUTING.md` §C++'s preference for PyTorch user-facing naming conventions over PyTorch internal C++ names. Rename `self` → `input` at parse time so generated headers stay consistent with hand-written ones.

The generated torch source instantiated all 10 `Operator<Op, kDev, 8>` device specializations unconditionally. Each instantiation pulls in a deep ATen template tree that costs roughly 0.5-1 GB of RSS during compilation; when the build compiles 451 ops in parallel (scikit-build's default ninja `-j$(nproc)`), peak memory exceeds what some CI containers can spare, and `cc1plus` is killed by the OOM killer. Guard each explicit instantiation with `#ifdef WITH_<DEV>`. Each `WITH_<DEV>` macro is set by `target_compile_definitions` (or, for `WITH_METAX` / `WITH_MOORE` / `WITH_CPU`, added to the vendor recompile loop's command line, since those sources are compiled outside the cmake target with the system C++ compiler). A typical NVIDIA-only build now instantiates only `kCpu` + `kNvidia`, cutting template instantiation work to 2 / 10.

The hand-written bases that get added via review (`src/base/<op>.h`) do not carry an `AUTO-GENERATED` header. Generated and reviewed files end up with the same content otherwise — the marker becomes the only visible difference and produces churn during the `generated/` ↔ `src/base/` migration. Drop the marker so a hand-written base is byte-for-byte the same as the generated one.

Some generated signatures (e.g. `Xlogy::operator()(const Tensor input, const Tensor other, Tensor out)` at 89 columns) overflow the 80-column limit enforced by `.clang-format` and CI's `clang-format-action@v4` running `clang-format` v21. The codegen previously emitted them as single lines, so every base PR ran into the same line-length violation once the workflow re-ran. Pipe each emitted header / source through the local `clang-format` (passing `--assume-filename=<path>` so the include-order rule treats each `.cc`'s own header as the primary include). Adds ~30s to a full regeneration but eliminates the recurring CI failure across 433+ PR branches.

The previous fix landed on a slightly older `ruff` version that preferred a multi-line `base_path.write_text(\n ...\n)` form; CI runs the latest `ruff format --check` which collapses the line. Reformatted to match upstream.

Each generated `<op>.cc` instantiates `at::<op>_out(...)`, which expands roughly 0.5-1 GB of ATen template metaprogramming. With 451 ops compiled in parallel at Ninja's default `-j$(nproc)`, peak memory can exceed 30 GB and the OOM killer drops `cc1plus` on build hosts that allocate less RAM (observed on metax, moore, and cambricon CI containers). Add a Ninja job pool `torch_compile=4` and apply it to: - the vendor-system-g++ `add_custom_command` recompile loop (metax / moore), via `JOB_POOL`; - a new `infiniops_torch_objs` OBJECT library for the regular cmake build path (cambricon / nvidia / iluvatar), via `JOB_POOL_COMPILE`. The rest of the build keeps full parallelism.

voltjia force-pushed the codex/add-upsample_nearest1d-base branch 2 times, most recently from 1fde1a9 to 3bd7dae Compare May 7, 2026 09:59

voltjia changed the title ~~feat: add UpsampleNearest1d base~~ feat: add upsample_nearest1d base May 7, 2026

wooway777 approved these changes May 8, 2026

View reviewed changes

wooway777 requested changes May 8, 2026

View reviewed changes

voltjia added 12 commits May 9, 2026 13:24

build: add PyYAML build dependency

7d58db3

The torch op codegen script imports `yaml` to parse `scripts/torch_ops.yaml` and PyTorch's `native_functions.yaml`. Since CMake invokes the script at configure time, PyYAML must be available in the build environment.

refactor(swiglu): move Sigmoid helper to detail::

15da799

Frees the `infini::ops::Sigmoid` name for the auto-generated PyTorch operator class emitted by the upcoming `scripts/generate_torch_ops.py`.

voltjia force-pushed the feat/torch-codegen branch from dc3b3b0 to 156e83f Compare May 9, 2026 07:32

voltjia force-pushed the codex/add-upsample_nearest1d-base branch from 3bd7dae to dc2cef8 Compare May 9, 2026 08:14

voltjia force-pushed the codex/add-upsample_nearest1d-base branch from dc2cef8 to e8b6277 Compare May 9, 2026 08:55

voltjia added 6 commits May 9, 2026 17:03

style(scripts): apply ruff format from latest

800619d

The previous fix landed on a slightly older `ruff` version that preferred a multi-line `base_path.write_text(\n ...\n)` form; CI runs the latest `ruff format --check` which collapses the line. Reformatted to match upstream.

feat: add upsample_nearest1d base

4988e19

voltjia force-pushed the codex/add-upsample_nearest1d-base branch from e8b6277 to 4988e19 Compare May 9, 2026 10:10

voltjia marked this pull request as ready for review May 13, 2026 03:13

voltjia changed the base branch from feat/torch-codegen to feat/torch-operator-bases May 13, 2026 03:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add `upsample_nearest1d` base#515

feat: add `upsample_nearest1d` base#515
voltjia wants to merge 20 commits into
feat/torch-operator-basesfrom
codex/add-upsample_nearest1d-base

voltjia commented May 5, 2026 •

edited

Loading

Uh oh!

wooway777 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

voltjia commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Type of Change

Platforms Affected

Test Results on Supported Platforms

Benchmark / Performance Impact

Notes for Reviewers

Checklist

Title, Branch, and Commits

Scope and Design

General Code Hygiene (applies to all languages)

C++ Specific (if C++ files changed)

Python Specific (if Python files changed)

Testing

Build, CI, and Tooling

Documentation

Security and Safety

Uh oh!

wooway777 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

voltjia commented May 5, 2026 •

edited

Loading