Skip to content

feat: add upsample_nearest1d base#515

Open
voltjia wants to merge 20 commits into
feat/torch-operator-basesfrom
codex/add-upsample_nearest1d-base
Open

feat: add upsample_nearest1d base#515
voltjia wants to merge 20 commits into
feat/torch-operator-basesfrom
codex/add-upsample_nearest1d-base

Conversation

@voltjia
Copy link
Copy Markdown
Collaborator

@voltjia voltjia commented May 5, 2026

Summary

  • Add the hand-written InfiniOps base class for upsample_nearest1d in src/base/upsample_nearest1d.h.
  • Let the torch code generator reuse src/base/upsample_nearest1d.h instead of emitting generated/base/upsample_nearest1d.h.
  • Apply the base class member-spacing convention required by scripts/check_conventions.py.

Motivation

This PR is part of the feat/torch-codegen base-header migration. The generated UpsampleNearest1d base declaration is moved into src/base so code generation can reuse a reviewed hand-written header.

N/A: no linked issue.

Type of Change

  • feat - new feature / new operator / new platform
  • fix - bug fix
  • perf - performance improvement (no behavioral change)
  • refactor - code restructuring without behavior change
  • test - adding or fixing tests only
  • docs - documentation only
  • build / ci - build system or CI configuration
  • chore - tooling, formatting, or other non-code changes
  • Breaking change (requires a ! in the Conventional Commits prefix or a BREAKING CHANGE: footer)

Platforms Affected

  • CPU (WITH_CPU)
  • NVIDIA (WITH_NVIDIA)
  • Iluvatar (WITH_ILUVATAR)
  • MetaX (WITH_METAX)
  • Cambricon (WITH_CAMBRICON)
  • Moore (WITH_MOORE)
  • Ascend (WITH_ASCEND)
  • PyTorch C++ bindings (WITH_TORCH)
  • Build system / CMake / CI
  • Python bindings / user-facing API

Test Results on Supported Platforms

Platform Built pytest Result Notes / Hardware
NVIDIA N/A Not run Not required for this non-master feat/torch-codegen base-header PR; no runtime implementation is added.
Iluvatar N/A Not run Not required for this non-master feat/torch-codegen base-header PR; no runtime implementation is added.
MetaX N/A Not run Not required for this non-master feat/torch-codegen base-header PR; no runtime implementation is added.
Cambricon N/A Not run Not required for this non-master feat/torch-codegen base-header PR; no runtime implementation is added.
Moore N/A Not run Not required for this non-master feat/torch-codegen base-header PR; no runtime implementation is added.
Ascend N/A Not run Not required for this non-master feat/torch-codegen base-header PR; no runtime implementation is added.
Full `pytest` output (optional)
N/A: pytest was intentionally not run because this PR targets `feat/torch-codegen`, not `master`, and only adds a reusable base header declaration.

Benchmark / Performance Impact

N/A. This PR only adds a base operator declaration for torch codegen reuse and does not add a runtime implementation.

Notes for Reviewers

  • This PR targets feat/torch-codegen, not master.
  • The branch diff against feat/torch-codegen contains only src/base/upsample_nearest1d.h.
  • Original branch validation reported clang-format 21 passing on src/base/upsample_nearest1d.h; the follow-up formatting commit applies the class member spacing required by scripts/check_conventions.py.

Checklist

Title, Branch, and Commits

  • PR title follows Conventional Commits (e.g. feat(nvidia): …, fix(cuda/gemm): …).
  • N/A: this automated batch uses existing codex/add-upsample_nearest1d-base PR branches targeting feat/torch-codegen; branch renaming is intentionally out of scope.
  • Each commit message follows Conventional Commits.
  • N/A: this batch intentionally keeps the base-header addition and convention-formatting follow-up as two meaningful, squashable commits.
  • N/A: this PR is based on feat/torch-codegen, not master; no master rebase is required for this integration target.
  • No fixup! / squash! / wip commits remain.

Scope and Design

  • Changes are minimal - nothing unrelated to the stated motivation was added (CONTRIBUTING.md §Code/General).
  • No dead code, commented-out blocks, debug prints, printf/std::cout/print(...) left behind, or TODO without an owner and issue link.
  • No unrelated formatting churn that would obscure the diff.
  • Public API changes are intentional and limited to the UpsampleNearest1d base operator declaration used by torch codegen.

General Code Hygiene (applies to all languages)

  • The code is self-explanatory; comments were added only where the why is non-obvious (CONTRIBUTING.md §Code/General).
  • Every modified or added file ends with a single trailing newline (CONTRIBUTING.md §Code/General).
  • No trailing whitespace, tab/space mixing, or stray BOMs.
  • Identifiers in comments and error messages are wrapped in backticks (e.g. the `seqlens_k` tensor) (CONTRIBUTING.md §Code/General).
  • All comments and error messages are in English (CONTRIBUTING.md §Code/General).
  • Comments and error messages are complete sentences - capitalized first letter, terminal punctuation - unless the language/framework convention says otherwise (CONTRIBUTING.md §Code/General; §Python).

C++ Specific (if C++ files changed)

  • Code follows the Google C++ Style Guide strictly.
  • clang-format (version 21, per .github/workflows/clang-format.yml) has been run against all modified .h, .cc, .cuh, and .mlu files; the diff is clean.
  • N/A: clang-tidy was not run because this PR only adds a base declaration header for feat/torch-codegen; no runtime implementation is added.
  • Operator parameter order is inputs first, outputs last; attributes are between inputs and outputs; naming follows PyTorch → ONNX → CUDA API precedence (CONTRIBUTING.md §C++).
  • N/A: this base declaration does not add C++ error paths or exceptions.
  • N/A: this base declaration does not add error or warning messages.
  • N/A: this base declaration does not add kernel files.
  • N/A: this base declaration does not add kernel launchers.
  • Constructor initializer list order matches member declaration order (CONTRIBUTING.md §C++).
  • Exactly one blank line between classes, between classes and functions, and between functions (CONTRIBUTING.md §C++).
  • Exactly one blank line between members (functions and variables) within a class (CONTRIBUTING.md §C++).
  • Exactly one blank line before and after the contents of a namespace (CONTRIBUTING.md §C++).
  • N/A: this PR adds only src/base/upsample_nearest1d.h for torch codegen reuse; platform implementations are out of scope.
  • No raw new/delete; RAII / smart pointers / existing allocators are used.

Python Specific (if Python files changed)

N/A: no Python files changed.

Testing

  • N/A: platform pytest was intentionally not run because this PR targets feat/torch-codegen, not master, and only adds a reusable base header declaration.
  • N/A: the table above records the reason platform testing was skipped.
  • N/A: no runtime functionality was added, so no new tests/ coverage is required.
  • N/A: no new pytest parameterization was added.
  • N/A: no Payload-returning test was added.
  • N/A: no dtype / device parameterization was added.
  • N/A: no flaky test was added.
  • N/A: this is not a runtime bug fix.

Build, CI, and Tooling

  • N/A: full platform builds were not run because this PR targets feat/torch-codegen, not master, and only adds a reusable base header declaration.
  • N/A: compile_commands.json behavior was not changed.
  • N/A: no new backend or device was added.
  • N/A: CUDA-like backend mutual exclusion was not changed.
  • Existing CI formatting expectations are preserved; original validation reported clang-format 21 passing on src/base/upsample_nearest1d.h.
  • N/A: no new runtime dependency was added.

Documentation

  • N/A: README.md, CONTRIBUTING.md, and developer workflow are unchanged.
  • N/A: UpsampleNearest1d is an internal base declaration for torch codegen reuse; no user-facing documentation is required.
  • N/A: no user-visible breaking change.

Security and Safety

  • No secrets, access tokens, internal URLs, customer data, or personal hardware identifiers have been committed.
  • N/A: no third-party code was added.
  • N/A: no unsafe pointer arithmetic, uninitialized reads, or missing bounds checks were introduced.

@voltjia voltjia force-pushed the codex/add-upsample_nearest1d-base branch 2 times, most recently from 1fde1a9 to 3bd7dae Compare May 7, 2026 09:59
@voltjia voltjia changed the title feat: add UpsampleNearest1d base feat: add upsample_nearest1d base May 7, 2026
Copy link
Copy Markdown

@wooway777 wooway777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch没有nearest1d

voltjia added 12 commits May 9, 2026 13:24
The torch op codegen script imports `yaml` to parse `scripts/torch_ops.yaml`
and PyTorch's `native_functions.yaml`. Since CMake invokes the script at
configure time, PyYAML must be available in the build environment.
Frees the `infini::ops::Sigmoid` name for the auto-generated PyTorch
operator class emitted by the upcoming `scripts/generate_torch_ops.py`.
Adds two pieces used by the upcoming pybind bindings for auto-generated
torch ops:

  - `detail::ListContains` and an early-out in
    `Operator::active_implementation_indices` so querying impls for a
    device the op does not support returns an empty vector instead of
    crashing in `DispatchFunc`.

  - `TryDeviceTypeFromString` returning `std::optional<Device::Type>`,
    so generated bindings can resolve a device name without aborting on
    unrecognized inputs.
For each entry in `scripts/torch_ops.yaml`, the script finds the
matching `.out` variant in PyTorch's `native_functions.yaml` (fetched
from GitHub on first invocation, cached under `generated/.cache/`),
parses its schema, and emits an InfiniOps base class plus a PyTorch
backend specialization at slot 8 that wraps `at::<op>_out`.

Key strategies:

  - Overload-aware lookup: prefers `<name>.out` then any
    `<name>.<overload>_out`, picking the variant with the most tensor
    inputs (so `pow.Tensor_Tensor_out` wins over `pow.Tensor_Scalar_out`).

  - Hidden-parameter pattern: optional types (`Scalar?`, `int[]?`,
    `ScalarType?`, `Generator?`, …), `bool` defaults, numeric
    `int`/`float` defaults, `int[N]=[]` defaults, and ATen enum
    symbols (`Mean`, `Sum`) are filtered from the user-facing API
    and substituted at the ATen call site.  Unlocks reductions, scans,
    comparisons, losses, and multi-scalar activations from a single
    mechanism.

  - Slot 8: reserved for PyTorch backends; native and vendor
    implementations use 0–7.  Also avoids a partial-specialization-after-
    instantiation conflict with `Operator<Op>` at index 0.

  - Hand-written-base coexistence: if `src/base/<op>.h` exists, the
    generator skips emitting `generated/base/<op>.h` so the
    hand-written one wins.  Ops whose pre-existing hand-written base
    has a different parameter shape (`add`, `linear`, `matmul`,
    `mul`) are kept out of the YAML; including them would cause the
    generated torch override to mismatch the hand-written base.

  - Per-op metadata (`generated/torch_ops_metadata.json`): records the
    full parameter list per op for the test harness, so adding a new op
    to the allowlist requires no code changes.
When `WITH_TORCH=ON`, run `scripts/generate_torch_ops.py` at configure
time and add the generated tree to the torch source glob and include
path.  Vendor compilers (`mxcc`/`mcc`) get the same include via the
system-`g++` torch recompile loop.  When Python bindings are enabled,
also install `generated/torch_ops_metadata.json` so the torch-op test
can discover the generated catalog at runtime.
Three changes that let `generate_wrappers.py` see the codegen output:

  - `_find_base_header` resolves an op's base in `src/base/` first,
    then `generated/base/` — mirroring the C++ include-path order so a
    hand-written base wins.  `_OperatorExtractor`,
    `_find_optional_tensor_params`, and `_find_vector_tensor_params`
    use it; clang's parser also picks up `-I generated` so the include
    in a generated torch source resolves through the parser too.

  - `_get_all_ops` now scans both base directories and both impl roots
    (`src/` and `generated/`), so generated PyTorch backends are
    bound alongside hand-written ones.  `_to_include_path` strips
    either `src/` or `generated/` when emitting legacy-C `#include`
    directives.

  - Active-impl device lookup goes through the new
    `TryDeviceTypeFromString<Self>(device)` helper, returning an empty
    vector for an unknown name instead of aborting.

Also wipes the bindings/src/include output trees at start so files for
ops removed from the active set do not linger and get globbed by the
next build, and pulls `_get_system_include_flags` out as a
module-level `lru_cache` (the `subprocess` probes were the slow
path).
Tensor parameters bind to `py::object`, which accepts any Python value
and only rejects inside `TensorFromPybind11Handle` at runtime.  When
a class has both scalar and Tensor overloads of `__call__` or its
constructor (e.g. `pow.Tensor_Tensor_out` vs `pow.Tensor_Scalar_out`),
pybind's overload resolver tries them in registration order, so the
`Tensor` signature swallows scalar calls if it sits first and the call
aborts inside the conversion.

`_overload_order_key` sorts by (object-like-arg count ascending, total
arg count descending), so the most-specific signature is registered
first and pybind walks toward more permissive ones only on a real
type-mismatch.  While here, rename the `__call__` lambda's first
parameter from `self` to `op` so it does not collide with ATen ops
that take a parameter literally named `self`.
A single parametrized `test_op` reads `generated/torch_ops_metadata.json`
(installed alongside the bindings, with a fallback to the source-tree
copy), synthesises inputs by parameter type, calls the InfiniOps
wrapper at slot 8, and compares each output tensor against `torch.<op>`
or its `torch.special` / `torch.nn.functional` counterpart.  Adding
an op to `scripts/torch_ops.yaml` extends coverage with no test
changes.

Skip-lists narrow the harness around known harness limitations: vendor
kernels that lack a given (op, dtype, device) combination, random ops
whose RNG state diverges from a fresh torch reference, low-precision
reductions where the functional and `_out` paths diverge, ops that
fire CUDA device-side asserts on random inputs, and ops whose inputs
or outputs use dtypes outside the InfiniOps `DataType` enum.

`tests/conftest.py` now compares non-floating outputs with
`torch.equal` (since `torch.allclose` rejects `bool`) and passes
`equal_nan=True` for floats so symmetric NaNs (common for special
functions fed out-of-domain inputs) do not fail the test.
Reviewers consistently flagged class names like `xlogy_outtensor`,
`triangular_solve_x`, `*_grad_input`, `*_forward_output`,
`*_n_scalar`, `*_dim_values`, `*_values_stable` etc. as bad
public-API naming — the suffix is just an ATen schema artifact and
carries no semantic info.

Use only the canonical `aten_name` for the InfiniOps class; multiple
ATen overloads of the same base op (e.g. `scatter.src`,
`scatter.value`, `scatter.reduce`) become overloaded `operator()`
methods on a single `Scatter` class, with tensor metadata members
shared across overloads.  Overloads that collapse to identical
visible C++ signatures after hidden defaults are still deduped by
`_dedupe_visible_overloads`.

The test harness's parametrize-id falls back to `overload_name` so
pytest does not collide ids between overloads.
Reviewers flagged on multiple PRs that scalar parameters such as `n`
on `special_chebyshev_polynomial_v` were declared in the
constructor but never stored on the class — leaving the backend with
no way to read them outside of `operator()`.  Add a `<type>
<name>_;` member for every visible non-tensor parameter, initialized
from the matching constructor argument.

Same-named scalars across overloads must agree on type; if a later
overload disagrees, that overload's value is left default-constructed
rather than emitting a conflicting member.  Tensor metadata members
(`<name>_shape_`, `_strides_`, `_type_`) keep their existing
union-across-overloads behaviour.
Reviewers consistently flagged on multiple PRs that semantically
critical default-valued parameters were being hidden by the codegen:

  - `bool upper`, `bool transpose`, `bool unitriangular` on
    `triangular_solve` (PR #580)
  - `int diagonal` on `triu` (PR #509)
  - `int n` on the `special_chebyshev_polynomial_*` family
  - `str ord` on `linalg_matrix_norm` (PR #280)
  - `int[N]` dims with `[]` defaults on reductions

These were hidden because they have a default in ATen's schema, but
defaults do not equal "optional to expose".  Stop hiding non-optional
default-valued params; they are now visible in the generated
`operator()` signatures and forwarded to ATen.

Optional ATen types (`Tensor?`, `Scalar?`, `int?`, …) remain hidden
for now — exposing them properly requires threading `std::optional`
through to ATen, which is a larger refactor and tracked separately.
…tion

libclang silently reports the type of `std::vector<int64_t>` parameters
as `int` on systems where the STL headers are not fully indexable
(observed under the NVIDIA build's libclang).  The fallback type then
leaks into the generated binding as `const int padding` instead of
`const std::vector<int64_t> padding`, and the binding's call to the
base operator fails to compile with a long instantiation trace at
`Operator::operator()` for any op with `int[N]` schema parameters
(im2col, col2im, reflection_pad*, replication_pad*, fft_*, upsample_*,
nuclear_norm, …).

Adopt the same regex-scan workaround already used for
`std::optional<Tensor>` and `std::vector<Tensor>` parameters: scan
the base header text for `std::vector<int64_t> <name>` declarations and
emit the binding parameter with that exact type, bypassing libclang's
inferred spelling.
@voltjia voltjia force-pushed the feat/torch-codegen branch from dc3b3b0 to 156e83f Compare May 9, 2026 07:32
The wrapper generator picked up `generated/base/<op>.h` headers
unconditionally whenever the directory existed.  When a CI container
inherits a `generated/` tree via rsync but configures with
`WITH_TORCH=OFF` (so the codegen never re-runs and the matching
torch sources never compile), the generated bindings reference base
headers that are not on the include path of any compiled target —
`ops.cc` then fails with "fatal error: base/<op>.h: No such file or
directory".

Skip the `generated/base/` scan unless `--with-torch` is in effect,
mirroring the existing gate on `generated/torch/`.
@voltjia voltjia force-pushed the codex/add-upsample_nearest1d-base branch from 3bd7dae to dc2cef8 Compare May 9, 2026 08:14
ATen names the first tensor parameter `self` to mirror the
method-style invocation `tensor.abs()`.  InfiniOps' hand-written
bases (`Add`, `Gemm`, …) use `input` for the primary tensor input,
matching `CONTRIBUTING.md` §C++'s preference for PyTorch user-facing
naming conventions over PyTorch internal C++ names.

Rename `self` → `input` at parse time so generated headers stay
consistent with hand-written ones.
@voltjia voltjia force-pushed the codex/add-upsample_nearest1d-base branch from dc2cef8 to e8b6277 Compare May 9, 2026 08:55
voltjia added 6 commits May 9, 2026 17:03
The generated torch source instantiated all 10 `Operator<Op, kDev,
8>` device specializations unconditionally.  Each instantiation pulls
in a deep ATen template tree that costs roughly 0.5-1 GB of RSS
during compilation; when the build compiles 451 ops in parallel
(scikit-build's default ninja `-j$(nproc)`), peak memory exceeds
what some CI containers can spare, and `cc1plus` is killed by the
OOM killer.

Guard each explicit instantiation with `#ifdef WITH_<DEV>`.  Each
`WITH_<DEV>` macro is set by `target_compile_definitions` (or, for
`WITH_METAX` / `WITH_MOORE` / `WITH_CPU`, added to the vendor
recompile loop's command line, since those sources are compiled
outside the cmake target with the system C++ compiler).  A typical
NVIDIA-only build now instantiates only `kCpu` + `kNvidia`, cutting
template instantiation work to 2 / 10.
The hand-written bases that get added via review (`src/base/<op>.h`)
do not carry an `AUTO-GENERATED` header.  Generated and reviewed
files end up with the same content otherwise — the marker becomes
the only visible difference and produces churn during the
`generated/` ↔ `src/base/` migration.  Drop the marker so a
hand-written base is byte-for-byte the same as the generated one.
Some generated signatures (e.g. `Xlogy::operator()(const Tensor input,
const Tensor other, Tensor out)` at 89 columns) overflow the 80-column
limit enforced by `.clang-format` and CI's `clang-format-action@v4`
running `clang-format` v21.  The codegen previously emitted them as
single lines, so every base PR ran into the same line-length violation
once the workflow re-ran.

Pipe each emitted header / source through the local `clang-format`
(passing `--assume-filename=<path>` so the include-order rule treats
each `.cc`'s own header as the primary include).  Adds ~30s to a
full regeneration but eliminates the recurring CI failure across
433+ PR branches.
The previous fix landed on a slightly older `ruff` version that
preferred a multi-line `base_path.write_text(\n  ...\n)` form; CI
runs the latest `ruff format --check` which collapses the line.
Reformatted to match upstream.
Each generated `<op>.cc` instantiates `at::<op>_out(...)`, which
expands roughly 0.5-1 GB of ATen template metaprogramming.  With 451
ops compiled in parallel at Ninja's default `-j$(nproc)`, peak
memory can exceed 30 GB and the OOM killer drops `cc1plus` on
build hosts that allocate less RAM (observed on metax, moore, and
cambricon CI containers).

Add a Ninja job pool `torch_compile=4` and apply it to:
  - the vendor-system-g++ `add_custom_command` recompile loop
    (metax / moore), via `JOB_POOL`;
  - a new `infiniops_torch_objs` OBJECT library for the regular
    cmake build path (cambricon / nvidia / iluvatar), via
    `JOB_POOL_COMPILE`.

The rest of the build keeps full parallelism.
@voltjia voltjia force-pushed the codex/add-upsample_nearest1d-base branch from e8b6277 to 4988e19 Compare May 9, 2026 10:10
@voltjia voltjia marked this pull request as ready for review May 13, 2026 03:13
@voltjia voltjia changed the base branch from feat/torch-codegen to feat/torch-operator-bases May 13, 2026 03:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants