[ci] add PPU unittest workflow by tiankongdeguiji · Pull Request #522 · alibaba/TorchEasyRec

tiankongdeguiji · 2026-05-21T02:35:07Z

Summary

Adds a PPU unit-test lane (.github/workflows/unittest_ppu_ci.yml) for release/v1.1, targeting the self-hosted tzrec-ppu-runner with image mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easyrec/tzrec-test:1.1-ppu. Shaped after master's unittest_h20_ci.yml (inline docker run instead of container: to dodge the DSW authZ docker.sock injection), with PPU-specific adjustments: alixpu /dev device passthrough instead of --gpus all, --shm-size=256g, no CUDA_HOME env, 24h job timeout, and a dedicated scripts/ci/ci_test_ppu.sh that skips pip install (the image ships deps pre-installed).

Trigger policy: auto-fires on PRs targeting release/** (where stability matters) plus workflow_dispatch. PRs to master do not auto-trigger — PPU runner capacity stays opt-in for development branches.

Making the lane green

Initial run (26218011590) had 8 failures + 5 errors. Fixed by root cause:

CUDA-only wheel missing on PPU image — hstu_attn_2_cuda (cutlass HSTU), torch_fx_tool (RTP export), dynamicemb. Gate the affected tests on @unittest.skipUnless(<dep>_available, ...) so they skip cleanly on any host missing the wheel (not a PPU-named carve-out).
PPU Triton mis-compiles num_warps=32 in the _get_bwd_dwdb_configs reduction kernel — produced a deterministic ~0.127 abs-diff in bf16 rms_norm bwd (test_ln). Cap autotune at num_warps≤16 on PPU, mirroring the existing HIP carve-out. This single fix also resolved several downstream integration tests that were timing out or crashing due to the bad kernel.
alixpu Triton fp32 matmul ULP drift vs PyTorch eager (~5e-6 rel, ~1e-4 abs) — wider than CUDA but mathematically correct. Add get_compare_tolerance(dtype) that returns (atol=3e-5, rtol=2e-5) for fp32 on PPU (~2.1x / ~4.4x headroom on observed worst) and PyTorch defaults elsewhere, so NVIDIA's tight regression guard stays intact.
Latent NameError in dynamicemb_util.build_dynamicemb_constraints — referenced DynamicEmbScoreStrategy outside the try: import dynamicemb guard. Made it fail loudly with a clear RuntimeError when has_dynamicemb is False.

New tzrec.ops.is_ppu_arch() cached helper detects PPU by device-name substring (PPUs report as (8, 0) compute capability so there's no other way to tell them apart from real Ampere).

Test plan

Re-ran each originally failing test on PPU devices 2,3 (devices 0,1 are CI-reserved). All now pass or skip cleanly.
PPU CI lane goes green on this PR.
After merge, verify the workflow auto-fires for the next PR targeting release/v1.1 and does NOT auto-fire for master PRs.

🤖 Generated with Claude Code

Adds .github/workflows/unittest_ppu_ci.yml — a workflow_dispatch-only lane that runs `bash scripts/ci/ci_test.sh` inside the `mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easyrec/tzrec-test:1.1-ppu` image on the self-hosted `tzrec-ppu-runner`. Uses the same `docker run` shape as master's unittest_h20_ci.yml (avoids the DSW authZ block on the `container:` form's docker.sock injection). No `pull_request` trigger and no PR-number `concurrency:` group — PPU runner capacity is opt-in and manually dispatched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The PPU test image (tzrec-test:1.1-ppu) ships requirements pre-installed, so the ppu lane skips the pip-install preface that ci_test.sh runs and goes straight to gen_proto + ci_data + tzrec/tests/run.py. Keeps ci_test.sh untouched so the existing GPU lane is unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the pull_request trigger (opened/reopened/synchronize) and a PR-number concurrency group, mirroring the other tzrec CI lanes. Revert this commit (or drop the two top blocks) once the PPU runner is stable and we want manual-only dispatch again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PPUs are Alibaba's alixpu accelerators, not NVIDIA GPUs — expose them to the container via explicit --device for /dev/alixpu_ppu0, /dev/alixpu_ppu1, /dev/alixpu, and /dev/alixpu_ctl rather than the nvidia-container-runtime --gpus hook (which the PPU host doesn't have). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PPU test runs exceed the default 6h job timeout (run 26203745164 hit the cap). Bump timeout-minutes to 1440 until the lane is fast enough to fit under the default again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PPU CI run 26218011590 hit `ModuleNotFoundError: No module named 'hstu_attn_2_cuda'` because the PPU image lacks that CUDA-only wheel, while the existing `gpu_unavailable` skip is False on this lane (the PPU image still reports torch.cuda.is_available()). Introduce a `cutlass_hstu_unavailable` probe alongside `gpu_unavailable` in tzrec/utils/test_util.py that flips when the wheel can't be imported, and stack it on the four cutlass-only tests: - tzrec/ops/hstu_attention_test.py: test_attn_cutlass, test_sla_attn_cutlass - tzrec/tests/rank_integration_test.py: test_rank_dlrm_hstu_cutlass_train_eval_export, test_rank_ultra_hstu_cutlass_train_eval_export This is an availability gate, not a PPU-named skip — it fires on any host missing the wheel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PPU CI run 26218011590 reproduced a deterministic ref_dw vs opt_dw mismatch (abs 0.127 on 49.8% of D=257 elements) in test_ln at dtype=bfloat16, norm_type=rms_norm. Bisecting on ty_ppu showed that restricting the `_get_bwd_dwdb_configs` autotune to num_warps in [8, 16] (dropping the num_warps=32 config) makes the test pass deterministically in 12.6s. Master's kernel is byte-identical and works on NVIDIA H20, so this is an alixpu Triton runtime bug at num_warps=32 for this reduction shape; we work around by adding a small `is_ppu_arch()` detector in tzrec/ops/__init__.py and using it to gate the autotune sweep, mirroring how `torch.ops.hip` already restricts the AMD lane. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PPU CI run 26218011590 hit numerical drift in test_addmm (fp32 abs 1.22e-4, rel 5e-6 vs default rtol 1.3e-6) and test_jagged_dense_bmm_broadcast_add_triton (fp32 abs 2.86e-5 on 1.4% of elements). Local repro on ty_ppu confirms these are legitimate ULP-level differences from alixpu Triton's matmul reduction order vs PyTorch eager -- not a kernel bug. Add a `get_compare_tolerance(dtype)` helper to tzrec/utils/test_util.py that returns wider (atol, rtol) on PPU and PyTorch defaults elsewhere, and default the helper into _test_addmm and _test_jagged_dense_bmm_broadcast_add when the caller passes atol=rtol=None. NVIDIA's tight regression guard is preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PPU CI run 26218011590 reported `test_dlrm_hstu_rtp_train_export` and `test_multi_tower_din_rtp_train_export` failing with `assertTrue(False)`. Local repro on ty_ppu surfaces two distinct missing-dependency root causes: 1. `test_dlrm_hstu_rtp_train_export` hits `ModuleNotFoundError: No module named 'torch_fx_tool'` inside the export subprocess (export_util.py:671). The PPU image doesn't ship the torch_fx_tool wheel that RTP export depends on. 2. `test_multi_tower_din_rtp_train_export` hits `NameError: name 'DynamicEmbScoreStrategy' is not defined` inside train_eval (dynamicemb_util.py:151) because the test config sets `dynamicemb { }` on features but the PPU image has no dynamicemb. The NameError is a latent bug -- `build_dynamicemb_constraints` references symbols only imported inside the `try: import dynamicemb` block; on hosts without dynamicemb the function fails with a NameError instead of a clear message. Gate both tests on the relevant dependency (mirroring the cutlass Family-A pattern), and harden `build_dynamicemb_constraints` to raise a clear RuntimeError when `has_dynamicemb` is False. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PPU fp32 Triton-vs-PyTorch tolerance was conservatively (1e-4, 1e-5). Tightening to (3e-5, 2e-5) gives ~2.1x headroom on test_jagged_dense_bmm (worst observed abs 2.86e-5, |y|~1.53) and ~4.4x on test_addmm (worst abs 1.22e-4, |y|~25.5), verified on ty_ppu. Drop the bf16/fp16 branches that were invented without measurement -- those dtypes fall back to PyTorch defaults. Also trim verbose docstrings/comments per [[feedback_concise_inline_comments]]. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…gger Replace the temporary universal pull_request trigger from edaba50 with a permanent branches: ['release/**'] filter. PRs targeting any release branch auto-trigger the PPU lane (where stability matters); PRs to master still need workflow_dispatch (where PPU runner capacity is opt-in). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-22T06:06:19Z

+def get_compare_tolerance(
+    dtype: torch.dtype,
+) -> Tuple[Optional[float], Optional[float]]:
+    """Return (atol, rtol) for Triton-vs-PyTorch comparisons; widen fp32 on PPU."""
+    from tzrec.ops import is_ppu_arch
+
+    if is_ppu_arch() and dtype == torch.float32:
+        return (3e-5, 2e-5)
+    return (None, None)


The (None, None) return path makes the contract awkward — every caller has to special-case it, and the docstring doesn't mention it. Two concrete issues this invites:

Partial-override blind spot at call sites. Both callsites added in this PR use if atol is None and rtol is None: (see jagged_tensors_test.py:389, mm_test.py:79). A future caller that passes atol=1e-6, rtol=None silently skips the widening and rtol falls back to torch's dtype defaults — which on PPU could mask or surface false regressions depending on direction. Widening each independently would be safer:
ppu_atol, ppu_rtol = get_compare_tolerance(dtype) atol = atol if atol is not None else ppu_atol rtol = rtol if rtol is not None else ppu_rtol

Returning (None, None) to mean "use defaults" is unusual. Consider returning concrete defaults for non-PPU/non-fp32 (e.g., the dtype's standard torch.testing.assert_close defaults), or restructure as get_compare_tolerance(dtype, atol, rtol) so the helper owns the precedence rule and callsites stay one line.

At minimum, the docstring should call out the (None, None) return path.

github-actions · 2026-05-22T06:06:26Z

+def is_ppu_arch() -> bool:
+    """Return True if a CUDA device is an Alibaba PPU (alixpu) accelerator."""
+    global _is_ppu_arch_cached
+    if _is_ppu_arch_cached is None:
+        try:
+            _is_ppu_arch_cached = torch.cuda.is_available() and any(
+                "PPU" in torch.cuda.get_device_name(i)
+                for i in range(torch.cuda.device_count())
+            )
+        except Exception:
+            _is_ppu_arch_cached = False
+    return _is_ppu_arch_cached


Two notes on the implementation vs. the docstring:

Scope is broader than the docstring suggests. The function returns True if any visible CUDA device's name contains the substring "PPU", and the result is cached for the process lifetime. The docstring reads as if it tests "a CUDA device" — worth tightening to e.g. "Return True if any visible CUDA device name contains 'PPU'. Cached after first call." This also surfaces the mixed-fleet edge case (PPU + non-PPU on one host both report True).

except Exception is wider than needed. The only realistic failure modes are RuntimeError from a broken driver and AssertionError from a stale device index. The broad clause will silently swallow future typos / API changes and quietly disable PPU code paths. Narrow to the specific exceptions, or at least leave a one-line comment naming the failure mode being defended against — otherwise the next reader has to guess.

github-actions · 2026-05-22T06:06:27Z

+concurrency:
+  group: unittest-ppu-ci-${{ github.event.pull_request.number }}
+  cancel-in-progress: true


github.event.pull_request.number is empty for workflow_dispatch, so all manual invocations collapse into the group key unittest-ppu-ci- and cancel-in-progress: true will cancel each other. Suggest:

Suggested change

concurrency:

group: unittest-ppu-ci-${{ github.event.pull_request.number }}

cancel-in-progress: true

concurrency:

group: unittest-ppu-ci-${{ github.event.pull_request.number || github.run_id }}

cancel-in-progress: true

github-actions · 2026-05-22T06:06:32Z

+      - name: FetchCommit
+        uses: actions/checkout@v4
+        with:
+          path: run_${{ github.run_id }}


Minor drift from unittest_h20_ci.yml: the H20 lane pins ref: ${{ github.event.pull_request.head.sha }}, this one doesn't. With pull_request, the default checkout is the merge ref — slightly different semantics (catches conflicts, but won't pin if the PR is rebased mid-run). If the intent was parity with H20, add the ref: line; if intentional, worth a one-line comment noting why.

github-actions · 2026-05-22T06:06:50Z

Review summary

Focused, well-scoped PR: a new release-branch-only PPU CI lane plus minimal source changes (capability gating + tolerance widening) to make the existing test suite green on alixpu. The PR description's failure-mode → root-cause walk is unusually clear. Posture comments below; specific issues are inline.

Test gating approach is right. Using ImportError-based flags (cutlass_hstu_unavailable, torch_fx_tool_unavailable) rather than a is_ppu_arch() carve-out means tests skip cleanly on any host missing the wheel — not just PPU. Tolerance widening, on the other hand, is correctly arch-named because it's a numerics issue, not a wheel-availability one. The asymmetry is worth one comment in test_util.py so a future contributor doesn't "fix" it for consistency.

num_warps=32 exclusion on PPU. Correct call — Triton autotune still has to compile each candidate; "let Triton find it's bad" requires compilation to succeed. Excluding the known-broken config is the right behavior, and it leaves the NVIDIA search space unchanged.

Security posture (informational, not a blocker). The lane runs on a self-hosted runner on pull_request, which means code from forked PRs targeting release/** will execute on tzrec-ppu-runner with raw /dev/alixpu* device passthrough and --ulimit memlock=-1. The H20 lane on master has the same posture, so this isn't a regression — but the PPU device-node attack surface (raw ioctls into the driver) is genuinely broader than --gpus all. Worth tracking as a repo-wide policy item (fork allowlist / label-gated trigger / required-reviewers environment) rather than blocking this PR.

Inline comments cover:

test_util.py:67-75 — get_compare_tolerance (None, None) contract + partial-override blind spot
tzrec/ops/__init__.py:30-41 — is_ppu_arch() docstring scope + over-broad except Exception
unittest_ppu_ci.yml:9-11 — concurrency group collapses workflow_dispatch runs
unittest_ppu_ci.yml:19-22 — drift from H20 lane's ref: head.sha pin

tiankongdeguiji and others added 3 commits May 21, 2026 10:34

tiankongdeguiji changed the title ~~[ci] add manual PPU unittest workflow for release/v1.1~~ [ci] add manual PPU unittest workflow May 21, 2026

tiankongdeguiji and others added 8 commits May 21, 2026 11:30

[ci] raise PPU CI job timeout to 24h

9d229e5

PPU test runs exceed the default 6h job timeout (run 26203745164 hit the cap). Bump timeout-minutes to 1440 until the lane is fast enough to fit under the default again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tiankongdeguiji changed the title ~~[ci] add manual PPU unittest workflow~~ [ci] add PPU unittest workflow May 22, 2026

tiankongdeguiji added the claude-review Let Claude Review label May 22, 2026

github-actions Bot removed the claude-review Let Claude Review label May 22, 2026

github-actions Bot reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ci] add PPU unittest workflow #522

[ci] add PPU unittest workflow #522
tiankongdeguiji wants to merge 11 commits into
release/v1.1from
feat/ppu-ci-workflow

tiankongdeguiji commented May 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot May 22, 2026

Uh oh!

github-actions Bot May 22, 2026

Uh oh!

github-actions Bot May 22, 2026

Uh oh!

github-actions Bot May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tiankongdeguiji commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Making the lane green

Test plan

Uh oh!

github-actions Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 22, 2026

Review summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tiankongdeguiji commented May 21, 2026 •

edited

Loading