feat: `CI` Stream for `NVIDIA` platform by zhangyue207 · Pull Request #21 · InfiniTensor/InfiniOps

zhangyue207 · 2026-03-13T05:43:35Z

No description provided.

- Pass host UID/GID into container and `chown` results after tests, so mounted `ci-results/` is accessible by the host user. - Limit `pytest-xdist` workers from `-n auto` to `-n 8` to prevent OOM worker crashes on high-core-count machines. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…intainability

…form

…orm detection and job resolution

Add Dockerfile, config, and mx-smi GPU detection for MetaX (MACA) platform. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…fectively

Add GPU detection via mthreads-gmi, Dockerfile, config, and update docs with Moore and MetaX platform deployment instructions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Capture last 50 lines of Docker output via ring buffer so failed jobs return diagnostic info to the CLI client. - Store raw bytes during execution; decode only on the failure path. - Align job name columns in `<==` result lines for readability. - Show summary only when jobs fail, removing redundant all-pass output. Co-Authored-By: Claude <noreply@anthropic.com>

- Add .ci/images/cambricon/Dockerfile for AnolisOS-based Cambricon image - Add cambricon platform to config.yaml with MLU-style GPU passthrough - Add GPU_STYLE_MLU constant and MLU_VISIBLE_DEVICES support in run.py - Add cnmon-based GPU detection (_detect_gpus_cambricon) in ci_resource.py - Add --test CLI flag to override pytest test path at runtime - Skip empty stage run commands instead of erroring (compilation-only mode) - Fix _torch_gemm fallback for CPU float16/bfloat16 (upcast to float32) - Skip bfloat16 on MLU (cnnlBatchMatMulEx does not support it) - Hoist _PYTEST_VALUE_FLAGS to module level; add ValueError guard in cambricon parser - Remove redundant yaml import guard in agent.py (utils.py already handles it) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…DIA scheduler - Rewrite README.md entirely in English; add Cambricon to platform table and directory tree. - Translate all inline comments in config.yaml to English. - Replace `gpu_ids: "0"` with `ngpus: 1` for NVIDIA platform so the scheduler auto-picks a free GPU rather than pinning to device 0. - Add `ngpus` support to `parse_gpu_requirement` in ci_resource.py so scheduler correctly counts NVIDIA GPU demand. - Replace deprecated `gpu_count` fallback with `ngpus` in run.py `build_docker_args`. Co-Authored-By: Claude <noreply@anthropic.com>

zhangyue207 · 2026-03-25T03:08:56Z

function display

(python3.10) zhangyue@server:~/InfiniOps$ python .ci/agent.py run --branch feat/dev-infra
==> dispatching nvidia gpu job to http://localhost:8080
    job_id: f088f369
==> dispatching iluvatar gpu job to http://172.22.162.13:8080
    job_id: 9f280eae
==> dispatching metax gpu job to http://172.22.162.49:8080
    job_id: e7a94c55
==> dispatching moore gpu job to http://172.22.162.95:8080
    job_id: a8de4bc5
==> dispatching cambricon gpu job to http://172.22.162.16:8080
error: failed to dispatch to http://172.22.162.16:8080: HTTP Error 502: Bad Gateway
    failed to dispatch cambricon_gpu
<== FAIL  moore_gpu     (24s)
--- error output (last 50 lines) ---
          -- Found OpenMP_CXX: -fopenmp (found version "4.5")
          -- Found OpenMP: TRUE (found version "4.5")
          -- Found Python: /usr/bin/python (found version "3.10.12") found components: Interpreter
          -- Generating wrappers - done
          -- Found Python: /usr/bin/python (found version "3.10.12") found components: Interpreter Development Development.Module Development.Embed
          -- Performing Test HAS_FLTO_AUTO
          -- Performing Test HAS_FLTO_AUTO - Success
          -- Found pybind11: /usr/local/lib/python3.10/dist-packages/pybind11/include (found version "3.0.1")
          -- Configuring done (10.9s)
          -- Generating done (0.0s)
          -- Build files have been written to: /tmp/tmpmqa_49uf/build
          *** Building project with Ninja...
          [0/2] Re-checking globbed directories...
          [1/10] Building CXX object src/CMakeFiles/ops.dir/__/generated/bindings/ops.cc.o
          FAILED: src/CMakeFiles/ops.dir/__/generated/bindings/ops.cc.o
          /workspace/repo/scripts/mcc_wrapper.sh -DWITH_CPU=1 -DWITH_MOORE=1 -Dops_EXPORTS -I/workspace/repo/src -I/usr/local/musa/include -I/workspace/repo -isystem /usr/include/python3.10 -isystem /usr/local/lib/python3.10/dist-packages/pybind11/include -O3 -DNDEBUG -std=gnu++17 -fPIC -fvisibility=hidden -x musa -MD -MT src/CMakeFiles/ops.dir/__/generated/bindings/ops.cc.o -MF src/CMakeFiles/ops.dir/__/generated/bindings/ops.cc.o.d -o src/CMakeFiles/ops.dir/__/generated/bindings/ops.cc.o -c /workspace/repo/generated/bindings/ops.cc
          In file included from /workspace/repo/generated/bindings/ops.cc:4:
          In file included from /workspace/repo/src/moore/add/kernel.h:14:
          In file included from /workspace/repo/src/cuda/add/kernel.h:6:
          In file included from /workspace/repo/src/base/add.h:6:
          In file included from /workspace/repo/src/operator.h:10:
          In file included from /workspace/repo/src/dispatcher.h:10:
          In file included from /workspace/repo/src/data_type.h:19:
          In file included from /usr/local/musa/include/musa_fp16.h:3540:
          /usr/local/musa/include/musa_fp16_mtgpu.h:1609:26: error: out-of-line declaration of 'hrcp' does not match any declaration in namespace 'infini::ops'
          inline __device__ __half hrcp(__half x);
                                   ^~~~
          /workspace/repo/src/moore/polyfills.cuh:39:27: note: expanded from macro 'hrcp'
          #define hrcp infini::ops::hrcp
                                    ^~~~
          1 error generated when compiling for mp_21.
          [2/10] Building CXX object examples/CMakeFiles/data_type.dir/data_type.cc.o
          [3/10] Building CXX object examples/CMakeFiles/tensor.dir/tensor.cc.o
          [4/10] Building CXX object src/CMakeFiles/infiniops.dir/tensor.cc.o
          [5/10] Building CXX object examples/CMakeFiles/gemm.dir/gemm/gemm.cc.o
          ninja: build stopped: subcommand failed.
          
          *** CMake build failed
          [end of output]
      
      note: This error originates from a subprocess, and is likely not a problem with pip.
      ERROR: Failed building wheel for InfiniOps
    Failed to build InfiniOps
    
    [notice] A new release of pip is available: 25.2 -> 26.0.1
    [notice] To update, run: python -m pip install --upgrade pip
    error: failed-wheel-build-for-install
    
    × Failed to build installable wheels for some pyproject.toml based projects
    ╰─> InfiniOps
---
<== PASS  iluvatar_gpu  (93s)
<== PASS  nvidia_gpu    (114s)
<== PASS  metax_gpu     (207s)

========== Failed ==========
  FAIL  cambricon_gpu  error (0s)
  FAIL  moore_gpu      failure (24s)

zhangyue207 changed the title ~~feat/CI system for NVIDIA platform~~ feat: Test Stream for NVIDIA platform Mar 13, 2026

zhangyue207 changed the title ~~feat: Test Stream for NVIDIA platform~~ feat: CI Stream for NVIDIA platform Mar 16, 2026

zhangyue207 force-pushed the feat/ci-nvidia branch from d884069 to 092d940 Compare March 19, 2026 07:08

zhangyue207 added the draft label Mar 19, 2026

zhangyue207 marked this pull request as draft March 19, 2026 08:29

zhangyue207 removed the draft label Mar 19, 2026

zhangyue207 force-pushed the feat/ci-nvidia branch from e7be47c to 9a28b44 Compare March 24, 2026 10:23

zhangyue207 and others added 13 commits March 25, 2026 02:58

feat/nv ci test

ecafdc0

feat: ci sys for nv platform

f8a6064

refactor(ci): Refactor code structure for improved readability and ma…

41c76c9

…intainability

docs: add multi-machine deployment guide for NVIDIA and Iluvatar plat…

5292415

…form

feat(ci): enhance CI configuration and agent functionality with platf…

5eb8fdc

…orm detection and job resolution

feat(ci): add MetaX platform CI support

038f884

Add Dockerfile, config, and mx-smi GPU detection for MetaX (MACA) platform. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(ci): improve job dispatch logging and handle job results more ef…

78deba2

…fectively

feat(ci): add Moore Threads (MUSA) platform CI support

a599ba9

Add GPU detection via mthreads-gmi, Dockerfile, config, and update docs with Moore and MetaX platform deployment instructions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(ci): capture Docker error output for remote job diagnostics

3166c87

zhangyue207 force-pushed the feat/ci-nvidia branch from e1a9dd8 to 7253bcd Compare March 25, 2026 02:59

zhangyue207 marked this pull request as ready for review March 25, 2026 03:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: `CI` Stream for `NVIDIA` platform#21

feat: `CI` Stream for `NVIDIA` platform#21
zhangyue207 wants to merge 13 commits intofeat/dev-infrafrom
feat/ci-nvidia

zhangyue207 commented Mar 13, 2026

Uh oh!

zhangyue207 commented Mar 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhangyue207 commented Mar 13, 2026

Uh oh!

zhangyue207 commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zhangyue207 commented Mar 25, 2026 •

edited

Loading