Skip to content

feat: CI Stream for NVIDIA platform#21

Open
zhangyue207 wants to merge 13 commits intofeat/dev-infrafrom
feat/ci-nvidia
Open

feat: CI Stream for NVIDIA platform#21
zhangyue207 wants to merge 13 commits intofeat/dev-infrafrom
feat/ci-nvidia

Conversation

@zhangyue207
Copy link
Collaborator

No description provided.

@zhangyue207 zhangyue207 changed the title feat/CI system for NVIDIA platform feat: Test Stream for NVIDIA platform Mar 13, 2026
@zhangyue207 zhangyue207 changed the title feat: Test Stream for NVIDIA platform feat: CI Stream for NVIDIA platform Mar 16, 2026
@zhangyue207 zhangyue207 marked this pull request as draft March 19, 2026 08:29
@zhangyue207 zhangyue207 removed the draft label Mar 19, 2026
zhangyue207 and others added 13 commits March 25, 2026 02:58
- Pass host UID/GID into container and `chown` results after tests,
  so mounted `ci-results/` is accessible by the host user.
- Limit `pytest-xdist` workers from `-n auto` to `-n 8` to prevent
  OOM worker crashes on high-core-count machines.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add Dockerfile, config, and mx-smi GPU detection for MetaX (MACA) platform.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add GPU detection via mthreads-gmi, Dockerfile, config, and update docs
with Moore and MetaX platform deployment instructions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Capture last 50 lines of Docker output via ring buffer so failed
  jobs return diagnostic info to the CLI client.
- Store raw bytes during execution; decode only on the failure path.
- Align job name columns in `<==` result lines for readability.
- Show summary only when jobs fail, removing redundant all-pass output.

Co-Authored-By: Claude <noreply@anthropic.com>
- Add .ci/images/cambricon/Dockerfile for AnolisOS-based Cambricon image
- Add cambricon platform to config.yaml with MLU-style GPU passthrough
- Add GPU_STYLE_MLU constant and MLU_VISIBLE_DEVICES support in run.py
- Add cnmon-based GPU detection (_detect_gpus_cambricon) in ci_resource.py
- Add --test CLI flag to override pytest test path at runtime
- Skip empty stage run commands instead of erroring (compilation-only mode)
- Fix _torch_gemm fallback for CPU float16/bfloat16 (upcast to float32)
- Skip bfloat16 on MLU (cnnlBatchMatMulEx does not support it)
- Hoist _PYTEST_VALUE_FLAGS to module level; add ValueError guard in cambricon parser
- Remove redundant yaml import guard in agent.py (utils.py already handles it)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…DIA scheduler

- Rewrite README.md entirely in English; add Cambricon to platform
  table and directory tree.
- Translate all inline comments in config.yaml to English.
- Replace `gpu_ids: "0"` with `ngpus: 1` for NVIDIA platform so the
  scheduler auto-picks a free GPU rather than pinning to device 0.
- Add `ngpus` support to `parse_gpu_requirement` in ci_resource.py so
  scheduler correctly counts NVIDIA GPU demand.
- Replace deprecated `gpu_count` fallback with `ngpus` in run.py
  `build_docker_args`.

Co-Authored-By: Claude <noreply@anthropic.com>
@zhangyue207 zhangyue207 marked this pull request as ready for review March 25, 2026 03:04
@zhangyue207
Copy link
Collaborator Author

zhangyue207 commented Mar 25, 2026

function display

(python3.10) zhangyue@server:~/InfiniOps$ python .ci/agent.py run --branch feat/dev-infra
==> dispatching nvidia gpu job to http://localhost:8080
    job_id: f088f369
==> dispatching iluvatar gpu job to http://172.22.162.13:8080
    job_id: 9f280eae
==> dispatching metax gpu job to http://172.22.162.49:8080
    job_id: e7a94c55
==> dispatching moore gpu job to http://172.22.162.95:8080
    job_id: a8de4bc5
==> dispatching cambricon gpu job to http://172.22.162.16:8080
error: failed to dispatch to http://172.22.162.16:8080: HTTP Error 502: Bad Gateway
    failed to dispatch cambricon_gpu
<== FAIL  moore_gpu     (24s)
--- error output (last 50 lines) ---
          -- Found OpenMP_CXX: -fopenmp (found version "4.5")
          -- Found OpenMP: TRUE (found version "4.5")
          -- Found Python: /usr/bin/python (found version "3.10.12") found components: Interpreter
          -- Generating wrappers - done
          -- Found Python: /usr/bin/python (found version "3.10.12") found components: Interpreter Development Development.Module Development.Embed
          -- Performing Test HAS_FLTO_AUTO
          -- Performing Test HAS_FLTO_AUTO - Success
          -- Found pybind11: /usr/local/lib/python3.10/dist-packages/pybind11/include (found version "3.0.1")
          -- Configuring done (10.9s)
          -- Generating done (0.0s)
          -- Build files have been written to: /tmp/tmpmqa_49uf/build
          *** Building project with Ninja...
          [0/2] Re-checking globbed directories...
          [1/10] Building CXX object src/CMakeFiles/ops.dir/__/generated/bindings/ops.cc.o
          FAILED: src/CMakeFiles/ops.dir/__/generated/bindings/ops.cc.o
          /workspace/repo/scripts/mcc_wrapper.sh -DWITH_CPU=1 -DWITH_MOORE=1 -Dops_EXPORTS -I/workspace/repo/src -I/usr/local/musa/include -I/workspace/repo -isystem /usr/include/python3.10 -isystem /usr/local/lib/python3.10/dist-packages/pybind11/include -O3 -DNDEBUG -std=gnu++17 -fPIC -fvisibility=hidden -x musa -MD -MT src/CMakeFiles/ops.dir/__/generated/bindings/ops.cc.o -MF src/CMakeFiles/ops.dir/__/generated/bindings/ops.cc.o.d -o src/CMakeFiles/ops.dir/__/generated/bindings/ops.cc.o -c /workspace/repo/generated/bindings/ops.cc
          In file included from /workspace/repo/generated/bindings/ops.cc:4:
          In file included from /workspace/repo/src/moore/add/kernel.h:14:
          In file included from /workspace/repo/src/cuda/add/kernel.h:6:
          In file included from /workspace/repo/src/base/add.h:6:
          In file included from /workspace/repo/src/operator.h:10:
          In file included from /workspace/repo/src/dispatcher.h:10:
          In file included from /workspace/repo/src/data_type.h:19:
          In file included from /usr/local/musa/include/musa_fp16.h:3540:
          /usr/local/musa/include/musa_fp16_mtgpu.h:1609:26: error: out-of-line declaration of 'hrcp' does not match any declaration in namespace 'infini::ops'
          inline __device__ __half hrcp(__half x);
                                   ^~~~
          /workspace/repo/src/moore/polyfills.cuh:39:27: note: expanded from macro 'hrcp'
          #define hrcp infini::ops::hrcp
                                    ^~~~
          1 error generated when compiling for mp_21.
          [2/10] Building CXX object examples/CMakeFiles/data_type.dir/data_type.cc.o
          [3/10] Building CXX object examples/CMakeFiles/tensor.dir/tensor.cc.o
          [4/10] Building CXX object src/CMakeFiles/infiniops.dir/tensor.cc.o
          [5/10] Building CXX object examples/CMakeFiles/gemm.dir/gemm/gemm.cc.o
          ninja: build stopped: subcommand failed.
          
          *** CMake build failed
          [end of output]
      
      note: This error originates from a subprocess, and is likely not a problem with pip.
      ERROR: Failed building wheel for InfiniOps
    Failed to build InfiniOps
    
    [notice] A new release of pip is available: 25.2 -> 26.0.1
    [notice] To update, run: python -m pip install --upgrade pip
    error: failed-wheel-build-for-install
    
    × Failed to build installable wheels for some pyproject.toml based projects
    ╰─> InfiniOps
---
<== PASS  iluvatar_gpu  (93s)
<== PASS  nvidia_gpu    (114s)
<== PASS  metax_gpu     (207s)

========== Failed ==========
  FAIL  cambricon_gpu  error (0s)
  FAIL  moore_gpu      failure (24s)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant