EXPERIMENT: GB drop wave1 OFED for inbox RDMA + float kernel#8776
Open
xuexu6666 wants to merge 2 commits into
Open
EXPERIMENT: GB drop wave1 OFED for inbox RDMA + float kernel#8776xuexu6666 wants to merge 2 commits into
xuexu6666 wants to merge 2 commits into
Conversation
Experiment (NOT for merge) to test whether the GB image can use the kernel's inbox RDMA stack instead of full DOCA/OFED. Branched off main; changes ONLY the OFED->inbox + kernel-pin variables, everything else stays at the main baseline. - pre-install-dependencies.sh: float the kernel - install the latest linux-azure-nvidia unpinned (like the vanilla arm64 image) instead of the pinned 6.14.0-1003.3 + PPA + curl fallback. The latest (6.14.0-1007.7) was verified to ship the Data Direct GPUDirect-RDMA support inbox in mlx5_ib. - gb-mai-bom.json: drop versions-wave1 (doca-ofed + MLNX verbs), doca-custom-repo, and kernel-versions. wave2 (driver) + wave3 unchanged. - install-dependencies.sh: skip the DOCA repo setup and wave1 install; disable the staged doca-net.list so RDMA userspace resolves to distro rdma-core (not MLNX); install rdma-core + ibverbs-providers + ibverbs-utils; drop 'systemctl enable openibd'. nvidia-peermem (wave2 driver DKMS output) now builds against inbox ib_core. Key build/hardware signals: peermem DKMS compiles; ibv_devinfo shows data_direct on CX8; distro rdma-core exposes the data_direct verbs path + SHARP; NCCL perf parity.
Contributor
PR Title Lint Failed ❌Current Title: Your PR title doesn't follow the expected format. Please update your PR title to follow one of these patterns: Conventional Commits Format:
Guidelines:
Examples:
Please update your PR title and the lint check will run again automatically. |
Contributor
There was a problem hiding this comment.
Pull request overview
Experimental VHD builder changes for the NVIDIA GB Ubuntu 24.04 arm64 image to validate running on the kernel’s inbox RDMA stack (rdma-core + in-kernel mlx5/ib_core) instead of installing DOCA/OFED wave1 packages, and to float the linux-azure-nvidia kernel version.
Changes:
- Removed DOCA/OFED (wave1) package pinning/config from the GB BOM and switched userspace RDMA to distro
rdma-coretooling. - Updated the GB install flow to disable the staged DOCA apt source and skip
openibdenablement. - Simplified kernel installation logic on Ubuntu 24.04 arm64 to install unpinned
linux-azure-nvidiafrom the repo.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| vhdbuilder/packer/pre-install-dependencies.sh | Drops pinned/PPA/fallback kernel logic and installs unpinned linux-azure-nvidia for Ubuntu 24.04 arm64. |
| vhdbuilder/packer/install-dependencies.sh | Skips DOCA/OFED wave1; disables DOCA apt repo and installs distro RDMA userspace packages; removes openibd enable. |
| vhdbuilder/packer/gb-mai-bom.json | Removes wave1 + doca repo + kernel pin entries; documents experiment intent and keeps wave2/wave3 pinned sets. |
| # EXPERIMENT: install the latest linux-azure-nvidia from the repo with no version pin (like | ||
| # the vanilla arm64 image), so GB floats the kernel onto the newest azure-nvidia kernel | ||
| # (verified to ship the Data Direct GPUDirect-RDMA support inbox in mlx5_ib). | ||
| apt-get update |
Comment on lines
213
to
+214
| else | ||
| apt-get update | ||
| if apt-cache show "${NVIDIA_KERNEL_PACKAGE}" &> /dev/null; then | ||
| echo "ARM64 image. Installing NVIDIA kernel and its packages alongside LTS kernel" | ||
| wait_for_apt_locks | ||
| sudo apt install --no-install-recommends -y "${NVIDIA_KERNEL_PACKAGE}" | ||
| echo "after installation:" | ||
| dpkg -l | grep "linux-.*-azure-nvidia" || true | ||
| else | ||
| echo "ARM64 image. NVIDIA kernel not available, skipping installation." | ||
| fi | ||
| echo "ARM64 image. NVIDIA kernel not available, skipping installation." |
Build 169569810 failed: the staged doca-net.list points at the DOCA 'latest' repo, which is signed with a key (DC726C5E41B9CC50) not in the shipped keyring. Every apt-get update then emits a GPG 'is not signed' W:/E:, which the retry-wrapped apt_get_update helper treats as fatal -> 10 retries -> exit 99 -> build failure, before the GB block's rm of the list ever runs. On main this is masked because the GB block replaces 'latest' with the pinned doca/3.1.0 repo (valid key); this experiment dropped that replacement, exposing the broken 'latest' repo. Fix: gate off the DOCA repo staging in packer_source.sh entirely (we install no OFED here - inbox mlx5/ib + distro rdma-core provide RDMA), so no apt-get update ever sees the DOCA repo. Keep the rm in install-dependencies.sh as belt-and-suspenders.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Experiment (NOT for merge) to test whether the GB image can use the kernel's inbox RDMA stack instead of full DOCA/OFED. Branched off main; changes ONLY the OFED->inbox + kernel-pin variables, everything else stays at the main baseline.
nvidia-peermem (wave2 driver DKMS output) now builds against inbox ib_core. Key build/hardware signals: peermem DKMS compiles; ibv_devinfo shows data_direct on CX8; distro rdma-core exposes the data_direct verbs path + SHARP; NCCL perf parity.
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #