Skip to content

EXPERIMENT: GB drop wave1 OFED for inbox RDMA + float kernel#8776

Open
xuexu6666 wants to merge 2 commits into
mainfrom
xuex/gb-wave1-drop-experiment
Open

EXPERIMENT: GB drop wave1 OFED for inbox RDMA + float kernel#8776
xuexu6666 wants to merge 2 commits into
mainfrom
xuex/gb-wave1-drop-experiment

Conversation

@xuexu6666

Copy link
Copy Markdown

Experiment (NOT for merge) to test whether the GB image can use the kernel's inbox RDMA stack instead of full DOCA/OFED. Branched off main; changes ONLY the OFED->inbox + kernel-pin variables, everything else stays at the main baseline.

  • pre-install-dependencies.sh: float the kernel - install the latest linux-azure-nvidia unpinned (like the vanilla arm64 image) instead of the pinned 6.14.0-1003.3 + PPA + curl fallback. The latest (6.14.0-1007.7) was verified to ship the Data Direct GPUDirect-RDMA support inbox in mlx5_ib.
  • gb-mai-bom.json: drop versions-wave1 (doca-ofed + MLNX verbs), doca-custom-repo, and kernel-versions. wave2 (driver) + wave3 unchanged.
  • install-dependencies.sh: skip the DOCA repo setup and wave1 install; disable the staged doca-net.list so RDMA userspace resolves to distro rdma-core (not MLNX); install rdma-core + ibverbs-providers + ibverbs-utils; drop 'systemctl enable openibd'.

nvidia-peermem (wave2 driver DKMS output) now builds against inbox ib_core. Key build/hardware signals: peermem DKMS compiles; ibv_devinfo shows data_direct on CX8; distro rdma-core exposes the data_direct verbs path + SHARP; NCCL perf parity.

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Experiment (NOT for merge) to test whether the GB image can use the kernel's
inbox RDMA stack instead of full DOCA/OFED. Branched off main; changes ONLY the
OFED->inbox + kernel-pin variables, everything else stays at the main baseline.

- pre-install-dependencies.sh: float the kernel - install the latest
  linux-azure-nvidia unpinned (like the vanilla arm64 image) instead of the
  pinned 6.14.0-1003.3 + PPA + curl fallback. The latest (6.14.0-1007.7) was
  verified to ship the Data Direct GPUDirect-RDMA support inbox in mlx5_ib.
- gb-mai-bom.json: drop versions-wave1 (doca-ofed + MLNX verbs), doca-custom-repo,
  and kernel-versions. wave2 (driver) + wave3 unchanged.
- install-dependencies.sh: skip the DOCA repo setup and wave1 install; disable the
  staged doca-net.list so RDMA userspace resolves to distro rdma-core (not MLNX);
  install rdma-core + ibverbs-providers + ibverbs-utils; drop 'systemctl enable
  openibd'.

nvidia-peermem (wave2 driver DKMS output) now builds against inbox ib_core. Key
build/hardware signals: peermem DKMS compiles; ibv_devinfo shows data_direct on
CX8; distro rdma-core exposes the data_direct verbs path + SHARP; NCCL perf parity.
@github-actions

Copy link
Copy Markdown
Contributor

PR Title Lint Failed ❌

Current Title: EXPERIMENT: GB drop wave1 OFED for inbox RDMA + float kernel

Your PR title doesn't follow the expected format. Please update your PR title to follow one of these patterns:

Conventional Commits Format:

  • feat: add new feature - for new features
  • fix: resolve bug in component - for bug fixes
  • docs: update README - for documentation changes
  • refactor: improve code structure - for refactoring
  • test: add unit tests - for test additions
  • chore: remove dead code - for maintenance tasks
  • chore(deps): update dependencies - for updating dependencies
  • ci: update build pipeline - for CI/CD changes

Guidelines:

  • Use lowercase for the type and description
  • Keep the description concise but descriptive
  • Use imperative mood (e.g., "add" not "adds" or "added")
  • Don't end with a period

Examples:

  • feat(windows): add secure TLS bootstrapping for Windows nodes
  • fix: resolve kubelet certificate rotation issue
  • docs: update installation guide
  • Added new feature
  • Fix bug.
  • Update docs

Please update your PR title and the lint check will run again automatically.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Experimental VHD builder changes for the NVIDIA GB Ubuntu 24.04 arm64 image to validate running on the kernel’s inbox RDMA stack (rdma-core + in-kernel mlx5/ib_core) instead of installing DOCA/OFED wave1 packages, and to float the linux-azure-nvidia kernel version.

Changes:

  • Removed DOCA/OFED (wave1) package pinning/config from the GB BOM and switched userspace RDMA to distro rdma-core tooling.
  • Updated the GB install flow to disable the staged DOCA apt source and skip openibd enablement.
  • Simplified kernel installation logic on Ubuntu 24.04 arm64 to install unpinned linux-azure-nvidia from the repo.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
vhdbuilder/packer/pre-install-dependencies.sh Drops pinned/PPA/fallback kernel logic and installs unpinned linux-azure-nvidia for Ubuntu 24.04 arm64.
vhdbuilder/packer/install-dependencies.sh Skips DOCA/OFED wave1; disables DOCA apt repo and installs distro RDMA userspace packages; removes openibd enable.
vhdbuilder/packer/gb-mai-bom.json Removes wave1 + doca repo + kernel pin entries; documents experiment intent and keeps wave2/wave3 pinned sets.

# EXPERIMENT: install the latest linux-azure-nvidia from the repo with no version pin (like
# the vanilla arm64 image), so GB floats the kernel onto the newest azure-nvidia kernel
# (verified to ship the Data Direct GPUDirect-RDMA support inbox in mlx5_ib).
apt-get update
Comment on lines 213 to +214
else
apt-get update
if apt-cache show "${NVIDIA_KERNEL_PACKAGE}" &> /dev/null; then
echo "ARM64 image. Installing NVIDIA kernel and its packages alongside LTS kernel"
wait_for_apt_locks
sudo apt install --no-install-recommends -y "${NVIDIA_KERNEL_PACKAGE}"
echo "after installation:"
dpkg -l | grep "linux-.*-azure-nvidia" || true
else
echo "ARM64 image. NVIDIA kernel not available, skipping installation."
fi
echo "ARM64 image. NVIDIA kernel not available, skipping installation."
Build 169569810 failed: the staged doca-net.list points at the DOCA 'latest'
repo, which is signed with a key (DC726C5E41B9CC50) not in the shipped keyring.
Every apt-get update then emits a GPG 'is not signed' W:/E:, which the
retry-wrapped apt_get_update helper treats as fatal -> 10 retries -> exit 99 ->
build failure, before the GB block's rm of the list ever runs.

On main this is masked because the GB block replaces 'latest' with the pinned
doca/3.1.0 repo (valid key); this experiment dropped that replacement, exposing
the broken 'latest' repo.

Fix: gate off the DOCA repo staging in packer_source.sh entirely (we install no
OFED here - inbox mlx5/ib + distro rdma-core provide RDMA), so no apt-get update
ever sees the DOCA repo. Keep the rm in install-dependencies.sh as belt-and-suspenders.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants