feat: replaced vm runner with test gpu arc from cncf #3067

jaiakash · 2026-01-05T20:25:07Z

What this PR does / why we need it:
This PR replaces the VM runner with GPU-based ARC from CNCF (See cncf/automation#115). This is just for testing, once verified, the related issue will be fixed. Added them for reference.

Parent Issue: [GSoC] Project 7: GPU Testing for LLM Blueprints

[GH ARC]: Docs for setup of ARC, yaml manifests, config and gpu based cloudrunner code.
feat: refactor and merge setup-gpu-cluster into single script based on flag.

Checklist:

Docs included if any changes are user facing

google-oss-prow · 2026-01-05T20:25:12Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign johnugeorge for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

This PR updates the GitHub Actions self-hosted runner configuration for GPU-based end-to-end testing by replacing the VM runner name with a GPU ARC (Actions Runner Controller) from CNCF. This change is part of the ongoing GSoC project for GPU testing of LLM Blueprints.

Updates the runner name from oracle-vm-16cpu-a10gpu-240gb to oracle-vm-gpu-a10-1 for the main GPU E2E test job
Aligns with the transition to using CNCF-provided GPU infrastructure with Actions Runner Controller

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coveralls · 2026-01-05T20:29:53Z

Pull Request Test Coverage Report for Build 20776951787

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 51.435%

Totals
Change from base Build 20758373125:	0.0%
Covered Lines:	1237
Relevant Lines:	2405

💛 - Coveralls

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

andreyvelich · 2026-01-05T22:35:07Z

.github/workflows/test-e2e-gpu.yaml

            exit 0
          else
            echo "Label found. Requesting environment approval to run GPU tests."
            echo "skip=false" >> $GITHUB_OUTPUT


Shall we disable this flag to always run this action since we have ARC now?

We can remove the restriction and run this on every PR.

That said, I’d suggest introducing some guardrails in the future (for example, based on file paths or directory structure) so it doesn’t run unnecessarily on any and every PR.

For now, let’s enable it for all PRs and observe how it performs.

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

jaiakash · 2026-01-06T03:38:41Z

I see few software we still have to install on ARC as we had the bootstrap script to the installation on VM.
https://github.com/kubeflow/trainer/blob/master/docs/proposals/2432-gpu-testing-on-llm-blueprints/OCI%20VM/bootstrap.sh

I will review and update this PR by today.

koksay · 2026-01-06T09:27:37Z

.github/workflows/test-e2e-gpu.yaml

          path: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer/artifacts/*
          retention-days: 1

  delete-kind-cluster:


@jaiakash this delete-kind-cluster job is wrong. It starts a new environment (a new VM in this case), for nothing. You can move it as a step to the other job (or just delete it, because the VM will also be deleted in any case)

I think, we can remove it and just follow the same workflow we do for CPU-based runners: https://github.com/kubeflow/trainer/blob/master/.github/workflows/test-e2e.yaml
As @koksay mentioned, the GPU Pod will be deleted after action is complete.

Yeah sure, now thats not needed.

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

andreyvelich · 2026-01-06T14:35:14Z

.github/workflows/test-e2e-gpu.yaml


      - name: Install dependencies
-        if: steps.check-label.outputs.skip == 'false'
+        working-directory: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer


You can just set it one at the job level:

trainer/.github/workflows/test-e2e.yaml

Lines 12 to 14 in 9e0093c

defaults:

run:

working-directory: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer

Sure, it was failing initially. I have fixed it in recent commit.

andreyvelich · 2026-01-06T14:35:43Z

hack/e2e-setup-gpu-cluster.sh

+echo "Install nvidia-ctk tool"
+export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.18.1-1
+sudo apt-get install -y \
+    nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
+    nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
+    libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
+    libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}


Can we move installation of those tools to the ARC runner @koksay @jaiakash?

Yes, That would be necessary for any CNCF project using GPU runner.
Also I am installing https://github.com/NVIDIA/nvkind as part of our action as standard kind doesnt support GPU, maybe if possible we can include that as well as part of GPU runner.

Yes, I think we should install nvkind as well.

Yeah @koksay, sharing some context here:

nvidia-ctk is already something that almost any project using GPU runners ends up installing as part of their CI workflow, so having it pre-installed in the runner image makes sense.

We also install nvkind in our CI because standard kind does not support creating GPU-enabled clusters. While not every project may need GPU clusters, for those that do, having nvkind available reduces duplication.

Given this, would it make sense to add above 2 to the GPU runner image?
https://github.com/cncf/automation/blob/main/ci/gha-runner-image/Dockerfile

@jaiakash that's not the file to update, but yeah I will add those.

@jaiakash the image is ready, can you re-run the action?

Sure, Thanks a lot for adding both to the script itself.

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

koksay · 2026-01-07T08:39:11Z

@jaiakash

sudo cp "$(sudo go env GOPATH)/bin/nvkind" /usr/local/bin/nvkind

should do it

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

Copilot AI review requested due to automatic review settings January 5, 2026 20:25

google-oss-prow bot requested review from jinchihe and kuizhiqing January 5, 2026 20:25

google-oss-prow bot added the size/XS label Jan 5, 2026

Copilot started reviewing on behalf of jaiakash January 5, 2026 20:25 View session

Copilot AI reviewed Jan 5, 2026

View reviewed changes

test: gpu based ARC from cncf instead of stanalone vm

6c75a3c

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

jaiakash force-pushed the test-gpu-arc branch from 960b866 to 6c75a3c Compare January 5, 2026 20:51

google-oss-prow bot added size/S and removed size/XS labels Jan 5, 2026

andreyvelich reviewed Jan 5, 2026

View reviewed changes

andreyvelich added the ok-to-test-gpu-runner label Jan 5, 2026

fix: removed label for gpu e2e test

0bd43e8

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

google-oss-prow bot added size/M and removed size/S labels Jan 6, 2026

koksay reviewed Jan 6, 2026

View reviewed changes

fix: remove delete cluster action

901bc75

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

jaiakash force-pushed the test-gpu-arc branch from 9e0093c to 5d631e3 Compare January 6, 2026 14:35

andreyvelich reviewed Jan 6, 2026

View reviewed changes

jaiakash force-pushed the test-gpu-arc branch from 5d631e3 to bba7978 Compare January 6, 2026 14:40

fix: rm seperate script, rm path and nvidia smi command for testing

896b867

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

jaiakash force-pushed the test-gpu-arc branch from 9769c1f to 896b867 Compare January 7, 2026 08:06

jaiakash added 3 commits January 7, 2026 13:47

fix: using nvkind with sudo

9ba475e

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

fix: move nvkind from noexec to local/bin

471d49d

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

tmp: install nvkind as from arc is not wroking

e9915f5

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

jaiakash mentioned this pull request Jan 7, 2026

error with quickstart NVIDIA/nvkind#37

Open

jaiakash added 2 commits January 7, 2026 14:35

fix: run the commands as sudo (check NVIDIA/nvkind#48)

6fc8304

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

fix: nvkind as sudo

31d185b

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

	defaults:
	run:
	working-directory: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer

feat: replaced vm runner with test gpu arc from cncf #3067

Are you sure you want to change the base?

feat: replaced vm runner with test gpu arc from cncf #3067

Conversation

jaiakash commented Jan 5, 2026

Uh oh!

google-oss-prow bot commented Jan 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

coveralls commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 20776951787

Details

💛 - Coveralls

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaiakash commented Jan 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaiakash Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaiakash Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaiakash Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

koksay commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coveralls commented Jan 5, 2026 •

edited

Loading

jaiakash Jan 6, 2026 •

edited

Loading

jaiakash Jan 6, 2026 •

edited

Loading

jaiakash Jan 7, 2026 •

edited

Loading

koksay commented Jan 7, 2026 •

edited

Loading