Skip to content

Conversation

@jaiakash
Copy link
Member

@jaiakash jaiakash commented Jan 5, 2026

What this PR does / why we need it:
This PR replaces the VM runner with GPU-based ARC from CNCF (See cncf/automation#115). This is just for testing, once verified, the related issue will be fixed. Added them for reference.

Parent Issue: [GSoC] Project 7: GPU Testing for LLM Blueprints

Related:

Checklist:

  • Docs included if any changes are user facing

Copilot AI review requested due to automatic review settings January 5, 2026 20:25
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign johnugeorge for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the GitHub Actions self-hosted runner configuration for GPU-based end-to-end testing by replacing the VM runner name with a GPU ARC (Actions Runner Controller) from CNCF. This change is part of the ongoing GSoC project for GPU testing of LLM Blueprints.

  • Updates the runner name from oracle-vm-16cpu-a10gpu-240gb to oracle-vm-gpu-a10-1 for the main GPU E2E test job
  • Aligns with the transition to using CNCF-provided GPU infrastructure with Actions Runner Controller

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@coveralls
Copy link

coveralls commented Jan 5, 2026

Pull Request Test Coverage Report for Build 20776951787

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 51.435%

Totals Coverage Status
Change from base Build 20758373125: 0.0%
Covered Lines: 1237
Relevant Lines: 2405

💛 - Coveralls

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
exit 0
else
echo "Label found. Requesting environment approval to run GPU tests."
echo "skip=false" >> $GITHUB_OUTPUT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we disable this flag to always run this action since we have ARC now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the restriction and run this on every PR.

That said, I’d suggest introducing some guardrails in the future (for example, based on file paths or directory structure) so it doesn’t run unnecessarily on any and every PR.

For now, let’s enable it for all PRs and observe how it performs.

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
@google-oss-prow google-oss-prow bot added size/M and removed size/S labels Jan 6, 2026
@jaiakash
Copy link
Member Author

jaiakash commented Jan 6, 2026

I see few software we still have to install on ARC as we had the bootstrap script to the installation on VM.
https://github.com/kubeflow/trainer/blob/master/docs/proposals/2432-gpu-testing-on-llm-blueprints/OCI%20VM/bootstrap.sh

I will review and update this PR by today.

path: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer/artifacts/*
retention-days: 1

delete-kind-cluster:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaiakash this delete-kind-cluster job is wrong. It starts a new environment (a new VM in this case), for nothing. You can move it as a step to the other job (or just delete it, because the VM will also be deleted in any case)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we can remove it and just follow the same workflow we do for CPU-based runners: https://github.com/kubeflow/trainer/blob/master/.github/workflows/test-e2e.yaml
As @koksay mentioned, the GPU Pod will be deleted after action is complete.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sure, now thats not needed.

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

- name: Install dependencies
if: steps.check-label.outputs.skip == 'false'
working-directory: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just set it one at the job level:

defaults:
run:
working-directory: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it was failing initially. I have fixed it in recent commit.

Comment on lines 63 to 69
echo "Install nvidia-ctk tool"
export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.18.1-1
sudo apt-get install -y \
nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move installation of those tools to the ARC runner @koksay @jaiakash?

Copy link
Member Author

@jaiakash jaiakash Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, That would be necessary for any CNCF project using GPU runner.
Also I am installing https://github.com/NVIDIA/nvkind as part of our action as standard kind doesnt support GPU, maybe if possible we can include that as well as part of GPU runner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we should install nvkind as well.

Copy link
Member Author

@jaiakash jaiakash Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah @koksay, sharing some context here:

  • nvidia-ctk is already something that almost any project using GPU runners ends up installing as part of their CI workflow, so having it pre-installed in the runner image makes sense.
  • We also install nvkind in our CI because standard kind does not support creating GPU-enabled clusters. While not every project may need GPU clusters, for those that do, having nvkind available reduces duplication.

Given this, would it make sense to add above 2 to the GPU runner image?
https://github.com/cncf/automation/blob/main/ci/gha-runner-image/Dockerfile

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaiakash that's not the file to update, but yeah I will add those.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaiakash the image is ready, can you re-run the action?

Copy link
Member Author

@jaiakash jaiakash Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, Thanks a lot for adding both to the script itself.

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
@koksay
Copy link
Contributor

koksay commented Jan 7, 2026

@jaiakash

sudo cp "$(sudo go env GOPATH)/bin/nvkind" /usr/local/bin/nvkind

should do it

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants