-
Notifications
You must be signed in to change notification settings - Fork 868
feat: replaced vm runner with test gpu arc from cncf #3067
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR updates the GitHub Actions self-hosted runner configuration for GPU-based end-to-end testing by replacing the VM runner name with a GPU ARC (Actions Runner Controller) from CNCF. This change is part of the ongoing GSoC project for GPU testing of LLM Blueprints.
- Updates the runner name from
oracle-vm-16cpu-a10gpu-240gbtooracle-vm-gpu-a10-1for the main GPU E2E test job - Aligns with the transition to using CNCF-provided GPU infrastructure with Actions Runner Controller
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Pull Request Test Coverage Report for Build 20776951787Details
💛 - Coveralls |
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
.github/workflows/test-e2e-gpu.yaml
Outdated
| exit 0 | ||
| else | ||
| echo "Label found. Requesting environment approval to run GPU tests." | ||
| echo "skip=false" >> $GITHUB_OUTPUT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we disable this flag to always run this action since we have ARC now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove the restriction and run this on every PR.
That said, I’d suggest introducing some guardrails in the future (for example, based on file paths or directory structure) so it doesn’t run unnecessarily on any and every PR.
For now, let’s enable it for all PRs and observe how it performs.
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
|
I see few software we still have to install on ARC as we had the bootstrap script to the installation on VM. I will review and update this PR by today. |
.github/workflows/test-e2e-gpu.yaml
Outdated
| path: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer/artifacts/* | ||
| retention-days: 1 | ||
|
|
||
| delete-kind-cluster: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jaiakash this delete-kind-cluster job is wrong. It starts a new environment (a new VM in this case), for nothing. You can move it as a step to the other job (or just delete it, because the VM will also be deleted in any case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, we can remove it and just follow the same workflow we do for CPU-based runners: https://github.com/kubeflow/trainer/blob/master/.github/workflows/test-e2e.yaml
As @koksay mentioned, the GPU Pod will be deleted after action is complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sure, now thats not needed.
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
.github/workflows/test-e2e-gpu.yaml
Outdated
|
|
||
| - name: Install dependencies | ||
| if: steps.check-label.outputs.skip == 'false' | ||
| working-directory: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just set it one at the job level:
trainer/.github/workflows/test-e2e.yaml
Lines 12 to 14 in 9e0093c
| defaults: | |
| run: | |
| working-directory: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, it was failing initially. I have fixed it in recent commit.
hack/e2e-setup-gpu-cluster.sh
Outdated
| echo "Install nvidia-ctk tool" | ||
| export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.18.1-1 | ||
| sudo apt-get install -y \ | ||
| nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \ | ||
| nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \ | ||
| libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \ | ||
| libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, That would be necessary for any CNCF project using GPU runner.
Also I am installing https://github.com/NVIDIA/nvkind as part of our action as standard kind doesnt support GPU, maybe if possible we can include that as well as part of GPU runner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we should install nvkind as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah @koksay, sharing some context here:
nvidia-ctkis already something that almost any project using GPU runners ends up installing as part of their CI workflow, so having it pre-installed in the runner image makes sense.- We also install
nvkindin our CI because standard kind does not support creating GPU-enabled clusters. While not every project may need GPU clusters, for those that do, having nvkind available reduces duplication.
Given this, would it make sense to add above 2 to the GPU runner image?
https://github.com/cncf/automation/blob/main/ci/gha-runner-image/Dockerfile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jaiakash that's not the file to update, but yeah I will add those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jaiakash the image is ready, can you re-run the action?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, Thanks a lot for adding both to the script itself.
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
should do it |
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
What this PR does / why we need it:
This PR replaces the VM runner with GPU-based ARC from CNCF (See cncf/automation#115). This is just for testing, once verified, the related issue will be fixed. Added them for reference.
Parent Issue: [GSoC] Project 7: GPU Testing for LLM Blueprints
Related:
Checklist: