Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
bcc35bf
test: gpu based ARC from cncf instead of stanalone vm
jaiakash Jan 5, 2026
f392c21
fix: removed label for gpu e2e test
jaiakash Jan 6, 2026
2959582
fix: remove delete cluster action
jaiakash Jan 6, 2026
d20c893
fix: rm seperate script, rm path and nvidia smi command for testing
jaiakash Jan 7, 2026
71b83ff
fix: using nvkind with sudo
jaiakash Jan 7, 2026
1fb948c
fix: move nvkind from noexec to local/bin
jaiakash Jan 7, 2026
ec90b13
tmp: install nvkind as from arc is not wroking
jaiakash Jan 7, 2026
1c9cf3a
fix: run the commands as sudo (check https://github.com/NVIDIA/nvkind…
jaiakash Jan 7, 2026
c17e0c0
fix: nvkind as sudo
jaiakash Jan 7, 2026
94a0da0
test: ignore the patch error
jaiakash Jan 13, 2026
f15ca1c
fix: downgrade the version for nvidia ctk
jaiakash Jan 16, 2026
2983295
add: service restart
jaiakash Jan 16, 2026
fa261f1
fix: patch version for nctk
jaiakash Jan 16, 2026
5fee385
fix: command
jaiakash Jan 16, 2026
62cd2fb
refactor: split into different script
jaiakash Jan 16, 2026
04edc60
chore: transfer to legacy mode
jaiakash Jan 16, 2026
d93b29d
fix: legacy mode for nctk
jaiakash Jan 16, 2026
a2106ef
chore: downgrade nctk
jaiakash Jan 16, 2026
c300fa7
fix: non root kubectl
jaiakash Jan 16, 2026
0f81f11
fix: helm dirs
jaiakash Jan 16, 2026
db14744
chore: refactored the code
jaiakash Jan 16, 2026
27eedb9
chore: rm seperate script, migrated to prod gpu arc, and other misc
jaiakash Jan 22, 2026
facda40
chore: rm sudo for kind
jaiakash Jan 22, 2026
cd59f0c
add: wait fo qwen to complete
jaiakash Jan 22, 2026
36ec784
chore: fix qwen nb and fix gpu operator to use host driver
jaiakash Jan 23, 2026
e27657c
test: use default value for ntk toolkit
jaiakash Feb 8, 2026
ef72d9e
hotfix: patch CRDs to run on GPU nodes (Check #3067)
jaiakash Feb 8, 2026
f43c8a0
fix: the patching to mdoel and initializers
jaiakash Feb 8, 2026
f742201
fix: single node for gpu e2e test
jaiakash Feb 8, 2026
45af997
revert: jax eg
jaiakash Feb 8, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 6 additions & 32 deletions .github/workflows/test-e2e-gpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,9 @@ permissions:
jobs:
gpu-e2e-test:
name: GPU E2E Test
runs-on: oracle-vm-16cpu-a10gpu-240gb
runs-on:
labels: oracle-vm-gpu-a10-1
group: GPUs

env:
GOPATH: ${{ github.workspace }}/go
Expand All @@ -26,54 +28,36 @@ jobs:
kubernetes-version: ["1.33.1"]

steps:
Comment thread
jaiakash marked this conversation as resolved.
- name: Check GPU label
id: check-label
run: |
if [[ "${{ join(github.event.pull_request.labels.*.name, ',') }}" != *"ok-to-test-gpu-runner"* ]]; then
echo "✅ Skipping GPU E2E tests (label not present)."
echo "skip=true" >> $GITHUB_OUTPUT
exit 0
else
echo "Label found. Requesting environment approval to run GPU tests."
echo "skip=false" >> $GITHUB_OUTPUT
fi

- name: Check out code
if: steps.check-label.outputs.skip == 'false'
uses: actions/checkout@v6
with:
ref: ${{ github.event.pull_request.head.sha }}
path: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer

- name: Setup Go
if: steps.check-label.outputs.skip == 'false'
uses: actions/setup-go@v6
with:
go-version-file: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer/go.mod

- name: Setup Python
if: steps.check-label.outputs.skip == 'false'
uses: actions/setup-python@v6
with:
python-version: 3.11

- name: Install dependencies
if: steps.check-label.outputs.skip == 'false'
run: |
pip install papermill==2.6.0 jupyter==1.1.1 ipykernel==6.29.5
pip install git+https://github.com/kubeflow/sdk.git@main

- name: Setup cluster with GPU support using nvidia/kind
if: steps.check-label.outputs.skip == 'false'
- name: Setup GPU cluster with nvkind
run: |
make test-e2e-setup-gpu-cluster K8S_VERSION=${{ matrix.kubernetes-version }}

- name: Run e2e test on GPU cluster
if: steps.check-label.outputs.skip == 'false'
run: |
mkdir -p artifacts/notebooks
make test-e2e-notebook NOTEBOOK_INPUT=./examples/torchtune/qwen2_5/qwen2.5-1.5B-with-alpaca.ipynb NOTEBOOK_OUTPUT=./artifacts/notebooks/${{ matrix.kubernetes-version }}_qwen2_5_with_alpaca-trainjob-yaml.ipynb TIMEOUT=900
make test-e2e-notebook NOTEBOOK_INPUT=./examples/jax/image-classification/mnist.ipynb NOTEBOOK_OUTPUT=./artifacts/notebooks/${{ matrix.kubernetes-version }}_jax_mnist.ipynb PAPERMILL_PARAMS="-p num_cpu 3 -p num_gpu 1" TIMEOUT=600
make test-e2e-notebook NOTEBOOK_INPUT=./examples/torchtune/qwen2_5/qwen2.5-1.5B-with-alpaca.ipynb NOTEBOOK_OUTPUT=./artifacts/notebooks/${{ matrix.kubernetes-version }}_qwen2_5_with_alpaca-trainjob-yaml.ipynb TIMEOUT=600
make test-e2e-notebook NOTEBOOK_INPUT=./examples/jax/image-classification/mnist.ipynb NOTEBOOK_OUTPUT=./artifacts/notebooks/${{ matrix.kubernetes-version }}_jax_mnist.ipynb PAPERMILL_PARAMS="-p num_cpu 8 -p num_gpu 1 -p num_nodes 1" TIMEOUT=600

- name: Upload Artifacts to GitHub
if: always()
Expand All @@ -82,13 +66,3 @@ jobs:
name: ${{ matrix.kubernetes-version }}
path: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer/artifacts/*
retention-days: 1
Comment thread
jaiakash marked this conversation as resolved.

delete-kind-cluster:
name: Delete kind Cluster
runs-on: oracle-vm-16cpu-a10gpu-240gb
needs: [gpu-e2e-test]
if: always()
steps:
- name: Delete any existing kind cluster
run: |
sudo kind delete cluster --name kind-gpu && echo "kind cluster has been deleted" || echo "kind cluster doesn't exist"
1 change: 1 addition & 0 deletions cmd/trainers/torchtune/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
torchao>=0.9.0
torchtune==0.6.1
bitsandbytes>=0.41.1
kagglehub>=0.4.0
Copy link
Copy Markdown
Member Author

@jaiakash jaiakash Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to spefically add kagglehub 0.4.0 as there is breaking change in the library.
Ref: Kaggle/kagglehub#268

Error:

Step: node-0, Status: Running, Devices: gpu x 1

Traceback (most recent call last):
  File "/opt/conda/bin/tune", line 3, in <module>
    from torchtune._cli.tune import main
  File "/opt/conda/lib/python3.11/site-packages/torchtune/_cli/tune.py", line 12, in <module>
    from torchtune._cli.download import Download
  File "/opt/conda/lib/python3.11/site-packages/torchtune/_cli/download.py", line 22, in <module>
    from kagglehub.auth import set_kaggle_credentials
ImportError: cannot import name 'set_kaggle_credentials' from 'kagglehub.auth' (/opt/conda/lib/python3.11/site-packages/kagglehub/auth.py)

11 changes: 6 additions & 5 deletions examples/jax/image-classification/mnist.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 1,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -315,7 +315,7 @@
},
{
"cell_type": "code",
"execution_count": 44,
"execution_count": null,
"metadata": {
"editable": true,
"slideshow": {
Expand All @@ -329,12 +329,13 @@
"source": [
"#parameters\n",
"num_cpu=3\n",
"num_gpu=0"
"num_gpu=0\n",
"num_nodes=3"
]
},
{
"cell_type": "code",
"execution_count": 45,
"execution_count": null,
"metadata": {
"editable": true,
"execution": {
Expand Down Expand Up @@ -369,7 +370,7 @@
" trainer=CustomTrainer(\n",
" func=jax_train_mnist,\n",
" # Set how many JAX nodes you want to use for distributed training.\n",
" num_nodes=3,\n",
" num_nodes=num_nodes,\n",
" resources_per_node=resources_per_node,\n",
" ),\n",
" runtime=\"jax-distributed\",\n",
Expand Down
Loading
Loading