Skip to content

Migrate A100 CUDA CI jobs to OSDC runners#20212

Merged
Gasoonjia merged 4 commits into
mainfrom
huydo/cuda-ci-a100-osdc
Jun 11, 2026
Merged

Migrate A100 CUDA CI jobs to OSDC runners#20212
Gasoonjia merged 4 commits into
mainfrom
huydo/cuda-ci-a100-osdc

Conversation

@huydhn

@huydhn huydhn commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Moves the A100-dependent CUDA CI jobs from pytorch/test-infra linux_job_v2 (AWS) to linux_job_v3 (OSDC/ARC), and remaps their runner labels per pytorch/.github/arc.yaml.

Migrated jobs (now on OSDC / linux_job_v3)

  • cuda.yml: export-model-cuda-artifact, test-model-cuda-e2e
  • cuda-perf.yml: export-models, benchmark-cuda

Runner label mapping

AWS label OSDC label
linux.aws.a100 mt-l-x86iavx512-11-125-a100
linux.g5.4xlarge.nvidia.gpu (A10G fallback branch) mt-l-x86aavx2-29-113-a10g

The A10G fallback branch in each conditional runner expression had to move to an OSDC label too, since linux_job_v3 requires ARC labels and that branch belongs to the same A100-dependent jobs.

Left unchanged

Jobs that never run on A100 stay on linux_job_v2 / linux.g5.4xlarge.nvidia.gpu: test-cuda-builds, test-models-cuda, unittest-cuda, test-cuda-pybind.

linux_job_v3 resolves the docker image and --gpus all identically to v2 for these jobs (none set docker-image), so build/runtime behavior is unchanged.

Authored with Claude Code.

Move the A100-dependent jobs in cuda.yml (export-model-cuda-artifact,
test-model-cuda-e2e) and cuda-perf.yml (export-models, benchmark-cuda)
from pytorch/test-infra linux_job_v2 (AWS) to linux_job_v3 (OSDC/ARC).
Runner labels are remapped per pytorch/.github/arc.yaml:
linux.aws.a100 -> mt-l-x86iavx512-11-125-a100 and the A10G fallback
linux.g5.4xlarge.nvidia.gpu -> mt-l-x86aavx2-29-113-a10g.

Jobs that never run on A100 stay on linux_job_v2 / linux.g5.4xlarge.nvidia.gpu.

Authored with Claude Code.
@pytorch-bot

pytorch-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20212

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 36 Pending

As of commit 1b44041 with merge base 129c687 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 11, 2026
@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

OSDC runners can't reach the public PyPI CDN that download.pytorch.org's
transitive deps resolve to, so the torch install in install_requirements.py
fails fetching e.g. sympy from files.pythonhosted.org. Pre-install torch's
pure-python deps from the in-cluster pypi-cache and clear PIP_EXTRA_INDEX_URL
in the four migrated CUDA jobs, mirroring the torchtitan/ao OSDC workaround.

Authored with Claude Code.
@huydhn huydhn requested a review from Gasoonjia June 11, 2026 01:36
@huydhn huydhn marked this pull request as ready for review June 11, 2026 01:54
The example-deps install (torchvision==0.27.0 torchaudio==2.11.0) pulls
pillow, which still resolved from files.pythonhosted.org and failed on
OSDC. Add pillow to the pre-installed pure-python deps, matching the
torchao OSDC list.

Authored with Claude Code.
@huydhn huydhn temporarily deployed to upload-benchmark-results June 11, 2026 03:38 — with GitHub Actions Inactive
The examples install pulls datasets==3.6.0, which pins
fsspec[http]<=2025.3.0. The unpinned pre-installed fsspec was newer, so
pip tried to downgrade it via download.pytorch.org's pythonhosted link,
which OSDC can't reach. Pre-install fsspec at <=2025.3.0 so only-if-needed
leaves it in place.

Authored with Claude Code.
@huydhn huydhn temporarily deployed to upload-benchmark-results June 11, 2026 17:53 — with GitHub Actions Inactive
@Gasoonjia Gasoonjia merged commit e7c5415 into main Jun 11, 2026
304 of 307 checks passed
@Gasoonjia Gasoonjia deleted the huydo/cuda-ci-a100-osdc branch June 11, 2026 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/cuda CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: not user facing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants