feat: add production-ready MNIST example for PyTorch #3063

Snehadas2005 · 2026-01-03T02:15:52Z

What this PR does / why we need it:
This PR refactors and updates the existing PyTorch Jupyter Notebooks to fully support the Kubeflow Trainer V2 SDK. These updates transition the notebooks from legacy patterns to production-ready workflows that are compatible with the latest SDK features and cross-platform environments.

Updated Workflows:

Image Classification (mnist.ipynb): Refactored to demonstrate the official V2 DDP training workflow on the Fashion MNIST dataset.
Question Answering (fine-tune-distilbert.ipynb): Updated to demonstrate fine-tuning with Hugging Face integration, including critical fixes for offset mapping, Fast Tokenizers, and Accelerate backend requirements.
Speech Recognition (speech-recognition.ipynb): Added a new transformer-based workflow for audio classification. Implements custom audio preprocessing/sampling using torchaudio and soundfile with native DDP support and multi-environment scaling.

Key Improvements:

SDK V2 Native: Migrated all training logic to use TrainerClient and CustomTrainer.
Windows & Local Compatibility: Standardized the distributed backend to gloo to ensure notebooks run successfully on Windows machines and local environments.
Robust Verification Logic: Implemented a KUBEFLOW_TRAINER_TEST environment flag system. For LLM and Audio tasks, this now uses max_steps=1 and a tiny data subset to allow for instant logic verification without requiring high-compute resources.
Unified Execution Model: Each notebook now provides three clear, tested paths for users:
Direct Python: Quick kernel-level experimentation.
SDK Local: Isolated environment verification using LocalProcessBackendConfig.
Cluster Scaling: Distributed execution on Kubernetes with num_nodes scaling.
Environment Stability: Added explicit dependency checks for accelerate, torchaudio, and tokenizers to prevent common runtime errors in notebook environments.

Which issue(s) this PR fixes:
Fixes #3062
Fixes #2040
PR: #2830

Checklist:

Docs included if any changes are user facing (Updated PyTorch README)

google-oss-prow · 2026-01-03T02:15:57Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jaiakash · 2026-01-03T18:21:52Z

Thanks for raising this, @Snehadas2005.

I see its still a draft PR but few minor suggestions which will help you.

Please use jupiter notebook (.ipynb) file for examples instead of standard python files (.py). You can checkout these examples PR (feat: qwen 2.5 1.5b runtime, example and fix gpu e2e test #2835, feat(runtimes): Support Distributed MLX on CUDA #2790) as reference.
You commits are not signed, thats why DCO is failing. Check this for more info on how to sign your current commits and even any of future commits.

Happy contributing.

Snehadas2005 · 2026-01-04T03:50:41Z

Thank you so much, @jaiakash, for the detailed feedback and references. I really appreciate it.

That makes sense. I will convert the example into a Jupyter notebook and align it with the existing example patterns you shared, focusing on clarity and readability for data scientists.

I also appreciate the note on DCO signing. I will fix the commit signatures and ensure all future commits are properly signed.

Thanks again for the guidance, happy to iterate further and adjust based on feedback from the team.

* feat: kep for flux hpc (2841) This KEP proposes adding an hpcPolicy to support Flux Framework and (in the future) other workload managers that provide more traditional HPC features. Signed-off-by: vsoch <vsoch@users.noreply.github.com> * review: see updates below. Changed crd examples to reflect documentation removed tasks from definition - can go in settings removed mentions of minicluster out of context specified train image instead of custom logic added user stories Signed-off-by: vsoch <vsoch@users.noreply.github.com> * feat: flux policy Update the KEP to define a FluxMLPolicySource that exposes attributes specific to Flux. Signed-off-by: vsoch <vsoch@users.noreply.github.com> * review: add details of cm and init container Signed-off-by: vsoch <vsoch@users.noreply.github.com> --------- Signed-off-by: vsoch <vsoch@users.noreply.github.com> Co-authored-by: vsoch <vsoch@users.noreply.github.com> Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

…ation Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

review-notebook-app · 2026-01-04T05:13:18Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

google-oss-prow bot requested review from jinchihe and kuizhiqing January 3, 2026 02:15

google-oss-prow bot added the size/L label Jan 3, 2026

Snehadas2005 marked this pull request as draft January 3, 2026 02:25

google-oss-prow bot added the do-not-merge/work-in-progress label Jan 3, 2026

vsoch and others added 3 commits January 4, 2026 10:00

feat: add PyTorch MNIST training example with Kubeflow Trainer integr…

c96e5d8

…ation Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

docs: add README for PyTorch examples with Kubeflow Trainer SDK

a1f1c75

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

Snehadas2005 force-pushed the master branch from 7dc39c6 to a1f1c75 Compare January 4, 2026 04:42

Snehadas2005 added 2 commits January 4, 2026 10:17

removed train_mnist.py

fedc0b1

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

update MNIST notebook with local and Kubernetes training options

0a062d9

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

google-oss-prow bot added size/XL and removed size/L labels Jan 4, 2026

Snehadas2005 added 2 commits January 4, 2026 10:54

update MNIST notebook

4a5ca81

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

update question-answering fine-tuning and README.md file based on that

c004660

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

google-oss-prow bot added size/XXL and removed size/XL labels Jan 6, 2026

Snehadas2005 added 2 commits January 6, 2026 10:23

update question-answering fine-tuning and README.md file based on that

da6daac

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

Merge branch 'master' of https://github.com/Snehadas2005/trainer

e4ebbb3

Snehadas2005 marked this pull request as ready for review January 6, 2026 05:12

google-oss-prow bot removed the do-not-merge/work-in-progress label Jan 6, 2026

Snehadas2005 added 2 commits January 8, 2026 21:53

feat: add Speech Recognition PyTorch

b649de1

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

updated README.md file

c96f037

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

Snehadas2005 mentioned this pull request Jan 8, 2026

chore: Add Speech Recognition with DDP Example #2830

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add production-ready MNIST example for PyTorch #3063

feat: add production-ready MNIST example for PyTorch #3063

Snehadas2005 commented Jan 3, 2026 •

edited

Loading

Uh oh!

google-oss-prow bot commented Jan 3, 2026

Uh oh!

jaiakash commented Jan 3, 2026 •

edited

Loading

Uh oh!

Snehadas2005 commented Jan 4, 2026 •

edited

Loading

Uh oh!

review-notebook-app bot commented Jan 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: add production-ready MNIST example for PyTorch #3063

Are you sure you want to change the base?

feat: add production-ready MNIST example for PyTorch #3063

Conversation

Snehadas2005 commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

google-oss-prow bot commented Jan 3, 2026

Uh oh!

jaiakash commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Snehadas2005 commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

review-notebook-app bot commented Jan 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Snehadas2005 commented Jan 3, 2026 •

edited

Loading

jaiakash commented Jan 3, 2026 •

edited

Loading

Snehadas2005 commented Jan 4, 2026 •

edited

Loading