Skip to content

Conversation

@Snehadas2005
Copy link

@Snehadas2005 Snehadas2005 commented Jan 3, 2026

What this PR does / why we need it:
This PR refactors and updates the existing PyTorch Jupyter Notebooks to fully support the Kubeflow Trainer V2 SDK. These updates transition the notebooks from legacy patterns to production-ready workflows that are compatible with the latest SDK features and cross-platform environments.

Updated Workflows:

  • Image Classification (mnist.ipynb): Refactored to demonstrate the official V2 DDP training workflow on the Fashion MNIST dataset.
  • Question Answering (fine-tune-distilbert.ipynb): Updated to demonstrate fine-tuning with Hugging Face integration, including critical fixes for offset mapping, Fast Tokenizers, and Accelerate backend requirements.
  • Speech Recognition (speech-recognition.ipynb): Added a new transformer-based workflow for audio classification. Implements custom audio preprocessing/sampling using torchaudio and soundfile with native DDP support and multi-environment scaling.

Key Improvements:

  • SDK V2 Native: Migrated all training logic to use TrainerClient and CustomTrainer.
  • Windows & Local Compatibility: Standardized the distributed backend to gloo to ensure notebooks run successfully on Windows machines and local environments.
  • Robust Verification Logic: Implemented a KUBEFLOW_TRAINER_TEST environment flag system. For LLM and Audio tasks, this now uses max_steps=1 and a tiny data subset to allow for instant logic verification without requiring high-compute resources.
  • Unified Execution Model: Each notebook now provides three clear, tested paths for users:
  • Direct Python: Quick kernel-level experimentation.
  • SDK Local: Isolated environment verification using LocalProcessBackendConfig.
  • Cluster Scaling: Distributed execution on Kubernetes with num_nodes scaling.
  • Environment Stability: Added explicit dependency checks for accelerate, torchaudio, and tokenizers to prevent common runtime errors in notebook environments.

Which issue(s) this PR fixes:
Fixes #3062
Fixes #2040
PR: #2830

Checklist:

  • Docs included if any changes are user facing (Updated PyTorch README)

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jaiakash
Copy link
Member

jaiakash commented Jan 3, 2026

Thanks for raising this, @Snehadas2005.

I see its still a draft PR but few minor suggestions which will help you.

Happy contributing.

@Snehadas2005
Copy link
Author

Snehadas2005 commented Jan 4, 2026

Thank you so much, @jaiakash, for the detailed feedback and references. I really appreciate it.

That makes sense. I will convert the example into a Jupyter notebook and align it with the existing example patterns you shared, focusing on clarity and readability for data scientists.

I also appreciate the note on DCO signing. I will fix the commit signatures and ensure all future commits are properly signed.

Thanks again for the guidance, happy to iterate further and adjust based on feedback from the team.

vsoch and others added 3 commits January 4, 2026 10:00
* feat: kep for flux hpc (2841)

This KEP proposes adding an hpcPolicy to support Flux
Framework and (in the future) other workload managers
that provide more traditional HPC features.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>

* review: see updates below.

Changed crd examples to reflect documentation
removed tasks from definition - can go in settings
removed mentions of minicluster out of context
specified train image instead of custom logic
added user stories

Signed-off-by: vsoch <vsoch@users.noreply.github.com>

* feat: flux policy

Update the KEP to define a FluxMLPolicySource that
exposes attributes specific to Flux.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>

* review: add details of cm and init container

Signed-off-by: vsoch <vsoch@users.noreply.github.com>

---------

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
Co-authored-by: vsoch <vsoch@users.noreply.github.com>
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
…ation

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@google-oss-prow google-oss-prow bot added size/XL and removed size/L labels Jan 4, 2026
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
@google-oss-prow google-oss-prow bot added size/XXL and removed size/XL labels Jan 6, 2026
@Snehadas2005 Snehadas2005 marked this pull request as ready for review January 6, 2026 05:12
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add production-ready standalone Python examples for PyTorch (non-notebook) Add more AI/ML Training Example with Kubeflow Trainer

3 participants