Skip to content

Conversation

@JEETDESAI25
Copy link

What this PR does / why we need it:

  • Decouple JobSet suspend toggling from the SSA payload so that the controller no longer trips the JobSet webhook's "spec.replicatedJobs is immutable" error when suspending or resuming existing workloads.
  • Add a clarifying comment that suspend for existing JobSets is handled via SyncSuspend, preventing future regressions.

Which issue(s) this PR fixes:

Checklist:

  • [ ] Docs included if any changes are user facing

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

@astefanutti
Copy link
Contributor

/ok-to-test

@coveralls
Copy link

coveralls commented Dec 18, 2025

Pull Request Test Coverage Report for Build 20355704238

Details

  • 24 of 48 (50.0%) changed or added relevant lines in 5 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.03%) to 51.469%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/runtime/core/clustertrainingruntime.go 0 3 0.0%
pkg/runtime/core/trainingruntime.go 0 3 0.0%
pkg/controller/trainjob_controller.go 0 8 0.0%
pkg/runtime/framework/plugins/jobset/jobset.go 15 25 60.0%
Files with Coverage Reduction New Missed Lines %
pkg/controller/trainjob_controller.go 1 0.0%
Totals Coverage Status
Change from base Build 20289967122: 0.03%
Covered Lines: 1261
Relevant Lines: 2450

💛 - Coveralls

Signed-off-by: Jeet <113221510+JEETDESAI25@users.noreply.github.com>
Signed-off-by: Jeet <113221510+JEETDESAI25@users.noreply.github.com>
@JEETDESAI25 JEETDESAI25 force-pushed the fix-suspend-resume-3008 branch from e35d722 to 4e615d7 Compare December 19, 2025 00:30
@google-oss-prow google-oss-prow bot added size/L and removed size/M labels Dec 19, 2025
@JEETDESAI25
Copy link
Author

/assign @terrytangyuan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TrainJob suspend/resume fails with JobSet webhook validation error

4 participants