feat(runtimes): add Pending and Running status conditions for TrainJob #3019

RohitYandigeri · 2025-12-04T17:33:01Z

This PR improves TrainJob observability by adding intermediate status
conditions that reflect the lifecycle of the underlying jobs.

What’s added

Pending condition when a TrainJob is created but underlying jobs have not started
Running condition when underlying jobs are actively executing
Status is derived from Job-level signals via the TrainJobStatus framework plugin
kubectl get trainjob now reflects these states consistently

Notes

Status is not set during TrainJob creation (controller-owned status), in line
with Kubernetes conventions
CRD regeneration could not be performed locally on Windows; happy to update
manifests if requested

Why

This aligns TrainJob with standard Kubernetes resource status patterns and
provides better UX for users monitoring training lifecycle.

cc @kubeflow/kubeflow-trainer-team

feat(trainjob): add Pending and Running status conditions

google-oss-prow · 2025-12-04T17:33:09Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2025-12-04T17:33:11Z

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Slack: Join our #kubeflow-trainer Slack channel.
Meetings: Attend the Kubeflow AutoML and Training Working Group bi-weekly meetings.

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

andreyvelich

Thank you for this contribution @RohitYandigeri!
As we discussed before, Job Active condition doesn't indicate that training process is running. Sometime, PyTorch can stuck in initialization phase or synchronization phase. Kubernetes Batch/Job doesn't detect that.

As part of this KEP: #2905, we are working on exposing training progress to the TrainJob status. That will help us to detect that the actual model training process is happening.

cc @robert-bell @kubeflow/kubeflow-trainer-team

Let's hold this PR for now, and discuss more details in 2905
/hold

RohitYandigeri and others added 2 commits December 4, 2025 22:57

feat(trainjob): add Pending and Running status conditions

6245ab0

Merge pull request #1 from kubeflow/master

ea8890c

feat(trainjob): add Pending and Running status conditions

google-oss-prow bot requested review from jinchihe and kuizhiqing December 4, 2025 17:33

google-oss-prow bot added the size/L label Dec 4, 2025

RohitYandigeri changed the title ~~feat(trainjob): add Pending and Running status conditions~~ feat(runtimes): add Pending and Running status conditions for TrainJob Dec 4, 2025

andreyvelich reviewed Dec 5, 2025

View reviewed changes

google-oss-prow bot added the do-not-merge/hold label Dec 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(runtimes): add Pending and Running status conditions for TrainJob #3019

feat(runtimes): add Pending and Running status conditions for TrainJob #3019

Uh oh!

RohitYandigeri commented Dec 4, 2025

Uh oh!

google-oss-prow bot commented Dec 4, 2025

Uh oh!

github-actions bot commented Dec 4, 2025

Uh oh!

andreyvelich left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(runtimes): add Pending and Running status conditions for TrainJob #3019

Are you sure you want to change the base?

feat(runtimes): add Pending and Running status conditions for TrainJob #3019

Uh oh!

Conversation

RohitYandigeri commented Dec 4, 2025

What’s added

Notes

Why

Uh oh!

google-oss-prow bot commented Dec 4, 2025

Uh oh!

github-actions bot commented Dec 4, 2025

Uh oh!

andreyvelich left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andreyvelich left a comment •

edited

Loading