Skip to content

Conversation

@RohitYandigeri
Copy link

Fixes #2713

This PR improves TrainJob observability by adding intermediate status
conditions that reflect the lifecycle of the underlying jobs.

What’s added

  • Pending condition when a TrainJob is created but underlying jobs have not started
  • Running condition when underlying jobs are actively executing
  • Status is derived from Job-level signals via the TrainJobStatus framework plugin
  • kubectl get trainjob now reflects these states consistently

Notes

  • Status is not set during TrainJob creation (controller-owned status), in line
    with Kubernetes conventions
  • CRD regeneration could not be performed locally on Windows; happy to update
    manifests if requested

Why

This aligns TrainJob with standard Kubernetes resource status patterns and
provides better UX for users monitoring training lifecycle.

cc @kubeflow/kubeflow-trainer-team

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link

github-actions bot commented Dec 4, 2025

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

@RohitYandigeri RohitYandigeri changed the title feat(trainjob): add Pending and Running status conditions feat(runtimes): add Pending and Running status conditions for TrainJob Dec 4, 2025
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution @RohitYandigeri!
As we discussed before, Job Active condition doesn't indicate that training process is running. Sometime, PyTorch can stuck in initialization phase or synchronization phase. Kubernetes Batch/Job doesn't detect that.

As part of this KEP: #2905, we are working on exposing training progress to the TrainJob status. That will help us to detect that the actual model training process is happening.

cc @robert-bell @kubeflow/kubeflow-trainer-team

Let's hold this PR for now, and discuss more details in 2905
/hold

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add more status conditions to TrainJob for better visibility of execution state,like Add Running/Pending

2 participants