Skip to content

Conversation

@XploY04
Copy link

@XploY04 XploY04 commented Jan 4, 2026

Summary

Add TTLSecondsAfterFinished and ActiveDeadlineSeconds fields to TrainJob CRD to enable automatic cleanup of finished jobs and enforce maximum runtime limits respectively, bringing TrainJob lifecycle management in check with Kubernetes Jobs and JobSets.

Motivation

Currently, TrainJob resources persist in the cluster indefinitely after completion unless manually deleted. This leads to:

  • Etcd Bloat: Accumulation of stale metadata in the cluster state.
  • Resource Contention: Runaway training jobs can consume GPU/CPU resources indefinitely if they hang or enter an infinite loop.

Goals

  • Add TTLSecondsAfterFinished for automatic deletion of finished TrainJobs
  • Add ActiveDeadlineSeconds to enforce maximum runtime
  • Follow Kubernetes Job/JobSet patterns

Proposal

  • Interaction with Suspend: If a TrainJob is suspended, the ActiveDeadlineSeconds timer continues to count down. This aligns with the behavior of Kubernetes Jobs.
  • Clock Skew: Both TTL and deadline enforcement rely on the controller's local clock being synchronized with the Kubernetes API server creation timestamps. Significant clock skew could lead to premature or delayed actions.

Risks and Mitigations

  • User Confusion: Users might be surprised when their finished TrainJobs disappear.
    • Mitigation: We will rely on clear documentation and potentially add a webhook warning if TTL is set to a very short value (< 60s).
  • Loss of Job History: Automatic deletion removes the resource and its status.
    • Mitigation: Users should utilize logging/monitoring solutions or the future TrainJob History Server to persist results beyond the resource lifespan.

Design Details

API Design

Add two optional fields to TrainJobSpec in pkg/apis/trainer/v1alpha1/trainjob_types.go:

type TrainJobSpec struct {
    // ... existing fields ...

    // TTLSecondsAfterFinished limits the lifetime of a TrainJob that has finished
    // execution (either Complete or Failed). If this field is set, once the TrainJob
    // finishes, it will be deleted after ttlSecondsAfterFinished expires. If this
    // field is unset, the TrainJob will not be automatically deleted. If set to zero,
    // the TrainJob becomes eligible for immediate deletion after finishing.
    // +optional
    TTLSecondsAfterFinished *int32 `json:"ttlSecondsAfterFinished,omitempty"`

    // ActiveDeadlineSeconds specifies the duration in seconds relative to the TrainJob
    // creation time that the TrainJob may be active before the system tries to terminate
    // it. Value must be a positive integer. Once reached, all running Pods are terminated
    // and the TrainJob status becomes Failed with reason: DeadlineExceeded.
    // +optional
    // +kubebuilder:validation:Minimum=1
    ActiveDeadlineSeconds *int64 `json:"activeDeadlineSeconds,omitempty"`
}

Add new condition reason:

const (
    // TrainJobDeadlineExceededReason is used when ActiveDeadlineSeconds is exceeded
    TrainJobDeadlineExceededReason string = "DeadlineExceeded"
)

User Example:

apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
  name: my-training
spec:
  ttlSecondsAfterFinished: 3600    # Delete 1 hour after completion
  activeDeadlineSeconds: 7200      # Max 2 hours runtime
  runtimeRef:
    name: torch-distributed
  trainer:
    numNodes: 2

Implementation Overview

Controller Changes (pkg/controller/trainjob_controller.go):

  1. TTL Reconciliation:
    • Check if job is finished and TTL is set.
    • Calculate deleteTime = finishTime + TTL.
    • If expired, delete TrainJob (cascades to owned resources).
    • Otherwise, requeue at deleteTime.
  2. Deadline Enforcement:
    • Check if job is running and deadline is set.
    • Calculate deadline = creationTime + ActiveDeadlineSeconds.
    • If exceeded, mark TrainJob as Failed (Reason: DeadlineExceeded) and delete JobSet.
    • Otherwise, requeue at deadline.
  3. Integration: Add both logic blocks to the main Reconcile() loop.

Webhook Validation (pkg/webhooks/trainjob_webhook.go):

  • Validate TTLSecondsAfterFinished >= 0.
  • Validate ActiveDeadlineSeconds > 0.
  • Warn if TTLSecondsAfterFinished < 60s.
  • Make fields immutable after creation.

Test Plan

[x] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.

Prerequisite testing updates

None identified.

Unit Tests

  • pkg/controller/trainjob/: 2026-01-04 - High coverage expected for new logic

Test Cases:

  • TTL not set → no deletion
  • TTL expired → job deleted
  • TTL not expired → requeue at correct time
  • TTL = 0 → immediate deletion
  • Deadline exceeded → job failed, pods terminated
  • Deadline not reached → requeue at deadline
  • Both fields set → correct interaction

E2E tests

  • test/e2e/trainjob_ttl_test.go:
    • Real training workload with TTL: Verify resource disappears after expiration.
    • Real training workload with deadline: Verify job fails at timeout with DeadlineExceeded reason.
    • Verify no orphaned resources remain.

Integration tests

  • test/integration/controller/trainjob_controller_test.go:
    • End-to-end TTL deletion workflow.
    • End-to-end deadline enforcement.
    • Cascade deletion of owned resources.
    • Controller restart (verify timers resume correctly).

Implementation History

  • 2025-10-20: Issue opened #2899.
  • 2026-01-04: KEP drafted.
  • TBD: Alpha implementation.

Drawbacks

  • Potential User Confusion: Users unfamiliar with TTL may be surprised when TrainJobs disappear.
  • Loss of Job History: TTL deletion removes TrainJob metadata permanently. Use logging/exporting to mitigate.

Alternatives

…ment.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign astefanutti for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link

github-actions bot commented Jan 4, 2026

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
@google-oss-prow google-oss-prow bot added size/XL and removed size/L labels Jan 5, 2026
…ment

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
…or deadline enforcement

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
…and TTL enforcement in TrainJob

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
@XploY04 XploY04 marked this pull request as ready for review January 5, 2026 19:58
Copilot AI review requested due to automatic review settings January 5, 2026 19:58
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds TTLSecondsAfterFinished and ActiveDeadlineSeconds fields to the TrainJob CRD to enable automatic cleanup of finished jobs and enforce maximum runtime limits. The implementation follows Kubernetes Job/JobSet patterns and includes comprehensive test coverage across unit, integration, and e2e tests.

Key changes:

  • Added two new optional fields to TrainJobSpec: TTLSecondsAfterFinished (int32) and ActiveDeadlineSeconds (int64) with immutability constraints
  • Implemented controller logic for TTL-based deletion and deadline enforcement with appropriate requeue mechanisms
  • Added RBAC permissions for TrainJob deletion and webhook validation for TTL warnings

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
pkg/apis/trainer/v1alpha1/trainjob_types.go Added TTLSecondsAfterFinished and ActiveDeadlineSeconds fields to TrainJobSpec with validation markers and DeadlineExceeded condition reason
pkg/controller/trainjob_controller.go Implemented reconcileTTL and reconcileDeadline functions to handle automatic deletion and deadline enforcement; added delete RBAC permission
pkg/webhooks/trainjob_webhook.go Added validateTTLSecondsAfterFinished function to warn users about short TTL values (<60s)
pkg/util/testing/wrapper.go Added helper methods TTLSecondsAfterFinished and ActiveDeadlineSeconds to TrainJobWrapper for testing
pkg/controller/trainjob_ttl_test.go Added comprehensive unit tests for TTL cleanup and deadline enforcement logic
test/integration/controller/trainjob_controller_test.go Added integration tests for TTL deletion (no TTL, TTL=3s, TTL=0) and deadline enforcement scenarios
test/e2e/e2e_test.go Added e2e tests for real-world TTL deletion and deadline exceeded scenarios
manifests/base/rbac/role.yaml Split trainjobs resource into separate rule entries with delete verb added for TTL cleanup
manifests/base/crds/trainer.kubeflow.org_trainjobs.yaml Added activeDeadlineSeconds and ttlSecondsAfterFinished to CRD schema with validation and immutability constraints
charts/kubeflow-trainer/crds/trainer.kubeflow.org_trainjobs.yaml Mirrored CRD changes for Helm chart deployment
pkg/apis/trainer/v1alpha1/zz_generated.deepcopy.go Generated DeepCopyInto methods for new pointer fields
pkg/apis/trainer/v1alpha1/zz_generated.openapi.go Generated OpenAPI schema definitions for new fields
pkg/client/applyconfiguration/trainer/v1alpha1/trainjobspec.go Added WithTTLSecondsAfterFinished and WithActiveDeadlineSeconds methods to apply configuration builder
api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_train_job_spec.py Added active_deadline_seconds and ttl_seconds_after_finished fields to Python API model
api/openapi-spec/swagger.json Updated OpenAPI specification with new field definitions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant