Skip to content

Release v0.16.0 #8462

@mimowo

Description

@mimowo

Release Checklist

  • OWNERS must LGTM the release proposal.
    At least two for minor or major releases. At least one for a patch release.
  • Verify that the changelog in this issue and the CHANGELOG folder is up-to-date
  • For major or minor releases (v$MAJ.$MIN.0), create a new release branch.
    • An OWNER creates a vanilla release branch with
      git branch release-$MAJ.$MIN main
    • An OWNER pushes the new release branch with
      git push upstream release-$MAJ.$MIN
  • Update the release branch:
    • Update RELEASE_BRANCH and RELEASE_VERSION in Makefile and run make prepare-release-branch
    • Update the CHANGELOG
    • Submit a pull request with the changes:
  • An OWNER creates a signed tag running
    git tag -s $VERSION
    and inserts the changelog into the tag description.
    To perform this step, you need a PGP key registered on github.
  • An OWNER pushes the tag with
    git push upstream $VERSION
    • Triggers prow to build and publish a staging container image
      us-central1-docker.pkg.dev/k8s-staging-images/kueue/kueue:$VERSION
  • An OWNER prepares a draft release
    • Create the draft release poiting out to the created tag.
    • Write the change log into the draft release.
    • Run
      make artifacts IMAGE_REGISTRY=registry.k8s.io/kueue GIT_TAG=$VERSION
      to generate the artifacts in the artifacts folder.
    • Upload the files in the artifacts folder to the draft release - either
      via UI or gh release --repo kubernetes-sigs/kueue upload $VERSION artifacts/*.
  • Submit a PR against k8s.io to
    promote the container images and Helm Chart
    to production:
    • Update registry.k8s.io/images/k8s-staging-kueue/images.yaml.
  • Wait for the PR to be merged and verify that the image registry.k8s.io/kueue/kueue:$VERSION is available.
  • Publish the draft release prepared at the GitHub releases page.
    Link:
  • Run the openvex action to generate openvex data. The action will add the file to the release artifacts.
  • Run the SBOM action to generate the SBOM and add it to the release.
  • Update the main branch :
    • Update RELEASE_VERSION in Makefile and run make prepare-release-branch
    • Release notes in the CHANGELOG
    • SECURITY-INSIGHTS.yaml values by running make update-security-insights GIT_TAG=$VERSION
    • Submit a pull request with the changes:
    • Cherry-pick the pull request onto the website branch
  • For major or minor releases, merge the main branch into the website branch to publish the updated documentation.
  • Send an announcement email to sig-scheduling@kubernetes.io and wg-batch@kubernetes.io with the subject [ANNOUNCE] kueue $VERSION is released.
  • For a major or minor release, prepare the repo for the next version:
    • Create an unannotated devel tag in the
      main branch, on the first commit that gets merged after the release
      branch has been created (presumably the README update commit above), and, push the tag:
      DEVEL=v$MAJ.$(($MIN+1)).0-devel; git tag $DEVEL main && git push upstream $DEVEL
      This ensures that the devel builds on the main branch will have a meaningful version number.
    • Create a milestone for the next minor release and update prow to set it automatically for new PRs:
    • Create the presubmits and the periodic jobs for the next patch release:
    • Drop CI Jobs for testing the out-of-support branch:

Changelog

Changes since `v0.15.0`:

## Urgent Upgrade Notes 

### (No, really, you MUST read this before you upgrade)

- Removed FlavorFungibilityImplicitPreferenceDefault feature gate.
  
  Configure flavor selection preference using the ClusterQueue field `spec.flavorFungibility.preference` instead. (#8134, @mbobrovskyi)
 - The short name "wl" for workloads has been removed to avoid potential conflicts with the in-tree workload object coming into Kubernetes (#8472, @kannon92)
 
## Changes by Kind

### API Change

- Add field multiplyBy for ResourceTransformation (#7599, @calvin0327)
- V1beta2: Use v1beta2 as storage for 0.16
  
  The v1beta1 API version will be removed in the v0.17.0 release.
  Please migrate all resources from v1beta1 to v1beta2 before then. Make sure the migration is complete. (#8020, @mbobrovskyi)

### Feature

- Adds support for PodsReady when JobSet dependsOn is used. (#7889, @MaysaMacedo)
- CLI: Support "kwl" and "kueueworkload" as a shortname for Kueue Workloads. (#8379, @kannon92)
- Enable Pod-based integrations by default (#8096, @sohankunkerkar)
- Logs now include `replica-role` field to identify Kueue instance roles (leader/follower/standalone). (#8107, @IrvingMg)
- Observability: Add more details (the preemptionMode) to the QuotaReserved condition message,
  and the related event, about the skipped flavors which were considered for preemption. 
  Before: "Quota reserved in ClusterQueue preempt-attempts-cq, wait time since queued was 9223372037s; Flavors considered: main: on-demand(Preempt;insufficient unused quota for cpu in flavor on-demand, 1 more needed)"
  After: "Quota reserved in ClusterQueue preempt-attempts-cq, wait time since queued was 9223372037s; Flavors considered: main: on-demand(preemptionMode=Preempt;insufficient unused quota for cpu in flavor on-demand, 1 more needed)" (#8024, @mykysha)
- Ray: Support RayJob InTreeAutoscaling by using the ElasticJobsViaWorkloadSlices feature. (#8082, @hiboyang)
- TAS: extend the information in condition messages and events about nodes excluded from calculating the
  assignment due to various recognized reasons like: taints, node affinity, node resource constraints. (#8043, @sohankunkerkar)

### Bug or Regression

- DRA: fix the race condition bug leading to undefined behavior due to concurrent operations
  on the Workload object, manifested by the "WARNING: DATA RACE" in test logs. (#8073, @mbobrovskyi)
- Fix `TrainJob` controller not correctly setting the `PodSet` count value based on `numNodes` for the expected number of training nodes. (#8135, @kaisoz)
- Fix a bug that WorkloadPriorityClass value changes do not trigger Workload priority updates. (#8442, @ASverdlov)
- Fix a performance bug as some "read-only" functions would be taking unnecessary "write" lock. (#8181, @ErikJiang)
- Fix the race condition bug where the kueue_pending_workloads metric may not be updated to 0 after the last 
  workload is admitted and there are no new workloads incoming. (#8037, @Singularity23x0)
- Fixed a bug that Kueue's scheduler would re-evaluate and update already finished workloads, significantly
  impacting overall scheduling throughput. This re-evaluation of a finished workload would be triggered when:
  1. Kueue is restarted
  2. There is any event related to LimitRange or RuntimeClass instances referenced by the workload (#8186, @mbobrovskyi)
- Fixed the following bugs for the StatefulSet integration by ensuring the Workload object
  has the ownerReference to the StatefulSet:
  1. Kueue doesn't keep the StatefulSet as deactivated
  2. Kueue marks the Workload as Finished if all StatefulSet's Pods are deleted
  3. changing the "queue-name" label could occasionally result in the StatefulSet getting stuck (#4799, @mbobrovskyi)
- JobFramework: Fixed a bug that allowed a deactivated workload to be activated. (#8424, @chengjoey)
- Kubeflow TrainJob v2: fix the bug to prevent duplicate pod template overrides when starting the Job is retried. (#8269, @j-skiba)
- MultiKueue via ClusterProfile: Fix the panic if the configuration for ClusterProfiles wasn't not provided in the configMap. (#8071, @mszadkow)
- MultiKueue: Fixed status sync for CRD-based jobs (JobSet, Kubeflow, Ray, etc.) that was blocked while the local job was suspended. (#8308, @IrvingMg)
- MultiKueue: fix the bug that for Pod integration the AdmissionCheck status would be kept Pending indefinitely,
  even when the Pods are already running.
  
  The analogous fix is also done for the batch/Job when the MultiKueueBatchJobWithManagedBy feature gate  is disabled. (#8189, @IrvingMg)
- MultiKueue: fix the eviction when initiated by the manager cluster (due to eg. Preemption or WairForPodsReady timeout). (#8151, @mbobrovskyi)
- ProvisioningRequest: Fixed a bug that prevented events from being updated when the AdmissionCheck state changed. (#8394, @mbobrovskyi)
- Scheduling: fix a bug that evictions submitted by scheduler (preemptions and eviction due to TAS NodeHotSwap failing)
  could result in conflict in case of concurrent workload modification by another controller.
  This could lead to indefinite failing requests sent by scheduler in some scenarios when eviction is initiated by
  TAS NodeHotSwap. (#7933, @mbobrovskyi)
- TAS NodeHotSwap: fixed the bug that allows workload to requeue by scheduler even if already deleted on TAS NodeHotSwap eviction. (#8278, @mbobrovskyi)
- TAS: Fix handling of admission for workloads using the LeastFreeCapacity algorithm when the  "unconstrained"
  mode is used. In that case scheduling would fail if there is at least one node in the cluster which does not have
  enough capacity to accommodate at least one Pod. (#8168, @PBundyra)
- TAS: fix TAS resource flavor controller to extract only scheduling-relevant node updates to prevent unnecessary reconciliation. (#8452, @Ladicle)
- TAS: fix a performance bug that continues reconciles of TAS ResourceFlavor (and related ClusterQueues) 
  were triggered by updates to Nodes' heartbeat times. (#8342, @PBundyra)
- TAS: fix bug that when TopologyAwareScheduling is disabled, but there is a ResourceFlavor configured with topologyName, then preemptions fail with "workload requires Topology, but there is no TAS cache information". (#8167, @zhifei92)
- TAS: fixed performance issue due to unncessary (empty) request by TopologyUngater (#8279, @mbobrovskyi)

### Other (Cleanup or Flake)

- Fix: Removed outdated comments incorrectly stating that deployment, statefulset, and leaderworkerset integrations require pod integration to be enabled. (#8053, @IrvingMg)
- Improve error messages for validation errors regarding WorkloadPriorityClass changes in workloads. (#8334, @olekzabl)
- MultiKueue: improve the MultiKueueCluster reconciler to skip attempting to reconcile and throw errors
  when the corresponding Secret or ClusterProfile objects don't exist. The reconcile will be triggered on 
  creation of the objects. (#8144, @mszadkow)
- Removes ConfigurableResourceTransformations feature gate. (#8133, @mbobrovskyi)

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions