generated from kubernetes/kubernetes-template-project
-
Notifications
You must be signed in to change notification settings - Fork 500
Open
Description
Release Checklist
- OWNERS must LGTM the release proposal.
At least two for minor or major releases. At least one for a patch release. - Verify that the changelog in this issue and the CHANGELOG folder is up-to-date
- Use https://github.com/kubernetes/release/tree/master/cmd/release-notes to gather notes.
Example:release-notes --org kubernetes-sigs --repo kueue --branch release-0.3 --start-sha 4a0ebe7a3c5f2775cdf5fc7d60c23225660f8702 --end-sha a51cf138afe65677f5f5c97f8f8b1bc4887f73d2 --dependencies=false --required-author=""
- Use https://github.com/kubernetes/release/tree/master/cmd/release-notes to gather notes.
- For major or minor releases (v$MAJ.$MIN.0), create a new release branch.
- An OWNER creates a vanilla release branch with
git branch release-$MAJ.$MIN main - An OWNER pushes the new release branch with
git push upstream release-$MAJ.$MIN
- An OWNER creates a vanilla release branch with
- Update the release branch:
- Update
RELEASE_BRANCHandRELEASE_VERSIONinMakefileand runmake prepare-release-branch - Update the
CHANGELOG - Submit a pull request with the changes:
- Update
- An OWNER creates a signed tag running
git tag -s $VERSION
and inserts the changelog into the tag description.
To perform this step, you need a PGP key registered on github. - An OWNER pushes the tag with
git push upstream $VERSION- Triggers prow to build and publish a staging container image
us-central1-docker.pkg.dev/k8s-staging-images/kueue/kueue:$VERSION
- Triggers prow to build and publish a staging container image
- An OWNER prepares a draft release
- Create the draft release poiting out to the created tag.
- Write the change log into the draft release.
- Run
make artifacts IMAGE_REGISTRY=registry.k8s.io/kueue GIT_TAG=$VERSION
to generate the artifacts in theartifactsfolder. - Upload the files in the
artifactsfolder to the draft release - either
via UI orgh release --repo kubernetes-sigs/kueue upload $VERSION artifacts/*.
- Submit a PR against k8s.io to
promote the container images and Helm Chart
to production:- Update
registry.k8s.io/images/k8s-staging-kueue/images.yaml.
- Update
- Wait for the PR to be merged and verify that the image
registry.k8s.io/kueue/kueue:$VERSIONis available. - Publish the draft release prepared at the GitHub releases page.
Link: - Run the openvex action to generate openvex data. The action will add the file to the release artifacts.
- Run the SBOM action to generate the SBOM and add it to the release.
- Update the
mainbranch :- Update
RELEASE_VERSIONinMakefileand runmake prepare-release-branch - Release notes in the
CHANGELOG -
SECURITY-INSIGHTS.yamlvalues by runningmake update-security-insights GIT_TAG=$VERSION - Submit a pull request with the changes:
- Cherry-pick the pull request onto the
websitebranch
- Update
- For major or minor releases, merge the
mainbranch into thewebsitebranch to publish the updated documentation. - Send an announcement email to
sig-scheduling@kubernetes.ioandwg-batch@kubernetes.iowith the subject[ANNOUNCE] kueue $VERSION is released. - For a major or minor release, prepare the repo for the next version:
- Create an unannotated devel tag in the
mainbranch, on the first commit that gets merged after the release
branch has been created (presumably the README update commit above), and, push the tag:
DEVEL=v$MAJ.$(($MIN+1)).0-devel; git tag $DEVEL main && git push upstream $DEVEL
This ensures that the devel builds on themainbranch will have a meaningful version number. - Create a milestone for the next minor release and update prow to set it automatically for new PRs:
- Create the presubmits and the periodic jobs for the next patch release:
- Drop CI Jobs for testing the out-of-support branch:
- Create an unannotated devel tag in the
Changelog
Changes since `v0.15.0`:
## Urgent Upgrade Notes
### (No, really, you MUST read this before you upgrade)
- Removed FlavorFungibilityImplicitPreferenceDefault feature gate.
Configure flavor selection preference using the ClusterQueue field `spec.flavorFungibility.preference` instead. (#8134, @mbobrovskyi)
- The short name "wl" for workloads has been removed to avoid potential conflicts with the in-tree workload object coming into Kubernetes (#8472, @kannon92)
## Changes by Kind
### API Change
- Add field multiplyBy for ResourceTransformation (#7599, @calvin0327)
- V1beta2: Use v1beta2 as storage for 0.16
The v1beta1 API version will be removed in the v0.17.0 release.
Please migrate all resources from v1beta1 to v1beta2 before then. Make sure the migration is complete. (#8020, @mbobrovskyi)
### Feature
- Adds support for PodsReady when JobSet dependsOn is used. (#7889, @MaysaMacedo)
- CLI: Support "kwl" and "kueueworkload" as a shortname for Kueue Workloads. (#8379, @kannon92)
- Enable Pod-based integrations by default (#8096, @sohankunkerkar)
- Logs now include `replica-role` field to identify Kueue instance roles (leader/follower/standalone). (#8107, @IrvingMg)
- Observability: Add more details (the preemptionMode) to the QuotaReserved condition message,
and the related event, about the skipped flavors which were considered for preemption.
Before: "Quota reserved in ClusterQueue preempt-attempts-cq, wait time since queued was 9223372037s; Flavors considered: main: on-demand(Preempt;insufficient unused quota for cpu in flavor on-demand, 1 more needed)"
After: "Quota reserved in ClusterQueue preempt-attempts-cq, wait time since queued was 9223372037s; Flavors considered: main: on-demand(preemptionMode=Preempt;insufficient unused quota for cpu in flavor on-demand, 1 more needed)" (#8024, @mykysha)
- Ray: Support RayJob InTreeAutoscaling by using the ElasticJobsViaWorkloadSlices feature. (#8082, @hiboyang)
- TAS: extend the information in condition messages and events about nodes excluded from calculating the
assignment due to various recognized reasons like: taints, node affinity, node resource constraints. (#8043, @sohankunkerkar)
### Bug or Regression
- DRA: fix the race condition bug leading to undefined behavior due to concurrent operations
on the Workload object, manifested by the "WARNING: DATA RACE" in test logs. (#8073, @mbobrovskyi)
- Fix `TrainJob` controller not correctly setting the `PodSet` count value based on `numNodes` for the expected number of training nodes. (#8135, @kaisoz)
- Fix a bug that WorkloadPriorityClass value changes do not trigger Workload priority updates. (#8442, @ASverdlov)
- Fix a performance bug as some "read-only" functions would be taking unnecessary "write" lock. (#8181, @ErikJiang)
- Fix the race condition bug where the kueue_pending_workloads metric may not be updated to 0 after the last
workload is admitted and there are no new workloads incoming. (#8037, @Singularity23x0)
- Fixed a bug that Kueue's scheduler would re-evaluate and update already finished workloads, significantly
impacting overall scheduling throughput. This re-evaluation of a finished workload would be triggered when:
1. Kueue is restarted
2. There is any event related to LimitRange or RuntimeClass instances referenced by the workload (#8186, @mbobrovskyi)
- Fixed the following bugs for the StatefulSet integration by ensuring the Workload object
has the ownerReference to the StatefulSet:
1. Kueue doesn't keep the StatefulSet as deactivated
2. Kueue marks the Workload as Finished if all StatefulSet's Pods are deleted
3. changing the "queue-name" label could occasionally result in the StatefulSet getting stuck (#4799, @mbobrovskyi)
- JobFramework: Fixed a bug that allowed a deactivated workload to be activated. (#8424, @chengjoey)
- Kubeflow TrainJob v2: fix the bug to prevent duplicate pod template overrides when starting the Job is retried. (#8269, @j-skiba)
- MultiKueue via ClusterProfile: Fix the panic if the configuration for ClusterProfiles wasn't not provided in the configMap. (#8071, @mszadkow)
- MultiKueue: Fixed status sync for CRD-based jobs (JobSet, Kubeflow, Ray, etc.) that was blocked while the local job was suspended. (#8308, @IrvingMg)
- MultiKueue: fix the bug that for Pod integration the AdmissionCheck status would be kept Pending indefinitely,
even when the Pods are already running.
The analogous fix is also done for the batch/Job when the MultiKueueBatchJobWithManagedBy feature gate is disabled. (#8189, @IrvingMg)
- MultiKueue: fix the eviction when initiated by the manager cluster (due to eg. Preemption or WairForPodsReady timeout). (#8151, @mbobrovskyi)
- ProvisioningRequest: Fixed a bug that prevented events from being updated when the AdmissionCheck state changed. (#8394, @mbobrovskyi)
- Scheduling: fix a bug that evictions submitted by scheduler (preemptions and eviction due to TAS NodeHotSwap failing)
could result in conflict in case of concurrent workload modification by another controller.
This could lead to indefinite failing requests sent by scheduler in some scenarios when eviction is initiated by
TAS NodeHotSwap. (#7933, @mbobrovskyi)
- TAS NodeHotSwap: fixed the bug that allows workload to requeue by scheduler even if already deleted on TAS NodeHotSwap eviction. (#8278, @mbobrovskyi)
- TAS: Fix handling of admission for workloads using the LeastFreeCapacity algorithm when the "unconstrained"
mode is used. In that case scheduling would fail if there is at least one node in the cluster which does not have
enough capacity to accommodate at least one Pod. (#8168, @PBundyra)
- TAS: fix TAS resource flavor controller to extract only scheduling-relevant node updates to prevent unnecessary reconciliation. (#8452, @Ladicle)
- TAS: fix a performance bug that continues reconciles of TAS ResourceFlavor (and related ClusterQueues)
were triggered by updates to Nodes' heartbeat times. (#8342, @PBundyra)
- TAS: fix bug that when TopologyAwareScheduling is disabled, but there is a ResourceFlavor configured with topologyName, then preemptions fail with "workload requires Topology, but there is no TAS cache information". (#8167, @zhifei92)
- TAS: fixed performance issue due to unncessary (empty) request by TopologyUngater (#8279, @mbobrovskyi)
### Other (Cleanup or Flake)
- Fix: Removed outdated comments incorrectly stating that deployment, statefulset, and leaderworkerset integrations require pod integration to be enabled. (#8053, @IrvingMg)
- Improve error messages for validation errors regarding WorkloadPriorityClass changes in workloads. (#8334, @olekzabl)
- MultiKueue: improve the MultiKueueCluster reconciler to skip attempting to reconcile and throw errors
when the corresponding Secret or ClusterProfile objects don't exist. The reconcile will be triggered on
creation of the objects. (#8144, @mszadkow)
- Removes ConfigurableResourceTransformations feature gate. (#8133, @mbobrovskyi)