feat: add Collections support + runtime/controller rewrite#936
feat: add Collections support + runtime/controller rewrite#936k8s-ci-robot merged 23 commits intokubernetes-sigs:mainfrom
Conversation
Rewrites `instanceGraphReconciler` implementation and associated unit tests. This component is needs to be prepped to be used for deeper integrations into levelled apply Signed-off-by: Jakob Möller <contact@jakob-moeller.com>
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: a-hilaly The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/ok-to-test |
ee2c5bd to
7d96612
Compare
|
/retest |
7d96612 to
15bf0cd
Compare
|
/test all |
Implements KREP-002. Adds the `forEach` field to resource definitions, enabling a single resource block to create multiple resources based on runtime data. A collection resource remains one DAG node but expands to N kubernetes resources when evaluated. This patch removes the need to statically define each resource in the RGD. Instead of writing N blocks for N resources, users express "create N resources where N comes from spec" and kro handles the rest. This patch modifies API types, CRD schema, CEL environment, graph builder, runtime, and instance controller. Includes unit tests for expansion logic and integration tests for end-to-end collection lifecycle.
When CEL expressions reference fields that don't exist yet (e.g `status.vpcID` before the resource is reconciled), the controller now uses delayed requeue instead of exponential backoff. This fixes integration test failures caused by the controller entering exponential backoff when resources are still being populated with status fields.
Move applyset from `pkg/applyset/` to `pkg/controller/instance/applyset/` and redesign it as a stateless component tightly integrated with the instance controller. The new API separates concerns cleanly: Project() computes parent metadata without I/O, Apply() handles SSA and returns batch metadata, and Prune() is a separate method that accepts scope from Project(). The controller now owns all GET operations and passes resource versions for change detection. Added integration tests for multi-GVK apply, pruning behavior, and instance/RGD instacnes isolation.
The runtime package was using an interface-heavy design that created unnecessary abstraction layers between graph definitions and runtime execution. This change consolidates around a concrete `Node` type that represents resources, collections, and external references uniformly. The old `Synchronize` method was a monolith that tried to do everything: evaluate static variables, propagate resource variables, check readiness, resolve conditions, and update instance status - all in one pass. It was optimizing for cases that rarely mattered while making the code hard to reason about. It worked, but mysteriously so. Key changes: The graph package now exports a `Node` type with `NodeMeta` containing all identification data (ID, type, GVR, namespaced, dependencies). `NodeType` distinguishes `Resource`, `Collection`, `External`, and `Instance` nodes. The old `Resource` struct and its getter methods are replaced by direct field access through `Node.Meta`. The runtime package replaces `interfaces.go` with `node.go` and `eval.go`. Runtime nodes wrap graph nodes and add evaluation state (desired, observed, ignored). CEL evaluation is centralized in `eval.go`. Collection expansion uses cartesian product for multi-dimensional `forEach` with proper identity validation to detect duplicate resources. The instance controller now interacts with `runtime.Node` through four methods only: `GetDesired()`, `SetObserved()`, `IsIgnored()`, and `IsReady()`. This clean API boundary replaces the previous `ResourceDescriptor` interface abstraction. Bug fixes along the way: - CEL data pending errors (e.g., `status.field` not yet populated) now trigger delayed requeue instead of exponential backoff - String interpolation in templates now resolves correctly - Instance status fields shrink when CEL expressions no longer resolve (e.g., when a dependency becomes ignored) - Collection expansion detects duplicate resource identities before apply, preventing silent conflicts - `forEach` iterator names are validated against resource IDs and reserved keywords at build time - Template/externalRef combinations are validated early, catching invalid resource definitions before runtime
15bf0cd to
8a58fa9
Compare
8a58fa9 to
15c9a01
Compare
c4fdeee to
44ffe5a
Compare
Deletion could fail when upstream resources lost data that templates depended on. Now `GetDesiredIdentity()` resolves only `metadata.name` and `metadata.namespace`, skipping readiness gates and unrelated fields. Deletion uses two phases: `observeDeletionState` walks nodes in topo order and GETs/LISTs current state, then `deleteTarget` issues the DELETE for the last resolvable node. `DeleteTargets()` centralizes the logic for what to delete. `normalizeNamespaces` moves to runtime.
Split into planResourcesForApply, pruneOrphans. Simplify prepareResource signature. Fix handleExternalRef NotFound handling. Add panic for unknown node types.
7e313ba to
3e11e91
Compare
Collections now treat unresolved desired state as not ready, while resolved-empty collections remain ready. This keeps dependency gating correct and clarifies the empty vs unresolved contract in tests and hard-resolve behavior. Could've been a boolean `resolved` on node. But not super necessary for now.
- evaluate includeWhen with schema-only env/context - remove schema from readyWhen contexts and update collection comment
jakobmoellerdev
left a comment
There was a problem hiding this comment.
/lgtm
/hold
feel free to unhold based on how much feedback you want before merge.
|
while i appreciate continuous improvements this should be done in a separate PR imho. The last 2 commits alone were worth 2 prs. |
|
/label tide/merge-method-squash |
| convert the map to a sorted array first to make resource ordering stable. | ||
| Unstable ordering can lead to noisy diffs and unexpected reconciliation churn. | ||
|
|
||
| ### Identity Must Include All Iterator Dimensions |
There was a problem hiding this comment.
I think this should be fairly easy to check at compile. we can verify that name / namespace include all iterator dimensions by checking their variable in the appropriate fieldpath. WDYT?
|
|
||
| ### Map Iteration Order Is Not Deterministic | ||
|
|
||
| Map iteration order in Go is randomized. If you iterate over map keys or values, |
There was a problem hiding this comment.
I think this mutation is very unnice to do in CEL. if you have a "nice" way of doing that its probably worth adding here.
There was a problem hiding this comment.
Yep. We do not have a better way that sort the keys for now. best we can do is provide custom CEL functions for that.
494bafe to
aedcc62
Compare
|
@jakobmoellerdev I jusst clipped the last two commits - sending them as seperate PRs. |
|
/unhold |
|
@a-hilaly, thank you for your work on this. It is very much appreciated! |
…s-sigs#936) * refactor: rewrite instance controller Rewrites `instanceGraphReconciler` implementation and associated unit tests. This component is needs to be prepped to be used for deeper integrations into levelled apply Signed-off-by: Jakob Möller <contact@jakob-moeller.com> * feat: add collections support Implements KREP-002. Adds the `forEach` field to resource definitions, enabling a single resource block to create multiple resources based on runtime data. A collection resource remains one DAG node but expands to N kubernetes resources when evaluated. This patch removes the need to statically define each resource in the RGD. Instead of writing N blocks for N resources, users express "create N resources where N comes from spec" and kro handles the rest. This patch modifies API types, CRD schema, CEL environment, graph builder, runtime, and instance controller. Includes unit tests for expansion logic and integration tests for end-to-end collection lifecycle. * fix: handle CEL data pending errors with delayed requeue When CEL expressions reference fields that don't exist yet (e.g `status.vpcID` before the resource is reconciled), the controller now uses delayed requeue instead of exponential backoff. This fixes integration test failures caused by the controller entering exponential backoff when resources are still being populated with status fields. * Add data pending integration tests * refactor: relocate and simplify applyset to instance controller package Move applyset from `pkg/applyset/` to `pkg/controller/instance/applyset/` and redesign it as a stateless component tightly integrated with the instance controller. The new API separates concerns cleanly: Project() computes parent metadata without I/O, Apply() handles SSA and returns batch metadata, and Prune() is a separate method that accepts scope from Project(). The controller now owns all GET operations and passes resource versions for change detection. Added integration tests for multi-GVK apply, pruning behavior, and instance/RGD instacnes isolation. * refactor: unify graph and runtime with node-based architecture The runtime package was using an interface-heavy design that created unnecessary abstraction layers between graph definitions and runtime execution. This change consolidates around a concrete `Node` type that represents resources, collections, and external references uniformly. The old `Synchronize` method was a monolith that tried to do everything: evaluate static variables, propagate resource variables, check readiness, resolve conditions, and update instance status - all in one pass. It was optimizing for cases that rarely mattered while making the code hard to reason about. It worked, but mysteriously so. Key changes: The graph package now exports a `Node` type with `NodeMeta` containing all identification data (ID, type, GVR, namespaced, dependencies). `NodeType` distinguishes `Resource`, `Collection`, `External`, and `Instance` nodes. The old `Resource` struct and its getter methods are replaced by direct field access through `Node.Meta`. The runtime package replaces `interfaces.go` with `node.go` and `eval.go`. Runtime nodes wrap graph nodes and add evaluation state (desired, observed, ignored). CEL evaluation is centralized in `eval.go`. Collection expansion uses cartesian product for multi-dimensional `forEach` with proper identity validation to detect duplicate resources. The instance controller now interacts with `runtime.Node` through four methods only: `GetDesired()`, `SetObserved()`, `IsIgnored()`, and `IsReady()`. This clean API boundary replaces the previous `ResourceDescriptor` interface abstraction. Bug fixes along the way: - CEL data pending errors (e.g., `status.field` not yet populated) now trigger delayed requeue instead of exponential backoff - String interpolation in templates now resolves correctly - Instance status fields shrink when CEL expressions no longer resolve (e.g., when a dependency becomes ignored) - Collection expansion detects duplicate resource identities before apply, preventing silent conflicts - `forEach` iterator names are validated against resource IDs and reserved keywords at build time - Template/externalRef combinations are validated early, catching invalid resource definitions before runtime * Refactor: Disolve the kind priority system. Simplified `getOrCreateExpr` to only cache non-iteration expressions. Iteration expressions now get fresh state objects each time (they need per-item evaluation anyway). * Install ginkgo CLI via go tools * Document ResourceStates in `instance` package * refactor(applyset): precompute label selector, unify namespace logic, parallelize listing - Build `labelSelector` once during initialization instead of per-prune - Remove `DeepCopy()` in applyResource - mutate directly since SSA returns new object - Extract `resolveNamespace()` to consolidate fallback logic (obj -> parent -> default) - List resources in parallel during prune using `errgroup` * Set default number of workers to 3 for integration tests * Add documentation on collection labelling system * Add TODO comments for collections and applyset id overwrites * feat(applyset): detect and reject `ApplySet` membership conflicts Prevent silent resource takeover when a resource already belongs to a different ApplySet. Instead of overwriting the `applyset.kubernetes.io/part-of label`, return an `ApplySetConflictError` with details about the conflict. This ensures controllers don't accidentally claim resources managed by other instances, improving multi-tenant safety. * feat(graph): validate forEach dimensions are used in resource identity Adds validation to ensure all forEach dimensions (iterators) are used in resource identity fields (metadata.name and/or metadata.namespace). This prevents collections from creating duplicate resources with the same identity. For namespaced resources, iterators can appear in either name or namespace. For cluster-scoped resources, iterators must appear in the name field since namespace is ignored. * applyset: use current object for ownership checks Replace the applyset conflict check to read the applyset membership label from the live object already fetched by the controller, * refactor(instance): improve error handling in status conditions and use t.Context() - Handle JSON marshal/unmarshal errors explicitly in GetConditions, SetConditions, and initialStatus instead of silently ignoring them - Use `t.Context()` instead of `context.Background()` in applyset tests - Return empty slice instead of nil from GetConditions for consistency * refactor(runtime): add identity-only resolution for deletion Deletion could fail when upstream resources lost data that templates depended on. Now `GetDesiredIdentity()` resolves only `metadata.name` and `metadata.namespace`, skipping readiness gates and unrelated fields. Deletion uses two phases: `observeDeletionState` walks nodes in topo order and GETs/LISTs current state, then `deleteTarget` issues the DELETE for the last resolvable node. `DeleteTargets()` centralizes the logic for what to delete. `normalizeNamespaces` moves to runtime. * refactor(instance): reorganize reconcileResources into distinct phases Split into planResourcesForApply, pruneOrphans. Simplify prepareResource signature. Fix handleExternalRef NotFound handling. Add panic for unknown node types. * fix(runtime): distinguish unresolved vs empty collections in readiness Collections now treat unresolved desired state as not ready, while resolved-empty collections remain ready. This keeps dependency gating correct and clarifies the empty vs unresolved contract in tests and hard-resolve behavior. Could've been a boolean `resolved` on node. But not super necessary for now. * fix(runtime): scope includeWhen/readyWhen contexts - evaluate includeWhen with schema-only env/context - remove schema from readyWhen contexts and update collection comment * test(integration): retry status partial spec updates on conflict * docs: add clarify on collection readiness and lifecycle --------- Signed-off-by: Jakob Möller <contact@jakob-moeller.com> Co-authored-by: Jakob Möller <contact@jakob-moeller.com>
Implements KREP-002 collections, but getting there required rewriting the
runtime and instance controller.
1/ collections (KREP-002)
Adds
forEachto resource definitions, enabling a single resource block toexpand into N Kubernetes resources based on runtime data. Instead of writing
N blocks for N resources, users express "create N resources where N comes
from spec" and kro handles the rest.
Docs: https://a-hilaly.github.io/kro/next/docs/concepts/rgd/resource-definitions/collections
2/
pkg/runtimerewriteThe old
Synchronizemethod tried to do everything in one pass. Now thecontroller interacts with
runtime.Nodethrough four methods:GetDesired(),SetObserved(),IsIgnored(),IsReady(). Graph andruntime unified around a concrete
Nodetype, replacing theResourceDescriptorinterface abstraction.3/
pkg/applysetrefactorMoved to
pkg/controller/instance/applyset/and redesigned as stateless:Project()computes metadata,Apply()handles SSA,Prune()cleans uporphans.
4/ some bugs fixes found on the refactor way: