Skip to content

feat: add Collections support + runtime/controller rewrite#936

Merged
k8s-ci-robot merged 23 commits intokubernetes-sigs:mainfrom
a-hilaly:imeplement-collections
Jan 25, 2026
Merged

feat: add Collections support + runtime/controller rewrite#936
k8s-ci-robot merged 23 commits intokubernetes-sigs:mainfrom
a-hilaly:imeplement-collections

Conversation

@a-hilaly
Copy link
Copy Markdown
Member

@a-hilaly a-hilaly commented Jan 6, 2026

Implements KREP-002 collections, but getting there required rewriting the
runtime and instance controller.

1/ collections (KREP-002)

Adds forEach to resource definitions, enabling a single resource block to
expand into N Kubernetes resources based on runtime data. Instead of writing
N blocks for N resources, users express "create N resources where N comes
from spec" and kro handles the rest.

Docs: https://a-hilaly.github.io/kro/next/docs/concepts/rgd/resource-definitions/collections

2/ pkg/runtime rewrite

The old Synchronize method tried to do everything in one pass. Now the
controller interacts with runtime.Node through four methods:
GetDesired(), SetObserved(), IsIgnored(), IsReady(). Graph and
runtime unified around a concrete Node type, replacing the
ResourceDescriptor interface abstraction.

3/ pkg/applyset refactor

Moved to pkg/controller/instance/applyset/ and redesigned as stateless:
Project() computes metadata, Apply() handles SSA, Prune() cleans up
orphans.

4/ some bugs fixes found on the refactor way:

  • CEL data pending errors use delayed requeue instead of exponential backoff
  • String interpolation resolves correctly
  • Instance status shrinks when expressions stop resolving

Rewrites `instanceGraphReconciler` implementation and associated unit tests. This component is needs to be prepped to be used for deeper integrations into levelled apply

Signed-off-by: Jakob Möller <contact@jakob-moeller.com>
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 6, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: a-hilaly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 6, 2026
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jan 6, 2026
@a-hilaly
Copy link
Copy Markdown
Member Author

a-hilaly commented Jan 6, 2026

/ok-to-test
/test all

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Jan 6, 2026
@a-hilaly a-hilaly force-pushed the imeplement-collections branch from ee2c5bd to 7d96612 Compare January 6, 2026 08:48
@a-hilaly
Copy link
Copy Markdown
Member Author

a-hilaly commented Jan 6, 2026

/retest

@a-hilaly a-hilaly force-pushed the imeplement-collections branch from 7d96612 to 15bf0cd Compare January 6, 2026 09:22
@a-hilaly
Copy link
Copy Markdown
Member Author

a-hilaly commented Jan 6, 2026

/test all

Implements KREP-002. Adds the `forEach` field to resource definitions,
enabling a single resource block to create multiple resources based on
runtime data. A collection resource remains one DAG node but expands to
N kubernetes resources when evaluated.

This patch removes the need to statically define each resource in the RGD.
Instead of writing N blocks for N resources, users express "create N
resources where N comes from spec" and kro handles the rest.

This patch modifies API types, CRD schema, CEL environment, graph builder,
runtime, and instance controller. Includes unit tests for expansion logic
and integration tests for end-to-end collection lifecycle.
When CEL expressions reference fields that don't exist yet (e.g
`status.vpcID` before the resource is reconciled), the controller
now uses delayed requeue instead of exponential backoff.

This fixes integration test failures caused by the controller
entering exponential backoff when resources are still being
populated with status fields.
Move applyset from `pkg/applyset/` to `pkg/controller/instance/applyset/` and
redesign it as a stateless component tightly integrated with the instance
controller. The new API separates concerns cleanly: Project() computes parent
metadata without I/O, Apply() handles SSA and returns batch metadata, and
Prune() is a separate method that accepts scope from Project(). The controller
now owns all GET operations and passes resource versions for change detection.

Added integration tests for multi-GVK apply, pruning behavior, and instance/RGD
instacnes isolation.
The runtime package was using an interface-heavy design that created
unnecessary abstraction layers between graph definitions and runtime
execution. This change consolidates around a concrete `Node` type that
represents resources, collections, and external references uniformly.

The old `Synchronize` method was a monolith that tried to do everything:
evaluate static variables, propagate resource variables, check readiness,
resolve conditions, and update instance status - all in one pass. It was
optimizing for cases that rarely mattered while making the code hard to
reason about. It worked, but mysteriously so.

Key changes:

The graph package now exports a `Node` type with `NodeMeta` containing
all identification data (ID, type, GVR, namespaced, dependencies).
`NodeType` distinguishes `Resource`, `Collection`, `External`, and
`Instance` nodes. The old `Resource` struct and its getter methods are
replaced by direct field access through `Node.Meta`.

The runtime package replaces `interfaces.go` with `node.go` and
`eval.go`. Runtime nodes wrap graph nodes and add evaluation state
(desired, observed, ignored). CEL evaluation is centralized in
`eval.go`. Collection expansion uses cartesian product for
multi-dimensional `forEach` with proper identity validation to detect
duplicate resources.

The instance controller now interacts with `runtime.Node` through four
methods only: `GetDesired()`, `SetObserved()`, `IsIgnored()`, and
`IsReady()`. This clean API boundary replaces the previous
`ResourceDescriptor` interface abstraction.

Bug fixes along the way:

- CEL data pending errors (e.g., `status.field` not yet populated) now
  trigger delayed requeue instead of exponential backoff
- String interpolation in templates now resolves correctly
- Instance status fields shrink when CEL expressions no longer resolve
  (e.g., when a dependency becomes ignored)
- Collection expansion detects duplicate resource identities before
  apply, preventing silent conflicts
- `forEach` iterator names are validated against resource IDs and
  reserved keywords at build time
- Template/externalRef combinations are validated early, catching
  invalid resource definitions before runtime
@a-hilaly a-hilaly force-pushed the imeplement-collections branch from 15bf0cd to 8a58fa9 Compare January 14, 2026 09:50
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 14, 2026
@a-hilaly a-hilaly force-pushed the imeplement-collections branch from 8a58fa9 to 15c9a01 Compare January 14, 2026 09:59
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 14, 2026
@a-hilaly a-hilaly marked this pull request as ready for review January 14, 2026 10:12
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 14, 2026
@a-hilaly a-hilaly changed the title feat: add Collections support feat: add Collections support + runtime/controller rewrite Jan 14, 2026
@a-hilaly a-hilaly force-pushed the imeplement-collections branch 2 times, most recently from c4fdeee to 44ffe5a Compare January 14, 2026 10:35
Deletion could fail when upstream resources lost data that templates
depended on. Now `GetDesiredIdentity()` resolves only `metadata.name`
and `metadata.namespace`, skipping readiness gates and unrelated fields.

Deletion uses two phases: `observeDeletionState` walks nodes in topo
order and GETs/LISTs current state, then `deleteTarget` issues the
DELETE for the last resolvable node. `DeleteTargets()` centralizes the
logic for what to delete. `normalizeNamespaces` moves to runtime.
Split into planResourcesForApply, pruneOrphans. Simplify prepareResource
signature. Fix handleExternalRef NotFound handling. Add panic for unknown
node types.
@a-hilaly a-hilaly force-pushed the imeplement-collections branch from 7e313ba to 3e11e91 Compare January 25, 2026 10:04
Collections now treat unresolved desired state as not ready, while resolved-empty
collections remain ready. This keeps dependency gating correct and clarifies the
empty vs unresolved contract in tests and hard-resolve behavior.

Could've been a boolean `resolved` on node. But not super necessary for now.
- evaluate includeWhen with schema-only env/context
- remove schema from readyWhen contexts and update collection comment
Copy link
Copy Markdown
Member

@jakobmoellerdev jakobmoellerdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/hold

feel free to unhold based on how much feedback you want before merge.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 25, 2026
@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jan 25, 2026
@jakobmoellerdev
Copy link
Copy Markdown
Member

while i appreciate continuous improvements this should be done in a separate PR imho. The last 2 commits alone were worth 2 prs.

@jakobmoellerdev
Copy link
Copy Markdown
Member

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Jan 25, 2026
convert the map to a sorted array first to make resource ordering stable.
Unstable ordering can lead to noisy diffs and unexpected reconciliation churn.

### Identity Must Include All Iterator Dimensions
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be fairly easy to check at compile. we can verify that name / namespace include all iterator dimensions by checking their variable in the appropriate fieldpath. WDYT?

Copy link
Copy Markdown
Member Author

@a-hilaly a-hilaly Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already done in 4b35e62


### Map Iteration Order Is Not Deterministic

Map iteration order in Go is randomized. If you iterate over map keys or values,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this mutation is very unnice to do in CEL. if you have a "nice" way of doing that its probably worth adding here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. We do not have a better way that sort the keys for now. best we can do is provide custom CEL functions for that.

@a-hilaly a-hilaly force-pushed the imeplement-collections branch 2 times, most recently from 494bafe to aedcc62 Compare January 25, 2026 20:51
@a-hilaly
Copy link
Copy Markdown
Member Author

@jakobmoellerdev I jusst clipped the last two commits - sending them as seperate PRs.

@a-hilaly
Copy link
Copy Markdown
Member Author

opened #970 and #971

Copy link
Copy Markdown
Member

@jakobmoellerdev jakobmoellerdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 25, 2026
@a-hilaly
Copy link
Copy Markdown
Member Author

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 25, 2026
@k8s-ci-robot k8s-ci-robot merged commit 245cf4e into kubernetes-sigs:main Jan 25, 2026
8 checks passed
@ross-grinvalds
Copy link
Copy Markdown

@a-hilaly, thank you for your work on this. It is very much appreciated!

shivansh-gohem pushed a commit to shivansh-gohem/kro that referenced this pull request Jan 29, 2026
…s-sigs#936)

* refactor: rewrite instance controller

Rewrites `instanceGraphReconciler` implementation and associated unit tests. This component is needs to be prepped to be used for deeper integrations into levelled apply

Signed-off-by: Jakob Möller <contact@jakob-moeller.com>

* feat: add collections support

Implements KREP-002. Adds the `forEach` field to resource definitions,
enabling a single resource block to create multiple resources based on
runtime data. A collection resource remains one DAG node but expands to
N kubernetes resources when evaluated.

This patch removes the need to statically define each resource in the RGD.
Instead of writing N blocks for N resources, users express "create N
resources where N comes from spec" and kro handles the rest.

This patch modifies API types, CRD schema, CEL environment, graph builder,
runtime, and instance controller. Includes unit tests for expansion logic
and integration tests for end-to-end collection lifecycle.

* fix: handle CEL data pending errors with delayed requeue

When CEL expressions reference fields that don't exist yet (e.g
`status.vpcID` before the resource is reconciled), the controller
now uses delayed requeue instead of exponential backoff.

This fixes integration test failures caused by the controller
entering exponential backoff when resources are still being
populated with status fields.

* Add data pending integration tests

* refactor: relocate and simplify applyset to instance controller package

Move applyset from `pkg/applyset/` to `pkg/controller/instance/applyset/` and
redesign it as a stateless component tightly integrated with the instance
controller. The new API separates concerns cleanly: Project() computes parent
metadata without I/O, Apply() handles SSA and returns batch metadata, and
Prune() is a separate method that accepts scope from Project(). The controller
now owns all GET operations and passes resource versions for change detection.

Added integration tests for multi-GVK apply, pruning behavior, and instance/RGD
instacnes isolation.

* refactor: unify graph and runtime with node-based architecture

The runtime package was using an interface-heavy design that created
unnecessary abstraction layers between graph definitions and runtime
execution. This change consolidates around a concrete `Node` type that
represents resources, collections, and external references uniformly.

The old `Synchronize` method was a monolith that tried to do everything:
evaluate static variables, propagate resource variables, check readiness,
resolve conditions, and update instance status - all in one pass. It was
optimizing for cases that rarely mattered while making the code hard to
reason about. It worked, but mysteriously so.

Key changes:

The graph package now exports a `Node` type with `NodeMeta` containing
all identification data (ID, type, GVR, namespaced, dependencies).
`NodeType` distinguishes `Resource`, `Collection`, `External`, and
`Instance` nodes. The old `Resource` struct and its getter methods are
replaced by direct field access through `Node.Meta`.

The runtime package replaces `interfaces.go` with `node.go` and
`eval.go`. Runtime nodes wrap graph nodes and add evaluation state
(desired, observed, ignored). CEL evaluation is centralized in
`eval.go`. Collection expansion uses cartesian product for
multi-dimensional `forEach` with proper identity validation to detect
duplicate resources.

The instance controller now interacts with `runtime.Node` through four
methods only: `GetDesired()`, `SetObserved()`, `IsIgnored()`, and
`IsReady()`. This clean API boundary replaces the previous
`ResourceDescriptor` interface abstraction.

Bug fixes along the way:

- CEL data pending errors (e.g., `status.field` not yet populated) now
  trigger delayed requeue instead of exponential backoff
- String interpolation in templates now resolves correctly
- Instance status fields shrink when CEL expressions no longer resolve
  (e.g., when a dependency becomes ignored)
- Collection expansion detects duplicate resource identities before
  apply, preventing silent conflicts
- `forEach` iterator names are validated against resource IDs and
  reserved keywords at build time
- Template/externalRef combinations are validated early, catching
  invalid resource definitions before runtime

* Refactor: Disolve the kind priority system.

Simplified `getOrCreateExpr` to only cache non-iteration expressions. Iteration expressions now get fresh state objects each time (they need per-item evaluation anyway).

* Install ginkgo CLI via go tools

* Document ResourceStates in `instance` package

* refactor(applyset): precompute label selector, unify namespace logic, parallelize listing

- Build `labelSelector` once during initialization instead of per-prune
- Remove `DeepCopy()` in applyResource - mutate directly since SSA returns new object
- Extract `resolveNamespace()` to consolidate fallback logic (obj -> parent -> default)
- List resources in parallel during prune using `errgroup`

* Set default number of workers to 3 for integration tests

* Add documentation on collection labelling system

* Add TODO comments for collections and applyset id overwrites

* feat(applyset): detect and reject `ApplySet` membership conflicts

Prevent silent resource takeover when a resource already belongs to a
different ApplySet. Instead of overwriting the `applyset.kubernetes.io/part-of
label`, return an `ApplySetConflictError` with details about the conflict.

This ensures controllers don't accidentally claim resources managed by
other instances, improving multi-tenant safety.

* feat(graph): validate forEach dimensions are used in resource identity

Adds validation to ensure all forEach dimensions (iterators) are used
in resource identity fields (metadata.name and/or metadata.namespace).
This prevents collections from creating duplicate resources with the
same identity.

For namespaced resources, iterators can appear in either name or
namespace. For cluster-scoped resources, iterators must appear in
the name field since namespace is ignored.

* applyset: use current object for ownership checks

Replace the applyset conflict check to read the applyset membership
label from the live object already fetched by the controller,

* refactor(instance): improve error handling in status conditions and use t.Context()

- Handle JSON marshal/unmarshal errors explicitly in GetConditions, SetConditions,
  and initialStatus instead of silently ignoring them
- Use `t.Context()` instead of `context.Background()` in applyset tests
- Return empty slice instead of nil from GetConditions for consistency

* refactor(runtime): add identity-only resolution for deletion

Deletion could fail when upstream resources lost data that templates
depended on. Now `GetDesiredIdentity()` resolves only `metadata.name`
and `metadata.namespace`, skipping readiness gates and unrelated fields.

Deletion uses two phases: `observeDeletionState` walks nodes in topo
order and GETs/LISTs current state, then `deleteTarget` issues the
DELETE for the last resolvable node. `DeleteTargets()` centralizes the
logic for what to delete. `normalizeNamespaces` moves to runtime.

* refactor(instance): reorganize reconcileResources into distinct phases

Split into planResourcesForApply, pruneOrphans. Simplify prepareResource
signature. Fix handleExternalRef NotFound handling. Add panic for unknown
node types.

* fix(runtime): distinguish unresolved vs empty collections in readiness

Collections now treat unresolved desired state as not ready, while resolved-empty
collections remain ready. This keeps dependency gating correct and clarifies the
empty vs unresolved contract in tests and hard-resolve behavior.

Could've been a boolean `resolved` on node. But not super necessary for now.

* fix(runtime): scope includeWhen/readyWhen contexts

- evaluate includeWhen with schema-only env/context
- remove schema from readyWhen contexts and update collection comment

* test(integration): retry status partial spec updates on conflict

* docs: add clarify on collection readiness and lifecycle

---------

Signed-off-by: Jakob Möller <contact@jakob-moeller.com>
Co-authored-by: Jakob Möller <contact@jakob-moeller.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants