Skip to content

KREP-006: Propagation Control#861

Open
ellistarn wants to merge 2 commits intokubernetes-sigs:mainfrom
ellistarn:prop-krep
Open

KREP-006: Propagation Control#861
ellistarn wants to merge 2 commits intokubernetes-sigs:mainfrom
ellistarn:prop-krep

Conversation

@ellistarn
Copy link
Copy Markdown
Contributor

KREP-006 introduces propagateWhen, a per-resource mechanism to conditionally gate mutation as
changes propagate through the graph. Both propagateWhen and readyWhen are complementary and
bookend when mutation for a node in the graph can start and is considered complete.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 24, 2025
@ellistarn ellistarn marked this pull request as draft November 24, 2025 22:32
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 24, 2025
@ellistarn ellistarn force-pushed the prop-krep branch 2 times, most recently from ec0eda0 to 06c4cc8 Compare November 24, 2025 22:48
@ellistarn ellistarn marked this pull request as ready for review November 25, 2025 16:28
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 25, 2025
Mechanically, supporting concurrent mutations will require new machinery in KRO. We defer the exact
details of this discussion to the implementation phase, due to the magnitude of the change.

Directionally, we could introduce a new `ResourceGraphRevision` CRD for each unique set of inputs to
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resource graph revision will become absolutely critical for long term stability of kro and its graph reconciliation.

BUT I believe the graph revision should be included in its own KREP in separation of the propagation policy. We need a stable implementation plan first and foremost. I love the ideas but considering kros state right now we need to start thinking how to get this in without causing significant breaks.

IMHO Both ApplySets and Static Type Eval have IMHO caused too many regressions because we didnt focus enough on test plans.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@a-hilaly a-hilaly added this to the 0.10 milestone Nov 26, 2025
@jamal
Copy link
Copy Markdown

jamal commented Dec 3, 2025

I wanted to +1 this, it would be fantastic to have this! I'm currently iterating on how to solve a problem I have where propagation of RGD changes can cause some impact to developers. To explain my use case a bit, I'm using kro to manage ephemeral development environments (somewhat similar to what Tilt, DevSpace or Skaffold will let you do). This is targeting game developers/designers who don't have kubectl or work with infrastructure/backend at all, so I wanted to avoid requiring installing things like kubectl, or even giving them cluster access. The RGD deploys a set of services and their dependencies so that a developer can have an isolated environment to work on.

But, one of the issues I'm running into is that the developer may be actively working and have state / configuration on the service that gets lost when the pod gets replaced. I'm iterating on options from trying to persist that state (which would add a ton of complexity) or just controlling when the instance can be updated.

Anyhow, having a way to control propagation would make that a lot simpler to solve. I'm still debating on what the control would be but either time based (to try to do things outside normal working hours) or manually managed by the developer using the cli tool.

Thank you! Definitely looking forward to see how this evolves.

Copy link
Copy Markdown
Contributor

@barney-s barney-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the proposal.

defined within. For example, an organization that has used KRO to unify application deployment with
an Application CRD risks cluster-wide impact from a bad change to the ResourceGraphDefinition. A
ResourceGraphDefinition that loops over a collection of zones to deploy a set of zonal Deployments
risks regional impact from a bad change in the deployment's configuration.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, is the controls proposed scoped to a single instance ?
Or does it propose across instances of RGD ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both.

// Returns true when updated items grow exponentially: 1, 2, 4, 8, 16...
// An item is considered updated when its generation annotation matches the graph revision generation
exponentiallyUpdated(collection, each) =
size(collection.filter(i, i.metadata.annotations['kro.run/generation'] == string(schema.metadata.generation))) >=
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance consideration: collection.filter(...) iterates over the entire collection.
If propagateWhen is evaluated for every resource in the collection during a reconciliation loop, this logic becomes O(N^2).
For large collections (e.g. thousands of resources), this could be a performance bottleneck.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just an example. Optimizing this in the impl.

// ... existing fields ...

// PropagateWhen defines CEL expressions that allow the object to be mutated when true
PropagateWhen []string `json:"propagateWhen,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify the semantics of the []string slice.
Are these CEL expressions evaluated as a logical AND (all must be true) or logical OR (at least one must be true)?

Copy link
Copy Markdown
Contributor Author

@ellistarn ellistarn Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's identical to the ReadyWhen semantics. All must pass (AND)

// ... existing fields ...

// PropagateWhen defines CEL expressions that allow the object to be mutated when true
PropagateWhen []string `json:"propagateWhen,omitempty"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a FailurePolicy field.
If a CEL expression fails to evaluate (e.g., division by zero, missing field, type error), should the propagation be allowed (FailOpen) or blocked (FailClosed)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's consider this in future scope. Fail closed is appropriate, I think.

resources defined in my ResourceGraphDefinitions when the inputs to the graph change. As an
administrator, I want to limit the rate of change to the instances of a ResourceGraphDefinition
when the definition itself changes.
2. **Time Controls**: As an administrator, I want to prevent changes from happening outside of
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For "Time Controls", will the CEL environment provide a safe and deterministic way to access the current timestamp (e.g., time.now())?
This is often restricted in standard CEL environments to ensure determinism.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed -- punting time to an external KREP. You can always drive it with a cronjob and externalRef.


Probably.

Changes can be made to the inputs of the graph while other changes are still propagating through.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we make a change to an RGD, existing propagations would see the latest RGD. How can we have multiple views of RGD in multiple propagations ?

Changes can be made to the inputs of the graph while other changes are still propagating through.
This is similar to Kubernetes deployments, which can be mutated mid-rollout. A common use case for
this is Rollback, described above. A mutation is made to the graph, and then the inverse of the
mutation is made before it completes. Overlapping propagations can be more complicated, with up to
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are propagations stateful. If so where is state stored ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not currently, but I believe we need to make them so. Without this, we can only support a single propagation at a time. I am deferring this work to a future KREP (ResourceGraphRevisions). I think this work stands on it's own, though.

```
Graph: A → B, C, D (collection with linearlyUpdated)

T1: Mutation T2: Mutation Propagates T3: Rollback Starts T4: Rollback Propagates
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

effectively indication there is only one view or propagation at a given time ?

Comment on lines +170 to +172
increasingly stale. One clear example of this is when using KRO to model software release pipelines.
Given an RGD for `SoftwareReleaseEnvironment` and a pipeline that deploys this environment to many
stages and regions, each of which are dependent on each other, it may take O(days) to propagate a
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again a reason why this may not be the right abstraction for such use cases. We may need to define what is scoped and what is not.

Mechanically, supporting concurrent mutations will require new machinery in KRO. We defer the exact
details of this discussion to the implementation phase, due to the magnitude of the change.

Directionally, we could introduce a new `ResourceGraphRevision` CRD for each unique set of inputs to
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chrisdoherty4
Copy link
Copy Markdown
Contributor

I like the approach.

KREP-0003 discusses decorators and how generally they should be considered singletons; however it doesn't enforce the constraint and acknowledges there could be unknown use cases where a schema can be provided and result in non-conflicting changes between decorator instances. Is it worth considering how propagation control works in that context or at least making it an explicit non-goal?

@ellistarn
Copy link
Copy Markdown
Contributor Author

KREP-0003 discusses decorators and how generally they should be considered singletons; however it doesn't enforce the constraint and acknowledges there could be unknown use cases where a schema can be provided and result in non-conflicting changes between decorator instances. Is it worth considering how propagation control works in that context or at least making it an explicit non-goal?

I owe this community a doc on "singletons" or "Schemaless Graphs".

@ellistarn ellistarn force-pushed the prop-krep branch 4 times, most recently from 73714ee to 58f6e04 Compare January 28, 2026 17:38
Introduces propagateWhen, a per-resource mechanism to conditionally gate
mutation as changes propagate through the graph.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…earlyReady

The new names better describe the intent - "is this item ready to propagate?"
rather than describing the mechanism (checking update counts).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Comment on lines +72 to +73
pod.ready() // true if readyWhen conditions are satisfied
pod.updated() // true if updated to the current graph generation
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if these should be collapsed. The both the linearlyReady and exponentiallyReday functions below use updated

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e., what if ready returned false when the resource was itself ready, but the generation has moved on. Thoughts on modeling this @a-hilaly ?


// exponentiallyReady(item, collection) -> bool
// Item can proceed when exponential batch (1, 2, 4, 8...) is reached
exponentiallyReady(pod, pods)
Copy link
Copy Markdown

@cheeseandcereal cheeseandcereal Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the exponential factor be exposed as an optional parameter instead of always being forced to 2.0?

## Proposed API and Behavior

1. Add `propagateWhen` to `ResourceGraphDefinitionSpec` to control propagation to graph instances
2. Add `propagateWhen` to `Resource` to control propagation to a resource in a graph instance
Copy link
Copy Markdown

@k1ranpk k1ranpk Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for the propagation to be stuck in a cycle between the different controls expressed across multiple resources and RGDs?


1. **Rate Controls**: As an administrator, I want to limit the rate of change to collections of
resources defined in my ResourceGraphDefinitions when the inputs to the graph change. As an
administrator, I want to limit the rate of change to the instances of a ResourceGraphDefinition
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the method for overriding? Would it be updating the RGD to no longer have these propagate controls?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you specify propagateWhen: [], then there are no gates.

└─ InstanceManaged - Instance finalizers and labels are properly set
```

Above, we assert that readiness and propagation are separate concepts, and thus we introduce a
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may tie into resource graph revisions, but what would be the way for a user to know if an instance had the RGD propagated to it?

Is this check all or nothing or is there a way to have per instance understanding of what changes have been applied or not

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was imagining an annotation on the resource the shows a graph revision. And then KRO is managing those revisions internally.

## Summary

KREP-006 introduces `propagateWhen`, a per-resource mechanism to conditionally gate mutation as
changes propagate through the graph. Both `propagateWhen` and `readyWhen` are complementary and
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will propagateWhen interact with includeWhen? If an RGD update changes an includeWhen condition to remove or add a node from a graph, is the node removal controlled by propagateWhen?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that includeWhen removes the node from the graph, so it would not be counted in the propagation math. If it resolves to true, it is included in the graph and is taken into account. Though I suppose order matters here -- if it poofs into existence, we probably want to start with it, depending on the order.

This may be a problem to solve with resource graph revisions

indexOf(pod, pods) < (pods.filter(p, p.updated()).size() / 3 + 1) * 3

// exponentiallyReady(item, collection) -> bool
// Item can proceed when exponential batch (1, 2, 4, 8...) is reached
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are instance batched for update? Are they chosen randomly per RGD generation or is the ordering fixed between generations?

resources:
- id: pods
forEach:
- pod: ${ schema.spec.pods }
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a revisit. Cannot reference schema.spec.pods

Probably.

Changes can be made to the inputs of the graph while other changes are still propagating through.
This is similar to Kubernetes deployments, which can be mutated mid-rollout. A common use case for
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deployments use an intermediate object (replica sets) between deployments and pods. Does KRO needs a similar construct to allow overlapping propagations?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! ResourceGraphRevision

Comment on lines +41 to +42
1. Add `propagateWhen` to `ResourceGraphDefinitionSpec` to control propagation to graph instances
2. Add `propagateWhen` to `Resource` to control propagation to a resource in a graph instance
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we define the CEL context propagateWhen will have for each scenario?


## Proposed API and Behavior

1. Add `propagateWhen` to `ResourceGraphDefinitionSpec` to control propagation to graph instances
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in what order would we process the instances?

@linux-foundation-easycla
Copy link
Copy Markdown

CLA Missing ID CLA Not Signed

One or more co-authors of this pull request were not found. You must specify co-authors in commit message trailer via:

Co-authored-by: name <email>

Supported Co-authored-by: formats include:

  1. Anything <id+login@users.noreply.github.com> - it will locate your GitHub user by id part.
  2. Anything <login@users.noreply.github.com> - it will locate your GitHub user by login part.
  3. Anything <public-email> - it will locate your GitHub user by public-email part. Note that this email must be made public on Github.
  4. Anything <other-email> - it will locate your GitHub user by other-email part but only if that email was used before for any other CLA as a main commit author.
  5. login <any-valid-email> - it will locate your GitHub user by login part, note that login part must be at least 3 characters long.

Alternatively, if the co-author should not be included, remove the Co-authored-by: line from the commit message.

Please update your commit message(s) by doing git commit --amend and then git push [--force] and then request re-running CLA check via commenting on this pull request:

/easycla

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ellistarn
Once this PR has been reviewed and has the lgtm label, please assign jlbutler for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 16, 2026
michaelhtm added a commit to michaelhtm/kro that referenced this pull request Apr 17, 2026
This KREP discusses the implementation plan for instance propagation
control accross GraphRevisions.

builds on:
* kubernetes-sigs#861
* kubernetes-sigs#1174
michaelhtm added a commit to michaelhtm/kro that referenced this pull request Apr 30, 2026
This KREP discusses the implementation plan for instance propagation
control accross GraphRevisions.

builds on:
* kubernetes-sigs#861
* kubernetes-sigs#1174
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. kind/krep size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.