KREP-006: Propagation Control#861
Conversation
ec0eda0 to
06c4cc8
Compare
| Mechanically, supporting concurrent mutations will require new machinery in KRO. We defer the exact | ||
| details of this discussion to the implementation phase, due to the magnitude of the change. | ||
|
|
||
| Directionally, we could introduce a new `ResourceGraphRevision` CRD for each unique set of inputs to |
There was a problem hiding this comment.
resource graph revision will become absolutely critical for long term stability of kro and its graph reconciliation.
BUT I believe the graph revision should be included in its own KREP in separation of the propagation policy. We need a stable implementation plan first and foremost. I love the ideas but considering kros state right now we need to start thinking how to get this in without causing significant breaks.
IMHO Both ApplySets and Static Type Eval have IMHO caused too many regressions because we didnt focus enough on test plans.
There was a problem hiding this comment.
this is what my proposal talks about https://docs.google.com/document/d/1kmi9hXK7tF5JkBBT-FLNDnz5j37RtiXuRIQYYBiFbUE/edit?usp=sharing
|
I wanted to +1 this, it would be fantastic to have this! I'm currently iterating on how to solve a problem I have where propagation of RGD changes can cause some impact to developers. To explain my use case a bit, I'm using kro to manage ephemeral development environments (somewhat similar to what Tilt, DevSpace or Skaffold will let you do). This is targeting game developers/designers who don't have kubectl or work with infrastructure/backend at all, so I wanted to avoid requiring installing things like kubectl, or even giving them cluster access. The RGD deploys a set of services and their dependencies so that a developer can have an isolated environment to work on. But, one of the issues I'm running into is that the developer may be actively working and have state / configuration on the service that gets lost when the pod gets replaced. I'm iterating on options from trying to persist that state (which would add a ton of complexity) or just controlling when the instance can be updated. Anyhow, having a way to control propagation would make that a lot simpler to solve. I'm still debating on what the control would be but either time based (to try to do things outside normal working hours) or manually managed by the developer using the cli tool. Thank you! Definitely looking forward to see how this evolves. |
barney-s
left a comment
There was a problem hiding this comment.
thanks for the proposal.
| defined within. For example, an organization that has used KRO to unify application deployment with | ||
| an Application CRD risks cluster-wide impact from a bad change to the ResourceGraphDefinition. A | ||
| ResourceGraphDefinition that loops over a collection of zones to deploy a set of zonal Deployments | ||
| risks regional impact from a bad change in the deployment's configuration. |
There was a problem hiding this comment.
Just to clarify, is the controls proposed scoped to a single instance ?
Or does it propose across instances of RGD ?
| // Returns true when updated items grow exponentially: 1, 2, 4, 8, 16... | ||
| // An item is considered updated when its generation annotation matches the graph revision generation | ||
| exponentiallyUpdated(collection, each) = | ||
| size(collection.filter(i, i.metadata.annotations['kro.run/generation'] == string(schema.metadata.generation))) >= |
There was a problem hiding this comment.
Performance consideration: collection.filter(...) iterates over the entire collection.
If propagateWhen is evaluated for every resource in the collection during a reconciliation loop, this logic becomes O(N^2).
For large collections (e.g. thousands of resources), this could be a performance bottleneck.
There was a problem hiding this comment.
This was just an example. Optimizing this in the impl.
| // ... existing fields ... | ||
|
|
||
| // PropagateWhen defines CEL expressions that allow the object to be mutated when true | ||
| PropagateWhen []string `json:"propagateWhen,omitempty"` |
There was a problem hiding this comment.
Please clarify the semantics of the []string slice.
Are these CEL expressions evaluated as a logical AND (all must be true) or logical OR (at least one must be true)?
There was a problem hiding this comment.
It's identical to the ReadyWhen semantics. All must pass (AND)
| // ... existing fields ... | ||
|
|
||
| // PropagateWhen defines CEL expressions that allow the object to be mutated when true | ||
| PropagateWhen []string `json:"propagateWhen,omitempty"` |
There was a problem hiding this comment.
Consider adding a FailurePolicy field.
If a CEL expression fails to evaluate (e.g., division by zero, missing field, type error), should the propagation be allowed (FailOpen) or blocked (FailClosed)?
There was a problem hiding this comment.
Let's consider this in future scope. Fail closed is appropriate, I think.
| resources defined in my ResourceGraphDefinitions when the inputs to the graph change. As an | ||
| administrator, I want to limit the rate of change to the instances of a ResourceGraphDefinition | ||
| when the definition itself changes. | ||
| 2. **Time Controls**: As an administrator, I want to prevent changes from happening outside of |
There was a problem hiding this comment.
For "Time Controls", will the CEL environment provide a safe and deterministic way to access the current timestamp (e.g., time.now())?
This is often restricted in standard CEL environments to ensure determinism.
There was a problem hiding this comment.
Agreed -- punting time to an external KREP. You can always drive it with a cronjob and externalRef.
|
|
||
| Probably. | ||
|
|
||
| Changes can be made to the inputs of the graph while other changes are still propagating through. |
There was a problem hiding this comment.
if we make a change to an RGD, existing propagations would see the latest RGD. How can we have multiple views of RGD in multiple propagations ?
| Changes can be made to the inputs of the graph while other changes are still propagating through. | ||
| This is similar to Kubernetes deployments, which can be mutated mid-rollout. A common use case for | ||
| this is Rollback, described above. A mutation is made to the graph, and then the inverse of the | ||
| mutation is made before it completes. Overlapping propagations can be more complicated, with up to |
There was a problem hiding this comment.
are propagations stateful. If so where is state stored ?
There was a problem hiding this comment.
Not currently, but I believe we need to make them so. Without this, we can only support a single propagation at a time. I am deferring this work to a future KREP (ResourceGraphRevisions). I think this work stands on it's own, though.
| ``` | ||
| Graph: A → B, C, D (collection with linearlyUpdated) | ||
|
|
||
| T1: Mutation T2: Mutation Propagates T3: Rollback Starts T4: Rollback Propagates |
There was a problem hiding this comment.
effectively indication there is only one view or propagation at a given time ?
| increasingly stale. One clear example of this is when using KRO to model software release pipelines. | ||
| Given an RGD for `SoftwareReleaseEnvironment` and a pipeline that deploys this environment to many | ||
| stages and regions, each of which are dependent on each other, it may take O(days) to propagate a |
There was a problem hiding this comment.
again a reason why this may not be the right abstraction for such use cases. We may need to define what is scoped and what is not.
| Mechanically, supporting concurrent mutations will require new machinery in KRO. We defer the exact | ||
| details of this discussion to the implementation phase, due to the magnitude of the change. | ||
|
|
||
| Directionally, we could introduce a new `ResourceGraphRevision` CRD for each unique set of inputs to |
There was a problem hiding this comment.
this is what my proposal talks about https://docs.google.com/document/d/1kmi9hXK7tF5JkBBT-FLNDnz5j37RtiXuRIQYYBiFbUE/edit?usp=sharing
|
I like the approach. KREP-0003 discusses decorators and how generally they should be considered singletons; however it doesn't enforce the constraint and acknowledges there could be unknown use cases where a schema can be provided and result in non-conflicting changes between decorator instances. Is it worth considering how propagation control works in that context or at least making it an explicit non-goal? |
I owe this community a doc on "singletons" or "Schemaless Graphs". |
73714ee to
58f6e04
Compare
Introduces propagateWhen, a per-resource mechanism to conditionally gate mutation as changes propagate through the graph. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…earlyReady The new names better describe the intent - "is this item ready to propagate?" rather than describing the mechanism (checking update counts). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
| pod.ready() // true if readyWhen conditions are satisfied | ||
| pod.updated() // true if updated to the current graph generation |
There was a problem hiding this comment.
I wonder if these should be collapsed. The both the linearlyReady and exponentiallyReday functions below use updated
There was a problem hiding this comment.
i.e., what if ready returned false when the resource was itself ready, but the generation has moved on. Thoughts on modeling this @a-hilaly ?
|
|
||
| // exponentiallyReady(item, collection) -> bool | ||
| // Item can proceed when exponential batch (1, 2, 4, 8...) is reached | ||
| exponentiallyReady(pod, pods) |
There was a problem hiding this comment.
Should the exponential factor be exposed as an optional parameter instead of always being forced to 2.0?
| ## Proposed API and Behavior | ||
|
|
||
| 1. Add `propagateWhen` to `ResourceGraphDefinitionSpec` to control propagation to graph instances | ||
| 2. Add `propagateWhen` to `Resource` to control propagation to a resource in a graph instance |
There was a problem hiding this comment.
Is it possible for the propagation to be stuck in a cycle between the different controls expressed across multiple resources and RGDs?
|
|
||
| 1. **Rate Controls**: As an administrator, I want to limit the rate of change to collections of | ||
| resources defined in my ResourceGraphDefinitions when the inputs to the graph change. As an | ||
| administrator, I want to limit the rate of change to the instances of a ResourceGraphDefinition |
There was a problem hiding this comment.
what is the method for overriding? Would it be updating the RGD to no longer have these propagate controls?
There was a problem hiding this comment.
If you specify propagateWhen: [], then there are no gates.
| └─ InstanceManaged - Instance finalizers and labels are properly set | ||
| ``` | ||
|
|
||
| Above, we assert that readiness and propagation are separate concepts, and thus we introduce a |
There was a problem hiding this comment.
this may tie into resource graph revisions, but what would be the way for a user to know if an instance had the RGD propagated to it?
Is this check all or nothing or is there a way to have per instance understanding of what changes have been applied or not
There was a problem hiding this comment.
I was imagining an annotation on the resource the shows a graph revision. And then KRO is managing those revisions internally.
| ## Summary | ||
|
|
||
| KREP-006 introduces `propagateWhen`, a per-resource mechanism to conditionally gate mutation as | ||
| changes propagate through the graph. Both `propagateWhen` and `readyWhen` are complementary and |
There was a problem hiding this comment.
How will propagateWhen interact with includeWhen? If an RGD update changes an includeWhen condition to remove or add a node from a graph, is the node removal controlled by propagateWhen?
There was a problem hiding this comment.
My understanding is that includeWhen removes the node from the graph, so it would not be counted in the propagation math. If it resolves to true, it is included in the graph and is taken into account. Though I suppose order matters here -- if it poofs into existence, we probably want to start with it, depending on the order.
This may be a problem to solve with resource graph revisions
| indexOf(pod, pods) < (pods.filter(p, p.updated()).size() / 3 + 1) * 3 | ||
|
|
||
| // exponentiallyReady(item, collection) -> bool | ||
| // Item can proceed when exponential batch (1, 2, 4, 8...) is reached |
There was a problem hiding this comment.
How are instance batched for update? Are they chosen randomly per RGD generation or is the ordering fixed between generations?
| resources: | ||
| - id: pods | ||
| forEach: | ||
| - pod: ${ schema.spec.pods } |
There was a problem hiding this comment.
This needs a revisit. Cannot reference schema.spec.pods
| Probably. | ||
|
|
||
| Changes can be made to the inputs of the graph while other changes are still propagating through. | ||
| This is similar to Kubernetes deployments, which can be mutated mid-rollout. A common use case for |
There was a problem hiding this comment.
Deployments use an intermediate object (replica sets) between deployments and pods. Does KRO needs a similar construct to allow overlapping propagations?
There was a problem hiding this comment.
Yes! ResourceGraphRevision
| 1. Add `propagateWhen` to `ResourceGraphDefinitionSpec` to control propagation to graph instances | ||
| 2. Add `propagateWhen` to `Resource` to control propagation to a resource in a graph instance |
There was a problem hiding this comment.
can we define the CEL context propagateWhen will have for each scenario?
|
|
||
| ## Proposed API and Behavior | ||
|
|
||
| 1. Add `propagateWhen` to `ResourceGraphDefinitionSpec` to control propagation to graph instances |
There was a problem hiding this comment.
in what order would we process the instances?
One or more co-authors of this pull request were not found. You must specify co-authors in commit message trailer via: Supported
Alternatively, if the co-author should not be included, remove the Please update your commit message(s) by doing |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ellistarn The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
This KREP discusses the implementation plan for instance propagation control accross GraphRevisions. builds on: * kubernetes-sigs#861 * kubernetes-sigs#1174
This KREP discusses the implementation plan for instance propagation control accross GraphRevisions. builds on: * kubernetes-sigs#861 * kubernetes-sigs#1174
KREP-006 introduces
propagateWhen, a per-resource mechanism to conditionally gate mutation aschanges propagate through the graph. Both
propagateWhenandreadyWhenare complementary andbookend when mutation for a node in the graph can start and is considered complete.