feat: Virtual cluster lifecycle state model#89
Conversation
tombentley
left a comment
There was a problem hiding this comment.
Thanks @SamBarker I think this is useful work, but have doubts that we really need to have "healthy" as a state.
|
|
||
| 1. **Startup is all-or-nothing.** If one virtual cluster fails to start (e.g. port conflict, filter initialisation failure), the entire proxy process fails. Other clusters that could have started successfully are taken down with it. | ||
|
|
||
| 2. **Shutdown is unstructured.** The proxy stops accepting connections and closes channels, but there is no formal draining phase that ensures in-flight Kafka requests complete before the connection is torn down. |
There was a problem hiding this comment.
It's not immediately apparent to me how this would work in practice. Suppose I have a producer pipelining Produce requests. How do we intend to stop that flow in such a way that the producer doesn't end up with an unacknowledged request?
You mention later on about timeout, but:
- the broker doesn't know what the client's timeout it configured to, so it doesn't know how long it needs to wait
- if the client timesout request 1 while thinking that 2, 3 and 4 are in flight then it will send request 5
There was a problem hiding this comment.
The exact mechanics of how the proxy tracks in-flight requests per connection are an implementation concern rather than a design one. The design's position is that draining is best-effort: clients are given a time window to complete naturally, with the drain timeout as the hard backstop after which connections are closed regardless. Kafka clients are required to handle connection loss, so forced closure after timeout is protocol-compliant — the goal is to avoid unnecessary disruption, not to guarantee zero errors. I've clarified this in the proposal.
There was a problem hiding this comment.
Sorry @SamBarker, but I think this does need to be explained.
In the reloading proposal it's been observed that idempotent (but not transactional) producers get their idempotency from the scope of a connection. So trying to minimise duplicate records is clearly a concern for some users, and I think it's only fair that you at least sketch how you think this will work, so everyone can see what the trade-offs are.
clients are given a time window to complete naturally
This is simply not the case, at least as far as I understand what's proposed.
For a real Kafka broker, the clean shutdown process involves the broker telling the controller that it needs to stop, the controller selects replacement leaders for all the partitions that that broker is currently leading, and only once the broker is not leading any partitions does it shut down. In that way producers avoid having any requests in-flight to the broker that's shutting down.
The proxy lacks that mechanism, which means that we are delivering a weaker idempotency guarantee under disconnections than Kafka does. That's an existing problem. But it will be made worse by building a dynamic restart feature which more frequent disconnections.
To me it looks like this is the proposal where we should be figuring out a solution and everyone can assure themselves that it will work well enough.
There was a problem hiding this comment.
Sorry, I've got local changes that were meant to go out in sync with comment.
Also I thought you were pushing for the opposite solution, let the protocol handle it all and skip the drain.
I'll push the changes shortly and come back to this.
There was a problem hiding this comment.
The idempotent producer concern is real and the proposal now acknowledges it explicitly. However, this is an existing proxy limitation that predates this proposal — any proxy restart today drops all connections with the same effect. A reload mechanism building on this lifecycle model actually improves the situation by confining disruption to a single virtual cluster rather than the whole process.
Solving the fundamental problem — replicating broker-style partition leader migration to avoid session disruption entirely — is a substantial piece of work well beyond the scope of a lifecycle model, with a multitude of possible solutions that the proxy cannot currently express. The proposal notes the limitation and directs users to consider how analogous situations (broker crash, network split) are handled without the proxy, since clients must already be prepared for those. Users requiring exactly-once guarantees across reconnections should use transactional producers.
There was a problem hiding this comment.
The draining section has been updated to clarify in-flight semantics and correct an earlier misstatement about idempotent producers.
On the specific concern about idempotent producers: a proxy-initiated drain is no different from any other connection loss event from the client's perspective. Idempotent producers retain their Producer ID and sequence numbers in client memory across reconnections — a connection drop does not start a new session. When the client reconnects after a drain, it retries unacknowledged requests with the same (PID, epoch, sequence) tuple, and the broker deduplicates them using its existing producer state. A new PID is only assigned on producer process restart, not on reconnection. The drain timeout is the hard backstop after which connections close, but this is protocol-compliant — Kafka clients are required to handle connection loss, and the idempotent producer's deduplication mechanism is designed to survive exactly this scenario.
References:
- KIP-360 documents how the broker retains producer state and how idempotent producers can safely bump their epoch when broker state is lost — the key point being that the PID is retained client-side across reconnections.
- The Idempotent Producer design doc confirms that deduplication is based on (PID, epoch, sequence) per partition, not per connection: "The broker maintains in memory the sequence numbers it receives for each topic partition from every PID."
- KIP-854 covers broker-side producer state expiry — the state is retained well beyond any reasonable drain timeout.
The lifecycle model actually improves the posture for idempotent producers compared to the status quo. Today, any configuration change requires a full process restart, dropping every connection across every virtual cluster. With per-cluster lifecycle and reload, only the affected cluster's connections drain — the blast radius shrinks from "all clusters" to "one cluster" and planned restarts become less frequent.
I think the proposal goes far enough on draining. The lifecycle model defines that draining exists as a state and its high-level semantics (reject new connections, reject new requests, allow in-flight to complete, timeout as backstop). The detailed mechanics — specific error codes for rejected requests, interaction with pipelining, TCP-level behaviour — are implementation concerns that don't change the lifecycle state model and are better resolved in the reload proposal where draining becomes a frequent operation rather than a rare shutdown event.
|
|
||
| Some configuration changes will likely always require draining — for example, changes to the upstream cluster identity or TLS configuration that invalidate existing connections. The optimisation is about identifying changes where draining can be safely skipped, not eliminating it. | ||
|
|
||
| ### Proxy-level lifecycle |
There was a problem hiding this comment.
You've defined how proxy started works wrt the VM state machines. It's not really clear to me whether we need a proposal about the rest, unless you think that that state is also exposed via metrics/mgmt etc.
There was a problem hiding this comment.
Yes — proxy-level state would also be exposed via metrics and management endpoints, which is the argument for capturing it as future work. The per-cluster metrics defined in this proposal are one layer; a proxy-level lifecycle would add an aggregate layer above them covering port binding, process startup/shutdown sequencing, and overall proxy health. Defining it now felt premature, but it's a natural next step once the per-cluster model is established.
There was a problem hiding this comment.
The future enhancements section flags proxy-level lifecycle as a problem space that exists, not as work we're committed to. Whether proxy-level state needs its own formal model, observable via metrics/management endpoints, or is simply emergent from the per-VC lifecycles plus standard process management is an open question — and one that might not need answering.
Every proxy has some form of process-level lifecycle (HAProxy expresses it through old/new process coexistence during reload, nginx through worker generations, httpd through child worker replacement), but none of them formalise it as a state machine — it's just how startup and shutdown work. Kroxylicious may end up the same way: the proxy starts VCs, manages their lifecycles, and exits. If that's sufficient, no proposal is needed.
I've reworded the section to make it clearer that these are genuine open questions without obvious answers (port binding significance, health aggregation across deployment models) rather than planned future work.
There was a problem hiding this comment.
@tombentley similar question here, are we happy?
- Fix startup wording: clusters that fail to start never become available (they cannot be "taken down" if they never came up) - Expand shutdown description to note Kafka protocol resilience - Replace "not modelled" with "notional" for clarity - Disambiguate "operator" → "Kubernetes operator" - Remove internal Java representation section — implementation detail better suited to PR review, not design proposal Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Adds a paragraph explaining the value of draining (reducing unnecessary errors during planned shutdown) and its ecosystem context — brokers drain before stopping and operators expect the same. Makes explicit that the drain timeout is the hard backstop and that forced closure after timeout remains protocol-compliant. In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Virtual clusters are entities (in the Domain-Driven Design sense) whose identifier is their name. The state machine tracks the lifecycle of the entity, not a configuration instance — a reload is a transition the entity passes through, not a replacement of it. Consequences made explicit: - stopped means permanently removed from configuration, not just idle - reload transitions happen on the same entity; recovery transition covers any retry regardless of what configuration is supplied - renaming a cluster is a destructive operation; documentation should warn users of this In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Replace vague "implementation concern" deferral with concrete commitments: - Management endpoint must expose per-cluster state and failure reason - Metrics must capture current state, time in state, and transition count (with suggested names following Prometheus/Micrometer conventions) - Proposal is intentionally non-exhaustive; implementations may add more In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
The policy applies whenever clusters initialise — on first startup or during reload — so startupPolicy was too narrow a name. Renamed to partialInitialisationPolicy to reflect that it governs the proxy's behaviour when initialisation is partial (some clusters failed, some succeeded). Values changed from fail-fast/best-effort to serve-none/serve-others, which describe the outcome from the operator's perspective rather than the mechanism. In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
…oxylicious#89 Circuit-breaking: note under the 'runtime health as lifecycle state' rejected alternative that circuit-breaking is a manifestation of runtime health concerns — the same reasoning that excludes runtime health from the lifecycle model excludes circuit-breaking with it. In-flight: clarify that the proxy takes the same view of in-flight as the Kafka client (sent but not yet acknowledged). Draining rejects both new connections and new requests on existing connections — only existing in-flight requests are completed, ensuring the cluster quiesces. In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
The proxy cannot replicate broker clean shutdown (partition leader migration before connection close), so connection drops during draining cause idempotent producers to lose session state on reconnect. This is an existing proxy limitation; a reload mechanism building on this lifecycle model can improve the situation by confining disruption to a single virtual cluster rather than a full process restart. The proposal now notes this explicitly and directs readers to consider analogous broker crash/netsplit scenarios, which clients must already handle. Users requiring exactly-once guarantees across reconnections should use transactional producers. In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
- Fix startup wording: clusters that fail to start never become available (they cannot be "taken down" if they never came up) - Expand shutdown description to note Kafka protocol resilience - Replace "not modelled" with "notional" for clarity - Disambiguate "operator" → "Kubernetes operator" - Remove internal Java representation section — implementation detail better suited to PR review, not design proposal Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Adds a paragraph explaining the value of draining (reducing unnecessary errors during planned shutdown) and its ecosystem context — brokers drain before stopping and operators expect the same. Makes explicit that the drain timeout is the hard backstop and that forced closure after timeout remains protocol-compliant. In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Virtual clusters are entities (in the Domain-Driven Design sense) whose identifier is their name. The state machine tracks the lifecycle of the entity, not a configuration instance — a reload is a transition the entity passes through, not a replacement of it. Consequences made explicit: - stopped means permanently removed from configuration, not just idle - reload transitions happen on the same entity; recovery transition covers any retry regardless of what configuration is supplied - renaming a cluster is a destructive operation; documentation should warn users of this In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Replace vague "implementation concern" deferral with concrete commitments: - Management endpoint must expose per-cluster state and failure reason - Metrics must capture current state, time in state, and transition count (with suggested names following Prometheus/Micrometer conventions) - Proposal is intentionally non-exhaustive; implementations may add more In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
The policy applies whenever clusters initialise — on first startup or during reload — so startupPolicy was too narrow a name. Renamed to partialInitialisationPolicy to reflect that it governs the proxy's behaviour when initialisation is partial (some clusters failed, some succeeded). Values changed from fail-fast/best-effort to serve-none/serve-others, which describe the outcome from the operator's perspective rather than the mechanism. In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
…oxylicious#89 Circuit-breaking: note under the 'runtime health as lifecycle state' rejected alternative that circuit-breaking is a manifestation of runtime health concerns — the same reasoning that excludes runtime health from the lifecycle model excludes circuit-breaking with it. In-flight: clarify that the proxy takes the same view of in-flight as the Kafka client (sent but not yet acknowledged). Draining rejects both new connections and new requests on existing connections — only existing in-flight requests are completed, ensuring the cluster quiesces. In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
The proxy cannot replicate broker clean shutdown (partition leader migration before connection close), so connection drops during draining cause idempotent producers to lose session state on reconnect. This is an existing proxy limitation; a reload mechanism building on this lifecycle model can improve the situation by confining disruption to a single virtual cluster rather than a full process restart. The proposal now notes this explicitly and directs readers to consider analogous broker crash/netsplit scenarios, which clients must already handle. Users requiring exactly-once guarantees across reconnections should use transactional producers. In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
2c59c54 to
c22b1c8
Compare
tombentley
left a comment
There was a problem hiding this comment.
Thanks @SamBarker, I think this is looking pretty good. I had a question is about how the configuration API for the termination behaviour.
|
|
||
| ```yaml | ||
| proxy: | ||
| partialInitialisationPolicy: serve-none # default — any cluster failure prevents serving traffic |
There was a problem hiding this comment.
I'm on-board with the broad concept, but:
- I don't find this name to be very intuitive
- Is there an equivalent thing which defines how the proxy handles failures during reload? I don't think we'd really need a separate config for that, (or did you have a use case in mind?)
This really seems to be about specifying the user's tolerance for virtual clusters entering/remaining in the failed state. Currently the state machine diagram shows a retry loop, but the way you're describing it here is that loop is, in actual fact, the kubernetes pod crash loop. And therefore that control loop does not exist when the proxy is not running in kube.
Perhaps we should borrow some ideas from how Kubernetes allows pods to fail, but apply them directly in the proxy (whether or not it happens to be running in Kube). Specifically, some number of retries, a retry delay ± a backoff. And implicitly we terminate when we run out of retries for any VC. (You can configure the number of reties as Integer.MAX_VALUE if you don't want to terminate). We can add some other policy than terminate later on if it's needed.
There was a problem hiding this comment.
I can see where your coming from. I had hoped that initialisation would be read as VC initialisation rather than proxy startup.
How about restructuring as:
proxy:
onVirtualClusterFailure:
serve: none # default
# serve: remaining # keep serving healthy clustersonVirtualClusterFailure avoids ambiguity with target clusters. The values describe the proxy-level consequence rather than the individual cluster's fate.
I think retry is orthogonal to this policy. serve answers "what about the rest of the proxy?" while retry answers "what about the failed cluster?" They compose independently. By making onVirtualClusterFailure a structured block rather than a flat string, new keys can be added alongside serve in future proposals without breaking existing configs. For example:
proxy:
onVirtualClusterFailure:
serve: remaining
# future extensions — new keys, existing configs unaffected:
# retry:
# maxRetries: 3
# backoff: 5s
# escalate:
# url: http://example.slack.com/
# messageFormat: "%s failed with %s"
# args: [VIRTUAL_CLUSTER, EXCEPTION_MESSAGE]This proposal defines serve with its two values. The rejected alternative on automatic retry still applies — we're not committing to implementing retry here, but the config structure leaves room for it without a breaking change.
| 1. All `serving` clusters transition to `draining`. | ||
| 2. All `failed` clusters transition directly to `stopped`. | ||
| 3. New connections are rejected for draining clusters. New requests on existing connections are also rejected. | ||
| 4. For each existing connection, the proxy waits for in-flight requests to complete, up to a configurable drain timeout. The proxy takes the same view of in-flight as the Kafka client: a request is in-flight from the moment the client sends it until the client receives a response. |
There was a problem hiding this comment.
The proxy cannot know when/whether the client receives a response that's been forwarded to it (at least not without inferring that fact from the receipt of another request, but that defeats the point).
In any case, I just don't see that this can really be completely solved by the proxy in its current shape. To solve it properly I think we'd need to "virtualise the brokers". That would allow the proxy to leverage Kafka's own technique for shutting down brokers cleanly: Tell the client that this broker is no longer the leader or coordinator for anything, so that all the clients update their metadata to find the new leader/coordinator. Doing that would require a proper cluster-of-proxies, which is not something we have today.
So I think what you're describing here is not perfect, but it's as good are we're likely to get for now, and we shouldn't block this proposal while we figure out a more perfect solution.
There was a problem hiding this comment.
Agreed on both points. "As best it can" now hedges the in-flight definition — the proxy observes its own reads and writes, not what the client has actually sent or received.
And yes, virtualising the brokers would let the proxy leverage Kafka's own clean shutdown mechanism (leader migration via metadata updates). That would be a really compelling capability if we can figure out how to make it work. For now, best-effort draining with a timeout backstop is as good as we can get without it.
- Fix startup wording: clusters that fail to start never become available (they cannot be "taken down" if they never came up) - Expand shutdown description to note Kafka protocol resilience - Replace "not modelled" with "notional" for clarity - Disambiguate "operator" → "Kubernetes operator" - Remove internal Java representation section — implementation detail better suited to PR review, not design proposal Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
There was a problem hiding this comment.
The entity model is well-defined, and the state machine is clean and practical. The separation of runtime health from lifecycle state is the right call. One minor suggestion left on drainTimeout scoping, but nothing blocking.
cc: @Uzziee
|
|
||
| ```yaml | ||
| proxy: | ||
| drainTimeout: 60s # default TBD |
There was a problem hiding this comment.
Since virtual clusters are modelled as independent entities with their own lifecycles, should drainTimeout be configured at the virtual cluster level rather than the root proxy: level? Different VCs may proxy to Kafka clusters with very different timeout characteristics (e.g. max.poll.interval.ms, producer acks=all latency), so a single global drain timeout could be too aggressive for some and too lenient for others.
There was a problem hiding this comment.
Should draining be opt-out? Kafka clients are designed to handle broker disconnections — they retry, reconnect, and rediscover leaders automatically. A proxy restart without switchover looks like a broker disconnect to clients. Draining primarily adds value during switchover scenarios where in-flight requests must complete on the old path before traffic moves to a new one. For simpler cases like reloading authz rules or config changes that don't affect connectivity, the drain phase adds unnecessary latency. Consider supporting drainTimeout: 0s (or an explicit drain: false) to skip draining entirely and transition directly to stopped. This also aligns with making drainTimeout per-VC — some VCs may never need draining.
cc: @Uzziee
There was a problem hiding this comment.
Good suggestion on per-VC drain timeouts — the proposal now does exactly that. Different clusters can have different workload profiles, and per-cluster config also provides the right foundation for reload where only the affected cluster drains.
On making draining opt-out: draining is best-effort rather than a guarantee. It operates within the Netty shutdown window, which is the hard backstop — if a drain exceeds it, Netty force-closes everything. So the cost of having draining enabled is bounded: at worst it uses time that Netty was going to wait anyway. Making it opt-out adds configuration surface without a clear benefit, since disabling it just means connections close immediately rather than getting a chance to complete gracefully.
The proposal also now acknowledges that some operations (notably long-polling consumers) will routinely exceed any reasonable drain timeout. Synthesising early responses to unblock these is a possible future enhancement, but would require its own proposal as it introduces a new signal to the filters.
There was a problem hiding this comment.
Thanks for accepting the per-VC drainTimeout.
On the other point, my reading of the updated design is that drainTimeout must be ≤ network.proxy.shutdownTimeout (default 15s), since Netty's graceful shutdown is the hard outer boundary. If so, the default drainTimeout of 60s exceeds that bound which gives a misleading impression that drain can run longer than Netty allows. Implementation should have warning log for this and default should be set as 15s.
The proposal should clarify the relationship between drainTimeout and shutdownTimeout. As I understand it, drainTimeout is the lower bound (how long we attempt to let in-flight requests complete) and shutdownTimeout is the upper bound (Netty's hard stop). This also means drainTimeout: 0s is a natural opt-out — skip straight to Netty's shutdown, no backpressure or drain phase needed. This framing would help implementors understand that precise in-flight tracking isn't necessary — drain is best-effort within the window, and Netty force-closes at the upper bound regardless.
This introduces a proposal index (README.md) for the proposals directory to solve the problem of colliding proposal numbers. Currently, multiple open PRs have chosen proposal numbers that conflict with already-merged proposals. This creates confusion and makes it difficult to maintain stable identifiers for proposals - which is essential for productive discussion on the mailing list. The new process: 1. Authors create proposals with temporary filenames (nnn-*) 2. A separate PR to this README allocates the next sequential number 3. Authors rename their proposal once the number is allocated 4. The index is updated again when proposals are accepted This ensures: - Proposal numbers are allocated chronologically - No number collisions occur - Each proposal gets a stable identifier before mailing list discussion - Clear visibility of open vs. accepted proposals Current collisions resolved by this index: - PR kroxylicious#70, kroxylicious#82, kroxylicious#83, kroxylicious#88, kroxylicious#89 all have numbers conflicting with accepted proposals and have been allocated new sequential numbers (016-024) Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This introduces a proposal index (README.md) for the proposals directory to solve the problem of colliding proposal numbers. Currently, multiple open PRs have chosen proposal numbers that conflict with already-merged proposals. This creates confusion and makes it difficult to maintain stable identifiers for proposals - which is essential for productive discussion on the mailing list. The new process: 1. Authors create proposals with temporary filenames (nnn-*) 2. A separate PR to this README allocates the next sequential number 3. Authors rename their proposal once the number is allocated 4. The index is updated again when proposals are accepted This ensures: - Proposal numbers are allocated chronologically - No number collisions occur - Each proposal gets a stable identifier before mailing list discussion - Clear visibility of open vs. accepted proposals Current collisions resolved by this index: - PR kroxylicious#70, kroxylicious#82, kroxylicious#83, kroxylicious#88, kroxylicious#89 all have numbers conflicting with accepted proposals and have been allocated new sequential numbers (016-024) Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Keith Wall <kwall@apache.org>
This introduces a proposal index (README.md) for the proposals directory to solve the problem of colliding proposal numbers. Currently, multiple open PRs have chosen proposal numbers that conflict with already-merged proposals. This creates confusion and makes it difficult to maintain stable identifiers for proposals - which is essential for productive discussion on the mailing list. The new process: 1. Authors create proposals with temporary filenames (nnn-*) 2. A separate PR to this README allocates the next sequential number 3. Authors rename their proposal once the number is allocated 4. The index is updated again when proposals are accepted This ensures: - Proposal numbers are allocated chronologically - No number collisions occur - Each proposal gets a stable identifier before mailing list discussion - Clear visibility of open vs. accepted proposals Current collisions resolved by this index: - PR kroxylicious#70, kroxylicious#82, kroxylicious#83, kroxylicious#88, kroxylicious#89 all have numbers conflicting with accepted proposals and have been allocated new sequential numbers (016-024) Assisted-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Keith Wall <kwall@apache.org>
|
|
||
| ```yaml | ||
| proxy: | ||
| drainTimeout: 60s # default TBD |
There was a problem hiding this comment.
Thanks for accepting the per-VC drainTimeout.
On the other point, my reading of the updated design is that drainTimeout must be ≤ network.proxy.shutdownTimeout (default 15s), since Netty's graceful shutdown is the hard outer boundary. If so, the default drainTimeout of 60s exceeds that bound which gives a misleading impression that drain can run longer than Netty allows. Implementation should have warning log for this and default should be set as 15s.
The proposal should clarify the relationship between drainTimeout and shutdownTimeout. As I understand it, drainTimeout is the lower bound (how long we attempt to let in-flight requests complete) and shutdownTimeout is the upper bound (Netty's hard stop). This also means drainTimeout: 0s is a natural opt-out — skip straight to Netty's shutdown, no backpressure or drain phase needed. This framing would help implementors understand that precise in-flight tracking isn't necessary — drain is best-effort within the window, and Netty force-closes at the upper bound regardless.
Introduce proposal 014 defining a lifecycle state machine for virtual clusters: initializing, degraded, healthy, draining, failed, and stopped. This provides a foundation for resilient startup (best-effort mode), graceful shutdown with drain timeouts, runtime health distinction (degraded vs healthy), and configuration reload via the hot reload proposal (012). Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
014 and 015 are already claimed by other proposals. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
…ve tenancy motivation - Replace healthy/degraded states with single 'accepting' state that makes no health claims — lifecycle tracks what the proxy is doing, not runtime health - Reframe motivation around failure domains rather than multi-tenancy - Replace ASCII state diagram with excalidraw PNG - Add 'Runtime health as lifecycle state' rejected alternative explaining why health is orthogonal to lifecycle Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Health criteria vary by deployment and may become per-destination once request-level routing is in play. Defining a health model prematurely would constrain future design without immediate value. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
- Fix startup wording: clusters that fail to start never become available (they cannot be "taken down" if they never came up) - Expand shutdown description to note Kafka protocol resilience - Replace "not modelled" with "notional" for clarity - Disambiguate "operator" → "Kubernetes operator" - Remove internal Java representation section — implementation detail better suited to PR review, not design proposal Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Adds a paragraph explaining the value of draining (reducing unnecessary errors during planned shutdown) and its ecosystem context — brokers drain before stopping and operators expect the same. Makes explicit that the drain timeout is the hard backstop and that forced closure after timeout remains protocol-compliant. In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Virtual clusters are entities (in the Domain-Driven Design sense) whose identifier is their name. The state machine tracks the lifecycle of the entity, not a configuration instance — a reload is a transition the entity passes through, not a replacement of it. Consequences made explicit: - stopped means permanently removed from configuration, not just idle - reload transitions happen on the same entity; recovery transition covers any retry regardless of what configuration is supplied - renaming a cluster is a destructive operation; documentation should warn users of this In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Replace vague "implementation concern" deferral with concrete commitments: - Management endpoint must expose per-cluster state and failure reason - Metrics must capture current state, time in state, and transition count (with suggested names following Prometheus/Micrometer conventions) - Proposal is intentionally non-exhaustive; implementations may add more In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
The policy applies whenever clusters initialise — on first startup or during reload — so startupPolicy was too narrow a name. Renamed to partialInitialisationPolicy to reflect that it governs the proxy's behaviour when initialisation is partial (some clusters failed, some succeeded). Values changed from fail-fast/best-effort to serve-none/serve-others, which describe the outcome from the operator's perspective rather than the mechanism. In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
…oxylicious#89 Circuit-breaking: note under the 'runtime health as lifecycle state' rejected alternative that circuit-breaking is a manifestation of runtime health concerns — the same reasoning that excludes runtime health from the lifecycle model excludes circuit-breaking with it. In-flight: clarify that the proxy takes the same view of in-flight as the Kafka client (sent but not yet acknowledged). Draining rejects both new connections and new requests on existing connections — only existing in-flight requests are completed, ensuring the cluster quiesces. In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
'Accepting' implies the proxy is merely open to new connections. 'Serving' better captures what the proxy is actually doing in this state — actively handling traffic for the cluster, not just admitting connections. This also aligns terminology with the partialInitialisationPolicy values (serve-none/serve-others). Also update compatibility section to reference the new policy name and values, and fix a stale description that referred to "connections completing" rather than "in-flight requests completing". Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
The proxy cannot replicate broker clean shutdown (partition leader migration before connection close), so connection drops during draining cause idempotent producers to lose session state on reconnect. This is an existing proxy limitation; a reload mechanism building on this lifecycle model can improve the situation by confining disruption to a single virtual cluster rather than a full process restart. The proposal now notes this explicitly and directs readers to consider analogous broker crash/netsplit scenarios, which clients must already handle. Users requiring exactly-once guarantees across reconnections should use transactional producers. In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
The future enhancements section now acknowledges that proxy-level lifecycle involves genuinely unsettled questions (port binding failure significance, health aggregation across deployment models) rather than presenting it as committed future work. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
The previous text incorrectly stated that connection drops during draining cause idempotent producers to lose their session state. In fact, producers retain their PID and sequence numbers in client memory across reconnections — the broker deduplicates retries using (PID, epoch, sequence) per partition, not per connection. A new PID is only assigned on producer process restart. References: KIP-360, Idempotent Producer design doc, KIP-854. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Add liveness probes alongside readiness probes in the observability discussion. Hedge in-flight definition to acknowledge the proxy observes its own reads/writes, not the client's. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Restructure as a block to separate orthogonal concerns: `serve` controls proxy-level behaviour (none vs remaining), leaving room for future extensions like retry and escalation as additional keys. Rename avoids ambiguity with target clusters and better describes the trigger (a virtual cluster failure) rather than the mechanism. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
The serve policy now hangs off onVirtualClusterStopped rather than onVirtualClusterFailure. This cleanly separates the blast radius decision (what about the other clusters?) from recovery policies (what about the failed cluster?) that a future reload proposal will define under onVirtualClusterFailed. Renamed serve values from none|remaining to none|successful — "remaining" implied a prior known-good state which is misleading at startup where nothing was serving yet. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Draining now operates within the Netty shutdown window rather than as an independent timeout. Per-VC drain timeouts replace the proxy-level drainTimeout to support different workload profiles and provide the right foundation for reload where only the affected cluster drains. Acknowledges inherent draining limitations with long-polling consumers and notes that synthesising early responses would require a new proposal (filter SPI change). Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Aligns the example with Netty's default shutdown timeout so the documented default sits comfortably within the hard ceiling. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
ebb7fc6 to
2210a12
Compare
Summary
Introduces proposal 016 defining a lifecycle state model for virtual clusters. Virtual clusters are modelled as entities in the Domain-Driven Design sense — each has a persistent identity (its name) that carries through state changes. Reload is a transition the entity passes through, not a replacement of it.
The model defines five states:
Key Design Decisions
servingnotaccepting— the state name reflects that the proxy is actively serving traffic, not merely open to new connectionsservingis not split into healthy/degraded; runtime health is orthogonal to lifecycle and better addressed by readiness probes and metrics. Circuit-breaking is noted as a manifestation of the same concern.partialInitialisationPolicy: serve-none | serve-others— governs proxy behaviour when cluster initialisation fails, whether on startup or reload. Default isserve-none(matches current behaviour).stoppedis terminal — reload routes throughdraining→initializing, never throughstoppedRelationship to Other Proposals
This provides a foundation for 012 - Hot Reload, which can define reload transitions on this state model. Note: the Hot Reload proposal's "Cluster modification: remove + add" framing describes internal runtime mechanics; from the lifecycle model's perspective the named entity persists through modification.
Test plan