Skip to content

DNM: complete cluster autoscaling branch, for CI and reference#36738

Draft
aljoscha wants to merge 17 commits into
MaterializeInc:mainfrom
aljoscha:adapter-cluster-autoscaling
Draft

DNM: complete cluster autoscaling branch, for CI and reference#36738
aljoscha wants to merge 17 commits into
MaterializeInc:mainfrom
aljoscha:adapter-cluster-autoscaling

Conversation

@aljoscha

@aljoscha aljoscha commented May 26, 2026

Copy link
Copy Markdown
Contributor

Complete branch for SQL-316, the intention is to open PRs for individual chunks of work from this. But I keep this PR as a reference for reviewers and to run CI on the full feature.

Contributes to SQL-316

@aljoscha aljoscha force-pushed the adapter-cluster-autoscaling branch 5 times, most recently from d9d177f to 69060d0 Compare June 1, 2026 12:07
@aljoscha aljoscha force-pushed the adapter-cluster-autoscaling branch 14 times, most recently from d5de464 to 135056a Compare June 10, 2026 11:43
@aljoscha aljoscha force-pushed the adapter-cluster-autoscaling branch 8 times, most recently from 7403ecf to dae3f3e Compare June 15, 2026 19:31
aljoscha and others added 3 commits June 15, 2026 23:17
…ss tracker

Companion to the cluster autoscaling design doc: a staged 7-PR plan and live
progress tracker, with an operating protocol for implementation sessions, the
controller boundary and gating model, codebase anchors, and per-PR scope and
checklists.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add the additive, behaviourally-inert durable state the cluster
controller will need, with one catalog migration (v85->v86) defaulting
it for existing clusters. No new field is read, so this is dark.

On the managed cluster config: `auto_scaling_strategy` (user policy),
`reconfiguration`, and `burst` (in-flight runtime records), all
`Option`, `None` by default.

On the managed replica location: collapse the single `availability_zone`
user-pin into an `availability_zones: Vec<String>` recording the zones
the replica was provisioned under -- a managed cluster's AVAILABILITY
ZONES pool, or an unmanaged replica's pin as a zero-/one-element list.
The controller needs this durable to tell realized- from target-shape
replicas by config shape (including an AVAILABILITY ZONES divergence). A
single-AZ field and a separate provisioned-list field would be mutually
exclusive and both collapse to a list at the orchestrator, so one list
is the honest shape. The migration backfills managed-cluster replicas
from their cluster's `availability_zones` and unmanaged-cluster replicas
from their pin.

Concretization stays inert: it re-derives the managed pool from the
cluster's current config and reads the durable list as the pin only for
unmanaged clusters, so the new managed-replica list is written but not
yet read. The in-memory `ManagedReplicaAvailabilityZones` enum is
unchanged here -- it is the discriminator the persistence path uses, and
can only be simplified once this durable field stores the list
unconditionally, which it now does.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`ManagedReplicaAvailabilityZones` distinguished a managed cluster's
`AVAILABILITY ZONES` pool (`FromCluster`) from an unmanaged cluster's
single user pin (`FromReplica`), but both collapse to a list of
acceptable zones at the orchestrator, the distinction is recoverable
from whether the owning cluster is managed, and the durable replica
record now stores a single `availability_zones` list. Replace the enum
on `ManagedReplicaLocation` with a bare `Vec<String>` (empty = no
constraint).

Now that the durable field stores the list unconditionally, the
in-memory->durable `From` is a passthrough and concretization fills the
list directly -- re-deriving the managed pool from the cluster config,
reading the durable list as the pin for unmanaged clusters -- so this
stays behaviour-preserving. The orchestrator maps an empty list to "no
constraint"; the convert-to-managed check iterates the pin(s); and the
`mz_cluster_replicas.availability_zone` column is unchanged -- it still
surfaces only an unmanaged replica's single pin, now derived from the
list plus the cluster's managed-ness, with a comment recording what a
future plural `text list` column would entail.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
aljoscha and others added 14 commits June 16, 2026 00:14
…iring

Stand up the cluster controller end to end with only the implicit baseline
strategy, so the reconcile loop runs but is a no-op for steady-state clusters.
Establish the task boundary, the input/decision types, the compare-and-append
apply path, and the master gate -- all dark.

New pure crate `mz-cluster-controller` (no adapter/catalog dep, no new
third-party license):

* `ClusterControllerCtx` -- the strategy-agnostic pull/apply seam. Reads are
  batched and pulled on demand (`managed_cluster_ids`, `cluster_states` with a
  latched `now`, and a `collections_hydrated_on_replicas` method that exists for
  the strategies that follow but goes unused here); the single `apply` transacts
  a tick's batch under a compare-and-append guard. The controller depends on
  exactly this trait, which is what makes it testable against a fake impl and
  extractable later without touching controller code.
* The pure `Strategy` trait (`update_state` / `desired_replicas`) and the
  implicit `BaselineStrategy`, which desires `replication_factor` replicas at the
  realized cluster shape. Baseline-only means desired equals realized, so a
  steady managed cluster reconciles to no decisions.
* The reconcile kernel: phase 1 unions every strategy's `update_state` per
  cluster and applies under the guard (a rejected cluster is skipped this tick);
  phase 2 re-reads, unions `desired_replicas` (multiset union is max-per-shape,
  not sum, so a replica survives iff some strategy desires its shape), matches by
  `ReplicaShape` against the actual replicas, and emits the creates and drops
  that close the gap, with per-create strategy attribution. Phase 2 reuses the
  phase-1 read when no state was written, keeping the barrier only for the
  writing strategies that follow.

Every decision -- the `UpdateClusterState` state writes and the create/drop
batch alike -- carries the `ExpectedClusterState` it was diffed against. The
apply path re-reads each target cluster's durable config and records and rejects
the whole batch on any mismatch, so a stale create or drop can never reshape the
replica set against a config a concurrent `ALTER` has since established; the
controller recomputes on the next tick.

Adapter driver `coord/cluster_controller.rs` is the other half of the seam. It
runs the controller as a separate task whose `CoordCtx` marshals each batched
pull/apply to the coordinator loop over `internal_cmd_tx` plus a oneshot
(`Message::ClusterControllerRequest`), because the catalog and live signals are
reachable only from that loop. On a held guard it builds `Op::UpdateClusterConfig`
(cut-over / record write), `Op::CreateClusterReplica` (reusing replica-location
concretization), and `Op::DropObjects`, transacted together. The interim
create/drop audit reason is `Manual`; the attribution-carrying controller reason
lands with the graceful strategy.

Add dyncfgs `enable_cluster_controller` (default false) and
`cluster_controller_tick_interval` (default 5s), both re-read each tick: a
runtime flip of the gate needs no restart, and the cadence is a live operational
knob. With the gate off, reads report no clusters and applies reject, keeping the
controller fully inert and every legacy path unchanged.

The frontier and read-timestamp reads the controller will also need are left to
their first consumer (the graceful and on-refresh strategies): their signatures
are dictated by that consumer, and declaring them speculatively would fix the
wrong shape and pull an unused frontier dependency into the pure crate. They land
the same pull-on-demand way as the hydration read.

Tested by boundary tests against a fake `ClusterControllerCtx` (steady no-op,
under/over-provision, wrong-shape drop+create, union max-not-sum, distinct-shape
attribution, and compare-and-append reject-and-recover against a state a
concurrent `ALTER` changed, for both the state-write and create/drop guards) and
an slt (`cluster_controller.slt`) that forces the gate on, drives the tick
interval to 5ms, waits across hundreds of ticks, and asserts a steady managed
cluster's replica ids and names are unchanged -- an assertion gate-off behaviour,
which never runs the loop, cannot satisfy.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… + wait-shim

Move graceful (zero-downtime) cluster reconfiguration into the cluster
controller as a pure strategy, driven by the durable `reconfiguration` record,
with hydration-aware cut-over and a durable, honored timeout. Everything lands
dark behind the `enable_cluster_controller` master gate; the legacy 3-stage
machine still runs when the gate is off.

Strategy (mz-cluster-controller). New pure `GracefulReconfigurationStrategy`,
engaged whenever the `reconfiguration` record is present. `desired_replicas`
contributes `target.replication_factor` replicas at the target shape (size,
logging, AZ list) on top of the baseline's realized set -- the hydrate-overlap.
`update_state` cuts the realized config over to the target and clears the
record once those replicas are all present and hydrated; success takes
precedence over the deadline. Past the deadline with the target not fully
hydrated it applies the record's `on_timeout`: `ROLLBACK` (the default) drops
the target set and keeps the record as a tombstone that parks the strategy;
`COMMIT` cuts over to the still-unhydrated target and clears it.

Hydration seam. New `ClusterControllerCtx::hydrated_replicas(cluster, replicas)
-> BTreeSet` ("which of these replicas have all current collections hydrated"),
the shape its only consumer needs and that the underlying controller APIs can
express. The controller pulls it on demand -- only while a reconfiguration is
in flight -- into the live-signal field `ClusterState::hydrated_replicas`
(excluded from the compare-and-append witness). The adapter driver backs it
per-replica against the compute and storage controllers, which collapse a
replica list to a single "hydrated on any" bool.

ALTER reshape (gated). With the master gate on, a managed-cluster `ALTER` that
changes a replica's config shape (SIZE, logging, AVAILABILITY ZONES) -- or any
`ALTER` while a record is already in flight -- writes/folds the
`reconfiguration` record onto the realized config and leaves the realized shape
in place; the controller converges and cuts over. A fold overlays the `ALTER`
onto the in-flight target per dimension: a dimension the `ALTER` sets
re-targets, one left `Unchanged` keeps the in-flight target's value (seeding
`Unchanged` dimensions from the realized config would silently revert the
in-flight transition, since the realized config only advances at cut-over),
while the deadline and `on_timeout` are replaced wholesale by the latest
`ALTER`'s. Non-shape changes with no record in flight keep updating the
realized config directly. The deadline is `now + timeout` and `on_timeout` is
threaded from the existing `WITH (WAIT ...)` clause: `WAIT UNTIL READY (TIMEOUT,
ON TIMEOUT ...)` verbatim, `WAIT FOR` desugars to `ON TIMEOUT COMMIT`, and
omitting `WAIT` falls back to the `default_cluster_reconfiguration_timeout`
dyncfg and the default action. The planner's implicit `OnTimeoutAction::default()`
flips `COMMIT`->`ROLLBACK` globally -- the safe default that never silently
induces downtime by cutting over to an un-hydrated target -- and the legacy
foreground path reads the same `default()`.

Wait-shim. New `ClusterStage::AwaitReconfiguration` polls the durable record
until it clears (done) or its deadline passes (timeout); since the strategy
still cuts over past the deadline once hydrated, the shim grants one grace
re-poll before surfacing `AlterClusterTimeout`. With the new
`enable_background_alter_cluster` dyncfg on, `ALTER` returns immediately
instead. Session disconnect no longer aborts a reconfiguration.

Audit. New `ReplicaCreateDropReason::GracefulReconfiguration` ->
`CreateOrDropClusterReplicaReasonV1::Reconfiguration`, carried on the
controller's graceful-desired replica creates; the audit proto enum is added in
place in the unshipped v86 snapshot.

Design/plan. Settle the per-`ALTER` timeout surface in the design doc and
tracker: keep `WITH (WAIT ...)` as the permanent spelling (it already carries
both the deadline and the on-timeout action) and record `on_timeout` as a
durable, controller-honored knob defaulting to `ROLLBACK`.

Tests: graceful kernel/flow cases in mz-cluster-controller (in-flight desire,
cut-over, partial hydration, timeout-vs-hydrated precedence, timeout park,
`COMMIT`- vs `ROLLBACK`-at-timeout, AZ-only shape change, full overlap then
cut-over, ALTER-back, and the `fold_reconfiguration_target` overlay), FakeCtx
seam tests that drive reconcile end-to-end past a forced deadline, and an
extended `cluster_controller.slt` asserting a background `ALTER` cuts the
realized size over and that the omitted/`COMMIT`/`ROLLBACK` spellings each drive
a record under the gate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…udit lifecycle

Surface in-flight cluster reconfigurations in SQL so a background ALTER
CLUSTER is observable, and record the reconfiguration lifecycle in the audit
log. All dark: the durable reconfiguration/burst records only ever move under
the `enable_cluster_controller` master gate, so for an ordinary cluster the
new introspection reports current == target with nothing in flight.

Introspection view. New builtin materialized view
`mz_internal.mz_cluster_reconfigurations` -- one row per managed cluster,
computed in `mz_catalog_server` from the raw catalog (`mz_catalog_raw`): the
realized managed config and the durable `reconfiguration`/`burst` records
yield current vs. target size / replication factor / availability-zone list,
an in-flight flag, the active deadline, and a placeholder `burst_size` column
for the burst strategy. Indexed on `cluster_id` in `mz_catalog_server`.
Deriving it from the catalog rather than imperatively packing builtin-table
rows keeps the relation a pure function of durable state; a new builtin needs
no migration.

SHOW CLUSTERS. `mz_show_clusters` LEFT JOINs the new relation to add
`current_size`, `target_size`, and `reconfiguration_in_flight`, so a single
SHOW CLUSTERS answers "what's there now", "what is it moving to", and "is
something in flight". The view is indexed in `mz_catalog_server`, so it stays
non-temporal; the timed-out-vs-in-progress split is read from
`reconfiguration_deadline` rather than computed with `mz_now()`.

Audit lifecycle. New `EventDetails::AlterClusterReconfigurationV1` records a
started / finalized / timed-out / cancelled transition with the target shape
and, where it applies, the active deadline. It is emitted from the single
`Op::UpdateClusterConfig` durable write site, classified purely from the
before/after `reconfiguration` record and the write timestamp vs. the
deadline: a record write or re-target is `started`, a hydrated clear (or any
clear under `ROLLBACK`) is `finalized`, a `COMMIT`-on-timeout clear past the
deadline is `timed-out`, and an ALTER-back whose new target equals the
realized shape is `cancelled` -- all without adding vocabulary to the
controller seam. A clear is `timed-out` only when the record's `on_timeout`
is `COMMIT`, so a hydrated-but-late success is never mislabeled. The proto
variant is added in place to the unshipped v86 snapshot.

Tests: a `classify_reconfiguration_transition` unit test in mz-adapter; an
observability section in `cluster_controller.slt` and gate-off column
assertions in `show_clusters.slt`; and the catalog-snapshot slt/testdrive
expectations and the `mz_internal` system-catalog doc updated for the new
relation, its index, and the SHOW CLUSTERS columns.

Deferred: the ROLLBACK-timeout audit event -- it performs no config write, so
it would need a controller-seam signal or a durable tombstone stamp; the
rollback is already visible via the tombstoned record and the none-desired
replica drops.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Port the existing `ON REFRESH` cluster scheduling into the cluster
controller as a pure `OnRefreshStrategy`, so the controller framework owns
all three existing behaviours (baseline, graceful reconfiguration,
on-refresh). Everything lands dark behind the master gate: the legacy
`cluster_scheduling.rs` policy still runs when `enable_cluster_controller`
is off and the strategy is exercised only with the gate forced on in tests.

The strategy contributes one replica at the cluster's realized shape while
the cluster is inside a refresh window and nothing otherwise. The window
decision (an MV still needs a refresh, or is estimated to still need Persist
compaction) is ported verbatim from `check_refresh_policy`. A scheduled
cluster's `replication_factor` is the controller's domain, so `update_state`
normalizes it to 0 at runtime — self-healing, no migration — and the
implicit baseline contributes nothing on a scheduled cluster, leaving the
on-refresh strategy as the sole contributor there.

The refresh-window signals (read timestamp, compaction estimate, bound
REFRESH MV write frontiers and schedules) are pulled through a new
`refresh_window_inputs` ctx method, on demand and only for scheduled
clusters, the same pay-for-what-you-use way as hydration. The MV write
frontier is modeled as the antichain's single upper bound, which keeps the
pure crate free of a direct timely dependency while preserving the exact
frontier semantics. The cluster schedule is added to the compare-and-append
witness so a concurrent `SET (SCHEDULE = ...)` rejects an in-flight
on-refresh decision.

The legacy `check_scheduling_policies` / `handle_scheduling_decisions`
no-op when the controller is enabled, so the two never both write a
scheduled cluster's replica set; the legacy path stays for gate-off and is
removed in the final cleanup PR.

Tests: ten on-refresh kernel and seam tests in `mz-cluster-controller`
(window in/out decision, the read-ts boundary, the hydration estimate and
compaction windows, replication-factor normalization, and end-to-end
create/drop through a fake ctx), and an ON REFRESH section in
`cluster_controller.slt` with the gate forced on.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…, break-glass

PR 6 of the cluster-autoscaling plan: the hydration-burst capability, the last
feature PR, landing dark behind its gates.

SQL surface. New `AUTO SCALING STRATEGY = (ON HYDRATION (HYDRATION SIZE = '...',
LINGER DURATION = '...'))` cluster option at CREATE CLUSTER and ALTER CLUSTER
SET/RESET, with a dedicated `ClusterAutoScalingStrategyOptionValue` AST node, the
`AUTO`/`SCALING`/`LINGER`/`DURATION` keywords, and the parser/planner threading
it into the durable `ClusterVariantManaged.auto_scaling_strategy` field. SHOW
CREATE CLUSTER renders it. Acceptance is gated by the item-parsing
`enable_auto_scaling_strategy` feature flag (not a dyncfg) so a stored statement
re-parses at catalog rehydration. Validations reject HYDRATION SIZE equal to the
cluster SIZE and AUTO SCALING STRATEGY combined with a non-MANUAL SCHEDULE.

Strategy. New pure `HydrationBurstStrategy`: while the cluster carries an
ON HYDRATION policy, the break-glass flag is on, the cluster is On, and no
steady-state replica is hydrated, it writes a durable `burst` record and desires
one extra replica at the hydration size; it stamps the steady-hydration time,
tears the record down a linger after, re-arms if the steady set un-hydrates, and
clears a stale record whenever a burst is no longer warranted. The seam gains the
`AutoScalingPolicy` mirror on the cluster state (and the compare-and-append
witness) plus the `enable_hydration_burst` break-glass and
`default_hydration_burst_linger` config signals; the controller now probes
hydration when a burst is in flight or warranted.

Observability and audit. Burst create/drop carry a new `HydrationBurst` reason; a
new `ClusterHydrationBurstV1` started/finished lifecycle event is classified at
the cluster-config write site. Both proto variants are added in place to the
unshipped v86 snapshot. The PR-4 `mz_cluster_reconfigurations.burst_size` column
already surfaces the record.

All of this is dark: the gates default to the no-burst state and existing tests
pass unchanged.

Tests: kernel arm/linger/re-arm/teardown/break-glass cases and a fake-ctx seam
test in mz-cluster-controller; parser testdata for the new option; a burst
section in cluster_controller.slt.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Flip enable_cluster_controller and enable_background_alter_cluster on for the
test harnesses only -- mzcompose get_minimal_system_parameters and the
sqllogictest binary's system-parameter defaults -- so CI exercises the
controller owning the managed-cluster replica set fleet-wide. The production
dyncfg defaults stay false: the feature still lands dark, and the real default
flip is a separate rollout commit after a prod bake.

With the gate on in CI, the legacy tests that assert behavior the controller
does not reproduce are migrated to controller behavior:

- managed_cluster.slt: a SIZE change reshapes in the background and advances the
  realized config only at cut-over, so the surviving replica churns its name
  across reshapes (r1 -> r2 -> r3).
- materialized_views.slt: the controller owns a scheduled cluster's replica set,
  holding the realized replication_factor at 0 and toggling a single replica in
  and out of the refresh window, so the scheduling shows up as ordinary
  manual-reason replica create/drops without the legacy per-policy
  scheduling-decision audit detail.
- test/cluster: graceful reconfiguration writes a durable record that resumes
  and completes across an environmentd restart, and the controller never
  creates a legacy "-pending" replica.

The async-awareness these and other tests also need (driving the controller
tick down and waiting for replica-set reconciles, which are correct with the
gate on or off) rides with the respective feature commits; this commit carries
only the changes that are incompatible with the legacy behavior, alongside the
CI flag flip that makes them apply.
A product-facing writeup of the new cluster-controller capabilities:
background graceful reconfiguration, hydration-burst autoscaling, and the
new introspection surfaces, plus the user-facing side-effect changes and
rollout gates. Every SQL/output pair was captured verbatim from a local
environmentd built from this branch; pm-showcase-replay.sh regenerates all
transcripts end-to-end (including the mid-reconfiguration restart) against
a fresh environment.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…burst arming)

Records the three rough edges found while live-testing for the PM showcase
and the workshopped fixes as a new tracked stage: (8a) controller drops
audited with a new uniform 'retired' reason and on-refresh creates restored
to 'schedule'; (8b) a durable rolled_back_at stamp on the reconfiguration
record, an event-vocabulary re-carve (finalized = cut over, timed-out =
parked), and a 'state' column on mz_cluster_reconfigurations — closing
PR 4's deferred ROLLBACK-timeout item; (8c) burst arming gated on the
existence of an un-hydrated object, so empty clusters never burst.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Spec'd, workshopped, and implemented 2026-06-10; the tracker records the
settled decisions, the implementation checklist, and the dissolution of
PR 8a + PR 9 into fixup! commits against their mainline PR targets.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A ROLLBACK at the deadline previously performed no durable write: the
graceful strategy just stopped desiring the target replicas, the record
stayed behind as an unmodified tombstone, and the audit trail ended at
'started'. The strategy now clears the reconfiguration record on a tick
past the deadline with the target un-hydrated (success precedence
unchanged: a hydrated target still cuts over first), leaving the
realized config untouched. No tombstone is retained: a rolled-back
cluster reads settled in mz_cluster_reconfigurations and SHOW CLUSTERS,
and the timeout's papertrail is the audit event.

The audit vocabulary is re-carved around the clear: only the controller
clears a record, and a clear either advanced the realized config to the
cleared record's target ('finalized' — a hydrated success under either
action, or a forced COMMIT past the deadline) or left it short
('timed-out' — the rollback). Both carry the record's deadline, so a
late/forced cut-over is distinguishable from an in-time one. This
closes the deferred ROLLBACK-timeout audit event and removes the
documented COMMIT-late-success imprecision, with no catalog schema
change.

A cleared record is no longer conclusively success, so the foreground
wait-shim now carries the ALTER's target and, when the record clears,
reports success iff the realized config reached it, AlterClusterTimeout
otherwise.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

doc: consolidate rollback/audit semantics in autoscaling design

The tombstone -> clear-and-audit model change had its semantics
re-explained across several sections. Keep the full mechanism in its two
canonical homes — the ROLLBACK case under Failure handling and the audit
event list under Observability — and trim the echoes:

- Graceful reconfiguration strategy: drop the audit-papertrail aside; the
  strategy bullet only needs the mechanism.
- Introspection view: keep "reads settled", drop the restated audit
  division.
- Notable user-facing changes: cut the internal mechanism (record clear,
  durable-outcome read), keep the user-observable fact.
- Align the audit-log prose to the `timed-out` event name (was the stale
  "timeout-fired").

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…drate

The burst strategy armed whenever no steady replica reported
fully-hydrated, which is also true when the replica simply does not exist
yet or has not registered with the compute controller — so a brand-new
cluster with an AUTO SCALING STRATEGY burst at creation, before any object
existed, wasting boot time plus the linger duration of an extra replica.

Arming is now existential: a burst is warranted iff there exists an object
on the cluster that no steady-state replica has hydrated. With zero
objects the condition is vacuously unsatisfied, so the empty-cluster case
needs no special-casing. The controller pulls a has_hydratable_objects
signal through the ctx on demand, answered from the cluster's bound
objects filtered to dataflow-backed items (webhook sources excluded) — a
catalog-level approximation of what the per-replica hydration check
counts, whose mismatches self-heal through the linger path. The signal is
a live input excluded from the compare-and-append witness, and doubles as
a probe gate: an object-less cluster skips the per-replica hydration
probe entirely. A crashed or restarting steady replica on a cluster with
objects still bursts, which is intended.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ified

Re-ran the full PM-showcase replay on a fresh environment with the PR 8
build: controller drops audit as 'retired', a rollback-at-deadline reads
started -> timed-out -> cancelled -> finalized with the deadline on every
transition and the new 'state' column live, and the burst audit shows a
single started/finished pair (no at-creation burst). Updates the showcase
report's rough-edges section to record the fixes and what a current build
prints where its transcripts predate them; marks PR 8 complete in the plan
tracker.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Re-run the full showcase replay now that the rough edges the original run
surfaced are fixed, and fold the new behavior into the document:

- The timeout-rollback section shows the controller rolling back on its
  own at the deadline — settled SHOW CLUSTERS, no parked record to clear,
  and the started/timed-out audit pair carrying the deadline and the
  abandoned target.
- Scenario 2 demonstrates that an empty cluster does not burst (burst
  arming is existential), and the burst audit trail is a single
  started/finished pair.
- The "rough edges found during this run" section is gone; transcripts
  now show retired drops and the rest of the fixed behavior directly.
- All timestamps and timings refreshed from the new run.

The replay script follows: the timeout capture queries the timed-out
audit event, the obsolete clear-parked-record step is removed, and the
at-creation-burst settle poll is replaced by a no-burst capture.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…lumn

Rework the in-flight reconfiguration introspection surface into two sparse,
JSON-forward base relations, and collapse the SHOW CLUSTERS additions into a
single column.

mz_cluster_reconfigurations keeps its name, OID, and mz_catalog_server
index but is reshaped: a row only while a graceful reconfiguration is in flight,
with typed deadline and on_timeout columns and the full target shape as a jsonb
column. The realized config already lives in mz_clusters, so the relation
carries only the in-flight delta.

mz_cluster_auto_scaling_strategies is new (own OID + index): one row while
a cluster has an AUTO SCALING STRATEGY configured or an autoscaling action
running, with the configured policy as a jsonb strategy column and the in-flight
runtime as a jsonb state column keyed by strategy ({"burst": ...}, NULL when
idle). JSON keeps the schema stable as strategies grow.

SHOW CLUSTERS drops the four added columns for a single nullable activity
column, built by LEFT JOINing both relations: a short summary of any in-flight
reconfiguration and/or burst, NULL when steady. It needs no mz_now(), so the
indexed mz_show_clusters view stays non-temporal.

Snapshots (oid.slt, mz_catalog_server_index_accounting.slt,
autogenerated/mz_internal.slt) regenerated from a live engine; the new MV takes
s506, shifting later catalog IDs. Tests updated: cluster_controller.slt and
show_clusters.slt for the new shapes, plus the catalog-inventory snapshots,
canary-clusters.td, and the test_managed_cluster.py rollback assertion.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@aljoscha aljoscha force-pushed the adapter-cluster-autoscaling branch 3 times, most recently from 55e75ca to 57e3371 Compare June 16, 2026 12:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant