feat: add replicated host sync gate by ku524 · Pull Request #2018 · Altinity/clickhouse-operator

ku524 · 2026-06-30T00:37:58Z

Summary

Add default-off reconcile.host.wait.replicas.sync rolling gate for replicated hosts.
Wait for a recreated host to catch up to a bounded ClickHouse replication baseline before advancing to the next host.
Use SYSTEM SYNC REPLICA ... LIGHTWEIGHT, replicated database sync, async-loader settling, and a stable health window.
Do not require replication_queue to become empty.
Clear stale caught-up markers on data-loss/missing-volume paths only when the gate is enabled.

Contribution checklist

All commits in the PR are squashed.
The PR targets the dedicated next-release branch 0.27.2, not master.
The commit is signed off.

Operational scenario

This is primarily aimed at local or direct-attached storage recovery cases, such as NVMe-backed Local PVs. When a recreated ClickHouse pod starts on an empty or replaced local disk, Kubernetes readiness and the existing absolute_delay marker can become true before the replica has discovered all replicated objects and fetched the known parts from peers. The opt-in sync gate prevents the rolling reconcile from advancing to the next host until that recreated replica has passed ClickHouse-level async-load, object-discovery, sync, and health checks.

Why the existing caught-up check is not enough

The existing caught-up marker path is intentionally left unchanged when sync.enabled=false, but it is a weak proxy for recreated-host recovery because it only polls the local host's MAX(absolute_delay) from system.replicas before writing status.hostsWithReplicaCaughtUp. That metric is limited to replicated objects already loaded and visible on the local server.

That can miss the failure mode this PR targets. During startup/recreation, asynchronous database/table loading may not have exposed every replicated object on the local host yet, and a local delay metric cannot discover replicated DBs/tables that exist on peer replicas or issue a ClickHouse sync barrier for their known parts. This PR keeps the old behavior as the default, and adds an opt-in gate that first settles async loading, discovers replicated objects from peers, runs DB/table sync barriers, and only then writes the caught-up marker after a stable health window.

Workload and side-effect considerations

The legacy check is lighter: it only polls MAX(absolute_delay) on the local host. The new gate does more work when explicitly enabled, so it can extend rolling reconcile time and add ClickHouse/Keeper/replication load while a recreated host catches up.

The added work is intentionally scoped:

Default-off: clusters keep the legacy lightweight behavior unless sync.enabled=true is configured.
Rolling scope: the gate runs in the existing per-host rolling path, not as a cluster-wide concurrent sweep.
Bounded object scope: it discovers replicated DBs/tables from peers and syncs those objects; it does not wait for the entire system.replication_queue to drain.
Lightweight table sync: table sync uses SYSTEM SYNC REPLICA ... LIGHTWEIGHT instead of legacy full sync, so it waits for the relevant known part-acquisition work without blocking on unrelated merges, mutations, or new ingest after the sync baseline.
Deadline controls: the whole gate uses a shared timeout; onTimeout=proceed can advance without writing the caught-up marker when operators prefer availability over blocking the rollout.

Operationally, enabling this gate trades faster rolling progress for a stronger recovery guarantee. That tradeoff is intended for local/direct-attached PV recovery cases where advancing to the next host before the recreated replica has rebuilt from peers is riskier than the extra catch-up work.

Related to #1704. This mitigates the “advance before recreated replica catches up” path, but does not close #1704 because the cross-operator-restart sequencing gap remains out of scope.

Safety

Default-off: existing behavior is unchanged when sync.enabled=false.
onTimeout=proceed advances without writing the caught-up marker.
Parent context cancellation, query/connection errors, async-loader failures, and readonly/session-expired at deadline remain hard failures.
Unsupported LIGHTWEIGHT versions fail explicitly.

Test plan

bash ./dev/run_code_generator.sh
bash ./dev/build_manifests.sh
bash ./dev/generate_helm_chart.sh
bash ./dev/go_build_all.sh
bash ./dev/find_unformatted_sources.sh
go test -count=1 ./pkg/apis/clickhouse.altinity.com/v1/... ./pkg/model/chi/schemer/... ./pkg/controller/chi/... ./pkg/controller/common/announcer
python3 -m py_compile tests/e2e/test_operator.py
yq eval-all 'true' tests/e2e/manifests/chopconf/test-079-sync-gate.yaml tests/e2e/manifests/chi/test-079-sync-gate-1.yaml tests/e2e/manifests/chi/test-079-sync-gate-2.yaml
git diff --check

Not run locally:

test_010079* and test_010056* e2e runtime, because the local docker-compose runner fails on Apple Silicon nested minikube/runc and no safe local native cluster is currently available.
Runtime confirmation of ClickHouse system.asynchronous_loader.is_ready behavior on the target e2e image; this is covered by the new e2e path/CI.

Signed-off-by: ku524 <yeonjuyeong@gmail.com>

ku524 marked this pull request as ready for review June 30, 2026 00:40

ku524 force-pushed the pr/replicated-host-sync-gate branch 2 times, most recently from 0fbfea6 to edaf1e7 Compare June 30, 2026 06:52

feat: add replicated host sync gate

2e28e6c

Signed-off-by: ku524 <yeonjuyeong@gmail.com>

ku524 force-pushed the pr/replicated-host-sync-gate branch from edaf1e7 to 2e28e6c Compare June 30, 2026 07:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add replicated host sync gate#2018

feat: add replicated host sync gate#2018
ku524 wants to merge 1 commit into
Altinity:0.27.2from
ku524:pr/replicated-host-sync-gate

ku524 commented Jun 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ku524 commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Contribution checklist

Operational scenario

Why the existing caught-up check is not enough

Workload and side-effect considerations

Safety

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ku524 commented Jun 30, 2026 •

edited

Loading