Skip to content

feat: add replicated host sync gate#2018

Open
ku524 wants to merge 1 commit into
Altinity:0.27.2from
ku524:pr/replicated-host-sync-gate
Open

feat: add replicated host sync gate#2018
ku524 wants to merge 1 commit into
Altinity:0.27.2from
ku524:pr/replicated-host-sync-gate

Conversation

@ku524

@ku524 ku524 commented Jun 30, 2026

Copy link
Copy Markdown

Summary

  • Add default-off reconcile.host.wait.replicas.sync rolling gate for replicated hosts.
  • Wait for a recreated host to catch up to a bounded ClickHouse replication baseline before advancing to the next host.
  • Use SYSTEM SYNC REPLICA ... LIGHTWEIGHT, replicated database sync, async-loader settling, and a stable health window.
  • Do not require replication_queue to become empty.
  • Clear stale caught-up markers on data-loss/missing-volume paths only when the gate is enabled.

Contribution checklist

  • All commits in the PR are squashed.
  • The PR targets the dedicated next-release branch 0.27.2, not master.
  • The commit is signed off.

Operational scenario

This is primarily aimed at local or direct-attached storage recovery cases, such as NVMe-backed Local PVs. When a recreated ClickHouse pod starts on an empty or replaced local disk, Kubernetes readiness and the existing absolute_delay marker can become true before the replica has discovered all replicated objects and fetched the known parts from peers. The opt-in sync gate prevents the rolling reconcile from advancing to the next host until that recreated replica has passed ClickHouse-level async-load, object-discovery, sync, and health checks.

Why the existing caught-up check is not enough

The existing caught-up marker path is intentionally left unchanged when sync.enabled=false, but it is a weak proxy for recreated-host recovery because it only polls the local host's MAX(absolute_delay) from system.replicas before writing status.hostsWithReplicaCaughtUp. That metric is limited to replicated objects already loaded and visible on the local server.

That can miss the failure mode this PR targets. During startup/recreation, asynchronous database/table loading may not have exposed every replicated object on the local host yet, and a local delay metric cannot discover replicated DBs/tables that exist on peer replicas or issue a ClickHouse sync barrier for their known parts. This PR keeps the old behavior as the default, and adds an opt-in gate that first settles async loading, discovers replicated objects from peers, runs DB/table sync barriers, and only then writes the caught-up marker after a stable health window.

Workload and side-effect considerations

The legacy check is lighter: it only polls MAX(absolute_delay) on the local host. The new gate does more work when explicitly enabled, so it can extend rolling reconcile time and add ClickHouse/Keeper/replication load while a recreated host catches up.

The added work is intentionally scoped:

  • Default-off: clusters keep the legacy lightweight behavior unless sync.enabled=true is configured.
  • Rolling scope: the gate runs in the existing per-host rolling path, not as a cluster-wide concurrent sweep.
  • Bounded object scope: it discovers replicated DBs/tables from peers and syncs those objects; it does not wait for the entire system.replication_queue to drain.
  • Lightweight table sync: table sync uses SYSTEM SYNC REPLICA ... LIGHTWEIGHT instead of legacy full sync, so it waits for the relevant known part-acquisition work without blocking on unrelated merges, mutations, or new ingest after the sync baseline.
  • Deadline controls: the whole gate uses a shared timeout; onTimeout=proceed can advance without writing the caught-up marker when operators prefer availability over blocking the rollout.

Operationally, enabling this gate trades faster rolling progress for a stronger recovery guarantee. That tradeoff is intended for local/direct-attached PV recovery cases where advancing to the next host before the recreated replica has rebuilt from peers is riskier than the extra catch-up work.

Related to #1704. This mitigates the “advance before recreated replica catches up” path, but does not close #1704 because the cross-operator-restart sequencing gap remains out of scope.

Safety

  • Default-off: existing behavior is unchanged when sync.enabled=false.
  • onTimeout=proceed advances without writing the caught-up marker.
  • Parent context cancellation, query/connection errors, async-loader failures, and readonly/session-expired at deadline remain hard failures.
  • Unsupported LIGHTWEIGHT versions fail explicitly.

Test plan

  • bash ./dev/run_code_generator.sh
  • bash ./dev/build_manifests.sh
  • bash ./dev/generate_helm_chart.sh
  • bash ./dev/go_build_all.sh
  • bash ./dev/find_unformatted_sources.sh
  • go test -count=1 ./pkg/apis/clickhouse.altinity.com/v1/... ./pkg/model/chi/schemer/... ./pkg/controller/chi/... ./pkg/controller/common/announcer
  • python3 -m py_compile tests/e2e/test_operator.py
  • yq eval-all 'true' tests/e2e/manifests/chopconf/test-079-sync-gate.yaml tests/e2e/manifests/chi/test-079-sync-gate-1.yaml tests/e2e/manifests/chi/test-079-sync-gate-2.yaml
  • git diff --check

Not run locally:

  • test_010079* and test_010056* e2e runtime, because the local docker-compose runner fails on Apple Silicon nested minikube/runc and no safe local native cluster is currently available.
  • Runtime confirmation of ClickHouse system.asynchronous_loader.is_ready behavior on the target e2e image; this is covered by the new e2e path/CI.

@ku524 ku524 marked this pull request as ready for review June 30, 2026 00:40
@ku524 ku524 force-pushed the pr/replicated-host-sync-gate branch 2 times, most recently from 0fbfea6 to edaf1e7 Compare June 30, 2026 06:52
Signed-off-by: ku524 <yeonjuyeong@gmail.com>
@ku524 ku524 force-pushed the pr/replicated-host-sync-gate branch from edaf1e7 to 2e28e6c Compare June 30, 2026 07:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant