PHOENIX-7562 HAGroupStore peer cache: fail-closed replay on peer loss by ritegarg · Pull Request #2547 · apache/phoenix

ritegarg · 2026-06-25T18:43:03Z

What changes were proposed in this pull request?

Make a STANDBY RegionServer's replication replay fail closed while its peer cluster is not visible, and move all peer-connection handling out of HAGroupStoreClient into a dedicated PeerClusterWatcher.

PeerClusterWatcher (new) — owns the peer PathChildrenCache + PhoenixHAAdmin for one HA group and reports peer state via a PeerStateListener (onPeerStateChanged / onPeerVisible / onPeerBlind):
- Retry when peer ZK is down (e.g. unreachable at startup): if the cache can't be built the watcher goes BLIND, and a per-client daemon — scheduled lazily only once a peer is configured, so an unused watcher starts no thread — retries the build on phoenix.ha.group.store.peer.cache.retry.interval.seconds until the peer returns, then goes VISIBLE with no restart needed.
- Forced redelivery after reconnect: peer records are de-duplicated by znode version, with exactly one forced redelivery on reconnect so no transition is missed across the disconnect.
- Concurrency: lock order transitionLock → stateLock; the blocking cache build and listener callbacks run outside stateLock, and visible/blind transitions are delivered atomically with their notifications.
HAGroupStoreCacheUtil (new) — shared helpers to parse a znode into (record, stat) and build/start a PathChildrenCache (init latch released in finally).
HAGroupStoreClient — delegates peer handling to PeerClusterWatcher via a PeerStateListener; adds the in-memory effective-state overlay getEffectiveHAGroupStoreRecord() (reports a local STANDBY as DEGRADED_STANDBY whenever the peer is blind — whether the peer drops while already STANDBY or the role reaches STANDBY after the peer is blind — never persisted), suppresses a redundant real STANDBY while degraded, and serializes the local degrade/recover notifications so they cannot reorder.
HAGroupStoreManager / HAGroupStoreRecord — add getEffectiveHAGroupStoreRecord(haGroupName) and withHAGroupState(...) (immutable copy for the overlay), and document DEGRADED_STANDBY's dual nature. The DEGRADED_STANDBY state and the subscription framework are pre-existing.
Replication — the replay consumers (ReplicationLogGroup.init, ReplicationLogDiscoveryReplay.getHAGroupRecord, ReplicationLogReplay.init) read the effective record; mode mapping is otherwise unchanged.
Config — new phoenix.ha.group.store.peer.cache.retry.interval.seconds (default 60s; 0 disables retry), with jittered retry and rate-limited WARN (1st + every 10th attempt).

Why are the changes needed?

When the peer ZK is unreachable — including down at startup — a STANDBY can't reliably determine peer state, so replay risked proceeding as if in sync (fail-open). It must instead fail closed (STORE_AND_FORWARD) until the peer is reachable, then recover automatically. Extracting the peer lifecycle into PeerClusterWatcher decouples replay from peer connectivity and keeps HAGroupStoreClient focused on the effective HA view.

Does this PR introduce any user-facing change?

Yes, within the unreleased consistent-failover feature branch (no change vs released Phoenix):

New config phoenix.ha.group.store.peer.cache.retry.interval.seconds (default 60s).
A STANDBY whose peer ZK is unreachable presents an effective DEGRADED_STANDBY (in-memory only, never written to ZK), so replay fails closed until the peer is visible again. Persisted wire format is unchanged.

How was this patch tested?

New and existing unit/integration tests: HAGroupStoreClientIT, HAGroupStoreManagerIT, HAGroupStateSubscriptionIT, ReplicationLogDiscoveryReplayTestIT, ReplicationLogGroupIT, ReplicationLogGroupTest, HAGroupStoreRecordTest, PeerClusterWatcherTest.

Key cases: peer-loss degrade/recover; the role entering STANDBY while the peer is already blind, and cold start with peer ZK down, both present DEGRADED_STANDBY while the persisted record stays STANDBY; peer ZK down at startup then retry rebuilds the cache; lazy retry scheduling (a never-configured watcher starts no retry thread); in-memory-only overlay; STANDBY re-entry suppressed while peer-blind; forced reconnect redelivery; visible/blind serialization; close() idempotency; replayer degrade → abort → recover; and a listener throwing on the cache INITIALIZED event still initializes healthy.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor

Extract peer-connection handling from HAGroupStoreClient into a dedicated PeerClusterWatcher: peer cache lifecycle, background retry (scheduled lazily, only once a peer is configured) when peer ZK is unreachable, connection-state handling, de-duplicated delivery with one forced redelivery after reconnect, and a visible/blind state machine. While this RegionServer is STANDBY and cannot see the peer, present an effective local DEGRADED_STANDBY so replication replay fails closed. The overlay is in-memory only; the shared HA record is never modified. The replication replay consumers read the effective HA state rather than peer-connectivity details, and peer reconcile runs off Curator event threads. Add phoenix.ha.group.store.peer.cache.retry.interval.seconds (default 60s) with retry jitter and rate-limited, reason-tagged logging. Co-authored-by: Cursor <cursoragent@cursor.com>

ritegarg force-pushed the PHOENIX-7562-peer-cache branch from d2b1191 to 717b765 Compare June 25, 2026 20:49

ritegarg force-pushed the PHOENIX-7562-peer-cache branch from 717b765 to f59328b Compare June 29, 2026 01:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PHOENIX-7562 HAGroupStore peer cache: fail-closed replay on peer loss#2547

PHOENIX-7562 HAGroupStore peer cache: fail-closed replay on peer loss#2547
ritegarg wants to merge 1 commit into
apache:PHOENIX-7562-feature-newfrom
ritegarg:PHOENIX-7562-peer-cache

ritegarg commented Jun 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ritegarg commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ritegarg commented Jun 25, 2026 •

edited

Loading