PHOENIX-7787 Make CCF HAGroupStore ZK Updates backward compatible with existing ZK based client by ritegarg · Pull Request #2476 · apache/phoenix

ritegarg · 2026-05-14T21:40:30Z

What changes were proposed in this pull request?

Adds an opt-in sync from /phoenix/consistentHA/<G> to the legacy /phoenix/ha/<G> znode in each region server's HAGroupStoreClient. Each client derives a combined ClusterRoleRecord (local role + peer role) and writes it to its local /phoenix/ha via ZK stat-version CAS, so pre-consistentHA ZK-registry clients keep reading a current view of the HA group.

New config keys (in QueryServices / QueryServicesOptions):

phoenix.ha.legacy.crr.sync.enabled — master switch (default false).
phoenix.ha.legacy.crr.reconciliation.interval.seconds — periodic reconciliation cadence (default 60; 0 disables only the periodic loop, the event-driven path still runs).

Convergence model:

Event-driven on consistentHA CHILD_ADDED / CHILD_UPDATED (LOCAL or PEER).
Periodic reconciler on a dedicated single-thread executor (0–30s jitter on first run).
Curator NodeCache watcher on the legacy znode itself, so per-sync reads are in-memory rather than a ZK round-trip.
On CAS BadVersion: log and bail. The NodeCache watch + next event/periodic cycle reconverges. No inline retry.

Code-level changes:

ClusterRoleRecord: new explicit-RegistryType ctor; @JsonCreator now round-trips registryType (legacy znode is always written with RegistryType.ZK for backward compatibility, and must read back as ZK). New isLogicallyEqualIgnoringVersionAndRegistry helper.
PhoenixHAAdmin: three new helpers on the legacy namespace (getClusterRoleRecordAndStatInZooKeeper, createOrUpdateClusterRoleRecordWithCAS); atomic-read fix on the existing getHAGroupStoreRecordInZooKeeper (replaces a pre-existing 2-RPC pattern that could return a stat version not matching its data bytes). Named CAS sentinels: CREATE_NEW_RECORD_STAT_VERSION = -2, ZK_MATCH_ANY_VERSION = -1. The -1 value is only used as ZK's wildcard, never as our create marker — using it for create would silently bypass CAS.
HAGroupStoreClient: legacyHaAdmin + legacyCrrNodeCache + private legacyCrrSyncLock + dedicated periodic executor. Listener hooks for CHILD_ADDED/CHILD_UPDATED. CHILD_REMOVED is intentionally a no-op — operator owns the legacy znode lifecycle.
New StaleClusterRoleRecordVersionException (CRR-specific analog of StaleHAGroupStoreRecordVersionException).

Why are the changes needed?

The Consistent Cluster Failover work moved HA group state from /phoenix/ha (which held a ClusterRoleRecord) to /phoenix/consistentHA (which holds an HAGroupStoreRecord and uses an RPC-registry view of CRRs). Existing Phoenix clients built against the older ZK-registry contract still read /phoenix/ha/<G> and expect:

The znode to exist for any live HA group.
The record's registryType to be ZK.
The roles to reflect the latest state of both clusters.

Without this patch, after a CCF rollout the /phoenix/ha znode goes stale or absent, and old ZK-registry clients break. This change keeps /phoenix/ha in sync with the new authoritative /phoenix/consistentHA records on a per-RS basis, gated by an opt-in flag so it only runs where the operator needs it.

Does this PR introduce any user-facing change?

Yes, but the new behavior is gated entirely behind the new master switch (phoenix.ha.legacy.crr.sync.enabled), which defaults to false.

When the flag is enabled:

Each HAGroupStoreClient writes to (and maintains) the legacy /phoenix/ha/<G> znode on its local ZK.
Old pre-consistentHA ZK-registry clients pointed at /phoenix/ha will see fresh records again.

One incidental visible change independent of the flag: ClusterRoleRecord JSON deserialization now respects the persisted registryType field. Previously it was hard-coded to RPC regardless of JSON content. Existing JSON without a registryType field still defaults to RPC, so deployments not exposed to legacy-sync writes see no behavior change.

How was this patch tested?

13 new integration tests in HAGroupStoreClientIT covering: feature-off no-op, initial create, role-changing transition, state-only transition (no rewrite), peer-driven role flip, peer-absent → role2=UNKNOWN → recovery, pre-existing stale znode overwrite, two-writer CAS race + invalid-sentinel rejection, LOCAL/PEER CHILD_REMOVED (znode preserved), registryType stays ZK across multiple role cycles, reconciliation interval =0 (event-driven still works), and periodic-loop recovery after an external divergence.
Full HAGroupStoreClientIT: 39/39 pass.
ClusterRoleRecordTest (unit): 12/12 pass.
mvn spotless:check clean across the repo.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude (Anthropic).

…h existing ZK based client Co-authored-by: Cursor <cursoragent@cursor.com>

PHOENIX-7787 Make CCF HAGroupStore ZK Updates backward compatible wit…

5892dc9

…h existing ZK based client Co-authored-by: Cursor <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PHOENIX-7787 Make CCF HAGroupStore ZK Updates backward compatible with existing ZK based client#2476

PHOENIX-7787 Make CCF HAGroupStore ZK Updates backward compatible with existing ZK based client#2476
ritegarg wants to merge 1 commit into
apache:PHOENIX-7562-feature-newfrom
ritegarg:PHOENIX-7787-legacy-crr-sync

ritegarg commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ritegarg commented May 14, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant