Skip to content

PHOENIX-7566 HAGroupStore admin tool: HDFS URLs, URL validation, failover recovery guidance#2519

Open
ritegarg wants to merge 4 commits into
apache:PHOENIX-7562-feature-newfrom
ritegarg:PHOENIX-7562-pr3-admin-tool
Open

PHOENIX-7566 HAGroupStore admin tool: HDFS URLs, URL validation, failover recovery guidance#2519
ritegarg wants to merge 4 commits into
apache:PHOENIX-7562-feature-newfrom
ritegarg:PHOENIX-7562-pr3-admin-tool

Conversation

@ritegarg

@ritegarg ritegarg commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

HAGroupStore admin tool (PhoenixHAAdminTool) fixes for consistent failover (PHOENIX-7566).

What changes were proposed in this pull request?

  • HDFS URLs in the tool: get/list print HDFS URL / Peer HDFS URL; update accepts --hdfs-url / --peer-hdfs-url (registered, counted in the "at least one field" guard, shown in proposed changes).
  • Canonical SYSTEM.HA_GROUP writes: create/update write the slot columns (including HDFS_URL_1/2) in a canonical order keyed on the formatted ZK URL, so each slot's ZK/CLUSTER/ROLE/HDFS stay paired and both clusters persist identical rows (matching the periodic ZK→SYSTEM.HA_GROUP sync). Previously update wrote local-first and never wrote HDFS, which could leave ZK_URL_n unpaired from HDFS_URL_n.
  • URL validation on write, tolerance on read: create/update validate URL fields against the registry type the read path uses (ZK quorum vs RPC/master). ZK URLs are always validated (they identify the HA pair); --force stores only malformed cluster URLs as-is. HAGroupStoreClient.getHAGroupNames skips + WARNs a row whose ZK URL won't parse instead of failing enumeration for all callers. get-cluster-role-record renders an unparseable stored cluster URL as <invalid> (surfacing the cause) instead of throwing.
  • Failover recovery guidance: on initiate-failover / abort-failover timeout, prints manual-recovery steps (inspect both clusters, restore connectivity, abort on the standby, or force a steady state).

Why are the changes needed?

The tool could not display or set HDFS URLs, and update could persist a row whose ZK/CLUSTER/ROLE/HDFS slots were unpaired. A single malformed stored URL could crash list / get-cluster-role-record and break HA-group enumeration for server-side callers. A timed-out failover left the operator with no recovery direction.

Does this PR introduce any user-facing change?

Yes — CLI only: new --hdfs-url / --peer-hdfs-url and HDFS URLs in get/list; canonical, paired SYSTEM.HA_GROUP rows; malformed cluster URLs rejected on create/update (overridable with --force; ZK URLs always validated); clearer get-cluster-role-record and failover-timeout output. No API or wire changes.

How was this patch tested?

  • PhoenixHAAdminToolIT (18/18) and PhoenixHAAdminIT (9/9).

Was this patch authored or co-authored using generative AI tooling?

Yes — co-authored with Cursor.

@ritegarg ritegarg force-pushed the PHOENIX-7562-pr3-admin-tool branch 2 times, most recently from fad8d80 to fa67679 Compare June 11, 2026 23:53
…nical SYSTEM.HA_GROUP writes, failover recovery guidance

- get/list print HDFS URL / Peer HDFS URL; update accepts --hdfs-url/--peer-hdfs-url
  (register options, count them in the field guard, show them in proposed changes).
- Validate URL fields on create/update against the registry type the read path uses
  (ZK quorum vs RPC/master), with a --force bypass; HAGroupStoreClient.getHAGroupNames
  skips + WARNs a row whose ZK URL will not parse instead of breaking enumeration for all
  callers; render an unparseable stored cluster URL as <invalid> in get-cluster-role-record
  instead of crashing.
- create/update now write the SYSTEM.HA_GROUP slot columns (including HDFS_URL_1/2) in a
  canonical order keyed on the formatted ZK URL, so each slot's ZK/CLUSTER/ROLE/HDFS columns
  stay paired and both clusters persist identical rows (matching the periodic
  ZK->SYSTEM.HA_GROUP sync). update previously wrote local-first and never wrote HDFS, which
  could leave ZK_URL_n unpaired from HDFS_URL_n.
- On initiate-failover/abort-failover timeout, print manual-recovery guidance (inspect both
  sides, restore connectivity, abort on standby, or force a steady state).

Co-authored-by: Cursor <cursoragent@cursor.com>
@ritegarg ritegarg force-pushed the PHOENIX-7562-pr3-admin-tool branch from fa67679 to f39678c Compare June 12, 2026 00:33
Ritesh Garg and others added 3 commits June 11, 2026 20:34
- firstClusterTakesSlot1: rename fa/fb to formattedZkUrlA/formattedZkUrlB
  and restore insertIntoSystemTable's javadoc.
- printFailoverRecoveryGuidance: drop the FAILOVER_RUNBOOK.md reference
  (not in repo) and merge the overlapping transitional-state recovery
  steps into one.
- get-cluster-role-record: inline the invalid-URL fallback and remove the
  redundant printClusterRoleRecordWithInvalidUrls helper.

Co-authored-by: Cursor <cursoragent@cursor.com>
…allback)

- firstClusterTakesSlot1: correct the javadoc. The canonical order is keyed
  on the formatted ZK URL; ClusterRoleRecord canonicalizes url1/url2 on the
  cluster URL (not the ZK URL), and the periodic ZK->SYSTEM.HA_GROUP sync now
  matches this ordering (apache#2521).
- get-cluster-role-record: on the invalid-URL fallback, surface the underlying
  cause so an unrelated RuntimeException is not silently mislabeled as a bad
  URL, and label the raw values Cluster URL / Peer Cluster URL (local/peer)
  instead of the slot-based Cluster 1/2 URL.
- update help: note that --force also stores malformed URLs as-is.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant