Skip to content

PHOENIX-7566 HAGroupStore state machine: ANISTS non-blocking, DS->STANDBY_TO_ACTIVE#2518

Open
ritegarg wants to merge 1 commit into
apache:PHOENIX-7562-feature-newfrom
ritegarg:PHOENIX-7562-pr1-state-machine
Open

PHOENIX-7566 HAGroupStore state machine: ANISTS non-blocking, DS->STANDBY_TO_ACTIVE#2518
ritegarg wants to merge 1 commit into
apache:PHOENIX-7562-feature-newfrom
ritegarg:PHOENIX-7562-pr1-state-machine

Conversation

@ritegarg

@ritegarg ritegarg commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

HAGroupStore state-machine fixes for consistent failover (PHOENIX-7566).

What changes were proposed in this pull request?

  • In HAGroupStoreRecord.getClusterRole(), map ACTIVE_NOT_IN_SYNC_TO_STANDBY to the ACTIVE role (non-blocking). Only ACTIVE_IN_SYNC_TO_STANDBY remains ACTIVE_TO_STANDBY (mutation-blocking).
  • Add STANDBY_TO_ACTIVE to DEGRADED_STANDBY.allowedTransitions, so a degraded standby can promote directly on failover (matches the TLA+ model).

Why are the changes needed?

On the not-in-sync failover path, an active that is still draining its backlog was treated as mutation-blocked, even though it should keep serving writes until it is in sync and ready to hand off. Separately, a DEGRADED_STANDBY had no allowed transition to STANDBY_TO_ACTIVE, so it could not promote during failover. Together these prevented a clean failover when the active is behind.

Does this PR introduce any user-facing change?

No. Internal HA state-machine semantics only; no API, config, or CLI changes.

How was this patch tested?

  • HAGroupStoreRecordTest (17/17): role mapping + mutation-block coverage (ACTIVE_NOT_IN_SYNC_TO_STANDBY non-blocking, ACTIVE_IN_SYNC_TO_STANDBY blocking) and the DEGRADED_STANDBY → STANDBY / STANDBY_TO_ACTIVE transitions.
  • HAGroupStoreManagerIT (17/17): per-group mutation blocking.

Was this patch authored or co-authored using generative AI tooling?

Yes — co-authored with Cursor.

…ED_STANDBY -> STANDBY_TO_ACTIVE

- Map ACTIVE_NOT_IN_SYNC_TO_STANDBY to ClusterRole ACTIVE (non-blocking); only
  ACTIVE_IN_SYNC_TO_STANDBY remains mutation-blocking.
- Allow DEGRADED_STANDBY -> STANDBY_TO_ACTIVE so a degraded standby can promote on failover.
- Tests: HAGroupStoreRecordTest role/mutation-block and transition assertions; HAGroupStoreManagerIT
  multi-group blocking uses ACTIVE_IN_SYNC_TO_STANDBY.

Co-authored-by: Cursor <cursoragent@cursor.com>
@ritegarg ritegarg force-pushed the PHOENIX-7562-pr1-state-machine branch from e846600 to eb84f89 Compare June 11, 2026 20:36
@ritegarg ritegarg changed the title PHOENIX-7566 HAGroupStore state machine: ANISTS non-blocking, DS->STANDBY_TO_ACTIVE, no anti-flap wait on ANISTS->AISTS PHOENIX-7566 HAGroupStore state machine: ANISTS non-blocking, DS->STANDBY_TO_ACTIVE Jun 11, 2026
@apurtell apurtell requested a review from Copilot June 12, 2026 22:27

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts Phoenix HAGroupStore state-machine semantics to enable cleaner failover behavior when the active cluster is not yet in sync, by making the ACTIVE_NOT_IN_SYNC_TO_STANDBY path non-mutation-blocking and allowing a degraded standby to promote directly.

Changes:

  • Map HAGroupState.ACTIVE_NOT_IN_SYNC_TO_STANDBY to ClusterRole.ACTIVE (non-blocking), leaving only ACTIVE_IN_SYNC_TO_STANDBY as ACTIVE_TO_STANDBY (mutation-blocking).
  • Allow DEGRADED_STANDBY -> STANDBY_TO_ACTIVE transition to enable direct promotion during failover.
  • Update unit/integration tests to reflect the updated mapping and transition behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
phoenix-core-client/src/main/java/org/apache/phoenix/jdbc/HAGroupStoreRecord.java Updates HA state→role mapping and extends DEGRADED_STANDBY allowed transitions to include STANDBY_TO_ACTIVE.
phoenix-core/src/test/java/org/apache/phoenix/jdbc/HAGroupStoreRecordTest.java Updates role-mapping expectations and adds assertions covering mutation-blocking behavior and degraded-standby transitions.
phoenix-core/src/it/java/org/apache/phoenix/jdbc/HAGroupStoreManagerIT.java Adjusts the integration test to use the mutation-blocking drain state (ACTIVE_IN_SYNC_TO_STANDBY) and clarifies per-group blocking expectations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@apurtell apurtell left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.
I did the review myself, but did have the robots help with considering TLA model updates after this goes in. (I'll have some fun there.)

Then I asked for suggestions on improving test coverage because the current coverage seems a bit weak. These were the suggestions, fyi

  1. Drive (ANIS, DS) -> (S, AIS) through the real ZK + FailoverManagementListener path in something like FailoverPhoenixConnectionIT or HAGroupStoreManagerIT, and assert both clusters converge to (S, AIS). The current unit test only pins the transition table.

  2. Nit in HAGroupStoreRecordTest: Add an assertFalse(DEGRADED_STANDBY.isTransitionAllowed(SOMETHING_NOT_IN_SET)) to lock down the allowed set.

ImmutableSet.of(ABORT_TO_ACTIVE_IN_SYNC, STANDBY);
STANDBY_TO_ACTIVE.allowedTransitions = ImmutableSet.of(ABORT_TO_STANDBY, ACTIVE_IN_SYNC);
DEGRADED_STANDBY.allowedTransitions = ImmutableSet.of(STANDBY);
DEGRADED_STANDBY.allowedTransitions = ImmutableSet.of(STANDBY, STANDBY_TO_ACTIVE);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Fixes a DS side deadlock at peer AISTS.

case ABORT_TO_ACTIVE_NOT_IN_SYNC:
case ACTIVE_IN_SYNC:
case ACTIVE_NOT_IN_SYNC:
case ACTIVE_NOT_IN_SYNC_TO_STANDBY:

@apurtell apurtell Jun 12, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Reduces unavailability during failover. The safety argument shifts a bit, but the transition tables do not allow the peer to promote, so we still have only a single writer in all states.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants