Skip to content

Catch per-startup failures during ConfigNode leader warm-up#17898

Merged
CRZbulabula merged 1 commit into
masterfrom
fix-confignode-leader-warmup-startup-failure
Jun 10, 2026
Merged

Catch per-startup failures during ConfigNode leader warm-up#17898
CRZbulabula merged 1 commit into
masterfrom
fix-confignode-leader-warmup-startup-failure

Conversation

@CRZbulabula

Copy link
Copy Markdown
Contributor

Background

Follow-up to #17821 ("Improve ConfigNode leader warm-up before serving"), addressing a review comment from @Caideyipi.

Problem

In ConfigRegionStateMachine.becomeLeader(), the parallel leader-service startups are joined with a barrier:

CompletableFuture.allOf(startups).join();

Each startup is wrapped by startInParallelIfEpochCurrent(), whose Javadoc claims the returned future "always completes normally so CompletableFuture#allOf acts as a clean join barrier." But the body ran startup.run() without any exception guard:

return CompletableFuture.runAsync(
    () -> {
      if (isCurrentLeaderServicesEpoch(epoch)) {
        startup.run(); // unguarded
      }
    },
    leaderServicesStartupPool);

If any of the parallel startups — CQ scheduler, pipe meta-sync/heartbeat, subscription meta-sync, metrics, cluster-id check, etc. — throws a RuntimeException, the corresponding future completes exceptionally. allOf(startups).join() then rethrows it as a CompletionException, which propagates out of becomeLeader() before ProcedureManager.startExecutor() and markLeaderServicesReadyIfEpochCurrent() run.

As a result leaderServicesReady is never set to true, and the ConfigNode keeps returning CONFIG_NODE_LEADER_WARMING_UP to clients indefinitely — the leader silently never becomes serviceable. The documented "always completes normally" invariant was simply false.

Fix

Make the invariant true by catching and logging each startup's failure inside startInParallelIfEpochCurrent() so it can never escape into the join barrier:

  • Each startup now runs inside a try/catch; a failure is logged at ERROR and swallowed, so a single misbehaving service can no longer abort the whole transition.
  • A failed service stays unstarted until the next leadership transition, but the node still finishes warming up and begins serving — and the failure is observable via the error log instead of a silent hang.
  • To make that log actionable, each startup is paired with a human-readable name through a small LeaderServiceStartup holder, so the log identifies exactly which service failed.
  • The startInParallelIfEpochCurrent() Javadoc is updated to describe the actual (now-correct) behavior.

Behavior change

Before After
One parallel startup throws becomeLeader() aborts; node stuck at CONFIG_NODE_LEADER_WARMING_UP forever Failure logged with the service name; warm-up completes; node serves
Other startups Their outcome irrelevant — transition already aborted Continue and complete normally

Testing

  • mvn spotless:apply -pl iotdb-core/confignode — clean.
  • mvn compile -pl iotdb-core/confignode — compiles successfully.

The change is a self-contained defensive guard around existing private startup logic; no public surface or existing tests are affected.

Within becomeLeader(), the parallel leader-service startups are joined with
CompletableFuture.allOf(startups).join(). startInParallelIfEpochCurrent()
ran startup.run() unguarded, so if any startup (CQ, pipe, subscription,
metrics, clusterId, ...) threw a RuntimeException, its future completed
exceptionally and join() rethrew it as a CompletionException out of
becomeLeader(). That aborted the transition before startExecutor() and
markLeaderServicesReadyIfEpochCurrent() ran, so leaderServicesReady never
flipped to true and the node kept returning CONFIG_NODE_LEADER_WARMING_UP
forever -- even though startInParallelIfEpochCurrent()'s Javadoc claimed the
future "always completes normally".

Make that claim true: each startup now catches and logs its own failure
(tagged with the service name via a small LeaderServiceStartup holder) and
never lets it escape. A single failing service stays unavailable until the
next leadership transition, but the node still finishes warming up and begins
serving, and the failure is observable through the error log.
@CRZbulabula CRZbulabula requested a review from Caideyipi June 10, 2026 07:38
@sonarqubecloud

Copy link
Copy Markdown

@codecov

codecov Bot commented Jun 10, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.74%. Comparing base (11a178a) to head (83ac794).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
...nsensus/statemachine/ConfigRegionStateMachine.java 0.00% 30 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17898      +/-   ##
============================================
+ Coverage     40.62%   40.74%   +0.11%     
- Complexity     2621     2623       +2     
============================================
  Files          5244     5244              
  Lines        362633   362646      +13     
  Branches      46684    46684              
============================================
+ Hits         147315   147754     +439     
+ Misses       215318   214892     -426     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Caideyipi Caideyipi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correctness concern: this makes every parallel leader-service startup best-effort but still marks leaderServicesReady true afterward. Some entries in leaderServiceStartups() are not merely optional background services, e.g. PipeConfigNodeRuntime, CQScheduler, PipeMetaSync/PipeHeartbeat, and SubscriptionMetaSync. If any of those fails during startup, confirmLeader() can start returning SUCCESS while pipe/CQ/subscription functionality stays disabled until the next leadership transition, since this patch only logs the failure and does not retry it. That changes the system from warming up to serving with unavailable leader-only functionality. Can we either distinguish required vs optional startups, keep the leader warming/not ready for required failures, or add retry/health gating so a transient startup failure is recovered without waiting for a leadership change?

@CRZbulabula

Copy link
Copy Markdown
Contributor Author

@Caideyipi Thanks, this is a fair concern and I want to be explicit about the trade-off I'm making.

You're right that this turns every parallel startup into best-effort and still marks leaderServicesReady true, so in the worst case confirmLeader() can return SUCCESS while one leader-only service (pipe/CQ/subscription) stays disabled until the next leadership transition. I considered the three options you listed (required-vs-optional gating, keep-warming on required failure, retry/health gating) but I think for this PR best-effort is actually the right line, for a few reasons:

  1. These startups almost never fail. They are the same activation calls that already ran successfully on every previous leadership transition; a RuntimeException here is a genuinely exceptional event (a bug or hard environmental fault), not an expected transient condition. So we're optimizing the handling of a near-zero-probability path.

  2. There is no good in-place recovery handle. If startCQScheduler() / startPipeMetaSync() / startSubscriptionMetaSync() throws, none of them expose an idempotent "try again" entry point that's safe to call mid-epoch — re-running them isn't a free retry, and a retry/health-gating loop would have to reason about partial state each one left behind. Building that recovery machinery is a much larger change than the bug this PR fixes, and it would add real complexity to the warm-up path for a case that essentially never fires.

  3. leaderServicesReady blocking forever is strictly worse than best-effort serving. The pre-fix behavior is: one throw aborts becomeLeader() before markLeaderServicesReadyIfEpochCurrent(), so the node returns CONFIG_NODE_LEADER_WARMING_UP indefinitely — the entire ConfigNode is unserviceable, including all the leader functionality that started fine, with no log pointing at the cause. The keep-warming-on-required-failure option you mention reproduces exactly this hang for required services; it just makes it intentional. Isolating the blast radius so one failed service can't take down the whole leader — and surfacing it via an ERROR log naming the service — is already a large improvement over a silent total stall.

So the design intent of this PR is deliberately narrow: make the documented "always completes normally" invariant true, and make a failure observable instead of silent, rather than try to auto-heal a service that we have no clean way to restart anyway.

If we do want required-vs-optional distinction or retry/health gating later, I'd rather do it as a focused follow-up that also gives each service a real idempotent restart path (so retry is meaningful) and a health endpoint to gate on — otherwise we're gating readiness on a flag we can't recover. Happy to file that as a separate issue. Does scoping it that way sound reasonable to you?

@CRZbulabula CRZbulabula merged commit 86987e4 into master Jun 10, 2026
43 of 44 checks passed
@CRZbulabula CRZbulabula deleted the fix-confignode-leader-warmup-startup-failure branch June 10, 2026 09:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants