Catch per-startup failures during ConfigNode leader warm-up by CRZbulabula · Pull Request #17898 · apache/iotdb

CRZbulabula · 2026-06-10T07:38:48Z

Background

Follow-up to #17821 ("Improve ConfigNode leader warm-up before serving"), addressing a review comment from @Caideyipi.

Problem

In ConfigRegionStateMachine.becomeLeader(), the parallel leader-service startups are joined with a barrier:

CompletableFuture.allOf(startups).join();

Each startup is wrapped by startInParallelIfEpochCurrent(), whose Javadoc claims the returned future "always completes normally so CompletableFuture#allOf acts as a clean join barrier." But the body ran startup.run() without any exception guard:

return CompletableFuture.runAsync(
    () -> {
      if (isCurrentLeaderServicesEpoch(epoch)) {
        startup.run(); // unguarded
      }
    },
    leaderServicesStartupPool);

If any of the parallel startups — CQ scheduler, pipe meta-sync/heartbeat, subscription meta-sync, metrics, cluster-id check, etc. — throws a RuntimeException, the corresponding future completes exceptionally. allOf(startups).join() then rethrows it as a CompletionException, which propagates out of becomeLeader() before ProcedureManager.startExecutor() and markLeaderServicesReadyIfEpochCurrent() run.

As a result leaderServicesReady is never set to true, and the ConfigNode keeps returning CONFIG_NODE_LEADER_WARMING_UP to clients indefinitely — the leader silently never becomes serviceable. The documented "always completes normally" invariant was simply false.

Fix

Make the invariant true by catching and logging each startup's failure inside startInParallelIfEpochCurrent() so it can never escape into the join barrier:

Each startup now runs inside a try/catch; a failure is logged at ERROR and swallowed, so a single misbehaving service can no longer abort the whole transition.
A failed service stays unstarted until the next leadership transition, but the node still finishes warming up and begins serving — and the failure is observable via the error log instead of a silent hang.
To make that log actionable, each startup is paired with a human-readable name through a small LeaderServiceStartup holder, so the log identifies exactly which service failed.
The startInParallelIfEpochCurrent() Javadoc is updated to describe the actual (now-correct) behavior.

Behavior change

	Before	After
One parallel startup throws	`becomeLeader()` aborts; node stuck at `CONFIG_NODE_LEADER_WARMING_UP` forever	Failure logged with the service name; warm-up completes; node serves
Other startups	Their outcome irrelevant — transition already aborted	Continue and complete normally

Testing

mvn spotless:apply -pl iotdb-core/confignode — clean.
mvn compile -pl iotdb-core/confignode — compiles successfully.

The change is a self-contained defensive guard around existing private startup logic; no public surface or existing tests are affected.

Within becomeLeader(), the parallel leader-service startups are joined with CompletableFuture.allOf(startups).join(). startInParallelIfEpochCurrent() ran startup.run() unguarded, so if any startup (CQ, pipe, subscription, metrics, clusterId, ...) threw a RuntimeException, its future completed exceptionally and join() rethrew it as a CompletionException out of becomeLeader(). That aborted the transition before startExecutor() and markLeaderServicesReadyIfEpochCurrent() ran, so leaderServicesReady never flipped to true and the node kept returning CONFIG_NODE_LEADER_WARMING_UP forever -- even though startInParallelIfEpochCurrent()'s Javadoc claimed the future "always completes normally". Make that claim true: each startup now catches and logs its own failure (tagged with the service name via a small LeaderServiceStartup holder) and never lets it escape. A single failing service stays unavailable until the next leadership transition, but the node still finishes warming up and begins serving, and the failure is observable through the error log.

sonarqubecloud · 2026-06-10T07:49:01Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

codecov · 2026-06-10T08:28:14Z

Codecov Report

❌ Patch coverage is 0% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.74%. Comparing base (11a178a) to head (83ac794).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
...nsensus/statemachine/ConfigRegionStateMachine.java	0.00%	30 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #17898      +/-   ##
============================================
+ Coverage     40.62%   40.74%   +0.11%     
- Complexity     2621     2623       +2     
============================================
  Files          5244     5244              
  Lines        362633   362646      +13     
  Branches      46684    46684              
============================================
+ Hits         147315   147754     +439     
+ Misses       215318   214892     -426

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Caideyipi

Correctness concern: this makes every parallel leader-service startup best-effort but still marks leaderServicesReady true afterward. Some entries in leaderServiceStartups() are not merely optional background services, e.g. PipeConfigNodeRuntime, CQScheduler, PipeMetaSync/PipeHeartbeat, and SubscriptionMetaSync. If any of those fails during startup, confirmLeader() can start returning SUCCESS while pipe/CQ/subscription functionality stays disabled until the next leadership transition, since this patch only logs the failure and does not retry it. That changes the system from warming up to serving with unavailable leader-only functionality. Can we either distinguish required vs optional startups, keep the leader warming/not ready for required failures, or add retry/health gating so a transient startup failure is recovered without waiting for a leadership change?

CRZbulabula · 2026-06-10T09:10:51Z

@Caideyipi Thanks, this is a fair concern and I want to be explicit about the trade-off I'm making.

You're right that this turns every parallel startup into best-effort and still marks leaderServicesReady true, so in the worst case confirmLeader() can return SUCCESS while one leader-only service (pipe/CQ/subscription) stays disabled until the next leadership transition. I considered the three options you listed (required-vs-optional gating, keep-warming on required failure, retry/health gating) but I think for this PR best-effort is actually the right line, for a few reasons:

These startups almost never fail. They are the same activation calls that already ran successfully on every previous leadership transition; a RuntimeException here is a genuinely exceptional event (a bug or hard environmental fault), not an expected transient condition. So we're optimizing the handling of a near-zero-probability path.
There is no good in-place recovery handle. If startCQScheduler() / startPipeMetaSync() / startSubscriptionMetaSync() throws, none of them expose an idempotent "try again" entry point that's safe to call mid-epoch — re-running them isn't a free retry, and a retry/health-gating loop would have to reason about partial state each one left behind. Building that recovery machinery is a much larger change than the bug this PR fixes, and it would add real complexity to the warm-up path for a case that essentially never fires.
leaderServicesReady blocking forever is strictly worse than best-effort serving. The pre-fix behavior is: one throw aborts becomeLeader() before markLeaderServicesReadyIfEpochCurrent(), so the node returns CONFIG_NODE_LEADER_WARMING_UP indefinitely — the entire ConfigNode is unserviceable, including all the leader functionality that started fine, with no log pointing at the cause. The keep-warming-on-required-failure option you mention reproduces exactly this hang for required services; it just makes it intentional. Isolating the blast radius so one failed service can't take down the whole leader — and surfacing it via an ERROR log naming the service — is already a large improvement over a silent total stall.

So the design intent of this PR is deliberately narrow: make the documented "always completes normally" invariant true, and make a failure observable instead of silent, rather than try to auto-heal a service that we have no clean way to restart anyway.

If we do want required-vs-optional distinction or retry/health gating later, I'd rather do it as a focused follow-up that also gives each service a real idempotent restart path (so retry is meaningful) and a health endpoint to gate on — otherwise we're gating readiness on a flag we can't recover. Happy to file that as a separate issue. Does scoping it that way sound reasonable to you?

CRZbulabula requested a review from Caideyipi June 10, 2026 07:38

Caideyipi reviewed Jun 10, 2026

View reviewed changes

CRZbulabula merged commit 86987e4 into master Jun 10, 2026
43 of 44 checks passed

CRZbulabula deleted the fix-confignode-leader-warmup-startup-failure branch June 10, 2026 09:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catch per-startup failures during ConfigNode leader warm-up#17898

Catch per-startup failures during ConfigNode leader warm-up#17898
CRZbulabula merged 1 commit into
masterfrom
fix-confignode-leader-warmup-startup-failure

CRZbulabula commented Jun 10, 2026

Uh oh!

sonarqubecloud Bot commented Jun 10, 2026

Uh oh!

codecov Bot commented Jun 10, 2026

Uh oh!

Caideyipi left a comment

Uh oh!

CRZbulabula commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CRZbulabula commented Jun 10, 2026

Background

Problem

Fix

Behavior change

Testing

Uh oh!

sonarqubecloud Bot commented Jun 10, 2026

Quality Gate passed

Uh oh!

codecov Bot commented Jun 10, 2026

Codecov Report

Uh oh!

Caideyipi left a comment

Choose a reason for hiding this comment

Uh oh!

CRZbulabula commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants