Catch per-startup failures during ConfigNode leader warm-up#17898
Conversation
Within becomeLeader(), the parallel leader-service startups are joined with CompletableFuture.allOf(startups).join(). startInParallelIfEpochCurrent() ran startup.run() unguarded, so if any startup (CQ, pipe, subscription, metrics, clusterId, ...) threw a RuntimeException, its future completed exceptionally and join() rethrew it as a CompletionException out of becomeLeader(). That aborted the transition before startExecutor() and markLeaderServicesReadyIfEpochCurrent() ran, so leaderServicesReady never flipped to true and the node kept returning CONFIG_NODE_LEADER_WARMING_UP forever -- even though startInParallelIfEpochCurrent()'s Javadoc claimed the future "always completes normally". Make that claim true: each startup now catches and logs its own failure (tagged with the service name via a small LeaderServiceStartup holder) and never lets it escape. A single failing service stays unavailable until the next leadership transition, but the node still finishes warming up and begins serving, and the failure is observable through the error log.
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #17898 +/- ##
============================================
+ Coverage 40.62% 40.74% +0.11%
- Complexity 2621 2623 +2
============================================
Files 5244 5244
Lines 362633 362646 +13
Branches 46684 46684
============================================
+ Hits 147315 147754 +439
+ Misses 215318 214892 -426 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Caideyipi
left a comment
There was a problem hiding this comment.
Correctness concern: this makes every parallel leader-service startup best-effort but still marks leaderServicesReady true afterward. Some entries in leaderServiceStartups() are not merely optional background services, e.g. PipeConfigNodeRuntime, CQScheduler, PipeMetaSync/PipeHeartbeat, and SubscriptionMetaSync. If any of those fails during startup, confirmLeader() can start returning SUCCESS while pipe/CQ/subscription functionality stays disabled until the next leadership transition, since this patch only logs the failure and does not retry it. That changes the system from warming up to serving with unavailable leader-only functionality. Can we either distinguish required vs optional startups, keep the leader warming/not ready for required failures, or add retry/health gating so a transient startup failure is recovered without waiting for a leadership change?
|
@Caideyipi Thanks, this is a fair concern and I want to be explicit about the trade-off I'm making. You're right that this turns every parallel startup into best-effort and still marks
So the design intent of this PR is deliberately narrow: make the documented "always completes normally" invariant true, and make a failure observable instead of silent, rather than try to auto-heal a service that we have no clean way to restart anyway. If we do want required-vs-optional distinction or retry/health gating later, I'd rather do it as a focused follow-up that also gives each service a real idempotent restart path (so retry is meaningful) and a health endpoint to gate on — otherwise we're gating readiness on a flag we can't recover. Happy to file that as a separate issue. Does scoping it that way sound reasonable to you? |



Background
Follow-up to #17821 ("Improve ConfigNode leader warm-up before serving"), addressing a review comment from @Caideyipi.
Problem
In
ConfigRegionStateMachine.becomeLeader(), the parallel leader-service startups are joined with a barrier:Each startup is wrapped by
startInParallelIfEpochCurrent(), whose Javadoc claims the returned future "always completes normally soCompletableFuture#allOfacts as a clean join barrier." But the body ranstartup.run()without any exception guard:If any of the parallel startups — CQ scheduler, pipe meta-sync/heartbeat, subscription meta-sync, metrics, cluster-id check, etc. — throws a
RuntimeException, the corresponding future completes exceptionally.allOf(startups).join()then rethrows it as aCompletionException, which propagates out ofbecomeLeader()beforeProcedureManager.startExecutor()andmarkLeaderServicesReadyIfEpochCurrent()run.As a result
leaderServicesReadyis never set totrue, and the ConfigNode keeps returningCONFIG_NODE_LEADER_WARMING_UPto clients indefinitely — the leader silently never becomes serviceable. The documented "always completes normally" invariant was simply false.Fix
Make the invariant true by catching and logging each startup's failure inside
startInParallelIfEpochCurrent()so it can never escape into the join barrier:try/catch; a failure is logged atERRORand swallowed, so a single misbehaving service can no longer abort the whole transition.LeaderServiceStartupholder, so the log identifies exactly which service failed.startInParallelIfEpochCurrent()Javadoc is updated to describe the actual (now-correct) behavior.Behavior change
becomeLeader()aborts; node stuck atCONFIG_NODE_LEADER_WARMING_UPforeverTesting
mvn spotless:apply -pl iotdb-core/confignode— clean.mvn compile -pl iotdb-core/confignode— compiles successfully.The change is a self-contained defensive guard around existing private startup logic; no public surface or existing tests are affected.