Improve ConfigNode leader warm-up before serving#17821
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #17821 +/- ##
============================================
+ Coverage 40.90% 40.99% +0.09%
- Complexity 2610 2611 +1
============================================
Files 5186 5188 +2
Lines 351388 352111 +723
Branches 44991 45086 +95
============================================
+ Hits 143722 144345 +623
- Misses 207666 207766 +100 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Caideyipi
left a comment
There was a problem hiding this comment.
I reviewed the warm-up changes on a57680d2542. I think there are a few issues that should be fixed before merge:
-
AINode treats the new warm-up status as a hard failure.
ConfigManager.registerAINode()now returnsCONFIG_NODE_LEADER_WARMING_UPwhileconfirmLeader()is warming up, but the Python AINode client only treatsREDIRECTION_RECOMMENDas retryable in_update_config_node_leader().node_register()/node_restart()then callverify_success()and raise on status1014, so an AINode can fail startup if it hits the leader during warm-up. Please add the new code to the AINode constants and retry handling paths. -
Non-seed ConfigNode registration has the same gap.
registerConfigNode()can now returnCONFIG_NODE_LEADER_WARMING_UP, butConfigNode.sendRegisterConfigNodeRequest()only retries success/redirection/internal-retry statuses and throwsStartupExceptionfor anything else. A ConfigNode joining during leader warm-up can fail immediately instead of waiting and retrying. -
The async leader-service startup has a stepdown race.
notifyLeaderReady()now submitsstartLeaderServicesAfterLoadReady()asynchronously. That task checksisLeaderReady()only once before starting leader-only services and settingleaderServicesReady=true. IfnotifyNotLeader()runs after that check but before/during service startup, the old task can re-enable services after cleanup. Please guard this with a leader epoch/cancellation token, and re-check before settingleaderServicesReady. -
The DataNode register retry budget is too tight for the 30s warm-up tolerance. On
CONFIG_NODE_LEADER_WARMING_UP,updateConfigNodeLeader()sleeps 2s and returns retryable, whileregisterDataNode()has 15 attempts. The final request can still happen before the 30s tolerance expires, then sleep and exit without one post-tolerance attempt. A deadline-based retry or a larger retry budget would avoid this edge case.



Summary
Tests