Skip to content

Improve ConfigNode leader warm-up before serving#17821

Open
CRZbulabula wants to merge 2 commits into
masterfrom
improve-confignode-leader-confirm
Open

Improve ConfigNode leader warm-up before serving#17821
CRZbulabula wants to merge 2 commits into
masterfrom
improve-confignode-leader-confirm

Conversation

@CRZbulabula
Copy link
Copy Markdown
Contributor

Summary

  • Gate ConfigNode leader confirmation on LoadCache warm-up after consensus leader-ready.
  • Track first heartbeat coverage for Nodes, Regions, RegionGroups, and ConsensusGroups before serving requests.
  • Return CONFIG_NODE_LEADER_WARMING_UP during warm-up so DataNodes wait and retry the current ConfigNode instead of treating it as redirection.

Tests

  • mvn spotless:apply -pl iotdb-core/confignode,iotdb-core/datanode,iotdb-client/service-rpc
  • mvn compile -pl iotdb-client/service-rpc,iotdb-core/confignode
  • mvn test -pl iotdb-core/confignode -Dtest=LoadManagerTest
  • mvn compile -pl iotdb-client/service-rpc,iotdb-core/datanode (fails in unrelated existing sources: ArrayDeviceTimeIndex.java and TableDeviceSchemaCache.java still pass IDeviceID to PartialPath.matchFullPath)

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 2, 2026

Codecov Report

❌ Patch coverage is 30.60109% with 127 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.99%. Comparing base (99f0af1) to head (a57680d).
⚠️ Report is 12 commits behind head on master.

Files with missing lines Patch % Lines
...iotdb/confignode/manager/load/cache/LoadCache.java 38.33% 37 Missing ⚠️
...c/handlers/heartbeat/DataNodeHeartbeatHandler.java 0.00% 30 Missing ⚠️
...confignode/manager/consensus/ConsensusManager.java 0.00% 24 Missing ⚠️
...nsensus/statemachine/ConfigRegionStateMachine.java 6.66% 14 Missing ⚠️
...che/iotdb/db/protocol/client/ConfigNodeClient.java 0.00% 11 Missing ⚠️
...che/iotdb/confignode/manager/load/LoadManager.java 74.35% 10 Missing ⚠️
...handlers/heartbeat/ConfigNodeHeartbeatHandler.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17821      +/-   ##
============================================
+ Coverage     40.90%   40.99%   +0.09%     
- Complexity     2610     2611       +1     
============================================
  Files          5186     5188       +2     
  Lines        351388   352111     +723     
  Branches      44991    45086      +95     
============================================
+ Hits         143722   144345     +623     
- Misses       207666   207766     +100     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

@Caideyipi Caideyipi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the warm-up changes on a57680d2542. I think there are a few issues that should be fixed before merge:

  1. AINode treats the new warm-up status as a hard failure. ConfigManager.registerAINode() now returns CONFIG_NODE_LEADER_WARMING_UP while confirmLeader() is warming up, but the Python AINode client only treats REDIRECTION_RECOMMEND as retryable in _update_config_node_leader(). node_register() / node_restart() then call verify_success() and raise on status 1014, so an AINode can fail startup if it hits the leader during warm-up. Please add the new code to the AINode constants and retry handling paths.

  2. Non-seed ConfigNode registration has the same gap. registerConfigNode() can now return CONFIG_NODE_LEADER_WARMING_UP, but ConfigNode.sendRegisterConfigNodeRequest() only retries success/redirection/internal-retry statuses and throws StartupException for anything else. A ConfigNode joining during leader warm-up can fail immediately instead of waiting and retrying.

  3. The async leader-service startup has a stepdown race. notifyLeaderReady() now submits startLeaderServicesAfterLoadReady() asynchronously. That task checks isLeaderReady() only once before starting leader-only services and setting leaderServicesReady=true. If notifyNotLeader() runs after that check but before/during service startup, the old task can re-enable services after cleanup. Please guard this with a leader epoch/cancellation token, and re-check before setting leaderServicesReady.

  4. The DataNode register retry budget is too tight for the 30s warm-up tolerance. On CONFIG_NODE_LEADER_WARMING_UP, updateConfigNodeLeader() sleeps 2s and returns retryable, while registerDataNode() has 15 attempts. The final request can still happen before the 30s tolerance expires, then sleep and exit without one post-tolerance attempt. A deadline-based retry or a larger retry budget would avoid this edge case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants