Make Tier-A metadata operations HA when a DataNode is down (metadata lease + self-fencing)#17831
Make Tier-A metadata operations HA when a DataNode is down (metadata lease + self-fencing)#17831JackieTien97 wants to merge 17 commits into
Conversation
Table-model DDL and ~30 other ConfigNode->DataNode metadata broadcasts fail when any single DataNode is unreachable, because the ConfigNode requires every registered DataNode to acknowledge a cache invalidation before committing (to stop a partitioned DataNode from serving stale CN-pushed caches and generating dirty data). This adds the test-covered foundation for a metadata-lease/fencing mechanism that lets such operations tolerate DataNode failures without sacrificing correctness. DataNode side: MetadataLeaseManager tracks the lease via the ConfigNode heartbeat (monotonic clock); isFenced() when no heartbeat within metadata_lease_fence_ms (default 20s); fires recovery listeners when a heartbeat arrives after a fence. DataNodeTableCache fail-closed (retryable error) in getTableInWrite/getTable while fenced, and invalidateAll() registered as a recovery listener so a recovered DataNode re-fetches fresh schema. getDataNodeHeartBeat records the heartbeat; a metadata_lease_heartbeat_age_ms gauge is exposed. ConfigNode side: DataNodeContactTracker records, per DataNode, the time the ConfigNode last received a successful heartbeat response, stamped on the ConfigNode clock only on success and never advanced by onError (recorded in DataNodeHeartbeatHandler.onComplete) - the sound signal for deciding whether an unreachable DataNode has self-fenced. MetadataBroadcastVerdict is the pure decision logic (capability checked first; FENCED_SAFE only via hbAge>=T_proceed or retired-from-routing; no additive fast-path). No ConfigNode procedure control flow is changed yet (the verdict is not wired into procedures); the DataNode fail-closed is active only while a DataNode is actually fenced. Config: metadata_lease_fence_ms in CommonConfig. 20 new unit tests.
Add an optional supportsMetadataLeaseFencing flag to TDataNodeHeartbeatResp. The DataNode advertises it (true); the ConfigNode records it per-DataNode in DataNodeContactTracker. The verdict checks capability before any liveness/timing test, so a not-yet-upgraded DataNode that omits the flag is recorded as not-capable and never treated as fenced (strict, rolling-upgrade safe). DataNodeContactTracker gains recordCapability/supportsFencing (default false). 3 new unit tests.
The DataNode permission cache (ClusterAuthorityFetcher) is invalidated by a ConfigNode broadcast after GRANT/REVOKE. A DataNode partitioned from the ConfigNode can miss that broadcast and keep authorizing a privilege that was already revoked. The pre-existing refreshToken() timeout did not close this window: it only marks the cache stale when a heartbeat finally arrives after a long gap, so during an ongoing partition (no heartbeat at all) refreshToken() is never called and the stale cache keeps being served until recovery. checkCacheAvailable() now also drops the cache when MetadataLeaseManager reports the lease fenced. isFenced() is evaluated on the DataNode's own clock and needs no heartbeat to fire, so an ongoing partition forces a re-fetch from the ConfigNode, which fails closed while partitioned (deny, not allow).
The tree-model schema cache (TreeDeviceSchemaCacheManager) is read-through: on a miss the caller re-fetches from the quorum-backed SchemaRegion, and a ConfigNode broadcast only invalidates entries after a DELETE TIMESERIES / datatype change. A DataNode partitioned from the ConfigNode can miss that broadcast and keep a stale cached entry, then validate a write or resolve a query against schema that no longer exists. All six cache reads funnel through getDeviceSchema(String[]); route them through getDeviceSchemaOrMissWhenFenced, which returns null (a miss) while the lease is fenced so the caller re-fetches from the authoritative SchemaRegion. This is more available than hard-failing (the op still succeeds whenever the SchemaRegion quorum is reachable, and only fails closed when it is not) and keeps the gate tree-scoped, since getDeviceSchema is also used by table-model fetching. On lease recovery cleanUp() drops entries cached before the partition that were never re-read while fenced.
Compaction physically deletes data older than its TTL window, reading the TTL from DataNodeTTLCache (pushed by the ConfigNode). A DataNode partitioned from the ConfigNode can miss a TTL update; a too-short stale TTL would make compaction permanently delete data that a missed TTL-increase says to keep - an irreversible loss. MultiTsFileDeviceIterator.nextDevice() now uses an infinite TTL (Long.MAX_VALUE -> timeLowerBound Long.MIN_VALUE -> no TTL-based deletion) when the lease is fenced, scoped to the compaction path only (query/write TTL behavior is unchanged). The check runs before the cache reads, so the table-model path also avoids the now fail-closed DataNodeTableCache. Real TTL deletion resumes once the lease recovers and the cache resyncs.
The ConfigNode side of the metadata-lease HA change. ClusterCachePropagator broadcasts a cache-invalidation to all registered DataNodes and turns the per-DataNode responses into a PROCEED/WAIT/FAIL verdict via the already-built MetadataBroadcastVerdict, instead of the legacy 'any unreachable DataNode fails the operation'. For each registered DataNode it builds a DataNodeState from: acked (SUCCESS response), supportsFencing and hbAge (from DataNodeContactTracker). A DataNode that is provably self-fenced (capable + silent past T_proceed) is safe to proceed past; a non-SUCCESS or recently-contacted unacked DataNode is unsafe. propagate() retries on WAIT until a DataNode acks/crosses T_proceed or the wait budget (T_proceed + buffer) runs out. The caller supplies a CacheBroadcast closure wrapping its specific RPC, so the propagator is agnostic to the request type. Clock and sleep are injectable; the verdict construction and the retry loop are covered by unit tests.
The metadata-broadcast verdict reads each DataNode's last-successful-response time from DataNodeContactTracker. Two lifecycle events must keep that signal sound: - On (re)acquiring ConfigRegion leadership (notifyLeaderReady), reset every registered DataNode's contact time to now. Otherwise a timestamp left from a previous leadership term - during which another ConfigNode was contacting the DataNodes - could make the verdict wrongly judge a live DataNode as fenced. - On permanent DataNode removal (removeDataNodePersistence), drop its tracker entry so stale contact/capability state is not retained and a future DataNode reusing the id cannot inherit it.
First Tier-A procedure migrated off 'any unreachable DataNode fails the op'. CreateTableProcedure's PRE_RELEASE step now broadcasts via ClusterCachePropagator: it proceeds once every unacked DataNode is provably self-fenced (which, per Phase 1, fails closed on its stale table cache and resyncs on lease recovery, so it cannot serve dirty schema), and only fails when an unacked DataNode is not provably fenced. SchemaUtils gains preUpdateTableReq() (request builder) and broadcastTableUpdate() (returns the full per-nodeId response map the verdict needs); the legacy preReleaseTable() is now a thin wrapper returning only failures, so its other callers are unchanged. The happy path (all DataNodes ack -> PROCEED) is behaviorally identical; CreateTableProcedureTest still passes. COMMIT_RELEASE stays best-effort (warn-only) as before. This is the template for the remaining Tier-A procedures.
End-to-end verification of the metadata-lease/fence HA change. With a short metadata_lease_fence_ms, the test starts 1 ConfigNode + 3 DataNodes, creates a database, stops one DataNode, and asserts CREATE TABLE still succeeds - whereas before the change a table DDL hard-failed whenever any DataNode could not acknowledge the cache-invalidation broadcast. Adds setMetadataLeaseFenceMs to the IT CommonConfig framework (interface + MppCommonConfig / MppSharedCommonConfig / RemoteCommonConfig) so the fence threshold can be shortened, keeping the proceed-past-fenced wait fast.
AbstractAlterOrDropTableProcedure is the base for all 8 table-mutation procedures (AddTableColumn, DropTableColumn, DropTable, RenameTable, RenameTableColumn, SetTableProperties, AlterTableColumnDataType, DeleteDevices), so wiring it migrates them all at once. Both the forward pre-release and the rollback pre-release now broadcast via ClusterCachePropagator and proceed once every unreachable DataNode is provably self-fenced, instead of hard-failing on the first unreachable DataNode. The rollback path is included so a down DataNode cannot block rollback either. commitRelease stays best-effort (warn-only) as before, since the change is already authoritative once committed. SchemaUtils gains rollbackUpdateTableReq() to mirror preUpdateTableReq(); both legacy preReleaseTable()/rollbackPreRelease() remain thin failure-returning wrappers. All 7 alter/drop procedure serialization tests still pass.
DeleteTimeSeriesProcedure.invalidateCache is the shared static helper that broadcasts INVALIDATE_MATCHED_SCHEMA_CACHE; AlterEncodingCompressorProcedure reuses it, so wiring it covers both. It now proceeds once every unreachable DataNode is provably self-fenced instead of hard-failing on the first one. It runs before the physical delete in the state machine, so the 'delete only after PROCEED' ordering holds with no reordering. Because the propagator may re-broadcast while waiting for unacked DataNodes, the broadcast closure builds a fresh request with patternTreeBytes.duplicate() on every attempt, so a consumed buffer can never be re-sent as an empty (silently-successful) invalidation. DeleteTimeSeries and AlterEncodingCompressor serialization tests still pass.
Add SchemaUtils.invalidateMatchedSchemaCache() as the single place that broadcasts INVALIDATE_MATCHED_SCHEMA_CACHE via ClusterCachePropagator (proceed once every unreachable DataNode is provably self-fenced) with the buffer.duplicate() safety against the propagator's re-broadcast on WAIT. Route all five INVALIDATE_MATCHED_SCHEMA_CACHE callers through it: - AlterLogicalViewProcedure, DeleteLogicalViewProcedure (views) - DeactivateTemplateProcedure (template) - DeleteTimeSeriesProcedure, AlterTimeSeriesDataTypeProcedure (static helpers, also used by AlterEncodingCompressorProcedure) - refactored off their inline broadcasts onto the shared helper. Each keeps its own error message and runs before its physical delete/alter step, so the 'delete/alter only after PROCEED' ordering holds. The region-task broadcasts (CONSTRUCT_*_BLACK_LIST, ALTER_VIEW, ALTER_TIMESERIES_DATATYPE) are deliberately untouched - they go through region consensus. Affected procedure tests still pass.
Add SchemaUtils.broadcastTemplateUpdate(cm, Supplier<TUpdateTemplateReq>): the single place that broadcasts UPDATE_TEMPLATE via ClusterCachePropagator, proceeding once every unreachable DataNode is provably self-fenced. The request is rebuilt from the supplier on each attempt because the propagator may re-broadcast on WAIT and TUpdateTemplateReq's binary field is ByteBuffer-backed (reusing one request could re-send a consumed, empty payload). SetTemplateProcedure (ADD_TEMPLATE_PRE_SET_INFO forward step) and UnsetTemplateProcedure (INVALIDATE_TEMPLATE_SET_INFO) now use it instead of hard-failing on the first unreachable DataNode; both keep their own messages and their state advance / throw-on-failure semantics. The region-task validation (VALIDATE_TIMESERIES_EXISTENCE) is unchanged. Template procedure tests pass.
SetTTLProcedure's UPDATE_DATANODE_CACHE step (and the symmetric rollback restore) broadcast SET_TTL to all DataNodes after the authoritative ConfigNode write. Both used to hard-fail on the first unreachable DataNode, which also forced a full rollback of the committed TTL whenever any DataNode was down. Both now go through a new overridable broadcastTTLAndDecide() seam backed by ClusterCachePropagator: proceed once every unreachable DataNode is provably self-fenced (it fails closed on TTL in compaction and resyncs on recovery), and fail (→ rollback) only when a live DataNode is genuinely unacked. TSetTTLReq has no ByteBuffer field, so the request is reused safely across the propagator's re-broadcasts. The test overrides broadcastTTLAndDecide instead of sendTTLRequest, keeping the rollback-on-live-failure scenario deterministic and fast (no real propagator sleep). All 6 SetTTL tests pass.
…gator ConfigNodeProcedureEnv.invalidateCache (the DeleteDatabase INVALIDATE_CACHE step) synchronously invalidates partition+schema cache on every DataNode. It used to poll an Unknown DataNode for 5s then hard-fail, so a single down DataNode broke DROP DATABASE. It now runs through ClusterCachePropagator: a per-round closure synchronously invalidates each reachable DataNode (SUCCESS only if both partition and schema succeed) and reports Unknown/erroring DataNodes as not-acked WITHOUT sync-sending to them (a sync send to a dead DataNode would block on connect timeouts). The propagator then proceeds once every not-acked DataNode is provably self-fenced (it fails closed and resyncs on recovery) and fails only on a live unacked DataNode. This runs before DELETE_DATABASE_SCHEMA, so the delete-after-PROCEED ordering holds. Removed the now-dead Unknown-polling loop and unused getNodeManager() helper. DeleteDatabaseProcedureTest passes.
The DATANODE_AUTHCACHE_INVALIDING step broadcast INVALIDATE_PERMISSION_CACHE and, after datanode_token_timeout_ms, SILENTLY DROPPED any still-unacked DataNode from the list - leaving a live DataNode serving a just-revoked permission until its own token timeout. (Phase 1 already closed the fenced- DataNode case via DN-side fail-closed; this closes the live-transiently-unacked case.) The step now runs through ClusterCachePropagator over the live registered DataNodes: it proceeds once every unreachable DataNode is provably self-fenced (it fails closed on auth and resyncs on recovery) and fails only when a live DataNode stays unacked - never silently dropping one. Unknown DataNodes are not sync-contacted (avoids connect-timeout stalls) and are reported as not-acked for the verdict. Fields/serialization are unchanged for procedure-restart compatibility (dataNodesToInvalid is now vestigial). AuthOperationProcedureTest passes.
These were accidentally committed by a 'git add -A'; they are informal design notes, not part of the change, and carry no Apache license header.
|
I reviewed this implementation and the design note. A few safety-semantics points still look worth tightening or proving explicitly:
These points do not block the availability goal for a stopped DataNode, but they do affect whether the fencing scheme is fully safe under the failure modes it is meant to cover. |
Problem
Many cluster metadata operations broadcast a cache-invalidation to every registered DataNode and treat any unreachable DataNode as a hard failure. As a result, with multiple replicas configured, a single down DataNode still makes these operations fail — contradicting the HA goal. The original report was table-model DDL (
CREATE TABLE,ALTER TABLE ...), but the same pattern affects ~all Tier-A metadata ops (tree-model schema, templates, views, TTL,DROP DATABASE, permissions).The reason the broadcast cannot simply "skip" an unreachable DataNode is correctness: a DataNode caches ConfigNode-pushed metadata (table/tree schema, permissions, TTL, ...). If it misses an invalidation during a network partition and keeps serving the stale cache, it can produce dirty data / stale reads / stale authorization.
Approach — metadata lease + self-fencing + a broadcast verdict
metadata_lease_fence_ms(T_fence, default 20s, aligned with the failure detector), it considers itself fenced and stops trusting ConfigNode-pushed caches. On the next heartbeat (recovery) it resyncs.T_proceed = T_fence + margin, and known to support fencing. Such a DataNode is fail-closed and will resync, so it cannot serve dirty data. A DataNode that is reachable-but-unacked (still possibly serving) blocks the op (wait, then fail).This gives the desired CP behavior: a healthy majority makes progress; a partitioned minority fails closed.
What this PR contains
DataNode side (fail-closed when fenced):
MetadataLeaseManager— lease tracking, lazy fence check, recovery listeners (monotonic clock, injectable for tests)DataNodeTableCache) — throws retryable; tree schema cache (TreeDeviceSchemaCacheManager) — forces re-fetch from the quorum-backed SchemaRegion; permission cache (ClusterAuthorityFetcher) — drops cache → re-fetch (deny while partitioned); TTL — compaction uses an infinite TTL when fenced (never deletes by a stale TTL)supportsMetadataLeaseFencingcapability bit; ametadata_lease_heartbeat_age_msmetricConfigNode side (proceed past provably-fenced DataNodes):
MetadataBroadcastVerdict(pure decision logic) +DataNodeContactTracker(sound per-DataNode last-successful-response signal, capability bit) +ClusterCachePropagator(broadcast → verdict → PROCEED/WAIT/FAIL with retry)CreateTable+ all 8 alter/drop procedures, forward + rollback), tree-model schema (DeleteTimeSeries,AlterTimeSeriesDataType,AlterEncodingCompressor, logical views,DeactivateTemplate), templates (Set/Unset),SetTTL, andDeleteDatabase(sync path). Region-task / physical-deletion broadcasts are deliberately left on region consensus.AuthOperationProcedure: fixed a silent permission-staleness hole (it used to drop an unacked DataNode after a timeout, leaving a live DataNode serving a just-revoked permission); now proceeds only past provably-fenced DataNodes.Testing
MetadataLeaseManager,DataNodeTableCache, tree schema, permission, TTL-compaction,MetadataBroadcastVerdict,DataNodeContactTracker,ClusterCachePropagator), plus no-regression on the affected procedures.IoTDBTableDDLHAIT): with one DataNode stopped,CREATE TABLEsucceeds after ~T_proceed(instead of failing immediately). A newsetMetadataLeaseFenceMsIT-config setter keeps the wait short.Configuration & compatibility
metadata_lease_fence_ms(default 20000).T_proceed = T_fence + internal margin (~5s).Not in this PR (follow-up)
Tier-B resource operations (DROP/CREATE
FUNCTION/TRIGGER/PIPE PLUGIN,SET SYSTEM STATUS=ReadOnly) and the quota re-pull. These need DataNode-side fail-closed on resource execution (not a read-through cache) plus a resource recovery-resync mechanism, which is a separate, self-contained effort.