Skip to content

add epoch to the add maintainer related opeartions#5181

Open
3AceShowHand wants to merge 4 commits into
pingcap:masterfrom
3AceShowHand:fix-maintainer-stuck
Open

add epoch to the add maintainer related opeartions#5181
3AceShowHand wants to merge 4 commits into
pingcap:masterfrom
3AceShowHand:fix-maintainer-stuck

Conversation

@3AceShowHand
Copy link
Copy Markdown
Collaborator

@3AceShowHand 3AceShowHand commented Jun 1, 2026

What problem does this PR solve?

Issue Number: close #5179

After the CDC owner is network-isolated from other CDC nodes and later recovers, an old maintainer generation can still be alive while a new owner/coordinator generation is scheduling the same changefeed. Without generation fencing, stale maintainer scheduling messages and dispatcher-manager lifecycle messages can race with the new generation.

The visible symptom is not only a short maintainer nodeID mismatch window. Stale add/remove/bootstrap/post-bootstrap/close messages can keep affecting the current control path after recovery, so the changefeed may stay stuck and its Kafka sink lag keeps growing for much longer.

What is changed and how it works?

This PR adds maintainer generation/epoch fencing across maintainer scheduling and dispatcher-manager ownership paths.

  • Propagate maintainer_epoch through maintainer scheduling messages, including add, remove, bootstrap, post-bootstrap, close, and the corresponding dispatcher-manager responses.
  • Make add/move/stop operators send the expected maintainer epoch and finish only when the reported maintainer status belongs to the expected generation.
  • Reject stale remove requests in the maintainer manager when the request epoch is older than the local maintainer epoch.
  • Reject stale bootstrap/post-bootstrap/close requests in the dispatcher orchestrator when the request epoch is older than the local dispatcher manager epoch.
  • Echo maintainer epoch in dispatcher-manager responses and ignore stale responses in the maintainer.
  • Let the dispatcher orchestrator pending queue replace older pending requests with newer-epoch requests, so a new generation is not blocked by a queued stale request.

Check List

Tests

  • Unit test

Commands run locally:

  • make generate-protobuf
  • make fmt
  • go test ./coordinator/changefeed ./coordinator/operator ./maintainer ./downstreamadapter/dispatcherorchestrator ./downstreamadapter/dispatchermanager ./heartbeatpb
  • go test ./coordinator/...
  • git diff --check

Questions

Will it cause performance regression or break compatibility?

No expected performance regression. The runtime overhead is limited to passing a uint64 epoch and doing simple integer comparisons on maintainer control messages.

The protobuf changes are additive. Epoch 0 is treated as the compatibility path for requests or responses that do not participate in maintainer epoch fencing yet.

Do you need to update user documentation, design documentation or monitoring documentation?

No user-facing behavior or configuration is changed.

Release note

Fix a changefeed lag issue caused by stale maintainer scheduling and dispatcher-manager lifecycle messages after CDC owner isolation recovery.

Summary by CodeRabbit

  • New Features

    • Introduced maintainer-epoch fields in maintainer/heartbeat messages and status.
  • Improvements

    • Add/move/stop/bootstrap flows now propagate and validate maintainer epochs to prevent stale operations and improve coordination.
    • Controller and dispatcher now respect and echo epochs; status formatting includes maintainer epoch for diagnostics.
  • Tests

    • Added and expanded tests covering epoch semantics, request acceptance/replacement, staleness gating, and operator behavior.

@ti-chi-bot ti-chi-bot Bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/needs-triage-completed labels Jun 1, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Jun 1, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hongyunyan for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR propagates a uint64 maintainer_epoch through heartbeat protos and coordinator flows: Changefeeds expose epochs; operators capture and include epochs in add/remove messages and require epoch matches to finish; maintainers echo and validate epochs and ignore stale responses; dispatcher components accept/replace maintainer requests based on epoch comparisons.

Changes

Maintainer Epoch Fencing

Layer / File(s) Summary
Changefeed message construction and accessor
coordinator/changefeed/changefeed.go
Add Changefeed.GetMaintainerEpoch(). NewAddMaintainerMessage accepts maintainerEpoch and sets AddMaintainerRequest.MaintainerEpoch. Add RemoveMaintainerMessageWithEpoch and wire remove-call sites.
Changefeed message tests
coordinator/changefeed/changefeed_test.go
Tests assert AddMaintainerRequest and RemoveMaintainerRequest include expected MaintainerEpoch values (default 0 and provided epochs).
AddMaintainerOperator: capture and fence by epoch
coordinator/operator/operator_add.go, coordinator/operator/operator_add_test.go
AddMaintainerOperator stores maintainerEpoch from changefeed, includes it in scheduled add message, and requires destination MaintainerEpoch (or expected 0) plus BootstrapDone before finishing; test validates behavior.
Stop operator wiring from Controller
coordinator/operator/operator_controller.go
Controller.StopChangefeed reads changefeed epoch and forwards it into stop-operator creation path.
StopChangefeedOperator: include epoch in removal
coordinator/operator/operator_stop.go, coordinator/operator/operator_stop_test.go
StopChangefeedOperator stores maintainerEpoch, NewStopChangefeedOperator accepts it, Schedule uses RemoveMaintainerMessageWithEpoch, Check gates completion by epoch; tests updated to pass/assert epoch.
MoveMaintainerOperator: epoch during move
coordinator/operator/operator_move.go, coordinator/operator/operator_move_test.go
MoveMaintainerOperator captures maintainerEpoch when scheduling add-to-dest, includes it in add message, and requires matching MaintainerEpoch + BootstrapDone before marking finished; tests assert epoch behavior.
Controller bootstrap and stale removal messaging
coordinator/controller.go, coordinator/controller_test.go
When bootstrapping changefeeds already running remotely, controller aligns in-memory Epoch with remote maintainer epoch if non-zero; stale maintainer removals now send RemoveMaintainerMessageWithEpoch; test added to verify removal message includes epoch.
Maintainer core: echo and validate epoch
maintainer/maintainer.go, maintainer/maintainer_manager_maintainers.go, maintainer/maintainer_epoch_test.go, maintainer/maintainer_test.go
NewMaintainerForRemove accepts maintainerEpoch into ChangeFeedInfo.Epoch. GetMaintainerStatus includes MaintainerEpoch. Maintain responses (bootstrap/post-bootstrap/close) are ignored when epochs mismatch; outgoing bootstrap/close requests include MaintainerEpoch. Added/updated tests for epoch matching and removal gating.
Dispatcher orchestrator & helper: epoch-aware acceptance and queue rules
downstreamadapter/dispatcherorchestrator/*, downstreamadapter/dispatcherorchestrator/helper.go, downstreamadapter/dispatchermanager/dispatcher_manager_info.go, downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.go
Orchestrator tracks closed maintainer epochs, rejects bootstrap for closed generations, applies shouldAcceptMaintainerRequest to ignore stale requests, conditionally applies req.MaintainerEpoch to manager config, includes MaintainerEpoch in responses and errors, and updates pending-message replacement rules to prefer higher maintainer epochs; tests added for queue replacement and acceptance rules.
Formatting: include epoch in status string
pkg/common/format.go
FormatMaintainerStatus now prints maintainerEpoch in the formatted output.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

lgtm

Suggested reviewers

  • asddongmen
  • lidezhu

Poem

🐰 I hop through epochs, small and spry,
I whisper numbers as messages fly.
Changefeed to maintainer, the epoch I send,
Stale echoes quiet, new generations mend.
🥕 Hooray — the scheduler stays true to the end.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.82% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title mentions 'add epoch to the add maintainer related operations' but is partially related and contains a typo ('opeartions' instead of 'operations'). It addresses a real aspect of the change but undersells the scope. Consider revising to 'Add maintainer epoch fencing across scheduling and lifecycle operations' or similar to better reflect the comprehensive scope of the changes.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed The PR successfully addresses issue #5179 by implementing maintainer epoch fencing across scheduling, bootstrap, post-bootstrap, close operations and dispatcher-manager responses to prevent stale messages from affecting new maintainer generations after CDC owner recovery.
Out of Scope Changes check ✅ Passed All changes are within scope of fixing issue #5179: epoch propagation in protobuf messages, operator implementations, maintainer/dispatcher handling, and comprehensive test coverage. No unrelated changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Tools execution failed with the following error:

Failed to run tools: 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error)


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jun 1, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a maintainer_epoch mechanism to fence maintainer scheduling generations from one another, updating protobuf definitions, coordinator operators, and maintainer components to propagate and validate the epoch. The review feedback identifies several critical issues: a potential runtime panic in handleAddMaintainer if info is nil, a compilation error in GetMaintainerEpoch due to comparing an atomic.Pointer directly to nil, and potential nil pointer dereferences in the Check methods of both AddMaintainerOperator and MoveMaintainerOperator if the status parameter is nil.

Comment thread maintainer/maintainer_manager_maintainers.go
Comment thread coordinator/changefeed/changefeed.go
Comment thread coordinator/operator/operator_add.go
Comment thread coordinator/operator/operator_move.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@coordinator/operator/operator_add.go`:
- Around line 87-90: The current maintainerEpochMatches function incorrectly
treats actual == 0 as a match; update it so only expected == 0 (operator opting
out) bypasses the fencing check and the receiver reporting zero does not count
as a match. Replace the logic in maintainerEpochMatches to return true only when
expected == 0 or when actual != 0 and expected == actual (i.e., remove the
actual == 0 branch) so Check no longer treats a zero-reported epoch as a
successful match.

In `@coordinator/operator/operator_move.go`:
- Around line 90-96: When m.originNodeStopped is true we must always initialize
m.maintainerEpoch from m.changefeed.GetMaintainerEpoch() before constructing the
AddMaintainer message; currently that assignment only runs when !m.bind so the
fallback path sends epoch 0. Move the call m.maintainerEpoch =
m.changefeed.GetMaintainerEpoch() so it executes unconditionally inside the
m.originNodeStopped branch, but keep the DB bind call
m.db.BindChangefeedToNode(m.origin, m.dest, m.changefeed) guarded by if !m.bind
and still set m.bind = true only there; then return
m.changefeed.NewAddMaintainerMessage(m.dest, m.maintainerEpoch).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a9ce3ffe-7823-41c7-bd00-1fb124290f54

📥 Commits

Reviewing files that changed from the base of the PR and between 99f4859 and eff6e26.

⛔ Files ignored due to path filters (1)
  • heartbeatpb/heartbeat.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (12)
  • coordinator/changefeed/changefeed.go
  • coordinator/changefeed/changefeed_test.go
  • coordinator/controller.go
  • coordinator/operator/operator_add.go
  • coordinator/operator/operator_add_test.go
  • coordinator/operator/operator_move.go
  • coordinator/operator/operator_move_test.go
  • heartbeatpb/heartbeat.proto
  • maintainer/maintainer.go
  • maintainer/maintainer_manager_maintainers.go
  • maintainer/maintainer_test.go
  • pkg/common/format.go

Comment thread coordinator/operator/operator_add.go Outdated
Comment thread coordinator/operator/operator_move.go
@3AceShowHand
Copy link
Copy Markdown
Collaborator Author

/test all

@3AceShowHand
Copy link
Copy Markdown
Collaborator Author

/test all

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
coordinator/changefeed/changefeed.go (1)

273-301: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Pass MaintainerEpoch into coordinator remove requests instead of always using epoch-0.

coordinator/controller.go still calls changefeed.RemoveMaintainerMessage(...) (epoch-0 wrapper) in both handleNonExistentChangefeed and the stale-removal loop in finishBootstrap. This sets heartbeatpb.RemoveMaintainerRequest.MaintainerEpoch = 0, and maintainer-side shouldApplyMaintainerRemove treats requestEpoch==0 as always-apply, bypassing epoch fencing. Switch those callsites to RemoveMaintainerMessageWithEpoch(..., status.GetMaintainerEpoch()) / RemoveMaintainerMessageWithEpoch(..., rm.status.GetMaintainerEpoch()) (the coordinator already has these epochs from the incoming MaintainerStatus) to preserve the fencing behavior.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@coordinator/changefeed/changefeed.go` around lines 273 - 301, The coordinator
currently calls RemoveMaintainerMessage(...) which sets MaintainerEpoch=0 and
bypasses epoch fencing; update the two call sites in coordinator/controller.go —
inside handleNonExistentChangefeed and inside the stale-removal loop in
finishBootstrap — to call RemoveMaintainerMessageWithEpoch(...) and pass the
correct epoch from the maintainer status (use status.GetMaintainerEpoch() in
handleNonExistentChangefeed and rm.status.GetMaintainerEpoch() in the
finishBootstrap loop) so heartbeatpb.RemoveMaintainerRequest.MaintainerEpoch
carries the actual epoch and preserves fencing semantics.
downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go (1)

379-406: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix stale maintainer close requests replying Success: true (no-op ack)

handleCloseRequest initializes MaintainerCloseResponse{Success: true} and, on !shouldAcceptMaintainerRequest(...), only logs and then still falls through to sendResponse—so the sender maintainer can receive a success ack even though TryClose was not executed.

Maintainer.onMaintainerCloseResponse treats response.Success as an acknowledgement when response.MaintainerEpoch matches the maintainer’s current epoch, so this can incorrectly trigger onRemoveMaintainer for a no-op close. Align stale handling with bootstrap/post-bootstrap by not replying (or set Success=false).

🔧 Proposed fix
 	m.mutex.Lock()
 	if manager, ok := m.dispatcherManagers[cfId]; ok {
 		if !shouldAcceptMaintainerRequest(req.MaintainerEpoch, manager.GetMaintainerEpoch()) {
 			log.Info("ignore stale maintainer close request",
 				zap.Stringer("changefeedID", cfId),
 				zap.Stringer("from", from),
 				zap.Uint64("requestMaintainerEpoch", req.MaintainerEpoch),
 				zap.Uint64("localMaintainerEpoch", manager.GetMaintainerEpoch()))
+			m.mutex.Unlock()
+			return nil
 		} else if closed := manager.TryClose(req.Removed); closed {
 			delete(m.dispatcherManagers, cfId)
 			metrics.DispatcherManagerGauge.WithLabelValues(cfId.Keyspace(), cfId.Name()).Dec()
 			response.Success = true
 		} else {
 			response.Success = false
 		}
 	}
 	m.mutex.Unlock()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go` around
lines 379 - 406, The code currently initializes response.Success=true and on a
stale request (when !shouldAcceptMaintainerRequest(...)) only logs, allowing a
success ack to be sent; change the stale handling inside handleCloseRequest so
that when !shouldAcceptMaintainerRequest(req.MaintainerEpoch,
manager.GetMaintainerEpoch()) you set response.Success = false and immediately
send the response (via m.sendResponse) and return (ensuring you still unlock
m.mutex before sending/returning), instead of falling through to the TryClose
branch; reference symbols: shouldAcceptMaintainerRequest,
manager.GetMaintainerEpoch, response (heartbeatpb.MaintainerCloseResponse),
TryClose, m.sendResponse, and m.mutex.
🧹 Nitpick comments (1)
coordinator/operator/operator_controller.go (1)

130-130: 💤 Low value

Local variable changefeed shadows the imported package.

changefeed := oc.changefeedDB.GetByID(cfID) shadows the github.com/pingcap/ticdc/coordinator/changefeed import. It's safe here since the package isn't referenced inside this function, but a rename (e.g. cf, matching AddOperator above) avoids confusion and future breakage.

♻️ Suggested rename
-	changefeed := oc.changefeedDB.GetByID(cfID)
-	if changefeed == nil {
+	cf := oc.changefeedDB.GetByID(cfID)
+	if cf == nil {
 		log.Warn("stop changefeed failed, changefeed not found",
 			zap.String("role", oc.role),
 			zap.Bool("removed", removed),
 			zap.String("changefeed", cfID.Name()))
 		if old, ok := oc.operators[cfID]; ok {
 			return old.OP
 		}
 		return nil
 	}
-	keyspaceID := changefeed.GetKeyspaceID()
-	maintainerEpoch := changefeed.GetMaintainerEpoch()
+	keyspaceID := cf.GetKeyspaceID()
+	maintainerEpoch := cf.GetMaintainerEpoch()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@coordinator/operator/operator_controller.go` at line 130, The local variable
name changefeed shadows the imported package changefeed; rename the local
variable returned by oc.changefeedDB.GetByID(cfID) (e.g., to cf to match
AddOperator) to avoid confusion and potential future bugs—update all subsequent
references in this function from changefeed to cf and ensure no other
identifiers collide with the imported package name.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@coordinator/changefeed/changefeed.go`:
- Around line 273-301: The coordinator currently calls
RemoveMaintainerMessage(...) which sets MaintainerEpoch=0 and bypasses epoch
fencing; update the two call sites in coordinator/controller.go — inside
handleNonExistentChangefeed and inside the stale-removal loop in finishBootstrap
— to call RemoveMaintainerMessageWithEpoch(...) and pass the correct epoch from
the maintainer status (use status.GetMaintainerEpoch() in
handleNonExistentChangefeed and rm.status.GetMaintainerEpoch() in the
finishBootstrap loop) so heartbeatpb.RemoveMaintainerRequest.MaintainerEpoch
carries the actual epoch and preserves fencing semantics.

In `@downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go`:
- Around line 379-406: The code currently initializes response.Success=true and
on a stale request (when !shouldAcceptMaintainerRequest(...)) only logs,
allowing a success ack to be sent; change the stale handling inside
handleCloseRequest so that when
!shouldAcceptMaintainerRequest(req.MaintainerEpoch,
manager.GetMaintainerEpoch()) you set response.Success = false and immediately
send the response (via m.sendResponse) and return (ensuring you still unlock
m.mutex before sending/returning), instead of falling through to the TryClose
branch; reference symbols: shouldAcceptMaintainerRequest,
manager.GetMaintainerEpoch, response (heartbeatpb.MaintainerCloseResponse),
TryClose, m.sendResponse, and m.mutex.

---

Nitpick comments:
In `@coordinator/operator/operator_controller.go`:
- Line 130: The local variable name changefeed shadows the imported package
changefeed; rename the local variable returned by oc.changefeedDB.GetByID(cfID)
(e.g., to cf to match AddOperator) to avoid confusion and potential future
bugs—update all subsequent references in this function from changefeed to cf and
ensure no other identifiers collide with the imported package name.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0ee957cf-114f-4672-a5f0-3a84cf22a15d

📥 Commits

Reviewing files that changed from the base of the PR and between eff6e26 and 6f578e0.

⛔ Files ignored due to path filters (1)
  • heartbeatpb/heartbeat.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (17)
  • coordinator/changefeed/changefeed.go
  • coordinator/changefeed/changefeed_test.go
  • coordinator/operator/operator_add.go
  • coordinator/operator/operator_add_test.go
  • coordinator/operator/operator_controller.go
  • coordinator/operator/operator_move.go
  • coordinator/operator/operator_move_test.go
  • coordinator/operator/operator_stop.go
  • coordinator/operator/operator_stop_test.go
  • downstreamadapter/dispatchermanager/dispatcher_manager_info.go
  • downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go
  • downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.go
  • downstreamadapter/dispatcherorchestrator/helper.go
  • heartbeatpb/heartbeat.proto
  • maintainer/maintainer.go
  • maintainer/maintainer_epoch_test.go
  • maintainer/maintainer_manager_maintainers.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • coordinator/operator/operator_move_test.go
  • coordinator/operator/operator_add_test.go
  • coordinator/operator/operator_move.go
  • coordinator/operator/operator_add.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@maintainer/maintainer.go`:
- Around line 1087-1089: The predicate in maintainerResponseEpochMatches
currently rejects responseEpoch == 0 when the local epoch is non-zero, breaking
bootstrap compatibility; update the function maintainerResponseEpochMatches to
accept a responseEpoch of 0 as a compatibility wildcard (in addition to the
existing localEpoch == 0 case) so compatibility responses are not dropped —
i.e., treat responseEpoch == 0 as matching the local epoch and preserve the
existing localEpoch == 0 behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e3e3c97f-01b4-4696-8219-ec7240e96676

📥 Commits

Reviewing files that changed from the base of the PR and between 6f578e0 and 0c85219.

📒 Files selected for processing (7)
  • coordinator/operator/operator_stop.go
  • coordinator/operator/operator_stop_test.go
  • downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go
  • downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.go
  • maintainer/maintainer.go
  • maintainer/maintainer_epoch_test.go
  • maintainer/maintainer_manager_maintainers.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • maintainer/maintainer_manager_maintainers.go
  • maintainer/maintainer_epoch_test.go
  • coordinator/operator/operator_stop.go
  • downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go
  • coordinator/operator/operator_stop_test.go
  • downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.go

Comment thread maintainer/maintainer.go Outdated
@3AceShowHand
Copy link
Copy Markdown
Collaborator Author

/test all

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
coordinator/operator/operator_stop.go (1)

65-79: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Make the node match and finish transition atomic.

Line 69 snapshots nodeID, but Line 79 can still mark the operator finished after OnNodeRemove has already switched the target to coordinatorNodeID. That lets a late non-working report from the removed node retire the operator and skip the rescheduled remove on the coordinator.

Proposed fix
func (m *StopChangefeedOperator) Check(from node.ID, status *heartbeatpb.MaintainerStatus) {
	if status == nil {
		return
	}
-	if from != m.getNodeID() {
+	m.mu.RLock()
+	defer m.mu.RUnlock()
+
+	if from != m.nodeID {
		return
	}
	if !m.finished.Load() &&
		status.State != heartbeatpb.ComponentState_Working &&
		maintainerEpochCanBeStopped(m.maintainerEpoch, status.MaintainerEpoch) {
		log.Info("maintainer report non-working status",
			zap.Stringer("maintainer", m.cfID),
			zap.Uint64("operatorMaintainerEpoch", m.maintainerEpoch),
			zap.Uint64("statusMaintainerEpoch", status.MaintainerEpoch))
		m.finished.Store(true)
	}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@coordinator/operator/operator_stop.go` around lines 65 - 79, The Check method
may race with OnNodeRemove because it calls m.getNodeID() twice implicitly (once
for the comparison and again when deciding to finish); capture the current
target node ID into a local variable (e.g., nodeID := m.getNodeID()) at the
start of Check and use that single snapshot for the from comparison and the
finished transition so the non-working report only retires the operator if the
node still matches that snapshot; ensure you only call m.finished.Store(true)
when from == nodeID and maintainerEpochCanBeStopped(...) and status checks all
pass.
coordinator/controller.go (1)

646-657: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't overwrite a newer persisted epoch with an older remote epoch.

allChangefeeds comes from metastore, so info.Epoch is the durable generation. If a remote maintainer reports a smaller non-zero epoch here, this branch downgrades the in-memory changefeed and then records that stale maintainer as authoritative. That re-admits exactly the generation this PR is trying to fence after recovery. Please only adopt the remote epoch for compatibility/newer-generation cases, and treat remote < local as stale instead.

Possible direction
-			if epoch := rm.status.GetMaintainerEpoch(); epoch != 0 && info.Epoch != epoch {
+			if epoch := rm.status.GetMaintainerEpoch(); epoch != 0 && (info.Epoch == 0 || epoch > info.Epoch) {
 				clonedInfo, err := info.Clone()
 				if err != nil {
 					log.Panic("clone changefeed info failed",
 						zap.Stringer("changefeed", cfID),
 						zap.Error(err))
 				}
 				clonedInfo.Epoch = epoch
 				info = clonedInfo
+			} else if epoch != 0 && info.Epoch > epoch {
+				_ = c.messageCenter.SendCommand(removeStaleMaintainerMessage(cfID, rm.nodeID, rm.status))
+				continue
 			}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@coordinator/controller.go` around lines 646 - 657, The current code
unconditionally replaces the durable in-memory epoch (info.Epoch) with the
remote maintainer epoch from rm.status.GetMaintainerEpoch(), which can downgrade
a newer persisted generation; change the logic in the branch around
rm.status.GetMaintainerEpoch() so you only adopt the remote epoch when it is
non-zero and strictly greater than info.Epoch (i.e., remote > local). If remote
<= info.Epoch treat the remote as stale: do not clone/modify info.Epoch and do
not call c.changefeedDB.AddReplicatingMaintainer based on the stale epoch;
continue creating the changefeed with the existing info and only add the
maintainer when the epoch check passes. Ensure this uses the existing symbols
rm.status.GetMaintainerEpoch(), info.Clone(), changefeed.NewChangefeed(...), and
c.changefeedDB.AddReplicatingMaintainer(...) to locate and update the code.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@coordinator/controller.go`:
- Around line 646-657: The current code unconditionally replaces the durable
in-memory epoch (info.Epoch) with the remote maintainer epoch from
rm.status.GetMaintainerEpoch(), which can downgrade a newer persisted
generation; change the logic in the branch around rm.status.GetMaintainerEpoch()
so you only adopt the remote epoch when it is non-zero and strictly greater than
info.Epoch (i.e., remote > local). If remote <= info.Epoch treat the remote as
stale: do not clone/modify info.Epoch and do not call
c.changefeedDB.AddReplicatingMaintainer based on the stale epoch; continue
creating the changefeed with the existing info and only add the maintainer when
the epoch check passes. Ensure this uses the existing symbols
rm.status.GetMaintainerEpoch(), info.Clone(), changefeed.NewChangefeed(...), and
c.changefeedDB.AddReplicatingMaintainer(...) to locate and update the code.

In `@coordinator/operator/operator_stop.go`:
- Around line 65-79: The Check method may race with OnNodeRemove because it
calls m.getNodeID() twice implicitly (once for the comparison and again when
deciding to finish); capture the current target node ID into a local variable
(e.g., nodeID := m.getNodeID()) at the start of Check and use that single
snapshot for the from comparison and the finished transition so the non-working
report only retires the operator if the node still matches that snapshot; ensure
you only call m.finished.Store(true) when from == nodeID and
maintainerEpochCanBeStopped(...) and status checks all pass.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 81789fb6-2656-4e0f-9de4-cd6cee16a38e

📥 Commits

Reviewing files that changed from the base of the PR and between 0c85219 and a9f6b9e.

📒 Files selected for processing (7)
  • coordinator/controller.go
  • coordinator/controller_test.go
  • coordinator/operator/operator_stop.go
  • downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go
  • downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.go
  • maintainer/maintainer.go
  • maintainer/maintainer_epoch_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • maintainer/maintainer_epoch_test.go
  • downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Jun 1, 2026

@3AceShowHand: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cdc-storage-integration-heavy a9f6b9e link true /test pull-cdc-storage-integration-heavy

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/needs-triage-completed release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

After the simulated cdc owner and other cdc networks are isolated for 10 minutes to recover, the changefeed lag gets larger and larger

1 participant