maintainer,dispatcher: fence stale generation requests by hongyunyan · Pull Request #5182 · pingcap/ticdc

hongyunyan · 2026-06-01T08:44:57Z

What problem does this PR solve?

Issue Number: close #5083

During maintainer failover, a delayed schedule request from the previous
maintainer can still reach a dispatcher manager after the new maintainer has
already bootstrapped and recreated the same table span. Without a receiver-side
ownership fence, the stale request can create an orphan dispatcher that enters
Working and writes to the downstream sink before the new maintainer observes
and removes it.

What is changed and how it works?

This PR adds a receiver-local maintainer generation fence:

Adds generation to maintainer bootstrap, schedule, post-bootstrap, and close
heartbeat messages.
Bumps and persists the changefeed epoch before new maintainer ownership is
scheduled through add/move operators, and before resume/retry scheduling.
Serializes persisted epoch bumps in the backend by reading the latest stored
ChangeFeedInfo and job status, advancing with max(candidate, persisted+1),
preserving stored status by default, and writing info/job under info-key and
job-key ModRevision compares.
Writes warning retry state/error through the same epoch bump boundary instead
of first doing an ordinary no-CAS changefeed update.
Generates epochs from PD TSO without silent production fallback, and keeps each
changefeed's generation strictly monotonic with max(candidate, current+1).
Keeps AddMaintainerRequest.Config bytes synchronized with the latest
ChangeFeedInfo.
Stamps maintainer outbound control messages with the changefeed epoch.
Makes dispatcher managers track the active maintainer owner plus explicit
request generation and reject stale schedule/post-bootstrap/close requests
locally.
Serializes dispatcher-manager control requests with maintainer generation
changes, and keeps currentOperatorMap keyed by dispatcher ID and generation.
Keeps rolling-upgrade compatibility by allowing generation 0 only while the
receiver has not observed a non-zero generation for the changefeed, and only
for the current compatibility-mode maintainer owner.

Check List

Tests

Unit test

Questions

Will it cause performance regression or break compatibility?

No expected performance regression. The new mutex only serializes per-changefeed
dispatcher-manager control operations such as bootstrap, close, and dispatcher
create/remove scheduling; it is not in the event write path.

The change is wire-compatible. New fields are optional protobuf fields, and a
new receiver still allows generation 0 from the current maintainer owner while
it remains in compatibility mode for that changefeed.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

Fix a bug where delayed stale maintainer requests could create duplicate dispatchers during maintainer failover.

Validation

make generate-protobuf
make fmt
tools/bin/golangci-lint run --timeout 10m0s --new-from-rev=upstream/master
go test ./coordinator/changefeed ./coordinator/operator ./coordinator ./pkg/pdutil
go test ./downstreamadapter/dispatchermanager ./downstreamadapter/dispatcherorchestrator ./coordinator ./coordinator/changefeed ./coordinator/operator ./pkg/pdutil ./maintainer ./maintainer/replica ./maintainer/operator
go test ./api/v1 ./coordinator ./coordinator/changefeed ./coordinator/drain ./coordinator/operator ./coordinator/scheduler ./downstreamadapter/dispatchermanager ./downstreamadapter/dispatcherorchestrator ./maintainer ./maintainer/replica ./maintainer/operator ./pkg/bootstrap ./pkg/server ./pkg/pdutil
git diff --check

ti-chi-bot · 2026-06-01T08:45:00Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

ti-chi-bot · 2026-06-01T08:45:01Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign charlescheung96 for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-06-01T08:45:05Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c334b3c8-dd15-4118-b22b-16a7dd66ac3c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a maintainer generation/epoch fencing mechanism to prevent stale maintainer requests from affecting dispatcher managers. It adds generation fields to heartbeat protobuf messages, implements fencing logic in the dispatcher manager and orchestrator, and stamps outgoing requests with the current maintainer generation. The review feedback highlights two critical head-of-line blocking issues in dispatcher_orchestrator.go where the orchestrator-wide lock m.mutex is held while waiting for the per-changefeed lock manager.LockControl(), and provides suggestions to safely release the lock before acquiring the per-changefeed lock.

gemini-code-assist · 2026-06-01T08:47:36Z

 	} else {
+		manager.LockControl()
+		defer manager.UnlockControl()
+		if !manager.TryUpdateMaintainer(from, generation) {
+			log.Warn("drop stale maintainer bootstrap request",
+				zap.String("changefeed", cfId.Name()),
+				zap.String("from", from.String()),
+				zap.Uint64("requestGeneration", generation),
+				zap.Uint64("currentGeneration", manager.GetMaintainerEpoch()),
+				zap.String("currentMaintainer", manager.GetMaintainerID().String()))
+			return nil
+		}


Holding the orchestrator-wide lock m.mutex while waiting for the per-changefeed lock manager.LockControl() can cause head-of-line blocking. If a single changefeed's dispatcher manager is slow or blocked, all other changefeeds on this node will be blocked from bootstrapping or closing.

To avoid this, we should unlock m.mutex as soon as we retrieve the manager from m.dispatcherManagers, and then acquire manager.LockControl(). To prevent races with concurrent close/delete operations, we can re-verify under m.mutex that the manager is still the active one in m.dispatcherManagers before proceeding.

} else { m.mutex.Unlock() manager.LockControl() defer manager.UnlockControl() m.mutex.Lock() currentManager, stillExists := m.dispatcherManagers[cfId] if !stillExists || currentManager != manager { m.mutex.Unlock() return nil } m.mutex.Unlock() if !manager.TryUpdateMaintainer(from, generation) { log.Warn("drop stale maintainer bootstrap request", zap.String("changefeed", cfId.Name()), zap.String("from", from.String()), zap.Uint64("requestGeneration", generation), zap.Uint64("currentGeneration", manager.GetMaintainerEpoch()), zap.String("currentMaintainer", manager.GetMaintainerID().String())) return nil }

gemini-code-assist · 2026-06-01T08:47:36Z

 	m.mutex.Lock()
 	if manager, ok := m.dispatcherManagers[cfId]; ok {
-		if closed := manager.TryClose(req.Removed); closed {
-			delete(m.dispatcherManagers, cfId)
-			metrics.DispatcherManagerGauge.WithLabelValues(cfId.Keyspace(), cfId.Name()).Dec()
-			response.Success = true
+		manager.LockControl()
+		if manager.IsMaintainerRequestAllowed(from, req.Generation) {
+			if closed := manager.TryClose(req.Removed); closed {
+				delete(m.dispatcherManagers, cfId)
+				metrics.DispatcherManagerGauge.WithLabelValues(cfId.Keyspace(), cfId.Name()).Dec()
+				response.Success = true
+			} else {
+				response.Success = false
+			}
 		} else {
-			response.Success = false
+			log.Warn("drop stale maintainer close request",
+				zap.String("changefeed", cfId.Name()),
+				zap.String("from", from.String()),
+				zap.Uint64("requestGeneration", req.Generation),
+				zap.Uint64("currentGeneration", manager.GetMaintainerEpoch()),
+				zap.String("currentMaintainer", manager.GetMaintainerID().String()))
 		}
+		manager.UnlockControl()
 	}
 	m.mutex.Unlock()


Holding the orchestrator-wide lock m.mutex while waiting for the per-changefeed lock manager.LockControl() can cause head-of-line blocking. If a single changefeed's dispatcher manager is slow or blocked, all other changefeeds on this node will be blocked from bootstrapping or closing.

To avoid this, we should unlock m.mutex as soon as we retrieve the manager from m.dispatcherManagers, and then acquire manager.LockControl(). To prevent races with concurrent close/delete operations, we can re-verify under m.mutex that the manager is still the active one in m.dispatcherManagers before proceeding.

m.mutex.Lock() manager, ok := m.dispatcherManagers[cfId] if !ok { m.mutex.Unlock() return response } m.mutex.Unlock() manager.LockControl() defer manager.UnlockControl() m.mutex.Lock() currentManager, stillExists := m.dispatcherManagers[cfId] if !stillExists || currentManager != manager { m.mutex.Unlock() response.Success = false return response } if manager.IsMaintainerRequestAllowed(from, req.Generation) { if closed := manager.TryClose(req.Removed); closed { delete(m.dispatcherManagers, cfId) metrics.DispatcherManagerGauge.WithLabelValues(cfId.Keyspace(), cfId.Name()).Dec() response.Success = true } else { response.Success = false } } else { log.Warn("drop stale maintainer close request", zap.String("changefeed", cfId.Name()), zap.String("from", from.String()), zap.Uint64("requestGeneration", req.Generation), zap.Uint64("currentGeneration", manager.GetMaintainerEpoch()), zap.String("currentMaintainer", manager.GetMaintainerID().String())) } m.mutex.Unlock()

…neration-fence # Conflicts: # coordinator/controller.go # heartbeatpb/heartbeat.pb.go

maintainer,dispatcher: fence stale generation requests

82b4765

ti-chi-bot Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/needs-triage-completed labels Jun 1, 2026

ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jun 1, 2026

gemini-code-assist Bot reviewed Jun 1, 2026

View reviewed changes

hongyunyan added 9 commits June 1, 2026 17:29

maintainer,dispatcher: fix generation fence blockers

d4e8199

coordinator: serialize changefeed epoch bumps

2021ba9

coordinator: preserve status during epoch bumps

271b310

lint: fix generation fence static checks

5e4c6b6

Merge remote-tracking branch 'upstream/master' into codex/fix-5083-ge…

a078c36

…neration-fence # Conflicts: # coordinator/controller.go # heartbeatpb/heartbeat.pb.go

coordinator: keep epoch fallback compatibility

260596e

dispatcher: cover bootstrap operator generation filter

823baf3

coordinator: simplify changefeed config marshaling

c925c80

coordinator: inject pd client into operator controller

68c91e2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

maintainer,dispatcher: fence stale generation requests#5182

maintainer,dispatcher: fence stale generation requests#5182
hongyunyan wants to merge 10 commits into
pingcap:masterfrom
hongyunyan:codex/fix-5083-generation-fence

hongyunyan commented Jun 1, 2026 •

edited

Loading

Uh oh!

ti-chi-bot Bot commented Jun 1, 2026

Uh oh!

ti-chi-bot Bot commented Jun 1, 2026

Uh oh!

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 1, 2026

Uh oh!

gemini-code-assist Bot Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hongyunyan commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how it works?

Check List

Tests

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Validation

Uh oh!

ti-chi-bot Bot commented Jun 1, 2026

Uh oh!

ti-chi-bot Bot commented Jun 1, 2026

Uh oh!

coderabbitai Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hongyunyan commented Jun 1, 2026 •

edited

Loading

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading