Skip to content

eventcollector: refactor dispatcher registration and session lifecycle#5175

Open
lidezhu wants to merge 17 commits into
masterfrom
ldz/refactor-event-collector006
Open

eventcollector: refactor dispatcher registration and session lifecycle#5175
lidezhu wants to merge 17 commits into
masterfrom
ldz/refactor-event-collector006

Conversation

@lidezhu
Copy link
Copy Markdown
Collaborator

@lidezhu lidezhu commented May 30, 2026

What problem does this PR solve?

Issue Number: close #xxx

What is changed and how it works?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

Summary by CodeRabbit

  • Refactor
    • Streamlined event service registration with explicit local and remote discovery entrypoints
    • Enhanced heartbeat tracking mechanism for improved dispatcher progress accuracy
    • Optimized remote event service discovery workflow with probing-based initialization

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 30, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels May 30, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 30, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign charlescheung96 for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 30, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR refactors dispatcher session orchestration in TiCDC's event collector by introducing explicit registration entrypoints (startLocalRegistration, startRemoteProbing, retryCurrentRegistrationIfRemovedFrom), consolidating reset semantics into session-owned helpers (commitLocalRegistration, resetCurrentEventService), changing heartbeat reporting from separate accessors to a unified getHeartbeatReport(), and updating all consumers to use the new APIs.

Changes

Dispatcher Session Registration and State Management Refactoring

Layer / File(s) Summary
Session orchestration refactoring: registration, probing, and reset entrypoints
downstreamadapter/eventcollector/dispatcher_session.go
Introduces startLocalRegistration, startRemoteProbing, retryCurrentRegistrationIfRemovedFrom for explicit registration flow control; adds commitLocalRegistration and resetCurrentEventService to replace generic ready/reset handlers; removes isCurrentEventService wrappers; documents removeFromLocked terminal vs stale-cleanup semantics.
Dispatcher state: heartbeat report generation and localized reset calls
downstreamadapter/eventcollector/dispatcher_stat.go
Replaces getHeartbeatProgressForEventService() with getHeartbeatReport() returning (eventServiceID, checkpointTs, epoch, ok); changes multiple reset paths to call session.resetCurrentEventService() directly; updates state-helper comments to document epoch, sequence, heartbeat, and session responsibilities.
Event collector: integration of new session APIs and heartbeat semantics
downstreamadapter/eventcollector/event_collector.go
Updates CommitAddDispatcher to use commitLocalRegistration(); refactors groupHeartbeat to rely on getHeartbeatReport() with ok-gate; changes dispatcher-state-removed recovery to retryCurrentRegistrationIfRemovedFrom(); validates event service ID via IsEmpty().
Test updates: registration entrypoints, heartbeat reporting, and reset behavior
downstreamadapter/eventcollector/dispatcher_stat_test.go, downstreamadapter/eventcollector/event_collector_test.go
Replaces TestRegisterTo with TestRegistrationEntrypoints covering local registration, remote probing, and retry-after-removal flows; updates checkpoint tests to use getHeartbeatReport() and session-level doReset(); adjusts concurrent registration tests to startRemoteProbing().
Log coordinator client: remote probing entrypoint
downstreamadapter/eventcollector/log_coordinator_client.go
Changes ReusableEventServiceResponse handling from setRemoteCandidates(nodes) to startRemoteProbing(nodes) to align with new session orchestration entrypoints.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • pingcap/ticdc#5022: Refactors the same dispatcher_session.go dispatcher-session transition and orchestration logic (local registration vs remote probing, ready/not-reusable handling, and reset/remove semantics), with overlapping state-machine clarification work.

Suggested labels

lgtm, approved

Suggested reviewers

  • hongyunyan
  • asddongmen

Poem

🐰 A session springs forth with paths so clear,
Local roots and probes both far and near,
Reset echoes through the heartbeat's sway,
No more wrapped checks—just direct display!
Each layer listens, true and bright.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly summarizes the main change: a refactor of dispatcher registration and session lifecycle in the eventcollector component, which aligns with the primary modifications across all affected files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ldz/refactor-event-collector006

Warning

Tools execution failed with the following error:

Failed to run tools: 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error)


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 30, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the registration and session management logic for dispatchers in the event collector. It introduces more explicit lifecycle methods (such as startLocalRegistration, retryCurrentRegistration, and startRemoteProbing) and encapsulates control-plane transitions within dispatcherSession and dispatcherStat. A critical race condition was identified in retryCurrentRegistration where a concurrent call to remove() could clear the current event service ID, leading to a panic. A code suggestion has been provided to handle this scenario gracefully.

Comment thread downstreamadapter/eventcollector/dispatcher_session.go Outdated
@lidezhu lidezhu changed the title [WIP] eventcollector: refactor dispatcher registration and session lifecycle May 31, 2026
@lidezhu lidezhu marked this pull request as ready for review May 31, 2026 00:11
@ti-chi-bot ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 31, 2026
@lidezhu
Copy link
Copy Markdown
Collaborator Author

lidezhu commented May 31, 2026

/gemini review

@lidezhu
Copy link
Copy Markdown
Collaborator Author

lidezhu commented May 31, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 31, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the registration, reset, and probing lifecycle of dispatchers in the event collector. It replaces generic registration and reset methods with more specific, context-aware entry points such as startLocalRegistration, retryCurrentRegistrationIfRemovedFrom, and resetCurrentEventService. Additionally, it consolidates heartbeat progress queries into a single getHeartbeatReport method. Feedback on these changes suggests adding a check in retryCurrentRegistrationIfRemovedFrom to prevent unnecessary operations and logging if the dispatcher session has already been removed.

Comment on lines +314 to +319
func (s *dispatcherSession) retryCurrentRegistrationIfRemovedFrom(serverID node.ID) bool {
s.requestMu.Lock()
defer s.requestMu.Unlock()
if s.connState.getCurrentEventServiceID() != serverID {
return false
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the dispatcher session has already been removed, we should avoid logging the retry message and attempting to register. Checking s.connState.isRemoved() at the beginning of retryCurrentRegistrationIfRemovedFrom prevents unnecessary logging and operations on a terminated session.

func (s *dispatcherSession) retryCurrentRegistrationIfRemovedFrom(serverID node.ID) bool {
	s.requestMu.Lock()
	defer s.requestMu.Unlock()
	if s.connState.isRemoved() {
		return false
	}
	if s.connState.getCurrentEventServiceID() != serverID {
		return false
	}

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 31, 2026

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

📖 For more info, you can check the "Contribute Code" section in the development guide.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
downstreamadapter/eventcollector/dispatcher_session.go (1)

586-600: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Don't make the init-time remote-ready path reachable until the callback contract is updated.

This now starts remote probing while the session can still be in the readyCallback != nil init state, but handleAcceptedRemoteReadyLocked still panics on that combination. The state-machine comment above explicitly allows “remote ready first”, so the first successful reusable remote during add will crash instead of resetting the remote stream. Either defer probing until after the initial local commit, or make the remote-ready path tolerate the init callback state.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@downstreamadapter/eventcollector/dispatcher_session.go` around lines 586 -
600, startRemoteProbing may run while the session is still in the init
"readyCallback != nil" state, which causes handleAcceptedRemoteReadyLocked to
panic; change startRemoteProbing (in dispatcherSession) to not start probing
when the init ready callback is present: inside startRemoteProbing (after
acquiring s.requestMu and before calling s.connState.beginRemoteProbing /
s.sendRegisterRequest) check the connection/init callback state (readyCallback
!= nil or equivalent on s.connState) and if it's non-nil, defer probing by
returning early or by saving the candidate nodes to a pending field to be
processed once the init callback completes (ensure the same lock protects the
pending field), so the remote probing path is only reachable after the initial
local commit; update any tests to cover the deferred-probing behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@downstreamadapter/eventcollector/dispatcher_session.go`:
- Around line 355-359: commitLocalRegistration currently calls doReset but
leaves the one-shot readyCallback set, so subsequent accepted local ready
messages re-enter the init-only branch in handleAcceptedLocalReadyLocked and
skip the normal RESET/re-registration path; fix by clearing the one-shot
callback (set s.readyCallback = nil) at the end of commitLocalRegistration (or
immediately after doReset) so the init-only callback cannot be reused and normal
RESET/re-registration logic in handleAcceptedLocalReadyLocked runs on future
events.

---

Outside diff comments:
In `@downstreamadapter/eventcollector/dispatcher_session.go`:
- Around line 586-600: startRemoteProbing may run while the session is still in
the init "readyCallback != nil" state, which causes
handleAcceptedRemoteReadyLocked to panic; change startRemoteProbing (in
dispatcherSession) to not start probing when the init ready callback is present:
inside startRemoteProbing (after acquiring s.requestMu and before calling
s.connState.beginRemoteProbing / s.sendRegisterRequest) check the
connection/init callback state (readyCallback != nil or equivalent on
s.connState) and if it's non-nil, defer probing by returning early or by saving
the candidate nodes to a pending field to be processed once the init callback
completes (ensure the same lock protects the pending field), so the remote
probing path is only reachable after the initial local commit; update any tests
to cover the deferred-probing behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1b8c6473-7300-4fac-9b7e-17000770830c

📥 Commits

Reviewing files that changed from the base of the PR and between 99f4859 and 03c0a6d.

📒 Files selected for processing (6)
  • downstreamadapter/eventcollector/dispatcher_session.go
  • downstreamadapter/eventcollector/dispatcher_stat.go
  • downstreamadapter/eventcollector/dispatcher_stat_test.go
  • downstreamadapter/eventcollector/event_collector.go
  • downstreamadapter/eventcollector/event_collector_test.go
  • downstreamadapter/eventcollector/log_coordinator_client.go

Comment on lines +355 to 359
// commitLocalRegistration commits the accepted local registration by sending
// RESET to the local EventService.
func (s *dispatcherSession) commitLocalRegistration() {
s.doReset(s.localServerID, s.target.GetCheckpointTs())
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clear the one-shot readyCallback when local registration is committed.

After commitLocalRegistration() runs, readyCallback is still non-nil. Any later accepted local ready will re-enter the init-only callback branch in handleAcceptedLocalReadyLocked and skip the normal RESET path, which breaks re-registration after removal/reconnect.

Proposed fix
 func (s *dispatcherSession) commitLocalRegistration() {
-	s.doReset(s.localServerID, s.target.GetCheckpointTs())
+	s.requestMu.Lock()
+	defer s.requestMu.Unlock()
+	if s.connState.isRemoved() {
+		return
+	}
+	s.readyCallback = nil
+	s.doResetLocked(s.localServerID, s.target.GetCheckpointTs())
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// commitLocalRegistration commits the accepted local registration by sending
// RESET to the local EventService.
func (s *dispatcherSession) commitLocalRegistration() {
s.doReset(s.localServerID, s.target.GetCheckpointTs())
}
// commitLocalRegistration commits the accepted local registration by sending
// RESET to the local EventService.
func (s *dispatcherSession) commitLocalRegistration() {
s.requestMu.Lock()
defer s.requestMu.Unlock()
if s.connState.isRemoved() {
return
}
s.readyCallback = nil
s.doResetLocked(s.localServerID, s.target.GetCheckpointTs())
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@downstreamadapter/eventcollector/dispatcher_session.go` around lines 355 -
359, commitLocalRegistration currently calls doReset but leaves the one-shot
readyCallback set, so subsequent accepted local ready messages re-enter the
init-only branch in handleAcceptedLocalReadyLocked and skip the normal
RESET/re-registration path; fix by clearing the one-shot callback (set
s.readyCallback = nil) at the end of commitLocalRegistration (or immediately
after doReset) so the init-only callback cannot be reused and normal
RESET/re-registration logic in handleAcceptedLocalReadyLocked runs on future
events.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant