[fix][broker]Do not trigger topic GC if replication is still active#25915
[fix][broker]Do not trigger topic GC if replication is still active#25915poorbarcode wants to merge 20 commits into
Conversation
codelipenghui
left a comment
There was a problem hiding this comment.
@poorbarcode I'm thinking another solution
- Close the replicator producer if the producer is idle for a while (e.g. 10 mins)
- Check all the producers for detecting the inactive topic (now it's more complicated to skip the replicator producer)
- Only delete the inactive topic if there is no producers (including the replicator producer)
Now, your solution added 7 days delay for inactive topic deletion if geo-replication enable. But if there is no messages from last 7 days, the issue can still happen, right?
I agree with this concern. A replicated topic can be quiet for longer than the threshold while the replication relationship is still valid. Using |
|
@codelipenghui @void-ptr974 Changed the solution as @codelipenghui suggested, please review again |
f4194c9 to
0c2ce19
Compare
|
@codelipenghui @lhotari @void-ptr974 Please review the PR again. I have changed the solution:
|
| // Start producer and retry. | ||
| if (state == Disconnected) { | ||
| startProducer(); | ||
| retryReplicateEntries.run(); | ||
| return; |
There was a problem hiding this comment.
[BUG] Auto-resume can replicate newer positions before older ones → reordering, and silent message loss when dedup is enabled
After an idle disconnect, this resume path can replicate a newer batch before the older one it is holding, which breaks replication ordering and — with deduplication enabled — silently drops the older message.
Sequence: when a wait-read completes while state == Disconnected, this branch holds that batch (call it R1, the older positions) and reschedules it ~100ms later, while also calling startProducer(). startProducer() → setProducerAndTriggerReadEntries() flips the state to Started and immediately calls readMoreEntries():
Because R1's InFlightTask already has entries != null, it is no longer counted as a pending read, so a fresh read R2 (strictly newer positions — the cursor read position already advanced past R1 and is no longer rewound) is issued and can be sent by doReplicateEntries(R2) before R1's 100ms-deferred retry fires.
Two consequences:
- Replication order across the reconnect can be violated.
- The receiving cluster dedups replicated messages by source position, not wire order:
Once R2's higher position is recorded, R1 (lower position) is classified a duplicate and discarded on the remote, while the source side mark-deletes R1 as delivered on the (successful) send callback:
i.e. permanent silent message loss across the reconnect. (Loss requires topic dedup enabled; without it the symptom is out-of-order replication, which is still a correctness issue for ordered consumers.)
The previous design avoided this: cursor.rewind() on (re)connect plus pending-read cancellation guaranteed in-order resume from markDelete + 1. Could we hold the new read until the held batch has been replicated (or rewind / cancel the pending read on disconnect) so resume is strictly in order?
There was a problem hiding this comment.
There's also a replicator dedup related PR in #25860 by @void-ptr974
There was a problem hiding this comment.
Because R1's InFlightTask already has entries != null, it is no longer counted as a pending read, so a fresh read R2 (strictly newer positions — the cursor read position already advanced past R1 and is no longer rewound) is issued and can be sent by doReplicateEntries(R2) before R1's 100ms-deferred retry fires.
Could you point why the entries is not null?
There was a problem hiding this comment.
@poorbarcode — entries is set as the very first action of the read callback, before any state is checked. In readEntriesComplete:
InFlightTask inFlightTask = (InFlightTask) ctx;
inFlightTask.setEntries(entries); // entries becomes non-null hereSo by the time we get into replicateEntries and hit the Disconnected branch (startProducer(); retryReplicateEntries.run(); return;), inFlightTask.entries is already populated, and that branch reschedules R1 ~100 ms later without resetting entries to null. R1 also can't be recycled while it's parked (isDone() is false: completedEntries == 0 < entries.size()), so it stays in inFlightTasks with non-null entries.
That's what removes the gate: the "pending read" test is exactly readPos != null && entries == null in both getPermitsIfNoPendingRead (L961) and hasPendingRead (L1054). With entries != null, R1 no longer counts as a pending read, so when startProducer() flips the state to Started and setProducerAndTriggerReadEntries() calls readMoreEntries(), a fresh read R2 is issued at the advanced cursor.getReadPosition() — and R2 can be sent before R1's deferred retry fires.
Worth noting: the GC/idle disconnect() path goes through beforeDisconnect() (backlog check only), so unlike beforeTerminateOrCursorRewinding() it does not cancelPendingReadRequest() or rewind the cursor — which is also why R1's wait-read survives the producer close and completes while Disconnected in the first place. This is essentially what the cursor.rewind() + pending-read cancellation that this PR removed from setProducerAndTriggerReadEntries used to prevent.
On the consequence: repl dedup is a high-water-mark keyed by source position (isDuplicateReplV2), so R2 (newer) being recorded before R1 (older) is enough — once the mark is past R1, R1 comes back as a duplicate. If R2 is already persisted, R1 gets Dup, which PersistentTopic acks as success, so the source mark-deletes it → silent loss; if R2 is only pushed-not-persisted, R1 gets Unknown and the source retries (recoverable). With dedup off it's out-of-order delivery. The simplest fix is to keep resume strictly in order — don't let the deferred retry leave a held batch counting as "not a pending read" while a fresh read is issued ahead of it (e.g. gate the new read on the held task, or rewind/cancel on disconnect as before).
There was a problem hiding this comment.
Thanks for the explanation, I should let the check to be in readCompete, rather than let it to be in replicatEntries.
Fixed
There's also a replicator dedup related PR in #25860 by @void-ptr974
I will review it tomorrow ❤️
| // Has backlog. | ||
| long backlog = getNumberOfEntriesInBacklog(); | ||
| if (backlog > 0) { | ||
| return; | ||
| } |
There was a problem hiding this comment.
[QUALITY] Loss-safety now rests entirely on the backlog == 0 precondition — worth documenting the invariant
It would help to document (a comment, or an assertion) that this backlog == 0 precondition is now load-bearing for correctness. Since both the cursor.rewind() on reconnect and the afterDisconnected() rewind were removed, nothing re-reads entries that are read-but-not-yet-acked at producer restart. No-loss safety now rests entirely on (a) this backlog == 0 check and (b) beforeDisconnect() rejecting any in-flight task below the last confirmed entry:
There's no construct of a loss case under these guards currently, so this is fine as written — but the invariant is implicit and easy to break. A future disconnect trigger that doesn't enforce backlog == 0, or a regression in beforeDisconnect(), would silently drop those entries, with no rewind left as a safety net. A short comment stating the requirement would make that explicit.
There was a problem hiding this comment.
I can not understand what concern is it
| long estimatedTimeStampProducerConnected = this.estimatedTimeStampProducerConnected; | ||
| long delayMillis; | ||
| if (estimatedTimeStampProducerConnected > System.currentTimeMillis()) { | ||
| delayMillis = (estimatedTimeStampProducerConnected - System.currentTimeMillis()) + 100; |
There was a problem hiding this comment.
line-412 prevents the situation you mentioned
There was a problem hiding this comment.
|
Please check the code style. |
56026a4 to
cd58d4d
Compare
Motivation
Replication is stuck due to the Topic GC
primary clusterandbackup clusterprimary cluster: Publishing messages intoprimary clusterbackup cluster: No consumer/producer is registered.backup cluster: Check topic GCremote producerbackup cluster: the topic GC progress is waiting for the remote-producer to be disconnected.backup cluster: backlog increases because the replicator was closedModifications
Does this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes