[improvement](recycler) Avoid single-point read/write during sequentially reading key#62476
[improvement](recycler) Avoid single-point read/write during sequentially reading key#62476wyxxxcat wants to merge 1 commit intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
There was a problem hiding this comment.
Blocking issue found.
-
Goal of this PR
Reduce per-rowset point reads/writes while scanning recycler keys. The batching direction is reasonable, but the current implementation changes recycler semantics on a correctness-critical path and is not safe as written. -
Critical checkpoint conclusions
- Goal accomplished: Partially. Point operations are reduced, but the deferred abort path introduces a race that can delete rowset data for a transaction/job that commits before
loop_done()runs. - Change size/focus: Focused to recycler batching, but it changes behavior in both
recycle_rowsets()andrecycle_tmp_rowsets(). - Concurrency: Involved and currently unsafe. The recycler scan thread now queues abort work and only executes it at batch end, while concurrent
commit_rowset/commit_txn/finish_tablet_jobRPCs can still succeed during that window whenenable_mark_delete_rowset_before_recycle=false. - Lifecycle/static init: No special lifecycle or static initialization concerns found in this PR.
- Config changes: No new configs added, but an existing supported config combination regresses:
enable_abort_txn_and_job_for_delete_rowset_before_recycle=truewithenable_mark_delete_rowset_before_recycle=false. - Compatibility: No protocol/storage compatibility change observed.
- Parallel paths: The same regression exists in both formal recycle-rowset and tmp-rowset paths.
- Special conditions: The existing
end_version() != 1gate remains; no new explanation issues beyond the race above. - Test coverage: Existing tests mainly exercise the mark-before-delete flow and do not cover the interleaving where a commit/job-finish wins the race before deferred abort execution. I did not find new coverage for this regression.
- Observability: Logging is adequate for tracing the new path.
- Transaction/persistence: No new persistence format issue, but transaction/job state handling is where the correctness regression is introduced.
- Data writes/modifications: Not safe on the affected path because object deletion can proceed after a successful commit that happened before the deferred abort ran.
- FE/BE variable passing: Not applicable.
- Performance: The batching optimization is valid in principle.
- Other issues: No second independent blocker found beyond the deferred-abort race.
- Recycler-specific checkpoints
- Mark-before-delete two-phase flow: Preserved when mark-delete is enabled.
- Abort-before-delete aligned with origin: The mapping is still correct (load -> txn, compaction/schema-change -> job), but timing is no longer safe because abort is deferred.
- Packed files: Not affected by this PR.
- Conflict/retry/idempotency: The new deferred abort flow is not restart-safe enough on the affected config path because a concurrent commit can invalidate the recycler's stale decision before deletion submission.
Because this is a data-correctness issue on a supported code path, this should not be approved yet.
| rowset_meta->end_version() != 1) { | ||
| if (auto abort_task = make_deferred_abort_task(rowset); abort_task.has_value()) { | ||
| LOG(INFO) << "rowset queued to abort related txn or job after current scan batch, " | ||
| "instance_id=" |
There was a problem hiding this comment.
Deferring this abort until loop_done() introduces a correctness race when enable_mark_delete_rowset_before_recycle=false. Before this PR, abort_txn_or_job_for_recycle() ran immediately after we saw the expired recycle-rowset key, so a concurrent commit_rowset / finish_tablet_job could not make the rowset live before deletion was scheduled. Now the scanner only queues the abort and keeps walking the batch. A concurrent commit can succeed in that window, and then loop_done() will still submit deletion based on the stale scan result. If the later abort sees an already-committed txn/job and returns 0, we end up deleting committed rowset data. The batching optimization needs to keep the abort-before-delete invariant for this supported config path.
| return ret; | ||
| if (auto abort_task = make_deferred_abort_task(rowset); abort_task.has_value()) { | ||
| LOG(INFO) << "rowset queued to abort related txn or job after current scan batch, " | ||
| "instance_id=" |
There was a problem hiding this comment.
Same race in the tmp-rowset path. With mark-before-delete disabled, moving the abort from the scan loop to loop_done() widens the window where commit_rowset / commit_txn / finish_tablet_job can succeed before the recycler aborts the owner txn/job. After that, the worker still deletes the tmp rowset based on a stale snapshot from the scan. This regresses the old abort-before-delete behavior and can delete rowset data that has just become visible.
342cec4 to
23e5acb
Compare
What problem does this PR solve?
fix: #58459
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)