Skip to content

[improvement](recycler) Avoid single-point read/write during sequentially reading key#62476

Draft
wyxxxcat wants to merge 1 commit intoapache:masterfrom
wyxxxcat:reduce_point_read_at_recycle
Draft

[improvement](recycler) Avoid single-point read/write during sequentially reading key#62476
wyxxxcat wants to merge 1 commit intoapache:masterfrom
wyxxxcat:reduce_point_read_at_recycle

Conversation

@wyxxxcat
Copy link
Copy Markdown
Collaborator

@wyxxxcat wyxxxcat commented Apr 14, 2026

What problem does this PR solve?

fix: #58459

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@wyxxxcat
Copy link
Copy Markdown
Collaborator Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking issue found.

  1. Goal of this PR
    Reduce per-rowset point reads/writes while scanning recycler keys. The batching direction is reasonable, but the current implementation changes recycler semantics on a correctness-critical path and is not safe as written.

  2. Critical checkpoint conclusions

  • Goal accomplished: Partially. Point operations are reduced, but the deferred abort path introduces a race that can delete rowset data for a transaction/job that commits before loop_done() runs.
  • Change size/focus: Focused to recycler batching, but it changes behavior in both recycle_rowsets() and recycle_tmp_rowsets().
  • Concurrency: Involved and currently unsafe. The recycler scan thread now queues abort work and only executes it at batch end, while concurrent commit_rowset / commit_txn / finish_tablet_job RPCs can still succeed during that window when enable_mark_delete_rowset_before_recycle=false.
  • Lifecycle/static init: No special lifecycle or static initialization concerns found in this PR.
  • Config changes: No new configs added, but an existing supported config combination regresses: enable_abort_txn_and_job_for_delete_rowset_before_recycle=true with enable_mark_delete_rowset_before_recycle=false.
  • Compatibility: No protocol/storage compatibility change observed.
  • Parallel paths: The same regression exists in both formal recycle-rowset and tmp-rowset paths.
  • Special conditions: The existing end_version() != 1 gate remains; no new explanation issues beyond the race above.
  • Test coverage: Existing tests mainly exercise the mark-before-delete flow and do not cover the interleaving where a commit/job-finish wins the race before deferred abort execution. I did not find new coverage for this regression.
  • Observability: Logging is adequate for tracing the new path.
  • Transaction/persistence: No new persistence format issue, but transaction/job state handling is where the correctness regression is introduced.
  • Data writes/modifications: Not safe on the affected path because object deletion can proceed after a successful commit that happened before the deferred abort ran.
  • FE/BE variable passing: Not applicable.
  • Performance: The batching optimization is valid in principle.
  • Other issues: No second independent blocker found beyond the deferred-abort race.
  1. Recycler-specific checkpoints
  • Mark-before-delete two-phase flow: Preserved when mark-delete is enabled.
  • Abort-before-delete aligned with origin: The mapping is still correct (load -> txn, compaction/schema-change -> job), but timing is no longer safe because abort is deferred.
  • Packed files: Not affected by this PR.
  • Conflict/retry/idempotency: The new deferred abort flow is not restart-safe enough on the affected config path because a concurrent commit can invalidate the recycler's stale decision before deletion submission.

Because this is a data-correctness issue on a supported code path, this should not be approved yet.

rowset_meta->end_version() != 1) {
if (auto abort_task = make_deferred_abort_task(rowset); abort_task.has_value()) {
LOG(INFO) << "rowset queued to abort related txn or job after current scan batch, "
"instance_id="
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferring this abort until loop_done() introduces a correctness race when enable_mark_delete_rowset_before_recycle=false. Before this PR, abort_txn_or_job_for_recycle() ran immediately after we saw the expired recycle-rowset key, so a concurrent commit_rowset / finish_tablet_job could not make the rowset live before deletion was scheduled. Now the scanner only queues the abort and keeps walking the batch. A concurrent commit can succeed in that window, and then loop_done() will still submit deletion based on the stale scan result. If the later abort sees an already-committed txn/job and returns 0, we end up deleting committed rowset data. The batching optimization needs to keep the abort-before-delete invariant for this supported config path.

return ret;
if (auto abort_task = make_deferred_abort_task(rowset); abort_task.has_value()) {
LOG(INFO) << "rowset queued to abort related txn or job after current scan batch, "
"instance_id="
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same race in the tmp-rowset path. With mark-before-delete disabled, moving the abort from the scan loop to loop_done() widens the window where commit_rowset / commit_txn / finish_tablet_job can succeed before the recycler aborts the owner txn/job. After that, the worker still deletes the tmp rowset based on a stale snapshot from the scan. This regresses the old abort-before-delete behavior and can delete rowset data that has just become visible.

@wyxxxcat wyxxxcat marked this pull request as draft April 14, 2026 07:23
@wyxxxcat wyxxxcat force-pushed the reduce_point_read_at_recycle branch from 342cec4 to 23e5acb Compare April 14, 2026 07:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants