Skip to content

redo destroy pg#435

Open
JacksonYao287 wants to merge 1 commit into
eBay:stable/v4.xfrom
JacksonYao287:fix-issues-of-destroy-pg
Open

redo destroy pg#435
JacksonYao287 wants to merge 1 commit into
eBay:stable/v4.xfrom
JacksonYao287:fix-issues-of-destroy-pg

Conversation

@JacksonYao287

Copy link
Copy Markdown
Collaborator

No description provided.

@codecov-commenter

codecov-commenter commented Jun 16, 2026

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 17.24138% with 24 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (stable/v4.x@e1c23e1). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/lib/homestore_backend/hs_pg_manager.cpp 0.00% 19 Missing ⚠️
...ib/homestore_backend/replication_state_machine.cpp 55.55% 4 Missing ⚠️
src/lib/homestore_backend/gc_manager.cpp 0.00% 0 Missing and 1 partial ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@              Coverage Diff               @@
##             stable/v4.x     #435   +/-   ##
==============================================
  Coverage               ?   45.60%           
==============================================
  Files                  ?       36           
  Lines                  ?     5436           
  Branches               ?      685           
==============================================
  Hits                   ?     2479           
  Misses                 ?     2683           
  Partials               ?      274           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JacksonYao287 JacksonYao287 force-pushed the fix-issues-of-destroy-pg branch from 0b5e4a4 to c686ce2 Compare June 16, 2026 11:59
// when reaching here, we will not receive any new log, both for BR or leave_group case. we wait for all logs to be
// committed and trigger a cp, so that when restart, no log will be replayed since cp_lsn is equal to
// last_append_lsn.
while (true) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this while loop is probably wrong.
if this member receive a leave_group, who will tell it to commit all appened log?
similar in BR, if there are some appended but not yet committed log in logstore, but leader decide to send us a BR, probably leader wont tell us to commit

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the intent here?
avoid the ongoing commit thread racing with destroying?
crash recovery?

@JacksonYao287 JacksonYao287 Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the intent here?
avoid the ongoing commit thread racing with destroying?
crash recovery?

both.
1 currently , we pause statemachine here to avoid the ongoing commit thread racing with destroying pg. here, I try to achieve this by waiting for all the appended log to be committed, which means no more log will not be commit anymore . as a result, there is no racing between committing log and destroying pg.

2 crash recovery. after all the appended log are committed and a cp is triggered, when recovery, no log replay will happen(cp_lsn == last_append_lsn), so that we can avoid log replay to access any stale pg resource. for example, we have a put blob log, whose lsn is larger than cp_lsn but smaller than dc_lsn. if destroy_pg is called and the pg_index_Table is destroyed but before pg_super_blk is destoryed, crash occures, Then, when recovery and replay this put blob log, it will try to access the destroyed pg_index_table. other pg resources also has this potential issue(shard or pg chunk)

if this member receive a leave_group, who will tell it to commit all appened log?
similar in BR, if there are some appended but not yet committed log in logstore, but leader decide to send us a BR, probably leader wont tell us to commit

let`s say, leader starts from lsn 30(lsn 1 - lsn 29 are truncated), and the last append log in follower is lsn 20 and last commit log in follower is 10. there is an important fact that any truncated log must be already committed before it is truncated. so the commit_lsn of the whole raft group is at least lsn 29.

leader will start sending snapshot to follower only if it sends an append_entry req to follower and the next_lsn (in follower resp) follower wants is smaller than the smallest lsn leader currently has. note that, in the append_entry req, follower can see the latest commit_lsn in the raft group(at least lsn 29), which must be greater than the follower`s last_append_lsn(lsn 20).

to answer your question, the append_entry req , which will trigger install_snapshot, will tell the laggy follower what is the latest commit_lsn currently. and the latest commit_lsn must be greater than the last_append_lsn in follower , which can make sure all the appended logs to be committed.

@xiaoxichen xiaoxichen Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let`s say, leader starts from lsn 30(lsn 1 - lsn 29 are truncated), and the last append log in follower is lsn 20 and last commit log in follower is 10. there is an important fact that any truncated log must be already committed before it is truncated. so the commit_lsn of the whole raft group is at least lsn 29.

This is not necessary true, we keep maximum raft_logstore_reserve_threshold logs,

    repl_lsn_t raft_logstore_reserve_threshold = HS_DYNAMIC_CONFIG(resource_limits.raft_logstore_reserve_threshold);
    repl_lsn_t truncation_upper_limit = std::max(leader_commit_idx - raft_logstore_reserve_threshold, minimum_repl_idx);
 

to answer your question, the append_entry req , which will trigger install_snapshot, will tell the laggy follower what is the latest commit_lsn currently. and the latest commit_lsn must be greater than the last_append_lsn in follower , which can make sure all the appended logs to be committed.

This part is wrong as well, say if a member has log up to 150, commited to 100, however it gets a install_snapshot_req, it wont know the commit lsn of leader(say 50000) , and it wont commit locally to 150.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raft_logstore_reserve_threshold does not matter , only last_append_lsn and last_commit_lsn matters.

however it gets a install_snapshot_req

in what case it will get a install_snapshot_req? as I said , the only case is that leader sends an append_entry req to follower and the next_lsn in follower resp, which indicates the next log follower wants, is smaller than the smallest lsn leader currently has. does this make sense?

the append_entry will have a commit_index , which indicates the lsn that the follower can commit to. commit lsn must be larger than followers last_append_lsn.

if install_snapshot is triggered, then
leader_commit_lsn >= leader_smallest_lsn > follower_last_append_lsn >= follower_commit_lsn

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC , a rejected append_log_request would not goes to the bottom and update the target_commit_index of the follower.

Comment thread src/lib/homestore_backend/replication_state_machine.cpp
@@ -499,7 +503,7 @@ void ReplicationStateMachine::write_snapshot_obj(std::shared_ptr< homestore::sna
if (home_object_->pg_exists(pg_data->pg_id())) {
LOGI("pg already exists, clean pg resources before snapshot, pg={} {}", pg_data->pg_id(), log_suffix);
// Need to pause state machine before destroying the PG, if fail, let raft retry.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments out of date, as well as we dont have a branch that returns false as of now.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let`s remove this out-of-date comments after addressing other comments for this PR

Comment thread src/lib/homestore_backend/hs_pg_manager.cpp
// clean up the stale pg resources since no message will be received to trigger pg_destroy again.

// so, we call it here for all the above cases to make sure the stale pg resources are cleaned up.
home_object_->destroy_pg_resource(pg_id);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need cleaning GC resources? like any pending gc metablk? gc index table? no_space_left_marker? latch_lsn?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gc index table is pdev related , not pg. no_space_left_marker and latch_lsn has already done in this PR, see reset_latch_lsn, reset_no_space_left_inf.

the gc metablk is a nice catch. there is indeed a case that a gc metablk left when crash. let`s do it in handle_recovered_gc_task. if we find the pg state is destroyed, skip this task and destroy this gc meta blk.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after checking , no need to handle gc metablk.

if there is any gc metablk, it means before crash, drain_pending_gc_task is not returned yet(and thus no pg resource is destroyed). when recovery , these gc metablk will be handle by handle_recovered_gc_task before starting log replay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants