redo destroy pg#435
Conversation
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## stable/v4.x #435 +/- ##
==============================================
Coverage ? 45.60%
==============================================
Files ? 36
Lines ? 5436
Branches ? 685
==============================================
Hits ? 2479
Misses ? 2683
Partials ? 274 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
0b5e4a4 to
c686ce2
Compare
| // when reaching here, we will not receive any new log, both for BR or leave_group case. we wait for all logs to be | ||
| // committed and trigger a cp, so that when restart, no log will be replayed since cp_lsn is equal to | ||
| // last_append_lsn. | ||
| while (true) { |
There was a problem hiding this comment.
I feel this while loop is probably wrong.
if this member receive a leave_group, who will tell it to commit all appened log?
similar in BR, if there are some appended but not yet committed log in logstore, but leader decide to send us a BR, probably leader wont tell us to commit
There was a problem hiding this comment.
what is the intent here?
avoid the ongoing commit thread racing with destroying?
crash recovery?
There was a problem hiding this comment.
what is the intent here?
avoid the ongoing commit thread racing with destroying?
crash recovery?
both.
1 currently , we pause statemachine here to avoid the ongoing commit thread racing with destroying pg. here, I try to achieve this by waiting for all the appended log to be committed, which means no more log will not be commit anymore . as a result, there is no racing between committing log and destroying pg.
2 crash recovery. after all the appended log are committed and a cp is triggered, when recovery, no log replay will happen(cp_lsn == last_append_lsn), so that we can avoid log replay to access any stale pg resource. for example, we have a put blob log, whose lsn is larger than cp_lsn but smaller than dc_lsn. if destroy_pg is called and the pg_index_Table is destroyed but before pg_super_blk is destoryed, crash occures, Then, when recovery and replay this put blob log, it will try to access the destroyed pg_index_table. other pg resources also has this potential issue(shard or pg chunk)
if this member receive a leave_group, who will tell it to commit all appened log?
similar in BR, if there are some appended but not yet committed log in logstore, but leader decide to send us a BR, probably leader wont tell us to commit
let`s say, leader starts from lsn 30(lsn 1 - lsn 29 are truncated), and the last append log in follower is lsn 20 and last commit log in follower is 10. there is an important fact that any truncated log must be already committed before it is truncated. so the commit_lsn of the whole raft group is at least lsn 29.
leader will start sending snapshot to follower only if it sends an append_entry req to follower and the next_lsn (in follower resp) follower wants is smaller than the smallest lsn leader currently has. note that, in the append_entry req, follower can see the latest commit_lsn in the raft group(at least lsn 29), which must be greater than the follower`s last_append_lsn(lsn 20).
to answer your question, the append_entry req , which will trigger install_snapshot, will tell the laggy follower what is the latest commit_lsn currently. and the latest commit_lsn must be greater than the last_append_lsn in follower , which can make sure all the appended logs to be committed.
There was a problem hiding this comment.
let`s say, leader starts from lsn 30(lsn 1 - lsn 29 are truncated), and the last append log in follower is lsn 20 and last commit log in follower is 10. there is an important fact that any truncated log must be already committed before it is truncated. so the commit_lsn of the whole raft group is at least lsn 29.
This is not necessary true, we keep maximum raft_logstore_reserve_threshold logs,
repl_lsn_t raft_logstore_reserve_threshold = HS_DYNAMIC_CONFIG(resource_limits.raft_logstore_reserve_threshold);
repl_lsn_t truncation_upper_limit = std::max(leader_commit_idx - raft_logstore_reserve_threshold, minimum_repl_idx);
to answer your question, the append_entry req , which will trigger install_snapshot, will tell the laggy follower what is the latest commit_lsn currently. and the latest commit_lsn must be greater than the last_append_lsn in follower , which can make sure all the appended logs to be committed.
This part is wrong as well, say if a member has log up to 150, commited to 100, however it gets a install_snapshot_req, it wont know the commit lsn of leader(say 50000) , and it wont commit locally to 150.
There was a problem hiding this comment.
raft_logstore_reserve_threshold does not matter , only last_append_lsn and last_commit_lsn matters.
however it gets a install_snapshot_req
in what case it will get a install_snapshot_req? as I said , the only case is that leader sends an append_entry req to follower and the next_lsn in follower resp, which indicates the next log follower wants, is smaller than the smallest lsn leader currently has. does this make sense?
the append_entry will have a commit_index , which indicates the lsn that the follower can commit to. commit lsn must be larger than followers last_append_lsn.
if install_snapshot is triggered, then
leader_commit_lsn >= leader_smallest_lsn > follower_last_append_lsn >= follower_commit_lsn
There was a problem hiding this comment.
IIRC , a rejected append_log_request would not goes to the bottom and update the target_commit_index of the follower.
| @@ -499,7 +503,7 @@ void ReplicationStateMachine::write_snapshot_obj(std::shared_ptr< homestore::sna | |||
| if (home_object_->pg_exists(pg_data->pg_id())) { | |||
| LOGI("pg already exists, clean pg resources before snapshot, pg={} {}", pg_data->pg_id(), log_suffix); | |||
| // Need to pause state machine before destroying the PG, if fail, let raft retry. | |||
There was a problem hiding this comment.
comments out of date, as well as we dont have a branch that returns false as of now.
There was a problem hiding this comment.
let`s remove this out-of-date comments after addressing other comments for this PR
| // clean up the stale pg resources since no message will be received to trigger pg_destroy again. | ||
|
|
||
| // so, we call it here for all the above cases to make sure the stale pg resources are cleaned up. | ||
| home_object_->destroy_pg_resource(pg_id); |
There was a problem hiding this comment.
do we need cleaning GC resources? like any pending gc metablk? gc index table? no_space_left_marker? latch_lsn?
There was a problem hiding this comment.
gc index table is pdev related , not pg. no_space_left_marker and latch_lsn has already done in this PR, see reset_latch_lsn, reset_no_space_left_inf.
the gc metablk is a nice catch. there is indeed a case that a gc metablk left when crash. let`s do it in handle_recovered_gc_task. if we find the pg state is destroyed, skip this task and destroy this gc meta blk.
There was a problem hiding this comment.
after checking , no need to handle gc metablk.
if there is any gc metablk, it means before crash, drain_pending_gc_task is not returned yet(and thus no pg resource is destroyed). when recovery , these gc metablk will be handle by handle_recovered_gc_task before starting log replay.
No description provided.