SDSTOR-21981: make shard create/seal log-only, avoid header/footer I/O, add prod repro issue#434
Conversation
Hooper9973
commented
Jun 11, 2026
- Switch create_shard and seal_shard to log-only; remove data channel path
- Stop persisting shard header/footer on disk for create/seal flows
- Add production GC issue reproductions:
- issue1: concurrent seal_shard, create_shard, and GC
- issue2: concurrent put_blob and seal_shard can place blob into sealed shard
| pg_id : uint16; // pg id which this shard belongs to; | ||
| state : ubyte; // shard state; | ||
| created_lsn : uint64; // lsn on shard creation; | ||
| sealed_lsn : uint64; // lsn on shard sealing; |
There was a problem hiding this comment.
the default value of sealed_lsn will be zero if new version (with this change) receives a message from old version, will that causing issues ?
There was a problem hiding this comment.
set the default value of sealed_lsn to INT64_MAX, which is same as the shardinfo's behavior
There was a problem hiding this comment.
can you test the behavior? i.e if older version sends a flatbuffer of resync_shard_data, it wont contains the sealed_lsn field, then the receiver (new version) will decode the flatbuffer and assign zero (type default) to sealed_lsn field during de-serialization.
I strongly believe the shard_meta.sealed_lsn() == 0 in this case in below line.
https://github.com/eBay/HomeObject/pull/434/changes#diff-2b53b1beca29a46e49d5d8b4785ca753ddbbccea43ebe648c9d6bf7b811b9cb4R83
There was a problem hiding this comment.
Add one compatible test ResyncShardMetaDataBackwardCompat in hs_shard_tests.cpp.
Change the sealed_lsn order in ResyncShardMetaData, the new field must be placed at the end to protect the backward compatibility
There was a problem hiding this comment.
yeah it makes sense now, new fields need a) stay at the end b) with default value if 0 doesnt make sense
0087f47 to
dbd7adc
Compare
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## Refactor_create_seal_shard #434 +/- ##
=============================================================
Coverage ? 49.71%
=============================================================
Files ? 36
Lines ? 5361
Branches ? 676
=============================================================
Hits ? 2665
Misses ? 2422
Partials ? 274 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
dbd7adc to
efea260
Compare
|
Run the Reproduced UT on the stablev4.x branch, the result show the UT reproduced the scenario. |
…O, add prod repro issue
- Switch create_shard and seal_shard to log-only; remove data channel path
- Stop persisting shard header/footer on disk for create/seal flows
- Add production GC issue reproductions:
- issue1: concurrent seal_shard, create_shard, and GC
- issue2: concurrent put_blob and seal_shard can place blob into sealed shard
efea260 to
78e64d7
Compare
xiaoxichen
left a comment
There was a problem hiding this comment.
LGTM.
The compatibility of shard_info_superblk/shard_info need to be handle carefully, on metablk as well as the raft log entries (new code can receive/read back log entries from old code)
JacksonYao287
left a comment
There was a problem hiding this comment.
I will take a look tomorrow
| } | ||
|
|
||
| if (lsn >= shard_sealed_lsn) { | ||
| homestore::data_service().async_free_blk(pbas).thenValue([lsn, shard_id, blob_id, tid, &pbas](auto&& err) { |
There was a problem hiding this comment.
&pba will be a dangling reference if thenvalue is executed in a different thread.
| pg_id_t pg_id = (uint16_t)(application_hint >> 16 & 0xFFFF); | ||
| homestore::chunk_num_t v_chunk_id = (uint16_t)(application_hint & 0xFFFF); | ||
| return select_specific_chunk(pg_id, v_chunk_id); | ||
| return select_specific_chunk(pg_id, v_chunk_id)->get_internal_chunk(); |
There was a problem hiding this comment.
maybe we nee check if the return value of select_specific_chunk(pg_id, v_chunk_id) is nullptr