Skip to content

SDSTOR-21981: make shard create/seal log-only, avoid header/footer I/O, add prod repro issue#434

Open
Hooper9973 wants to merge 1 commit into
eBay:Refactor_create_seal_shardfrom
Hooper9973:refactor_shard
Open

SDSTOR-21981: make shard create/seal log-only, avoid header/footer I/O, add prod repro issue#434
Hooper9973 wants to merge 1 commit into
eBay:Refactor_create_seal_shardfrom
Hooper9973:refactor_shard

Conversation

@Hooper9973

Copy link
Copy Markdown
Contributor
  • Switch create_shard and seal_shard to log-only; remove data channel path
  • Stop persisting shard header/footer on disk for create/seal flows
  • Add production GC issue reproductions:
    • issue1: concurrent seal_shard, create_shard, and GC
    • issue2: concurrent put_blob and seal_shard can place blob into sealed shard

Comment thread src/lib/homestore_backend/hs_shard_manager.cpp Outdated
Comment thread src/lib/homestore_backend/hs_shard_manager.cpp Outdated
Comment thread src/lib/homestore_backend/hs_shard_manager.cpp Outdated
Comment thread src/lib/homestore_backend/hs_shard_manager.cpp Outdated
Comment thread src/lib/homestore_backend/hs_shard_manager.cpp
Comment thread src/lib/homestore_backend/hs_shard_manager.cpp Outdated
pg_id : uint16; // pg id which this shard belongs to;
state : ubyte; // shard state;
created_lsn : uint64; // lsn on shard creation;
sealed_lsn : uint64; // lsn on shard sealing;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default value of sealed_lsn will be zero if new version (with this change) receives a message from old version, will that causing issues ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set the default value of sealed_lsn to INT64_MAX, which is same as the shardinfo's behavior

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you test the behavior? i.e if older version sends a flatbuffer of resync_shard_data, it wont contains the sealed_lsn field, then the receiver (new version) will decode the flatbuffer and assign zero (type default) to sealed_lsn field during de-serialization.

I strongly believe the shard_meta.sealed_lsn() == 0 in this case in below line.
https://github.com/eBay/HomeObject/pull/434/changes#diff-2b53b1beca29a46e49d5d8b4785ca753ddbbccea43ebe648c9d6bf7b811b9cb4R83

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add one compatible test ResyncShardMetaDataBackwardCompat in hs_shard_tests.cpp.
Change the sealed_lsn order in ResyncShardMetaData, the new field must be placed at the end to protect the backward compatibility

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it makes sense now, new fields need a) stay at the end b) with default value if 0 doesnt make sense

Comment thread src/lib/homestore_backend/hs_blob_manager.cpp Outdated
Comment thread src/lib/homestore_backend/heap_chunk_selector.cpp Outdated
@codecov-commenter

codecov-commenter commented Jun 16, 2026

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 49.47368% with 48 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (Refactor_create_seal_shard@e1c23e1). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/lib/homestore_backend/hs_shard_manager.cpp 42.10% 28 Missing and 5 partials ⚠️
...ib/homestore_backend/replication_state_machine.cpp 38.46% 7 Missing and 1 partial ⚠️
src/lib/homestore_backend/hs_blob_manager.cpp 71.42% 2 Missing and 2 partials ⚠️
src/lib/homestore_backend/heap_chunk_selector.cpp 71.42% 2 Missing ⚠️
src/lib/homestore_backend/pg_blob_iterator.cpp 0.00% 1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@                      Coverage Diff                      @@
##             Refactor_create_seal_shard     #434   +/-   ##
=============================================================
  Coverage                              ?   49.71%           
=============================================================
  Files                                 ?       36           
  Lines                                 ?     5361           
  Branches                              ?      676           
=============================================================
  Hits                                  ?     2665           
  Misses                                ?     2422           
  Partials                              ?      274           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Hooper9973

Copy link
Copy Markdown
Contributor Author

Run the Reproduced UT on the stablev4.x branch, the result show the UT reproduced the scenario.

$ ./homestore_test_gc -csv error --executor immediate --config_path ./ --chunks_per_pg 1 --override_config hs_backend_config.enable_gc=true --override_config hs_backend_config.gc_garbage_rate_threshold=0 --override_config hs_backend_config.gc_scan_interval_sec=5 --gtest_filter=HomeObjectFixture.StalePChunkRouteAfterGC:HomeObjectFixture.StaleBlobRouteAfterSealAndGC

Port 4000 is not in use.
Port 4001 is not in use.
Port 4002 is not in use.
Note: Google Test filter = HomeObjectFixture.StalePChunkRouteAfterGC:HomeObjectFixture.StaleBlobRouteAfterSealAndGC
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from HomeObjectFixture
[ RUN      ] HomeObjectFixture.StalePChunkRouteAfterGC

Note: Google Test filter = HomeObjectFixture.StalePChunkRouteAfterGC:HomeObjectFixture.StaleBlobRouteAfterSealAndGC
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from HomeObjectFixture
[ RUN      ] HomeObjectFixture.StalePChunkRouteAfterGC

Note: Google Test filter = HomeObjectFixture.StalePChunkRouteAfterGC:HomeObjectFixture.StaleBlobRouteAfterSealAndGC
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from HomeObjectFixture
[ RUN      ] HomeObjectFixture.StalePChunkRouteAfterGC
/home/ubuntu/WorkSpace/Upstream/HomeObject/src/lib/homestore_backend/tests/hs_gc_tests.cpp:1050: Failure
Expected equality of these values:
  p2.value()
    Which is: 87
  live_pchunk
    Which is: 90
shard2 on replica 2 is routed to pchunk 87 but the live vchunk->pchunk mapping is 90 (stale shard/blob route; see PG 39 / PG 3409)


/home/ubuntu/WorkSpace/Upstream/HomeObject/src/lib/homestore_backend/tests/hs_gc_tests.cpp:1080: Failure
Expected equality of these values:
  p2.value()
    Which is: 87
  live_pchunk_after
    Which is: 90
after putting blobs into shard2 on replica 2, shard2's recorded pchunk 87 diverged from the live mapping 90

[  FAILED  ] HomeObjectFixture.StalePChunkRouteAfterGC (16069 ms)
[       OK ] HomeObjectFixture.StalePChunkRouteAfterGC (17070 ms)
[ RUN      ] HomeObjectFixture.StaleBlobRouteAfterSealAndGC
[ RUN      ] HomeObjectFixture.StaleBlobRouteAfterSealAndGC
[       OK ] HomeObjectFixture.StalePChunkRouteAfterGC (15083 ms)
[ RUN      ] HomeObjectFixture.StaleBlobRouteAfterSealAndGC

/home/ubuntu/WorkSpace/Upstream/HomeObject/src/lib/homestore_backend/tests/hs_gc_tests.cpp:1223: Failure
Value of: blob_rejected
  Actual: false
Expected: true
[StaleBlobRouteAfterSealAndGC-fix] late _put_blob should have been rejected by sealed_lsn guard!

[  FAILED  ] HomeObjectFixture.StaleBlobRouteAfterSealAndGC (10255 ms)
[       OK ] HomeObjectFixture.StaleBlobRouteAfterSealAndGC (10255 ms)
[----------] 2 tests from HomeObjectFixture (25338 ms total)

[----------] 2 tests from HomeObjectFixture (27326 ms total)

[----------] Global test environment tear-down
[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (25338 ms total)
[==========] 2 tests from 1 test suite ran. (27326 ms total)
[  PASSED  ] 1 test.
[  PASSED  ] 2 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] HomeObjectFixture.StaleBlobRouteAfterSealAndGC

 1 FAILED TEST
[       OK ] HomeObjectFixture.StaleBlobRouteAfterSealAndGC (10255 ms)
[----------] 2 tests from HomeObjectFixture (26325 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (26325 ms total)
[  PASSED  ] 1 test.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] HomeObjectFixture.StalePChunkRouteAfterGC

…O, add prod repro issue

  - Switch create_shard and seal_shard to log-only; remove data channel path
  - Stop persisting shard header/footer on disk for create/seal flows
  - Add production GC issue reproductions:
    - issue1: concurrent seal_shard, create_shard, and GC
    - issue2: concurrent put_blob and seal_shard can place blob into sealed shard

@xiaoxichen xiaoxichen left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

The compatibility of shard_info_superblk/shard_info need to be handle carefully, on metablk as well as the raft log entries (new code can receive/read back log entries from old code)

@JacksonYao287 JacksonYao287 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take a look tomorrow

}

if (lsn >= shard_sealed_lsn) {
homestore::data_service().async_free_blk(pbas).thenValue([lsn, shard_id, blob_id, tid, &pbas](auto&& err) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

&pba will be a dangling reference if thenvalue is executed in a different thread.

pg_id_t pg_id = (uint16_t)(application_hint >> 16 & 0xFFFF);
homestore::chunk_num_t v_chunk_id = (uint16_t)(application_hint & 0xFFFF);
return select_specific_chunk(pg_id, v_chunk_id);
return select_specific_chunk(pg_id, v_chunk_id)->get_internal_chunk();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we nee check if the return value of select_specific_chunk(pg_id, v_chunk_id) is nullptr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants