Skip to content

fix(testcontainers): wait for vshard storages to complete handshake before declaring cluster ready#97

Draft
dkasimovskiy wants to merge 1 commit into
masterfrom
fix/box-integration-tests
Draft

fix(testcontainers): wait for vshard storages to complete handshake before declaring cluster ready#97
dkasimovskiy wants to merge 1 commit into
masterfrom
fix/box-integration-tests

Conversation

@dkasimovskiy

@dkasimovskiy dkasimovskiy commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

VshardClusterConfigurator#configure() previously declared the cluster ready after three checks: router is up, vshard.router.bootstrap() returns cleanly, and crud._VERSION is reachable on the router. None of these verifies that individual storages have completed the vshard handshake during the initial rebalance; the router can answer "bootstrap ok" while some storages are still in the VHANDSHAKE_NOT_COMPLETE (code 40) state, and a CRUD request that targets such a storage fails immediately.

Observed in tests-crud-integration (3.5.0):

TarantoolTemplateViaJavaConfigTest.testKVTemplateDeleteEntities
  -> CrudException: Failed to truncate for storage-002
     VHANDSHAKE_NOT_COMPLETE, 'Handshake with storage-002-a have not
     been completed yet'

Add a fourth readiness step: waitUntilVshardStoragesAreReady polls vshard.router.info() until every replica in every replicaset is in status='available' and there are no unreachable buckets, with a 120s budget. This guarantees that any subsequent CRUD request hits a fully handshaked storage.

Changes:

  • VshardClusterContainer: add VSHARD_STORAGES_READY_COMMAND Lua probe, TIMEOUT_VSHARD_STORAGES_READY_IN_SECONDS = 120, and waitUntilVshardStoragesAreReady / vshardStoragesAreReady methods.
  • VshardClusterConfigurator#configure(): invoke the new readiness check as the final step before marking the cluster configured.

I haven't forgotten about:

  • Tests
  • Changelog
  • Documentation
    • JavaDoc was written
  • Commit messages comply with the guideline
  • Cleanup the code for review. See checklist

Related issues:

@dkasimovskiy dkasimovskiy force-pushed the fix/box-integration-tests branch from a2189c3 to 44a8a09 Compare June 19, 2026 11:06
@dkasimovskiy dkasimovskiy changed the title fix(tests): prevent HashedWheelTimer leak in integration tests fix(testcontainers): wait for vshard storages to complete handshake before declaring cluster ready Jun 19, 2026
…laring cluster ready

VshardClusterConfigurator#configure() previously stopped at router-up +
vshard.router.bootstrap() + crud._VERSION. None of those verify that
individual storages finished the vshard handshake, so a CRUD request
right after configure() could fail with VHANDSHAKE_NOT_COMPLETE
(vshard code 40).

Add waitUntilVshardStoragesAreReady: polls vshard.router.info() until
every replica is status='available' and info.bucket.unreachable == 0,
120s budget.
@dkasimovskiy dkasimovskiy force-pushed the fix/box-integration-tests branch from 44a8a09 to 1590ff9 Compare June 19, 2026 11:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant