Skip to content

coordination: fix reconnect hang on half-open TCP connection#652

Open
krasnovdm wants to merge 2 commits into
ydb-platform:masterfrom
krasnovdm:feature/coordination-reconnect-timeout
Open

coordination: fix reconnect hang on half-open TCP connection#652
krasnovdm wants to merge 2 commits into
ydb-platform:masterfrom
krasnovdm:feature/coordination-reconnect-timeout

Conversation

@krasnovdm
Copy link
Copy Markdown

Problem

When a coordination session loses its gRPC stream and begins reconnecting, a new Stream is created with disableDeadline() (intentional — the session stream is long-lived). If the underlying TCP connection enters a half-open state (TCP handshake completes, but no application data flows), the gRPC stream never delivers SessionStarted, so startFuture never completes.

The reconnect loop then stalls indefinitely:

  1. restoreSession → creates new Stream(rpc) → calls connectToSession(stream, sessionId)
  2. connectToSession returns startFuture, which never resolves
  3. The whenCompleteAsync callback in reconnect() never fires
  4. messagesToRetry (containing the original acquireEphemeralSemaphore) are frozen forever
  5. Application thread blocked on .get() hangs — observed hanging for hours until Thread.interrupt()

The connectTimeout field existed in SessionImpl but was only passed to the server via SessionStart.setTimeoutMillis (server-side session keepalive), not used as a Java-side deadline.

Reported via YDB L2 support: YDBREQUESTS-7830

Fix

Pass connectTimeout into Stream and schedule a one-shot watchdog timer:

  • If SessionStarted arrives before the deadline → timer is a no-op (startFuture.isDone() is true)
  • If the stream hangs → timer cancels the gRPC stream and completes startFuture with TIMEOUT

This unblocks the reconnect loop: connectToSession future resolves with failure, reconnect() proceeds to either the next retry attempt or completeMessagesWithBadSession, ensuring pending operations always terminate.

Testing

  • Updated StreamTest and StreamIntegrationTest to pass the required connectTimeout argument
  • All unit tests pass (BUILD SUCCESS); integration tests skipped (no Docker)

When a coordination session loses its gRPC stream and starts reconnecting,
a new Stream is created with disableDeadline(). If the underlying TCP
connection enters a half-open state (SYN established but no data flows),
the gRPC stream never delivers SessionStarted, so startFuture never
completes. The reconnect loop then stalls indefinitely, leaving all
pending CompletableFutures (e.g. acquireEphemeralSemaphore) unresolved
even after the application-level acquire timeout has expired.

Fix: pass connectTimeout into Stream and schedule a one-shot timer that
cancels the gRPC stream and completes startFuture with TIMEOUT if
SessionStarted is not received within that window. The timer fires only
when startFuture is still pending, so it is a no-op on the happy path.

Reported via YDB support: YDBREQUESTS-7830
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses coordination session reconnect hangs caused by half-open TCP connections by adding a client-side connect watchdog that cancels the gRPC stream if SessionStarted is not received within connectTimeout, allowing the reconnect loop (and pending operations) to terminate instead of stalling indefinitely.

Changes:

  • Updated Stream to accept a connectTimeout and schedule a one-shot timeout that cancels the stream and fails startFuture if startup hangs.
  • Updated StreamIntegrationTest to pass the new connectTimeout constructor argument.
  • Removed/emptied StreamTest instead of updating it (contradicts the PR description and drops unit coverage).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
coordination/src/main/java/tech/ydb/coordination/impl/Stream.java Adds connect-time watchdog logic and changes constructor signature to require connectTimeout.
coordination/src/test/java/tech/ydb/coordination/impl/StreamIntegrationTest.java Updates test to use the new Stream(Rpc, Duration) constructor.
coordination/src/test/java/tech/ydb/coordination/impl/StreamTest.java Test contents removed (file now empty), reducing unit coverage and diverging from PR description.
Comments suppressed due to low confidence (1)

coordination/src/main/java/tech/ydb/coordination/impl/Stream.java:45

  • Stream constructor now requires connectTimeout, but there are still call sites using the old new Stream(rpc) signature (e.g., SessionImpl.connect() and SessionImpl.restoreSession()), which will fail compilation. Update those call sites to pass the appropriate timeout (likely the existing connectTimeout field).
    Stream(Rpc rpc, Duration connectTimeout) {
        this.scheduler = rpc.getScheduler();
        this.stream = rpc.createSession(GrpcRequestSettings.newBuilder()
                .disableDeadline()
                .build());

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +57 to +68
// Guard against a half-open TCP connection where the gRPC stream never delivers
// SessionStarted. Without this timeout the reconnect loop stalls indefinitely,
// keeping CompletableFutures for pending operations (e.g. acquireEphemeralSemaphore)
// unresolved even after the application-level acquire timeout has expired.
scheduler.schedule(() -> {
if (startFuture.isDone()) {
return;
}
logger.warn("stream {} connect timeout after {} ms, cancelling", hashCode(), connectTimeout.toMillis());
stream.cancel();
startFuture.complete(Result.fail(Status.of(StatusCode.TIMEOUT)));
}, connectTimeout.toMillis(), TimeUnit.MILLISECONDS);
Comment on lines 29 to 33
@Test
public void stopBeforeStartTest() {
Stream stream = new Stream(RPC);
Stream stream = new Stream(RPC, java.time.Duration.ofSeconds(5));
Status stopped = stream.stop().join();

…store tests

Keep Stream(Rpc rpc) as a delegate to Stream(rpc, Duration.ofSeconds(5)) so
existing unit tests compile unchanged. SessionImpl passes the configured
connectTimeout explicitly; the no-arg overload is only used in tests.

Restore StreamTest.java (was accidentally cleared) and revert
StreamIntegrationTest.java to the original constructor call.
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 58.33333% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.22%. Comparing base (eff1645) to head (06968e9).

Files with missing lines Patch % Lines
...c/main/java/tech/ydb/coordination/impl/Stream.java 60.00% 3 Missing and 1 partial ⚠️
...n/java/tech/ydb/coordination/impl/SessionImpl.java 50.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master     #652      +/-   ##
============================================
- Coverage     71.31%   71.22%   -0.10%     
+ Complexity     3365     3364       -1     
============================================
  Files           379      379              
  Lines         15920    15929       +9     
  Branches       1669     1669              
============================================
- Hits          11353    11345       -8     
- Misses         3917     3929      +12     
- Partials        650      655       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants