Skip to content

Write-focused client experiencing overly aggressive "dead socket detected" disconnects/reconnects #3100

@mhenry-cb

Description

@mhenry-cb

Since updating StackExchange.Redis from v2.10.1 to 2.12.14 and 2.13.17, I'm seeing frequent "dead socket detected" events accompanied by disconnects and reconnects and exceptions like "InvalidOperationException: Reading is not allowed after reader was completed" and "System.Threading.Tasks.TaskCanceledException: A task was canceled". After updating, these events started happening dozens of times per day, whereas before updating, it only happened a few times in the span of a month.

The client is write-focused, being write-only most of the time. The only explicit read operation is during application startup. Explicit Redis operations are not fire-and-forget. Also, there's an eval call which only occurs in some environments where the issue happens (it can perform a read and write). And in case it matters, the client uses RESP3 and listens and responds to pub/sub messages on the same connection, but there are at most a few pub/sub messages per day during roughly an hour window which does not correlate with dead socket events.

Since updating StackExchange.Redis, dead socket detection logic appears to be overly aggressive at killing the connection from this write-focused application. I noticed that the dead socket detection logic in PhysicalBridge.cs considers LastReadSecondsAgo but does not consider LastWriteSecondsAgo. And since the application is write-only almost all of the time, LastReadSecondsAgo * 1_000 will almost always be greater than {Async}TimeoutMilliseconds * 4, causing the statement to evaluate to true if there were any timeouts within the heartbeat, even when there were successful writes within that heartbeat.

Relevant snippet from PhysicalBridge.cs dead socket logic:

bool deadConnectionOnAsync = asyncTimeoutThisHeartbeat > 0 && tmp.LastReadSecondsAgo * 1_000 > (tmp.BridgeCouldBeNull?.Multiplexer.AsyncTimeoutMilliseconds * 4);
bool deadConnectionOnSync = syncTimeoutThisHeartbeat > 0 && tmp.LastReadSecondsAgo * 1_000 > (tmp.BridgeCouldBeNull?.Multiplexer.TimeoutMilliseconds * 4);

Logs suggest that some implicit fire-and-forget operations are timing out, but not async or sync operations, and don't indicate which operation is timing out. I suspect these are getting counted with syncTimeoutThisHeartbeat which may explain why dead socket detected events may have increased since PR #2965. I found LogInformationTimeoutsSummary logs, which always indicate sync timeouts: 0, async timeouts: 0, and sometimes indicate fire and forget > 0. The application doesn't call any explicit commands with the RedisFlags.FireAndForget flag.

Workarounds

I attempted increasing TimeoutMilliseconds from the default (5 seconds) to 10 seconds, and the frequency seems to have been reduced somewhat, but I still see dead socket detected events.

I suspect that adding a periodic explicit read operation may work around the issue temporarily (forcing LastReadSecondsAgo to be updated more often), though that seems less than ideal as a permanent solution.

I imagine a more robust solution for the dead socket logic might either take write operations into consideration when detecting dead sockets and/or perhaps ignore or give less weight to fire-and-forget timeouts.

Example logs

Dead socket detected, no reads in 21 seconds with 1 timeouts, issuing disconnect. EventId.Name: LogWarningDeadSocketDetected

Sync timeouts: 0; async timeouts: 0; fire and forget: 2; last heartbeat: 0s ago. EventId.Name: LogInformationTimeoutsSummary

Example exceptions

System.InvalidOperationException: Reading is not allowed after reader was completed.
at System.IO.Pipelines.ThrowHelper.ThrowInvalidOperationException_NoReadingAllowed()
at StackExchange.Redis.PhysicalConnection.ReadFromPipe()
at ExampleApplication.ExampleClass.HashSetAsync(...) ...

System.Threading.Tasks.TaskCanceledException: A task was canceled.
at ExampleApplication.ExampleClass.HashSetAsync(...) ...

Environment

  • Redis: Amazon ElastiCache Redis OSS v7.0.x
  • Client host: Kubernetes
  • Client runtime: .NET 10 on Linux (x64)
  • Client NuGet packages:
    • StackExchange.Redis: v2.12.14 and v2.13.17 (issue did not occur in v2.10.1)
    • Aspire.StackExchange.Redis: v13.2.4 and v13.4.1 (issue did not occur in v13.1.0)
  • Most commonly called StackExchange.Redis method:
    • IDatabaseAsync.HashSetAsync(RedisKey, HashEntry[], CommandFlags)

Related

Possibly related issue: #3085
Possibly related PR: #2965

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions