Since updating StackExchange.Redis from v2.10.1 to 2.12.14 and 2.13.17, I'm seeing frequent "dead socket detected" events accompanied by disconnects and reconnects and exceptions like "InvalidOperationException: Reading is not allowed after reader was completed" and "System.Threading.Tasks.TaskCanceledException: A task was canceled". After updating, these events started happening dozens of times per day, whereas before updating, it only happened a few times in the span of a month.
The client is write-focused, being write-only most of the time. The only explicit read operation is during application startup. Explicit Redis operations are not fire-and-forget. Also, there's an eval call which only occurs in some environments where the issue happens (it can perform a read and write). And in case it matters, the client uses RESP3 and listens and responds to pub/sub messages on the same connection, but there are at most a few pub/sub messages per day during roughly an hour window which does not correlate with dead socket events.
Since updating StackExchange.Redis, dead socket detection logic appears to be overly aggressive at killing the connection from this write-focused application. I noticed that the dead socket detection logic in PhysicalBridge.cs considers LastReadSecondsAgo but does not consider LastWriteSecondsAgo. And since the application is write-only almost all of the time, LastReadSecondsAgo * 1_000 will almost always be greater than {Async}TimeoutMilliseconds * 4, causing the statement to evaluate to true if there were any timeouts within the heartbeat, even when there were successful writes within that heartbeat.
Relevant snippet from PhysicalBridge.cs dead socket logic:
bool deadConnectionOnAsync = asyncTimeoutThisHeartbeat > 0 && tmp.LastReadSecondsAgo * 1_000 > (tmp.BridgeCouldBeNull?.Multiplexer.AsyncTimeoutMilliseconds * 4);
bool deadConnectionOnSync = syncTimeoutThisHeartbeat > 0 && tmp.LastReadSecondsAgo * 1_000 > (tmp.BridgeCouldBeNull?.Multiplexer.TimeoutMilliseconds * 4);
Logs suggest that some implicit fire-and-forget operations are timing out, but not async or sync operations, and don't indicate which operation is timing out. I suspect these are getting counted with syncTimeoutThisHeartbeat which may explain why dead socket detected events may have increased since PR #2965. I found LogInformationTimeoutsSummary logs, which always indicate sync timeouts: 0, async timeouts: 0, and sometimes indicate fire and forget > 0. The application doesn't call any explicit commands with the RedisFlags.FireAndForget flag.
Workarounds
I attempted increasing TimeoutMilliseconds from the default (5 seconds) to 10 seconds, and the frequency seems to have been reduced somewhat, but I still see dead socket detected events.
I suspect that adding a periodic explicit read operation may work around the issue temporarily (forcing LastReadSecondsAgo to be updated more often), though that seems less than ideal as a permanent solution.
I imagine a more robust solution for the dead socket logic might either take write operations into consideration when detecting dead sockets and/or perhaps ignore or give less weight to fire-and-forget timeouts.
Example logs
Dead socket detected, no reads in 21 seconds with 1 timeouts, issuing disconnect. EventId.Name: LogWarningDeadSocketDetected
Sync timeouts: 0; async timeouts: 0; fire and forget: 2; last heartbeat: 0s ago. EventId.Name: LogInformationTimeoutsSummary
Example exceptions
System.InvalidOperationException: Reading is not allowed after reader was completed.
at System.IO.Pipelines.ThrowHelper.ThrowInvalidOperationException_NoReadingAllowed()
at StackExchange.Redis.PhysicalConnection.ReadFromPipe()
at ExampleApplication.ExampleClass.HashSetAsync(...) ...
System.Threading.Tasks.TaskCanceledException: A task was canceled.
at ExampleApplication.ExampleClass.HashSetAsync(...) ...
Environment
- Redis: Amazon ElastiCache Redis OSS v7.0.x
- Client host: Kubernetes
- Client runtime: .NET 10 on Linux (x64)
- Client NuGet packages:
- StackExchange.Redis: v2.12.14 and v2.13.17 (issue did not occur in v2.10.1)
- Aspire.StackExchange.Redis: v13.2.4 and v13.4.1 (issue did not occur in v13.1.0)
- Most commonly called StackExchange.Redis method:
- IDatabaseAsync.HashSetAsync(RedisKey, HashEntry[], CommandFlags)
Related
Possibly related issue: #3085
Possibly related PR: #2965
Since updating StackExchange.Redis from v2.10.1 to 2.12.14 and 2.13.17, I'm seeing frequent "dead socket detected" events accompanied by disconnects and reconnects and exceptions like "InvalidOperationException: Reading is not allowed after reader was completed" and "System.Threading.Tasks.TaskCanceledException: A task was canceled". After updating, these events started happening dozens of times per day, whereas before updating, it only happened a few times in the span of a month.
The client is write-focused, being write-only most of the time. The only explicit read operation is during application startup. Explicit Redis operations are not fire-and-forget. Also, there's an eval call which only occurs in some environments where the issue happens (it can perform a read and write). And in case it matters, the client uses RESP3 and listens and responds to pub/sub messages on the same connection, but there are at most a few pub/sub messages per day during roughly an hour window which does not correlate with dead socket events.
Since updating StackExchange.Redis, dead socket detection logic appears to be overly aggressive at killing the connection from this write-focused application. I noticed that the dead socket detection logic in PhysicalBridge.cs considers
LastReadSecondsAgobut does not considerLastWriteSecondsAgo. And since the application is write-only almost all of the time,LastReadSecondsAgo * 1_000will almost always be greater than{Async}TimeoutMilliseconds * 4, causing the statement to evaluate to true if there were any timeouts within the heartbeat, even when there were successful writes within that heartbeat.Relevant snippet from PhysicalBridge.cs dead socket logic:
Logs suggest that some implicit fire-and-forget operations are timing out, but not async or sync operations, and don't indicate which operation is timing out. I suspect these are getting counted with syncTimeoutThisHeartbeat which may explain why dead socket detected events may have increased since PR #2965. I found LogInformationTimeoutsSummary logs, which always indicate sync timeouts: 0, async timeouts: 0, and sometimes indicate fire and forget > 0. The application doesn't call any explicit commands with the RedisFlags.FireAndForget flag.
Workarounds
I attempted increasing TimeoutMilliseconds from the default (5 seconds) to 10 seconds, and the frequency seems to have been reduced somewhat, but I still see dead socket detected events.
I suspect that adding a periodic explicit read operation may work around the issue temporarily (forcing LastReadSecondsAgo to be updated more often), though that seems less than ideal as a permanent solution.
I imagine a more robust solution for the dead socket logic might either take write operations into consideration when detecting dead sockets and/or perhaps ignore or give less weight to fire-and-forget timeouts.
Example logs
Example exceptions
Environment
Related
Possibly related issue: #3085
Possibly related PR: #2965