Write-focused client experiencing overly aggressive "dead socket detected" disconnects/reconnects

Since updating StackExchange.Redis from v2.10.1 to 2.12.14 and 2.13.17, I'm seeing frequent "dead socket detected" events accompanied by disconnects and reconnects and exceptions like "InvalidOperationException: Reading is not allowed after reader was completed" and "System.Threading.Tasks.TaskCanceledException: A task was canceled". After updating, these events started happening dozens of times per day, whereas before updating, it only happened a few times in the span of a month.

The client is write-focused, being write-only most of the time. The only explicit read operation is during application startup. Explicit Redis operations are _not_ fire-and-forget. Also, there's an eval call which only occurs in some environments where the issue happens (it can perform a read and write). And in case it matters, the client uses RESP3 and listens and responds to pub/sub messages on the same connection, but there are at most a few pub/sub messages per day during roughly an hour window which does not correlate with dead socket events.

Since updating StackExchange.Redis, dead socket detection logic appears to be overly aggressive at killing the connection from this write-focused application. I noticed that the dead socket detection logic in [PhysicalBridge.cs](/StackExchange/StackExchange.Redis/blob/main/src/StackExchange.Redis/PhysicalBridge.cs) considers `LastReadSecondsAgo` but does not consider `LastWriteSecondsAgo`. And since the application is write-only almost all of the time, `LastReadSecondsAgo * 1_000` will almost always be greater than `{Async}TimeoutMilliseconds * 4`, causing the statement to evaluate to true if there were any timeouts within the heartbeat, even when there were successful writes within that heartbeat.

Relevant snippet from PhysicalBridge.cs dead socket logic:

```csharp
bool deadConnectionOnAsync = asyncTimeoutThisHeartbeat > 0 && tmp.LastReadSecondsAgo * 1_000 > (tmp.BridgeCouldBeNull?.Multiplexer.AsyncTimeoutMilliseconds * 4);
bool deadConnectionOnSync = syncTimeoutThisHeartbeat > 0 && tmp.LastReadSecondsAgo * 1_000 > (tmp.BridgeCouldBeNull?.Multiplexer.TimeoutMilliseconds * 4);
```

Logs suggest that some implicit fire-and-forget operations are timing out, but not async or sync operations, and don't indicate which operation is timing out. I suspect these are getting counted with syncTimeoutThisHeartbeat which may explain why dead socket detected events may have increased since PR #2965. I found LogInformationTimeoutsSummary logs, which always indicate sync timeouts: 0, async timeouts: 0, and sometimes indicate fire and forget > 0. The application doesn't call any explicit commands with the RedisFlags.FireAndForget flag.

## Workarounds

I attempted increasing TimeoutMilliseconds from the default (5 seconds) to 10 seconds, and the frequency seems to have been reduced somewhat, but I still see dead socket detected events.

I suspect that adding a periodic explicit read operation may work around the issue temporarily (forcing LastReadSecondsAgo to be updated more often), though that seems less than ideal as a permanent solution.

I imagine a more robust solution for the dead socket logic might either take write operations into consideration when detecting dead sockets and/or perhaps ignore or give less weight to fire-and-forget timeouts.

## Example logs

> Dead socket detected, no reads in 21 seconds with 1 timeouts, issuing disconnect. EventId.Name: LogWarningDeadSocketDetected

> Sync timeouts: 0; async timeouts: 0; fire and forget: 2; last heartbeat: 0s ago. EventId.Name: LogInformationTimeoutsSummary

## Example exceptions

> System.InvalidOperationException: Reading is not allowed after reader was completed.
>   at System.IO.Pipelines.ThrowHelper.ThrowInvalidOperationException_NoReadingAllowed()
>   at StackExchange.Redis.PhysicalConnection.ReadFromPipe()
>   at ExampleApplication.ExampleClass.HashSetAsync(...) ...

> System.Threading.Tasks.TaskCanceledException: A task was canceled.
>   at ExampleApplication.ExampleClass.HashSetAsync(...) ...

## Environment

- Redis: Amazon ElastiCache Redis OSS v7.0.x
- Client host: Kubernetes
- Client runtime: .NET 10 on Linux (x64)
- Client NuGet packages:
  - StackExchange.Redis: v2.12.14 and v2.13.17 (issue did not occur in v2.10.1)
  - Aspire.StackExchange.Redis: v13.2.4 and v13.4.1 (issue did not occur in v13.1.0)
- Most commonly called StackExchange.Redis method:
  - IDatabaseAsync.HashSetAsync(RedisKey, HashEntry[], CommandFlags)

## Related

Possibly related issue: #3085
Possibly related PR: #2965

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write-focused client experiencing overly aggressive "dead socket detected" disconnects/reconnects #3100

Workarounds

Example logs

Example exceptions

Environment

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Write-focused client experiencing overly aggressive "dead socket detected" disconnects/reconnects #3100

Description

Workarounds

Example logs

Example exceptions

Environment

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions