Proposal to change how the WebSocket clients are cleaned up to remove a race condition#424
Conversation
There was a problem hiding this comment.
Pull request overview
This PR tightens AsyncWebSocket client lifetime handling to avoid iterator invalidation and unsafe client pointer usage when disconnects race with broadcast sends.
Changes:
- Defers client removal from disconnect handling to
cleanupClients()to avoid invalidating_clientsiteration during broadcasts. - Adds additional locking around
AsyncWebSocketClient::_clientreads and moves per-client operations (close/ping/text/binary) to execute under the websocket server lock. - Updates
cleanupClients()to count connected clients directly and erase all deletable clients in a single pass.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/AsyncWebSocket.h |
Adds client lock usage in shouldBeDeleted() to make deletion checks thread-safe. |
src/AsyncWebSocket.cpp |
Introduces a locked client lookup helper, adds additional locking in client callbacks/queueing, defers disconnect removal, and revises cleanup + per-client operations to run under the server lock. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
While stress-testing with FluidNC running on MacOS, I ran into some problems that showed up with websocket textAll which tended to deadlock. GPT-5.4 did a deep analysis and came up with these fixes, which solved the problem. |
|
status() is now synchronized under the per-client mutex, so the connected-client checks in find_connected_client_locked() and cleanupClients() no longer race with _statusupdates. I also changed cleanupClients() to splice deletable clients into a temporary list and destroy them after releasing AsyncWebSocket::_lock, so disconnect callbacks no longer run under the server lock. |
|
@MitchBradley : I have cherry picked your commits into #427 and removed in an added commit the changes regarding how the clients are cleaned up in order to split the PR in 2 parts:
For the story, we know the issue and @willmmiles also suggested we do some changes in this area once, and we got a user ran into such issue like you did. Now, the proposed changes are quite important because initially, With the middleware stack that was added, now the websocket client control can be done with a middleware that is controlling directly the upgrade requests to pro-actively close the WS upgrade request before it is transformed into a websocket client. So there is less a need now to call cleanupClients() than before, even if still advised. So moving the complete client cleanup into this badly named So let's merge #427 first with all the good fixes and we can rebase this PR to focus the changes on the behavior change, that we will need to discuss within the team with @me-no-dev and @willmmiles . Maybe there is another better way also to handle that. We are also advising users to run with |
| return _clientId; | ||
| } | ||
| AwsClientStatus status() const { | ||
| asyncsrv::lock_guard_type lock(_lock); |
There was a problem hiding this comment.
Although I understand the use case of locking a whole function, although I do not to read a register value ? status could also be updated from many other places.
Let's say client A disconnects and has locked and is about to set its status to DISCONNECTED.
At the same time, the app calls status() and reads CONNECTED for client A
Then what ?
Even if the app would trigger any call to send some data to this client A, the app will encounter a lock in one of the send method.
I do not think there is a race also through the cleanupClient function ? Because a double-call on close() should be supported ? And also it would be ok if we miss a client that is currently disconnecting, it will be cleaned next time.
Not sure we need a lock here ?
There was a problem hiding this comment.
I must confess that I am out of my depth here. Concurrency has always made my brain hurt. Here is what GPT-5.4 has to say:
There are two different questions:
- Is a stale read semantically acceptable?
- Is the read synchronized according to the C++ memory model?
For status(), the second question is the important one. If _status is an ordinary field, then a read from status() without taking the client lock is concurrent with writes that happen under the client lock in AsyncWebSocket.cpp. That is a real data race in C++, even if the only practical consequence most of the time would be “you briefly saw CONNECTED before DISCONNECTED”. The language does not treat that as a harmless stale read; it treats it as unsynchronized access to shared state, which is undefined behavior.
The later lock in a send path does not repair that earlier unsynchronized read. Once status() has read _status without synchronization, the damage is already done from the memory-model perspective. The same applies to cleanupClients(): whether a missed disconnect is tolerable is a behavioral question, but it does not make an unlocked read of _status safe.
There was a problem hiding this comment.
If async_tcp task is running on the same core as the loop or user code, there is no need to add mutex but if the async_tcp it ran from core 0 and user code from another core then accessing the value could potentielle return a local cachée value while the real value in memory was updated from another core. The question is : is this temporary situation acceptable considering that the local caches value won’t stay long in the cache : it would take some user code silly looping and calling status() to force many reads. so in that silly case marking the field volatile could be enough in order to force a read from memory each time.
But like I said, I am not sure I ever want to go there because if you look at the parsing logic there are many fields (most of them) set from the async-tcp task that could be on core 0 and accessed from core 1. And that’s ok. Status is no exception.
and I think it is ok so admit that no one should ever call status() in a loop at very high frequency such that the returned value could be a cached one.
Usually people falling status() are doing that because they want to execute something else based on the returned value.
so I would revert this one.
There was a problem hiding this comment.
I will defer to your judgment on this - but is the mutex so expensive that removing it is worth the risk? Apparently, according to C++, if there is any unguarded access of such a variable, all bets are off with respect to memory barrier guarantees.
There was a problem hiding this comment.
@MitchBradley : I agree with you regarding the memory model of c++ not being respected in a sense that this status flag can be r/w from many places, so it could be considered as a shared state.
My concern is how much it can be considered as a shared state... For me there is no difference with the _acked and _ack fields for example that can also be read and accessed from many tasks (core).
That's why I want this discussion and concern completely out of scope of this PR.
I don't disagree about the fact that some code review / update is required in this area.
But I disagree on putting it here in this PR.
This PR is to propose a change regarding the location when the cleanup happens, and the more important part is that this is a design / breaking change.
A new PR should be opened with the title being "Review locking and unlocking fo shared state to adhere to c++ memory model". And in this PR, yes the work can be done and discussed with the whole team and also be scoped to more than just the status flag. The status flag is only 1 variable like that but there are many more that are sharing the same concerns are that are accessed more frequently than the status flag.
Please make sure your PR are as isolated as possible and only fix 1 concern at a time. That's fine if you open 2 or even 3 PRs and one has to be merged (and is based) on the other one. What is important is to keep PRs focused in 1 issue at a time.
This is particularly hard otherwise to review or read back the history of 1 PR contains a big bag of several fixes that are not focused on solving the same problem.
There was a problem hiding this comment.
Also, our current locks are made in a way to guard specific variables.
If you look at #429 I have renamed them for clarity.
We have 2 locks:
- one that is guarding the queues
- one that is guarding the ws clients list
That is all.
And we cannot use the same locks to lock everything...
For example, your changes tend to reuse the existing locks to sometimes guard the _client pointer, but not at every places, and also guard the status flag, also not at every places.
So that's not consistent and can cause some unexpected slowdowns and un-necessary locking because could prevent access to the status flag while someone is sending a message.
That's why this review work has better be done in a more global PR and be discussed because maybe the solution will be completely different than what you proposed.
C++ has several mechanism also like atomic fields, weak pointers, etc
Locks are not the only solution.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
fcff5d0 to
8bf0526
Compare
|
|
||
| void AsyncWebSocketClient::_onPoll() { | ||
| asyncsrv::unique_lock_type lock(_lock); | ||
|
|
There was a problem hiding this comment.
this line removal change can be removed
|
|
||
| bool AsyncWebSocketClient::_queueControl(uint8_t opcode, const uint8_t *data, size_t len, bool mask) { | ||
| asyncsrv::lock_guard_type lock(_lock); | ||
|
|
There was a problem hiding this comment.
this line removal change can be removed
| _messageQueue.clear(); | ||
| _controlQueue.clear(); | ||
| } | ||
| _server->_handleEvent(this, WS_EVT_DISCONNECT, NULL, NULL, 0); |
There was a problem hiding this comment.
Did you correctly test that moving this to _handleDisconnect works in all situations ?
- javascript client (or websocat) closes WS connection
- server-side close
- cleanupClient call
- network connection breaks (i.e. wifi disconnects)
The call here makes sure that the user event handler is called whatever the use case and can cleanup resources.
The move to _handleDisconnect will only be called if _onDisconnect is called, which is subject to how the AsyncTCP / ESPAsyncTCP / RPI / etc implementation was done, which we have no control over.
|
|
||
| bool AsyncWebSocketClient::_queueMessage(AsyncWebSocketSharedBuffer buffer, uint8_t opcode, bool mask) { | ||
| asyncsrv::unique_lock_type lock(_lock); | ||
|
|
There was a problem hiding this comment.
this line removal change can be removed
| @@ -502,14 +498,16 @@ bool AsyncWebSocketClient::_queueMessage(AsyncWebSocketSharedBuffer buffer, uint | |||
| } | |||
|
|
|||
| void AsyncWebSocketClient::close(uint16_t code, const char *message) { | |||
There was a problem hiding this comment.
I don't think the changes here are relevant to the goal of this PR (which is to propose a new way to cleanup clients) and I do not think it is needed at all to guard the status flag
| _server->_handleEvent(this, WS_EVT_ERROR, (void *)&reasonCode, (uint8_t *)reasonString, strlen(reasonString)); | ||
| } | ||
| } | ||
| asyncsrv::unique_lock_type lock(_lock); |
There was a problem hiding this comment.
I don't think the changes here are relevant to the goal of this PR (which is to propose a new way to cleanup clients) and I do not think it is needed at all to guard the status flag
| if (_client) { | ||
| _client->ackLater(); | ||
| { | ||
| asyncsrv::lock_guard_type lock(_lock); |
There was a problem hiding this comment.
There is no need to protect the client ptr: it canot become null between the if and the ackLater call.
_client ptr is only set to null from _client.close() and this is only called from the async_tcp task, same as this method.
Also, see: #429
this _lock is aimed at guarding the queue, not the client ptr
|
|
||
| IPAddress AsyncWebSocketClient::remoteIP() const { | ||
| asyncsrv::lock_guard_type lock(_lock); | ||
|
|
There was a problem hiding this comment.
this line removal change can be removed
|
|
||
| uint16_t AsyncWebSocketClient::remotePort() const { | ||
| asyncsrv::lock_guard_type lock(_lock); | ||
|
|
There was a problem hiding this comment.
this line removal change can be removed
| // active iterator in the caller. However, emit the disconnect event now so | ||
| // applications observe the disconnect at the time it happens even though the | ||
| // client object remains in _clients until cleanup. | ||
| _handleEvent(client, WS_EVT_DISCONNECT, NULL, NULL, 0); |
There was a problem hiding this comment.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
src/AsyncWebSocket.h:267
AsyncWebSocketClient::client()/client() constreturn the raw_clientpointer without taking_lock, but_clientis mutated under_lock(e.g. set to nullptr in_onDisconnect()). Withstatus()now locking and other_clientaccessors being locked, these unlocked accessors become a remaining data-race/UB entry point. Consider guarding these accessors with_lock(or returning a snapshot via a locked getter) and/or clearly documenting that callers must externally synchronize before callingclient().
AwsClientStatus status() const {
asyncsrv::lock_guard_type lock(_lock);
return _status;
}
AsyncClient *client() {
return _client;
}
const AsyncClient *client() const {
return _client;
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // iterating _clients for broadcast sends, and erasing here invalidates the | ||
| // active iterator in the caller. However, emit the disconnect event now so | ||
| // applications observe the disconnect at the time it happens even though the | ||
| // client object remains in _clients until cleanup. |
There was a problem hiding this comment.
@MitchBradley : indeed, the lock added in onDisconnect is wrong and should be removed for this PR. See my other comments. First this is not the goal of this PR and also the lock that is used is the one to lock guard the queues.
|
@MitchBradley : I have merged #429. Could you please run your tests again ?
I think just the fixes done in #429 and yesterday should be enough. If not, this should be covered by a new PR, but not in this one. Also, as I mentioned, you are not supposed to call cleanupClients(): the list cleans up by itself when a client disconnects. This is an asynchronous server so no loop action should be required from user. The cleanupClients() API is an old API that was put in place to limit the number of WS clients. |
This fixes several lifetime and synchronization hazards in AsyncWebSocket that can surface when a client disconnects while the server is broadcasting to connected websocket clients.
Summary:
_handleDisconnect()tocleanupClients()so broadcast iteration is not invalidated by disconnect callbacksAsyncWebSocketClient::_clientreads inshouldBeDeleted(),_onPoll(),_queueControl(),_queueMessage(),_onTimeout(),_onDisconnect(),remoteIP(), andremotePort()close(),ping(),text(), andbinary()operations under the websocket server lock instead of fetching a raw client pointer and using it later outside the lockcleanupClients()count connected clients directly and erase all deletable clients safely in one passWhy:
A disconnect callback can race with websocket send/broadcast paths. In particular, erasing a client immediately during
_handleDisconnect()can invalidate active iteration duringtextAll()/binaryAll(), and unlocked_clientaccess can race with teardown.This change keeps the fix narrowly scoped to AsyncWebSocket lifetime handling. It does not include unrelated local debugging or response-state-machine changes from downstream investigation.