Proposal to change how the WebSocket clients are cleaned up to remove a race condition by MitchBradley · Pull Request #424 · ESP32Async/ESPAsyncWebServer

MitchBradley · 2026-04-09T03:49:21Z

This fixes several lifetime and synchronization hazards in AsyncWebSocket that can surface when a client disconnects while the server is broadcasting to connected websocket clients.

Summary:

defer client removal from _handleDisconnect() to cleanupClients() so broadcast iteration is not invalidated by disconnect callbacks
protect AsyncWebSocketClient::_client reads in shouldBeDeleted(), _onPoll(), _queueControl(), _queueMessage(), _onTimeout(), _onDisconnect(), remoteIP(), and remotePort()
resolve per-client close(), ping(), text(), and binary() operations under the websocket server lock instead of fetching a raw client pointer and using it later outside the lock
make cleanupClients() count connected clients directly and erase all deletable clients safely in one pass

Why:
A disconnect callback can race with websocket send/broadcast paths. In particular, erasing a client immediately during _handleDisconnect() can invalidate active iteration during textAll()/binaryAll(), and unlocked _client access can race with teardown.

This change keeps the fix narrowly scoped to AsyncWebSocket lifetime handling. It does not include unrelated local debugging or response-state-machine changes from downstream investigation.

Copilot

Pull request overview

This PR tightens AsyncWebSocket client lifetime handling to avoid iterator invalidation and unsafe client pointer usage when disconnects race with broadcast sends.

Changes:

Defers client removal from disconnect handling to cleanupClients() to avoid invalidating _clients iteration during broadcasts.
Adds additional locking around AsyncWebSocketClient::_client reads and moves per-client operations (close/ping/text/binary) to execute under the websocket server lock.
Updates cleanupClients() to count connected clients directly and erase all deletable clients in a single pass.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`src/AsyncWebSocket.h`	Adds client lock usage in `shouldBeDeleted()` to make deletion checks thread-safe.
`src/AsyncWebSocket.cpp`	Introduces a locked client lookup helper, adds additional locking in client callbacks/queueing, defers disconnect removal, and revises cleanup + per-client operations to run under the server lock.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/AsyncWebSocket.cpp

MitchBradley · 2026-04-09T03:55:36Z

While stress-testing with FluidNC running on MacOS, I ran into some problems that showed up with websocket textAll which tended to deadlock. GPT-5.4 did a deep analysis and came up with these fixes, which solved the problem.

MitchBradley · 2026-04-09T04:02:44Z

status() is now synchronized under the per-client mutex, so the connected-client checks in find_connected_client_locked() and cleanupClients() no longer race with _statusupdates. I also changed cleanupClients() to splice deletable clients into a temporary list and destroy them after releasing AsyncWebSocket::_lock, so disconnect callbacks no longer run under the server lock.

mathieucarbou · 2026-04-09T13:14:17Z

@MitchBradley : I have cherry picked your commits into #427 and removed in an added commit the changes regarding how the clients are cleaned up in order to split the PR in 2 parts:

Fix leftovers in locking improvements from past commit 085f27806ddc6b72b8598dc1ad494e7c7f869e12 #427 is more a fix PR following the previous commit 085f278 so this one I will merge it once the build pass because this is more a fix PR regarding previous changes. And it is quite straightforward.
Proposal to change how the WebSocket clients are cleaned up to remove a race condition #424 (this PR) : we will be able to rebase it and it will only contain the changes regarding how the clients are cleaned up.

For the story, we know the issue and @willmmiles also suggested we do some changes in this area once, and we got a user ran into such issue like you did.

Now, the proposed changes are quite important because initially, cleanupClients() was more used to close the rogue clients or a surplus of clients (when max client is reached).

With the middleware stack that was added, now the websocket client control can be done with a middleware that is controlling directly the upgrade requests to pro-actively close the WS upgrade request before it is transformed into a websocket client. So there is less a need now to call cleanupClients() than before, even if still advised.

So moving the complete client cleanup into this badly named cleanupClients() function would then require everyone to call it from the loop, which defies a little the purpose of an asynchronous framework, and woudl also change the meaning of this badly named cleanupClients() functions, which then becomes a client count control function but also a client cleanup function.

So let's merge #427 first with all the good fixes and we can rebase this PR to focus the changes on the behavior change, that we will need to discuss within the team with @me-no-dev and @willmmiles .

Maybe there is another better way also to handle that.

We are also advising users to run with -D CONFIG_ASYNC_TCP_RUNNING_CORE=1 (https://esp32async.github.io/ESPAsyncWebServer/configuration/#important-recommendations-for-build-options). Running the async_tcp task under the same core as the user code (loop) also helps reducing some concurrency issues.

src/AsyncWebSocket.cpp

mathieucarbou · 2026-04-09T15:15:01Z

src/AsyncWebSocket.h

    return _clientId;
  }
  AwsClientStatus status() const {
+    asyncsrv::lock_guard_type lock(_lock);


Although I understand the use case of locking a whole function, although I do not to read a register value ? status could also be updated from many other places.

Let's say client A disconnects and has locked and is about to set its status to DISCONNECTED.
At the same time, the app calls status() and reads CONNECTED for client A
Then what ?
Even if the app would trigger any call to send some data to this client A, the app will encounter a lock in one of the send method.

I do not think there is a race also through the cleanupClient function ? Because a double-call on close() should be supported ? And also it would be ok if we miss a client that is currently disconnecting, it will be cleaned next time.

Not sure we need a lock here ?

I must confess that I am out of my depth here. Concurrency has always made my brain hurt. Here is what GPT-5.4 has to say:

There are two different questions:

Is a stale read semantically acceptable?

Is the read synchronized according to the C++ memory model?

For status(), the second question is the important one. If _status is an ordinary field, then a read from status() without taking the client lock is concurrent with writes that happen under the client lock in AsyncWebSocket.cpp. That is a real data race in C++, even if the only practical consequence most of the time would be “you briefly saw CONNECTED before DISCONNECTED”. The language does not treat that as a harmless stale read; it treats it as unsynchronized access to shared state, which is undefined behavior.

The later lock in a send path does not repair that earlier unsynchronized read. Once status() has read _status without synchronization, the damage is already done from the memory-model perspective. The same applies to cleanupClients(): whether a missed disconnect is tolerable is a behavioral question, but it does not make an unlocked read of _status safe.

If async_tcp task is running on the same core as the loop or user code, there is no need to add mutex but if the async_tcp it ran from core 0 and user code from another core then accessing the value could potentielle return a local cachée value while the real value in memory was updated from another core. The question is : is this temporary situation acceptable considering that the local caches value won’t stay long in the cache : it would take some user code silly looping and calling status() to force many reads. so in that silly case marking the field volatile could be enough in order to force a read from memory each time.

But like I said, I am not sure I ever want to go there because if you look at the parsing logic there are many fields (most of them) set from the async-tcp task that could be on core 0 and accessed from core 1. And that’s ok. Status is no exception.

and I think it is ok so admit that no one should ever call status() in a loop at very high frequency such that the returned value could be a cached one.

Usually people falling status() are doing that because they want to execute something else based on the returned value.

so I would revert this one.

I will defer to your judgment on this - but is the mutex so expensive that removing it is worth the risk? Apparently, according to C++, if there is any unguarded access of such a variable, all bets are off with respect to memory barrier guarantees.

@MitchBradley : I agree with you regarding the memory model of c++ not being respected in a sense that this status flag can be r/w from many places, so it could be considered as a shared state.

My concern is how much it can be considered as a shared state... For me there is no difference with the _acked and _ack fields for example that can also be read and accessed from many tasks (core).

That's why I want this discussion and concern completely out of scope of this PR.

I don't disagree about the fact that some code review / update is required in this area.

But I disagree on putting it here in this PR.

This PR is to propose a change regarding the location when the cleanup happens, and the more important part is that this is a design / breaking change.

A new PR should be opened with the title being "Review locking and unlocking fo shared state to adhere to c++ memory model". And in this PR, yes the work can be done and discussed with the whole team and also be scoped to more than just the status flag. The status flag is only 1 variable like that but there are many more that are sharing the same concerns are that are accessed more frequently than the status flag.

Please make sure your PR are as isolated as possible and only fix 1 concern at a time. That's fine if you open 2 or even 3 PRs and one has to be merged (and is based) on the other one. What is important is to keep PRs focused in 1 issue at a time.

This is particularly hard otherwise to review or read back the history of 1 PR contains a big bag of several fixes that are not focused on solving the same problem.

Also, our current locks are made in a way to guard specific variables.

If you look at #429 I have renamed them for clarity.

We have 2 locks:

one that is guarding the queues

one that is guarding the ws clients list

That is all.
And we cannot use the same locks to lock everything...

For example, your changes tend to reuse the existing locks to sometimes guard the _client pointer, but not at every places, and also guard the status flag, also not at every places.

So that's not consistent and can cause some unexpected slowdowns and un-necessary locking because could prevent access to the status flag while someone is sending a message.

That's why this review work has better be done in a more global PR and be discussed because maybe the solution will be completely different than what you proposed.

C++ has several mechanism also like atomic fields, weak pointers, etc
Locks are not the only solution.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/AsyncWebSocket.h

src/AsyncWebSocket.cpp

mathieucarbou · 2026-04-10T08:10:54Z

src/AsyncWebSocket.cpp


 void AsyncWebSocketClient::_onPoll() {
  asyncsrv::unique_lock_type lock(_lock);
-


this line removal change can be removed

mathieucarbou · 2026-04-10T08:11:08Z

src/AsyncWebSocket.cpp


 bool AsyncWebSocketClient::_queueControl(uint8_t opcode, const uint8_t *data, size_t len, bool mask) {
  asyncsrv::lock_guard_type lock(_lock);
-


this line removal change can be removed

mathieucarbou · 2026-04-10T08:11:25Z

src/AsyncWebSocket.cpp

    _messageQueue.clear();
    _controlQueue.clear();
  }
-  _server->_handleEvent(this, WS_EVT_DISCONNECT, NULL, NULL, 0);


Did you correctly test that moving this to _handleDisconnect works in all situations ?

javascript client (or websocat) closes WS connection

server-side close

cleanupClient call

network connection breaks (i.e. wifi disconnects)

The call here makes sure that the user event handler is called whatever the use case and can cleanup resources.

The move to _handleDisconnect will only be called if _onDisconnect is called, which is subject to how the AsyncTCP / ESPAsyncTCP / RPI / etc implementation was done, which we have no control over.

mathieucarbou · 2026-04-10T08:11:36Z

src/AsyncWebSocket.cpp


 bool AsyncWebSocketClient::_queueMessage(AsyncWebSocketSharedBuffer buffer, uint8_t opcode, bool mask) {
  asyncsrv::unique_lock_type lock(_lock);
-


this line removal change can be removed

mathieucarbou · 2026-04-10T08:14:43Z

src/AsyncWebSocket.cpp

@@ -502,14 +498,16 @@ bool AsyncWebSocketClient::_queueMessage(AsyncWebSocketSharedBuffer buffer, uint
 }

 void AsyncWebSocketClient::close(uint16_t code, const char *message) {


I don't think the changes here are relevant to the goal of this PR (which is to propose a new way to cleanup clients) and I do not think it is needed at all to guard the status flag

mathieucarbou · 2026-04-10T08:18:12Z

src/AsyncWebSocket.cpp

            _server->_handleEvent(this, WS_EVT_ERROR, (void *)&reasonCode, (uint8_t *)reasonString, strlen(reasonString));
          }
        }
+        asyncsrv::unique_lock_type lock(_lock);


I don't think the changes here are relevant to the goal of this PR (which is to propose a new way to cleanup clients) and I do not think it is needed at all to guard the status flag

mathieucarbou · 2026-04-10T08:47:01Z

src/AsyncWebSocket.cpp

-      if (_client) {
-        _client->ackLater();
+      {
+        asyncsrv::lock_guard_type lock(_lock);


There is no need to protect the client ptr: it canot become null between the if and the ackLater call.

_client ptr is only set to null from _client.close() and this is only called from the async_tcp task, same as this method.

Also, see: #429

this _lock is aimed at guarding the queue, not the client ptr

mathieucarbou · 2026-04-10T08:47:18Z

src/AsyncWebSocket.cpp


 IPAddress AsyncWebSocketClient::remoteIP() const {
  asyncsrv::lock_guard_type lock(_lock);
-


this line removal change can be removed

mathieucarbou · 2026-04-10T08:47:27Z

src/AsyncWebSocket.cpp


 uint16_t AsyncWebSocketClient::remotePort() const {
  asyncsrv::lock_guard_type lock(_lock);
-


this line removal change can be removed

mathieucarbou · 2026-04-10T08:47:50Z

src/AsyncWebSocket.cpp

+  // active iterator in the caller. However, emit the disconnect event now so
+  // applications observe the disconnect at the time it happens even though the
+  // client object remains in _clients until cleanup.
+  _handleEvent(client, WS_EVT_DISCONNECT, NULL, NULL, 0);


See https://github.com/ESP32Async/ESPAsyncWebServer/pull/424/changes#r3063000504

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

src/AsyncWebSocket.h:267

AsyncWebSocketClient::client() / client() const return the raw _client pointer without taking _lock, but _client is mutated under _lock (e.g. set to nullptr in _onDisconnect()). With status() now locking and other _client accessors being locked, these unlocked accessors become a remaining data-race/UB entry point. Consider guarding these accessors with _lock (or returning a snapshot via a locked getter) and/or clearly documenting that callers must externally synchronize before calling client().

  AwsClientStatus status() const {
    asyncsrv::lock_guard_type lock(_lock);
    return _status;
  }
  AsyncClient *client() {
    return _client;
  }
  const AsyncClient *client() const {
    return _client;
  }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/AsyncWebSocket.cpp

mathieucarbou · 2026-04-10T09:37:56Z

src/AsyncWebSocket.cpp

+  // iterating _clients for broadcast sends, and erasing here invalidates the
+  // active iterator in the caller. However, emit the disconnect event now so
+  // applications observe the disconnect at the time it happens even though the
+  // client object remains in _clients until cleanup.


@MitchBradley : indeed, the lock added in onDisconnect is wrong and should be removed for this PR. See my other comments. First this is not the goal of this PR and also the lock that is used is the one to lock guard the queues.

mathieucarbou · 2026-04-10T13:03:15Z

@MitchBradley : I have merged #429.

Could you please run your tests again ?

without calling cleanupClients()
with calling cleanupClients()

I think just the fixes done in #429 and yesterday should be enough. If not, this should be covered by a new PR, but not in this one.

Also, as I mentioned, you are not supposed to call cleanupClients(): the list cleans up by itself when a client disconnects. This is an asynchronous server so no loop action should be required from user.

The cleanupClients() API is an old API that was put in place to limit the number of WS clients.
But in order to do that, the correct way now to setup a middleware like it is done in the WebSocket.ino example.

Copilot AI review requested due to automatic review settings April 9, 2026 03:49

Copilot started reviewing on behalf of MitchBradley April 9, 2026 03:49 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

src/AsyncWebSocket.cpp Outdated Show resolved Hide resolved

src/AsyncWebSocket.cpp Outdated Show resolved Hide resolved

src/AsyncWebSocket.cpp Outdated Show resolved Hide resolved

mathieucarbou mentioned this pull request Apr 9, 2026

Fix leftovers in locking improvements from past commit 085f27806ddc6b72b8598dc1ad494e7c7f869e12 #427

Merged

mathieucarbou marked this pull request as draft April 9, 2026 13:18

mathieucarbou assigned MitchBradley Apr 9, 2026

mathieucarbou changed the title ~~Fix websocket client lifetime races during broadcast~~ Proposal to change how the WebSocket clients are cleaned up to remove a race condition Apr 9, 2026

mathieucarbou reviewed Apr 9, 2026

View reviewed changes

src/AsyncWebSocket.cpp Outdated Show resolved Hide resolved

mathieucarbou reviewed Apr 9, 2026

View reviewed changes

mathieucarbou marked this pull request as ready for review April 9, 2026 15:19

mathieucarbou requested review from Copilot, me-no-dev and willmmiles April 9, 2026 15:19

Copilot started reviewing on behalf of mathieucarbou April 9, 2026 15:20 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

src/AsyncWebSocket.h Show resolved Hide resolved

src/AsyncWebSocket.cpp Outdated Show resolved Hide resolved

src/AsyncWebSocket.cpp Show resolved Hide resolved

MitchBradley added 4 commits April 9, 2026 08:44

Fix websocket client lifetime races during broadcast

7f96a06

Address websocket review feedback

a5d0afd

Consistent locking around all uses of _status

50f1e28

Implemented _handleEvent copilot review suggestion

8bf0526

MitchBradley force-pushed the fix/posix-websocket-broadcast-race branch from fcff5d0 to 8bf0526 Compare April 9, 2026 21:25

mathieucarbou requested changes Apr 10, 2026

View reviewed changes

mathieucarbou requested a review from Copilot April 10, 2026 08:53

Copilot started reviewing on behalf of mathieucarbou April 10, 2026 08:53 View session

Copilot AI reviewed Apr 10, 2026

View reviewed changes


		void AsyncWebSocketClient::_onPoll() {
		asyncsrv::unique_lock_type lock(_lock);


		bool AsyncWebSocketClient::_queueControl(uint8_t opcode, const uint8_t *data, size_t len, bool mask) {
		asyncsrv::lock_guard_type lock(_lock);


		bool AsyncWebSocketClient::_queueMessage(AsyncWebSocketSharedBuffer buffer, uint8_t opcode, bool mask) {
		asyncsrv::unique_lock_type lock(_lock);

		@@ -502,14 +498,16 @@ bool AsyncWebSocketClient::_queueMessage(AsyncWebSocketSharedBuffer buffer, uint
		}

		void AsyncWebSocketClient::close(uint16_t code, const char *message) {


		IPAddress AsyncWebSocketClient::remoteIP() const {
		asyncsrv::lock_guard_type lock(_lock);


		uint16_t AsyncWebSocketClient::remotePort() const {
		asyncsrv::lock_guard_type lock(_lock);

Conversation

MitchBradley commented Apr 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MitchBradley commented Apr 9, 2026

Uh oh!

MitchBradley commented Apr 9, 2026

Uh oh!

mathieucarbou commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mathieucarbou Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mathieucarbou Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mathieucarbou commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mathieucarbou commented Apr 9, 2026 •

edited

Loading

mathieucarbou Apr 9, 2026 •

edited

Loading

mathieucarbou Apr 10, 2026 •

edited

Loading