Added deadlock detection in CommsDecoder#68377
Conversation
…concurrently on event loop
fd61c0e to
3ac9b72
Compare
|
I'm not sold on this fix. sync send on the event loop in an async should go via the existing greenlet trampoline etc to go back to the event loop, no? |
|
That should never be allowed to happen. We need to detect why sync send is being run on the async loop and trampoline back to the async version instead. |
This is because of the mask_secret, which is sync, but the get_async_conn connection in msgraph is async, and it also logs some connection details (except password ofc), but the mask_secret is still applied anyway. We could add detection, but as mask_secret is sync, it would then crash anyway, but at least we would know why. That's why it was so difficult to track and why I didn't have the issue with SFTPAsyncHook |
|
Claude propose to change mask_secret to something like this, but dunno if this is a good idea: This means detecting if mask_secret was called from an async context or not. |
Tested this solution and it doesn't solve the deadlock. |
That doesn't make sense to me. Oh, to tell the supervisor to mask it too |
If that doesn't solve the deadlock then I'm not sure that the problem is identified correctly. |
|
I think I found the issue: We should implement an async get_hook in BaseHook, hence this is why it's causing deadlock, as the get_hook method underneath calls the sync get_connection method. That was also the mean difference between my MSGraph case and SFTP one, hence why I didn't have the issue with SFTP example, because there I'm using the async SFTPClientPool: So it's good that you challenged the issue! Still I think it's a good thing to add the async aget_hook method on BaseHook. |
|
Oh greenback is currently only in the triggerer. airflow/airflow-core/src/airflow/jobs/triggerer_job_runner.py Lines 1016 to 1032 in d4c7791 I think the AsyncPython operator might need to do the same. Maybe. It's a bit invasive to set it there |
|
@dabla Thanks for digging in more. I'm just very sensitive to adding or changing locks, as I've worked quite hard to fix deadlocks on trigger (by removing locks!) |
|
I think adding the |
|
I wonder if we could also catch the deadlock in this case ( |
…() runs concurrently on event loop" This reverts commit 3ac9b72.
Will try to add the check as well, good idea. |
Detection mechanism added |
|
f3a06a9 to
22f4ff2
Compare
22f4ff2 to
bf229cb
Compare
And there's basically 0 chance of this affecting anyone today, as this is only a surfaces when you use the AsyncPythonOperator, right? (Just thinking about if this will suddenly start failing loads of existing workloads that were working before due to dumb luck) |
I did exactly the same reflection myself yesterday, but I monkey patched this check to all our environments (even production) to see what comes out of it and until now now exception where raised due to this check (+1200 DAG's), and we use a lot the async PythonOperator with iterate. A possible solution would be to add a config parameter for this check which is disabled by default for example. |
d2c27f1 to
2a0e4df
Compare
|
I've splitted the PR in two and create a separate PR to add the asynchronous aget_hook method to BaseHook. |
2a0e4df to
cbd969f
Compare
cbd969f to
ff9b930
Compare
| # Typical cause: BaseHook.get_hook() or BaseHook.get_connection() called | ||
| # from inside an async task / aexecute(). | ||
| # Fix: use 'await BaseHook.aget_hook()' or 'await BaseHook.aget_connection()'. | ||
| _on_loop_thread = self._loop_thread_id is not None and threading.get_ident() == self._loop_thread_id |
There was a problem hiding this comment.
| _on_loop_thread = self._loop_thread_id is not None and threading.get_ident() == self._loop_thread_id | |
| _on_loop_thread = threading.get_ident() == self._loop_thread_id |
I think this is enough? get_ident() can never return None so if self._loop_thread_id is None the part after and would never be True anyway.
Actually I think the entire added block can be simplified to
on_loop_thread = False
if threading.get_ident() == self._loop_thread_id:
with contextlib.suppress(RuntimeError):
on_loop_thread = bool(asyncio.get_running_loop())There was a problem hiding this comment.
Not sure if I missed anything; the two blocks seem to have mostly the same logic except the former uses blocking=False. I think we can just do
try:
if not self._thread_lock.acquire(blocking=not on_loop_thread):
raise DeadlockImminentError(...)
...
finally:
self._thread_lock.release()or, if this is too difficult to understand, with an inner function
def _communicate():
...
if on_loop_thread:
try:
if not acquire(blocking=False):
raise
_communicate()
finally:
release()
else:
with lock:
_communicate()…ock_imminent_error_when_asend_in_flight
49e591c to
3bfd48f
Compare
Summary
Adds a two-level
DeadlockImminentErrorguard inCommsDecoder.send()that catches the misuseimmediately and tells the developer exactly how to fix it.
Root cause
Async tasks that called
BaseHook.get_hook()from insideasync def/aexecute()wereinadvertently using the sync connection-fetch path:
get_hook()callsget_connection()→Connection.get()→_get_connection(), whichultimately calls
comms.send()— the blocking supervisor round-trip — on the event loopthread. Blocking the event loop thread while other coroutines are concurrently in-flight via
comms.asend()causes a deadlock or severe stall.This was the reason the issue only appeared with MSGraph and not with SFTP: the SFTP example
used
SFTPClientPooldirectly (a native async API), while the MSGraph task called the syncget_hook()from inside an async function.Changes
DeadlockImminentError+ guard inCommsDecoder.send()(task-sdk/src/airflow/sdk/execution_time/comms.py)Even with
aget_hook()available, a developer could still mistakenly call the sync path froman async context. Without a guard, the result is a silent hang that is almost impossible to
diagnose.
A new
DeadlockImminentError(aBaseExceptionsubclass, so it is not swallowed bycontextlib.suppress(Exception)) is raised immediately bysend()when it detects it isrunning on the event loop thread.
The detection uses two levels:
send()is called from the thread that last calledasend()asend()_thread_lockis currently held by a thread-pool worker from a concurrentasend()round-tripDeadlockImminentErrorinherits fromBaseExceptionrather thanExceptionfor two reasons:contextlib.suppress(Exception)— used inmask_secret()and other helpers — does notcatch it, so the error always surfaces.
failure, making
BaseExceptionsemantically correct.The error message includes the offending call stack and tells the developer exactly what to do:
Why this was hard to spot
get_hook()→ syncget_connection()SFTPClientPool→ async native APIcomms.send()blocks event loop threadcomms.asend()offloads to thread poolTesting
TestBaseHook.test_aget_hook— verifiesaget_hook()routes exclusively throughasend()and never calls the blocking
send().DeadlockImminentError— verifiessend()raises the error (both levels) whencalled from the event loop thread, and that it propagates through
contextlib.suppress(Exception).Was generative AI tooling used to co-author this PR?
Claude Opus 4.6 to diagnose the issue and assist writing unit test
{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.