Summary
When two variant translation jobs run concurrently for score sets that share ClinGen allele IDs (CAIDs), both jobs can hang indefinitely — silently, with no exception raised — until the 3-hour JOB_TIMEOUT_SECONDS fires. This is caused by a combination of synchronous psycopg2 blocking the asyncio event loop and long-lived uncommitted transactions holding row locks on variant_translations.
Problem
populate_variant_translations_for_score_set in worker/jobs/external_services/variant_translation.py calls upsert_variant_translations (in lib/variant_translations.py) for each allele. That function issues an INSERT ... ON CONFLICT DO NOTHING against the variant_translations table using a synchronous SQLAlchemy Session backed by psycopg2.
The hang occurs through the following sequence:
- Both jobs run as coroutines in the same asyncio event loop (
MAX_JOBS = 2 in worker/settings/worker.py).
- Score sets from the same experiment share CAIDs. When a CA allele is resolved through a shared PA, both jobs discover and attempt to insert the same
(aa_clingen_id, nt_clingen_id) pairs into variant_translations.
- Transactions are long-lived:
db.execute() in upsert_variant_translations flushes but does not commit. Commits only happen inside update_progress every ~10 alleles.
- Job B's
db.execute() blocks the OS thread while waiting for a row lock held by Job A's open transaction.
- Because psycopg2 is synchronous, blocking the OS thread freezes the entire asyncio event loop. Job A cannot advance to its next
await point, cannot call update_progress, and cannot commit — so it never releases its locks.
- From Postgres's perspective, only Job B is waiting. Job A's transaction is idle. There is no circular wait, so Postgres does not detect a deadlock and raises no exception.
- The set-based deduplication in
upsert_variant_translations (list({(aa, nt) for ...})) produces non-deterministic row ordering, which means in a multi-process scenario the jobs can also acquire locks in opposite orders — a true circular deadlock that would raise an exception in separate-process deployments but still manifests as an indefinite hang in the shared event loop case.
Steps to Reproduce
- Create two score sets within the same experiment that share mapped variants resolving to overlapping CAIDs (e.g.,
urn:mavedb:00001268-a-1 and urn:mavedb:00001268-b-1).
- Trigger variant translation jobs for both score sets such that they execute concurrently within the same worker process.
- Observe both jobs log progress, and may even complete successfully. On some runs however, both jobs will appear to stop executing and hang.
- No error or exception is logged. Both jobs remain in
RUNNING state until JOB_TIMEOUT_SECONDS (3 hours) elapses.
Expected Behavior
Concurrent variant translation jobs on overlapping score sets should either:
- Complete successfully (one waits briefly for the other to commit, then continues), or
- Fail fast with a recoverable error and be retried, rather than hanging silently for hours.
Proposed Behavior
Two changes to lib/variant_translations.py:
-
Sort rows before inserting. Change list({(aa, nt) for ...}) to sorted({(aa, nt) for ...}). This ensures all transactions acquire row locks in the same canonical (aa_clingen_id, nt_clingen_id) order, eliminating any circular wait in multi-process deployments and reducing the overlap window in the shared event loop case.
-
Set a per-statement lock timeout using SET LOCAL. Issue db.execute(text("SET LOCAL lock_timeout = '5s'")) immediately before the INSERT. SET LOCAL scopes the timeout to the current transaction only — it expires at the next commit and does not affect unrelated jobs or statements. When Job B's insert blocks on Job A's lock, Postgres will raise ERROR: canceling statement due to lock timeout after 5 seconds. This propagates as an OperationalError through SQLAlchemy, is caught by the with_pipeline_management decorator's exception handler, and the job is marked failed and retried. On retry, the overlapping job has typically already committed its batch, so the conflict does not recur.
The long-term fix is tracked in #715. Once all worker DB sessions are async, db.execute() will yield to the event loop on lock waits rather than blocking the OS thread, making the lock timeout unnecessary.
Acceptance Criteria
Implementation Notes
- The
SET LOCAL statement must be issued on the same Session (and therefore the same underlying connection) as the INSERT, within the same transaction. Issuing it on a separate connection or after a commit would have no effect.
- The 5-second timeout value is a starting point. It should be long enough to avoid spurious failures under normal load but short enough to unblock the event loop well before any downstream timeout fires.
worker/settings/worker.py already contains a comment explaining the MAX_JOBS = 2 cap and the psycopg2 event loop starvation risk. That comment should be updated to reference this fix and note that the lock timeout is a mitigation, not a resolution.
Summary
When two variant translation jobs run concurrently for score sets that share ClinGen allele IDs (CAIDs), both jobs can hang indefinitely — silently, with no exception raised — until the 3-hour
JOB_TIMEOUT_SECONDSfires. This is caused by a combination of synchronous psycopg2 blocking the asyncio event loop and long-lived uncommitted transactions holding row locks onvariant_translations.Problem
populate_variant_translations_for_score_setinworker/jobs/external_services/variant_translation.pycallsupsert_variant_translations(inlib/variant_translations.py) for each allele. That function issues anINSERT ... ON CONFLICT DO NOTHINGagainst thevariant_translationstable using a synchronous SQLAlchemySessionbacked by psycopg2.The hang occurs through the following sequence:
MAX_JOBS = 2inworker/settings/worker.py).(aa_clingen_id, nt_clingen_id)pairs intovariant_translations.db.execute()inupsert_variant_translationsflushes but does not commit. Commits only happen insideupdate_progressevery ~10 alleles.db.execute()blocks the OS thread while waiting for a row lock held by Job A's open transaction.awaitpoint, cannot callupdate_progress, and cannot commit — so it never releases its locks.upsert_variant_translations(list({(aa, nt) for ...})) produces non-deterministic row ordering, which means in a multi-process scenario the jobs can also acquire locks in opposite orders — a true circular deadlock that would raise an exception in separate-process deployments but still manifests as an indefinite hang in the shared event loop case.Steps to Reproduce
urn:mavedb:00001268-a-1andurn:mavedb:00001268-b-1).RUNNINGstate untilJOB_TIMEOUT_SECONDS(3 hours) elapses.Expected Behavior
Concurrent variant translation jobs on overlapping score sets should either:
Proposed Behavior
Two changes to
lib/variant_translations.py:Sort rows before inserting. Change
list({(aa, nt) for ...})tosorted({(aa, nt) for ...}). This ensures all transactions acquire row locks in the same canonical(aa_clingen_id, nt_clingen_id)order, eliminating any circular wait in multi-process deployments and reducing the overlap window in the shared event loop case.Set a per-statement lock timeout using
SET LOCAL. Issuedb.execute(text("SET LOCAL lock_timeout = '5s'"))immediately before theINSERT.SET LOCALscopes the timeout to the current transaction only — it expires at the next commit and does not affect unrelated jobs or statements. When Job B's insert blocks on Job A's lock, Postgres will raiseERROR: canceling statement due to lock timeoutafter 5 seconds. This propagates as anOperationalErrorthrough SQLAlchemy, is caught by thewith_pipeline_managementdecorator's exception handler, and the job is marked failed and retried. On retry, the overlapping job has typically already committed its batch, so the conflict does not recur.The long-term fix is tracked in #715. Once all worker DB sessions are async,
db.execute()will yield to the event loop on lock waits rather than blocking the OS thread, making the lock timeout unnecessary.Acceptance Criteria
upsert_variant_translationssorts the deduplicated(aa_clingen_id, nt_clingen_id)pairs before constructing theINSERTvalues list.upsert_variant_translationsissuesSET LOCAL lock_timeout = '5s'on the session before executing theINSERT.OperationalErrorand is retried.upsert_variant_translationsare not affected by the lock timeout (verified by confirmingSET LOCALscope).upsert_variant_translationscontinue to pass.Implementation Notes
SET LOCALstatement must be issued on the sameSession(and therefore the same underlying connection) as theINSERT, within the same transaction. Issuing it on a separate connection or after a commit would have no effect.worker/settings/worker.pyalready contains a comment explaining theMAX_JOBS = 2cap and the psycopg2 event loop starvation risk. That comment should be updated to reference this fix and note that the lock timeout is a mitigation, not a resolution.