Variant translation jobs hang indefinitely when run concurrently on score sets with overlapping CAIDs

## Summary

When two variant translation jobs run concurrently for score sets that share ClinGen allele IDs (CAIDs), both jobs can hang indefinitely — silently, with no exception raised — until the 3-hour `JOB_TIMEOUT_SECONDS` fires. This is caused by a combination of synchronous psycopg2 blocking the asyncio event loop and long-lived uncommitted transactions holding row locks on `variant_translations`.

## Problem

`populate_variant_translations_for_score_set` in `worker/jobs/external_services/variant_translation.py` calls `upsert_variant_translations` (in `lib/variant_translations.py`) for each allele. That function issues an `INSERT ... ON CONFLICT DO NOTHING` against the `variant_translations` table using a synchronous SQLAlchemy `Session` backed by psycopg2.

The hang occurs through the following sequence:

1. Both jobs run as coroutines in the same asyncio event loop (`MAX_JOBS = 2` in `worker/settings/worker.py`).
2. Score sets from the same experiment share CAIDs. When a CA allele is resolved through a shared PA, both jobs discover and attempt to insert the same `(aa_clingen_id, nt_clingen_id)` pairs into `variant_translations`.
3. Transactions are long-lived: `db.execute()` in `upsert_variant_translations` flushes but does not commit. Commits only happen inside `update_progress` every ~10 alleles.
4. Job B's `db.execute()` blocks the OS thread while waiting for a row lock held by Job A's open transaction.
5. Because psycopg2 is synchronous, blocking the OS thread freezes the entire asyncio event loop. Job A cannot advance to its next `await` point, cannot call `update_progress`, and cannot commit — so it never releases its locks.
6. From Postgres's perspective, only Job B is waiting. Job A's transaction is idle. There is no circular wait, so Postgres does not detect a deadlock and raises no exception.
7. The set-based deduplication in `upsert_variant_translations` (`list({(aa, nt) for ...})`) produces non-deterministic row ordering, which means in a multi-process scenario the jobs can also acquire locks in opposite orders — a true circular deadlock that would raise an exception in separate-process deployments but still manifests as an indefinite hang in the shared event loop case.

## Steps to Reproduce

1. Create two score sets within the same experiment that share mapped variants resolving to overlapping CAIDs (e.g., `urn:mavedb:00001268-a-1` and `urn:mavedb:00001268-b-1`).
2. Trigger variant translation jobs for both score sets such that they execute concurrently within the same worker process.
3. Observe both jobs log progress, and may even complete successfully. On some runs however, both jobs will appear to stop executing and hang.
4. No error or exception is logged. Both jobs remain in `RUNNING` state until `JOB_TIMEOUT_SECONDS` (3 hours) elapses.

## Expected Behavior

Concurrent variant translation jobs on overlapping score sets should either:
- Complete successfully (one waits briefly for the other to commit, then continues), or
- Fail fast with a recoverable error and be retried, rather than hanging silently for hours.

## Proposed Behavior

Two changes to `lib/variant_translations.py`:

1. **Sort rows before inserting.** Change `list({(aa, nt) for ...})` to `sorted({(aa, nt) for ...})`. This ensures all transactions acquire row locks in the same canonical `(aa_clingen_id, nt_clingen_id)` order, eliminating any circular wait in multi-process deployments and reducing the overlap window in the shared event loop case.

2. **Set a per-statement lock timeout using `SET LOCAL`.** Issue `db.execute(text("SET LOCAL lock_timeout = '5s'"))` immediately before the `INSERT`. `SET LOCAL` scopes the timeout to the current transaction only — it expires at the next commit and does not affect unrelated jobs or statements. When Job B's insert blocks on Job A's lock, Postgres will raise `ERROR: canceling statement due to lock timeout` after 5 seconds. This propagates as an `OperationalError` through SQLAlchemy, is caught by the `with_pipeline_management` decorator's exception handler, and the job is marked failed and retried. On retry, the overlapping job has typically already committed its batch, so the conflict does not recur.

The long-term fix is tracked in #715. Once all worker DB sessions are async, `db.execute()` will yield to the event loop on lock waits rather than blocking the OS thread, making the lock timeout unnecessary.

## Acceptance Criteria

- [ ] `upsert_variant_translations` sorts the deduplicated `(aa_clingen_id, nt_clingen_id)` pairs before constructing the `INSERT` values list.
- [ ] `upsert_variant_translations` issues `SET LOCAL lock_timeout = '5s'` on the session before executing the `INSERT`.
- [ ] Concurrent variant translation jobs on score sets with fully overlapping CAIDs do not hang; at least one job completes successfully and the other either completes or fails with a logged `OperationalError` and is retried.
- [ ] Unrelated jobs and statements outside of `upsert_variant_translations` are not affected by the lock timeout (verified by confirming `SET LOCAL` scope).
- [ ] Existing unit tests for `upsert_variant_translations` continue to pass.

## Implementation Notes

- The `SET LOCAL` statement must be issued on the same `Session` (and therefore the same underlying connection) as the `INSERT`, within the same transaction. Issuing it on a separate connection or after a commit would have no effect.
- The 5-second timeout value is a starting point. It should be long enough to avoid spurious failures under normal load but short enough to unblock the event loop well before any downstream timeout fires.
- `worker/settings/worker.py` already contains a comment explaining the `MAX_JOBS = 2` cap and the psycopg2 event loop starvation risk. That comment should be updated to reference this fix and note that the lock timeout is a mitigation, not a resolution.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variant translation jobs hang indefinitely when run concurrently on score sets with overlapping CAIDs #733

Summary

Problem

Steps to Reproduce

Expected Behavior

Proposed Behavior

Acceptance Criteria

Implementation Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Variant translation jobs hang indefinitely when run concurrently on score sets with overlapping CAIDs #733

Description

Summary

Problem

Steps to Reproduce

Expected Behavior

Proposed Behavior

Acceptance Criteria

Implementation Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions