feat: Migrate the annotation pipeline onto the allele model + append-only event log#783
Draft
bencap wants to merge 23 commits into
Conversation
…-keyed refresh Replace MappedVariant-based gnomAD linkage with valid-time GnomadAlleleLink rows keyed on the deduplicated Allele. Linking covers every current allele of a score set (authoritative and RT-derived), so protein/coding score sets — whose genomic allele is RT-derived — are no longer dropped; per-variant annotation status flows through the _annotate_gnomad choke point against authoritative links only (interim bandaid, the AnnotationEvent migration seam). Refresh / idempotency model: - One live link per allele (unique index on allele_id); a gnomAD version bump supersedes rather than accumulating one live link per version. - Supersede only on change — an unchanged re-run leaves the live link untouched, so the valid-time history records no spurious boundary. - Version-keyed skip avoids re-fetching alleles already current at the version; a force param bypasses the skip (re-ingestion / heal) without churning unchanged links. - Per-variant status is a per-run audit event: created / preexisting / skipped. Normalize CAIDs across the Athena join: the gnomAD Hail dump drops leading zeros (CA025094 -> CA25094), so exact-string matching silently missed zero-padded CAIDs. Extract group_alleles_for_annotation as the shared allele-grouping primitive for allele-subject annotation jobs (adopted by gnomAD; VEP/ClinVar to follow). Refs #742. Partially addresses #722 (CAID-completeness tracking there stays open).
Migrate the VEP functional-consequence job (Step 2 of the annotation infrastructure migration) off MappedVariant onto the deduplicated allele model. - New ValidTime vep_allele_consequences table: a single allele-keyed row collapses record + link (the consequence is a scalar with no shared external entity, unlike gnomAD/ClinVar). One live consequence per allele via a partial unique index. - Job runs over the score set's full allele set (authoritative + RT-derived) via get_alleles_for_score_set + the shared grouping primitive; VAS still fans only to authoritative variants (the bandaid seam). - Version-key the refresh on the Ensembl release (/info/software): skip alleles already current, abort if the release can't be fetched. Supersede is value-keyed, not version-keyed — an unchanged consequence at a new release bumps source_version/access_date in place to avoid churning history. force bypasses the skip (e.g. after editing VEP_CONSEQUENCES). - A no-result is treated as a non-answer, never a negative: held consequences are not retired on an empty/failed VEP fetch. - Annotation links stay one-directional to Allele (no reverse back-ref). Adds lib + job tests covering linkage, RT-derived scope, version skip, in-place bump, supersede-on-change, no-result handling, and release-fetch failure.
Step 3 of the #742 external-annotation migration. ClinVar linkage moves off MappedVariant onto the deduplicated allele model. - Rename clinical_controls -> clinvar_controls (ORM model + table + unique constraint). Internal only: the ClinicalControl* view models, the /clinical-controls serving endpoints, and the frozen mapped_variants_clinical_controls association are unchanged, since the view-model name is both the OpenAPI schema name and the record_type discriminator consumed by the UI. - New clinvar_allele_links ValidTime table + ClinvarAlleleLink model. Multi-live: partial unique index (allele_id, clinvar_control_id) WHERE valid_to IS NULL, so an allele accumulates one live link per release. - Refactor refresh_clinvar_controls onto get_alleles_for_score_set + group_alleles_for_annotation (payload = CAID, full allele scope). Links are get-or-create; a same-version re-resolution to a different control supersedes newest-wins (gap-free retire+insert) rather than leaving two live links. VAS writes funnel through the _annotate_clinvar choke point, fanned only to authoritative_variant_ids and version-scoped. - Additively capture ClinVar's VariationID: nullable clinvar_variation_id column populated forward from the variant_summary TSV (parse degrades to None on archival schemas lacking the column). Unserved; the dedicated clinvar_variants remodel is deferred to the read-cutover. Tests rewritten for the allele model, covering the multi-live link writes, the version-scoped supersede guard, and the authoritative-only VAS fan-out.
Steps 4 & 5 of the #742 migration. Both jobs are redundant under the allele model: Allele.hgvs_g/c/p are populated by the mapping job, and the reverse-translation equivalence space (genomic/coding/protein alleles linked per MappingRecord) replaces the ClinGen PA<->CA translation table. - Remove populate_hgvs_for_score_set and populate_variant_translations_for_score_set from the pipeline DAG (both were leaf nodes — no dependency edges to repair), the worker registry (BACKGROUND_FUNCTIONS + STANDALONE_JOB_DEFINITIONS), and the external_services package exports. - Delete the two job modules and the now-orphaned ClinGen HGVS helpers (extract_hgvs_from_ca_allele_data / extract_hgvs_from_pa_allele_data), used only by the HGVS job. - Keep lib/variant_translations.py and the variant_translations table/model, marked FROZEN (serving-only) — they back old-model serving and are dropped at read-cutover. - Delete the obsolete job tests and their conftest fixtures.
get_allele_translations(db, allele_id, *, as_of=None) returns an allele's full cross-layer equivalence set (genomic/coding/protein) by traversing the MappingRecordAllele link graph: allele -> its live links -> mapping record(s) -> all co-linked alleles. The relation is co-membership in a MappingRecord's allele set, not a shared identifier — ClinGen's CAID spans only the nucleotide layers (the protein allele carries a distinct PA), so the link graph is the only thing tying all layers together. This replaces what the retired variant_translations PA<->CA table provided. Forward-compatible with temporal reads: the same half-open valid-time predicate is applied at both the anchor and fan-out hops, so passing as_of reconstructs the equivalence set as of any instant. Defaults to the currently-live set.
…bulary Introduces the AnnotationEvent log: a single append-only event table whose subjects are Variant and Allele, selected by annotation_type via a polymorphic CHECK. "Current" is derived (DISTINCT ON … id DESC), never stored. Adds the shared Disposition (present/absent/not_applicable/failed) and EventReason vocabulary, the v_current_annotation_events view, and the supporting migrations.
group_alleles_for_annotation now returns one job-specific payload per allele instead of an AlleleAnnotationGroup. Drops the authoritative_variant_ids bandaid seam (and is_authoritative from ScoreSetAlleleRow): annotation is an allele-level fact, and with the AnnotationEvent log the per-variant status fan-out goes away.
AnnotationStatusManager now writes AnnotationEvent rows via record_event and derives current status from the event log instead of maintaining per-variant status rows. Resolves subjects through VARIANT_SUBJECT_TYPES/ALLELE_SUBJECT_TYPES and the Disposition vocabulary.
link_gnomad_variants_to_alleles now returns a GnomadLinkVerdict per touched allele (CREATED/UNCHANGED) as the single source of truth for per-allele status; the job derives annotation events directly from the verdict map rather than re-querying link state. Alleles absent from the map are read as 'gnomAD had no record'.
VEP resolution now returns a VepResolution (consequences vs errored) so a genuine empty is distinguished from an unknown, and linking returns a VepLinkVerdict per allele (CREATED/UNCHANGED). The job derives annotation events from these outcomes instead of re-querying consequence state; errored HGVS are retried, not recorded as negatives.
The ClinGen allele-registration and LDH-submission jobs now record AnnotationEvents via record_event, using the shared EventReason vocabulary (created/preexisting/ reconfirmed/submitted, and the CAR-specific failure and reconfirmation-skipped reasons) in place of per-variant status updates.
The ClinVar job records AnnotationEvents via record_event, distinguishing superseded/no_record/no_caid/multi_variant_caid cases through the shared EventReason vocabulary. Clarifies the ClinicalControl comments to mark the row-level vs variation-level link identifiers.
The mapping job records AnnotationEvents via record_event, deriving disposition and reason from the typed MappingOutcome (mapped / intronic / no_protein_consequence / failed) rather than writing per-variant status rows.
The reverse-translation job records AnnotationEvents via record_event, using the RT-specific EventReason cases (translated / no_coding_transcript / no_assay_level_hgvs / translation_failed / translation_error / transcript_unresolved) to distinguish informative negatives from failures.
Adds the pipeline_tracking script for querying pipeline progress from the AnnotationEvent log, and updates the variant_annotations script to read/write through the event model.
An allele with an existing CAID always matched the preexisting branch first, so force_reregister never re-submitted it. Existing-CAID alleles now route to pending whenever force_reregister is set.
Split the resolution exception handling: an HTTPError (VEP rejecting a batch, e.g. a 400 for protein HGVS it cannot parse) now routes the batch to Variant Recoder, which may still resolve it via genomic recoding. Only transport/network errors (RequestException) are marked errored and held for retry — fixing the case where recoverable inputs were wrongly recorded as unknown.
normalize_and_identify/_rle_to_lse assume an inlined SequenceReference and integer coordinates, but the VRS models type location.sequenceReference and location.start/rle.length as unions (iriReference/Range/None). Add narrowing assertions documenting that contract — restoring a green `mypy src/` (a required CI job) and turning the unreachable bad-input case into a clear AssertionError instead of a deeper AttributeError. Adds guard tests.
The newest-wins ClinvarAlleleLink supersede stamped valid_to/valid_from with a tz-naive datetime.now() into timestamptz columns, skewing the valid-time axis against every other job's func.now()-stamped rows. Use supersede_with so retire and insert share one DB timestamp. Test now asserts the boundary is timezone-aware.
Every other FK on annotation_event was indexed except score_set_id. Add it (model __table_args__ + migration) to back the ON DELETE SET NULL cascade on score-set deletion and score-set-scoped audit queries.
Revision a1b2c3d4e5f6 creates alleles/mapping_records/mapping_record_alleles — there is no allele-closure table anywhere (equivalence is a runtime query). Rename the file so it stops misleading readers; the revision id is unchanged.
Re-add cases dropped in the manager rewrite: None/empty reads, no-op flush, auto-flush-before-read for both query methods, a FAILED-disposition round-trip through the manager, and per-subject current-status isolation.
…ents split_cis_phased_hgvs raised ValueError on bracketed input lacking an accession; it now degrades to a single-element list. join_cis_phased_hgvs emits components in coordinate order so CSV exports are deterministic regardless of VRS member ordering (the block digest is order-independent).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Builds on the deduplicated allele / valid-time mapping model from #741 (and #739). This PR moves the external-annotation pipeline onto that model and replaces per-variant annotation status with an append-only event log.
Two themes:
MappedVariant. gnomAD, VEP, and ClinVar now resolve and link against the deduplicated allele space (authoritative and RT-derived alleles), with version-keyed refresh and idempotent re-runs. CAR/LDH already ran on alleles.AnnotationEventlog. A single log across the pipeline records disposition + why + when per subject; "current" is derived (v_current_annotation_events). This can distinguish confirmed-absence from never-checked — the gapVariantAnnotationStatuscouldn't express.Still parallel-tables: new link tables and the event log are written for new runs; legacy
MappedVariantcolumns and association tables stay frozen-serving. Backfill/read-cutover/drops are deferred to #775/#747.Annotation jobs onto the allele model
gnomad_allele_links(one live link per allele; supersede-on-change; version-keyed skip with aforcebypass).vep_allele_consequences(collapsed record+link; keyed on the Ensembl release; supersede only on a changed consequence so categorical values don't churn each release). HTTP rejections are routed to Variant Recoder rather than failing the batch.clinical_controlsrenamed toclinvar_controls(ORM/table/constraint only; theClinicalControl*serving contract is intentionally frozen) + multi-liveclinvar_allele_links(one live link per archival release; get-or-create). Adds forward-captured, unservedclinvar_variation_id.group_alleles_for_annotationprimitive (one work-unit per allele);force_reregisterre-registers existing CAIDs.lib/alleles.py::get_allele_translations— cross-layer equivalence via mapping-record co-membership, withas_oftemporal support.Append-only annotation event log
annotation_eventtable +v_current_annotation_eventsview (latest-event-per-subject).AnnotationStatusManagerrewritten to emit events; mapping, reverse-translation, gnomAD, VEP, ClinVar, and CAR/LDH all emit events from their resolution outcomes.disposition(present/absent/not_applicable/failed;pending= no event) + a granular per-domainreason. Events share the job's DB session (flushed-not-committed), so they're transactionally coupled to the annotation data they describe.Retired
populate_hgvs_for_score_setandpopulate_variant_translations_for_score_setjobs (hgvs.py,variant_translation.py) — both DAG leaf nodes, no edge repair needed.Allele.hgvs_*is mapping-populated; the RT equivalence space replaces variant translations.lib/variant_translations.py+ thevariant_translationstable stay frozen (serving-only).Migrations
8 new (the three annotation link/consequence tables, the
clinvar_controlsrename,clinvar_variation_id, the event table + view, and an index onannotation_event.score_set_id) plus a rename of the existing alleles/mapping-records migration to match its contents. Single alembic head; all have working downgrades.Testing
mypy src/andruffgreen. Updated/added coverage acrosstests/{lib,models,worker,routers}for each migrated job, the event model + view, and the rewritten status manager. (API/worker tests per project convention; UI out of scope.)Deferred follow-ups
clinvar_controls→clinvar_variantsserving remodel (expose VariationID) — deferred to read-cutover.v_current_annotation_events+ add a retention/compaction sweep if/when cross-subject filtering or log growth becomes hot (fine at current scale; read path is internal-only).