Skip to content

feat: Extend annotation pipeline to cover Allele entities #742

Description

@bencap

Context

Depends on: #741

Allele rows at each level are real variants that can be independently annotated. Because Allele rows are shared across score sets by VRS digest, annotation results are stored once per allele and shared across all score sets that reference it. Annotation jobs must check for an existing current annotation before re-annotating.

Annotation architecture

AnnotationStatus is QC/audit only — it tracks whether the pipeline ran and whether it succeeded. The actual annotation data lives in dedicated per-type tables (VEPAnnotation, GnomADVariant, ClinicalControl), each with superseded_at for temporal support.

When a new annotation is produced for an allele that already has a current annotation of that type, the existing row's superseded_at is set to NOW() in the same transaction that inserts the new row. This ensures the temporal query WHERE created_at <= T AND (superseded_at IS NULL OR superseded_at > T) always returns exactly one result for any point in time.

Level-to-annotation routing

All routing is driven by the level column on the flat alleles table:

  • gnomAD allele frequencylevel = 'genomic' → stored in GnomADVariant
  • VEP functional consequencelevel = 'genomic' or level = 'coding' → stored in VEPAnnotation
  • ClinVar / clinical controlslevel = 'coding' or level = 'protein' → stored in ClinicalControl
  • ClinGen allele ID → stored as clingen_allele_id column on the alleles table (stable, no temporal table needed)

Acceptance Criteria

  • VEP annotation job creates VEPAnnotation rows linked by allele_id; sets superseded_at on the previous current row in the same transaction
  • gnomAD annotation job creates GnomADVariant rows linked by allele_id; sets superseded_at on the previous current row in the same transaction
  • ClinVar annotation job creates ClinicalControl rows linked by allele_id; sets superseded_at on the previous current row
  • AnnotationStatus rows are created per job run per allele_id for QC tracking only
  • Annotation jobs skip alleles that already have a current annotation of the same type and source version (cross-score-set deduplication)
  • Temporal query pattern is verified: querying annotation state at a past timestamp returns the correct historical annotation
  • Existing AssayedVariant-level annotation behavior is unchanged

Metadata

Metadata

Assignees

Labels

app: backendTask implementation touches the backendapp: workerTask implementation touches the workertype: featureNew featureworkstream: clinicalTask relates to clinical features

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions