Skip to content

Support ReferenceLengthExpression state in annotated-variants study-result pipeline #736

@bencap

Description

@bencap

Summary

The annotated-variants study-result endpoint crashes with a Pydantic ValidationError when a score set contains reference-identical variants whose VRS Allele state is a ReferenceLengthExpression (RLE). The function responsible for deserializing stored mapping results unconditionally constructs a LiteralSequenceExpression (LSE), which fails when the stored state dictionary describes an RLE.

Problem

allele_from_mapped_variant_dictionary_result in src/mavedb/lib/annotation/util.py hard-codes:

state=LiteralSequenceExpression(**variation["state"]),

Reference-identical variants are intentionally mapped to VRS Allele objects whose state is a ReferenceLengthExpression (fields: type, length, repeatSubunitLength). When such a variant passes through the annotation pipeline, Pydantic raises four simultaneous validation errors:

  • typeliteral_error: value is 'ReferenceLengthExpression', not 'LiteralSequenceExpression'
  • sequencemissing: LSE requires this field; RLE doesn't have it
  • lengthextra_forbidden: not a valid LSE field
  • repeatSubunitLengthextra_forbidden: not a valid LSE field

Error observed

ValidationError: 4 validation errors for LiteralSequenceExpression
type
  Input should be 'LiteralSequenceExpression' [type=literal_error, input_value='ReferenceLengthExpression', ...]
sequence
  Field required [type=missing, ...]
length
  Extra inputs are not permitted [type=extra_forbidden, ...]
repeatSubunitLength
  Extra inputs are not permitted [type=extra_forbidden, ...]

Stack trace (production)

GET https://api.mavedb.org/api/v1/score-sets/urn:mavedb:00000657-a-1/annotated-variants/study-result

score_sets.py:1163  _stream_generated_annotations
annotate.py:38      variant_study_result
study_result.py:23  mapped_variant_to_experimental_variant_impact_study_result
util.py:181         variation_from_mapped_variant
util.py:151         vrs_object_from_mapped_variant
util.py:86          allele_from_mapped_variant_dictionary_result   <- crash

Steps to reproduce

  1. Find or create a score set that contains reference-identical variants (HGVS c.= / p.= / g.=), such as urn:mavedb:00000657-a-1.
  2. Request GET /api/v1/score-sets/{urn}/annotated-variants/study-result.
  3. Observe a 500 response containing the Pydantic ValidationError above.

Expected behavior

The endpoint returns a valid study-result payload. Reference-identical variants are deserialized into Allele objects with a ReferenceLengthExpression state, consistent with how they are stored and how dcd_mapping and the worker pipeline already handle them.

Proposed behavior

allele_from_mapped_variant_dictionary_result should inspect variation["state"]["type"] and construct either a LiteralSequenceExpression or a ReferenceLengthExpression accordingly, instead of unconditionally constructing an LSE. Downstream code that consumes Allele.state should already tolerate RLE states, mirroring the isinstance(allele.state, ReferenceLengthExpression) guards used throughout dcd_mapping.

Acceptance criteria

  • GET /api/v1/score-sets/{urn}/annotated-variants/study-result succeeds for score sets whose variants include reference-identical variants stored with an RLE state.
  • allele_from_mapped_variant_dictionary_result returns an Allele with a ReferenceLengthExpression state when variation["state"]["type"] == "ReferenceLengthExpression", and a LiteralSequenceExpression state otherwise.
  • A unit test covers allele_from_mapped_variant_dictionary_result with an RLE state input dict (fields: type, length, repeatSubunitLength).
  • A unit test covers vrs_object_from_mapped_variant with a top-level Allele whose state is an RLE.
  • Existing tests for LSE-state variants continue to pass.

Implementation notes

  • The fix is localized to allele_from_mapped_variant_dictionary_result in src/mavedb/lib/annotation/util.py. ReferenceLengthExpression is available from ga4gh.vrs.models and is already imported elsewhere in the annotation package.
  • Dispatch can be done with a simple if/else on variation["state"].get("type").
  • No schema migration is needed; the stored post_mapped JSON already contains valid RLE dictionaries. The bug is a deserialization oversight only.
  • CisPhasedBlock members are also deserialized via the same function (vrs_object_from_mapped_variant delegates to allele_from_mapped_variant_dictionary_result for each member), so the fix also covers haplotype-mapped variants with RLE member states.

Metadata

Metadata

Assignees

Labels

app: backendTask implementation touches the backendtype: bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions