Summary
The annotated-variants study-result endpoint crashes with a Pydantic ValidationError when a score set contains reference-identical variants whose VRS Allele state is a ReferenceLengthExpression (RLE). The function responsible for deserializing stored mapping results unconditionally constructs a LiteralSequenceExpression (LSE), which fails when the stored state dictionary describes an RLE.
Problem
allele_from_mapped_variant_dictionary_result in src/mavedb/lib/annotation/util.py hard-codes:
state=LiteralSequenceExpression(**variation["state"]),
Reference-identical variants are intentionally mapped to VRS Allele objects whose state is a ReferenceLengthExpression (fields: type, length, repeatSubunitLength). When such a variant passes through the annotation pipeline, Pydantic raises four simultaneous validation errors:
type — literal_error: value is 'ReferenceLengthExpression', not 'LiteralSequenceExpression'
sequence — missing: LSE requires this field; RLE doesn't have it
length — extra_forbidden: not a valid LSE field
repeatSubunitLength — extra_forbidden: not a valid LSE field
Error observed
ValidationError: 4 validation errors for LiteralSequenceExpression
type
Input should be 'LiteralSequenceExpression' [type=literal_error, input_value='ReferenceLengthExpression', ...]
sequence
Field required [type=missing, ...]
length
Extra inputs are not permitted [type=extra_forbidden, ...]
repeatSubunitLength
Extra inputs are not permitted [type=extra_forbidden, ...]
Stack trace (production)
GET https://api.mavedb.org/api/v1/score-sets/urn:mavedb:00000657-a-1/annotated-variants/study-result
score_sets.py:1163 _stream_generated_annotations
annotate.py:38 variant_study_result
study_result.py:23 mapped_variant_to_experimental_variant_impact_study_result
util.py:181 variation_from_mapped_variant
util.py:151 vrs_object_from_mapped_variant
util.py:86 allele_from_mapped_variant_dictionary_result <- crash
Steps to reproduce
- Find or create a score set that contains reference-identical variants (HGVS
c.= / p.= / g.=), such as urn:mavedb:00000657-a-1.
- Request
GET /api/v1/score-sets/{urn}/annotated-variants/study-result.
- Observe a
500 response containing the Pydantic ValidationError above.
Expected behavior
The endpoint returns a valid study-result payload. Reference-identical variants are deserialized into Allele objects with a ReferenceLengthExpression state, consistent with how they are stored and how dcd_mapping and the worker pipeline already handle them.
Proposed behavior
allele_from_mapped_variant_dictionary_result should inspect variation["state"]["type"] and construct either a LiteralSequenceExpression or a ReferenceLengthExpression accordingly, instead of unconditionally constructing an LSE. Downstream code that consumes Allele.state should already tolerate RLE states, mirroring the isinstance(allele.state, ReferenceLengthExpression) guards used throughout dcd_mapping.
Acceptance criteria
GET /api/v1/score-sets/{urn}/annotated-variants/study-result succeeds for score sets whose variants include reference-identical variants stored with an RLE state.
allele_from_mapped_variant_dictionary_result returns an Allele with a ReferenceLengthExpression state when variation["state"]["type"] == "ReferenceLengthExpression", and a LiteralSequenceExpression state otherwise.
- A unit test covers
allele_from_mapped_variant_dictionary_result with an RLE state input dict (fields: type, length, repeatSubunitLength).
- A unit test covers
vrs_object_from_mapped_variant with a top-level Allele whose state is an RLE.
- Existing tests for LSE-state variants continue to pass.
Implementation notes
- The fix is localized to
allele_from_mapped_variant_dictionary_result in src/mavedb/lib/annotation/util.py. ReferenceLengthExpression is available from ga4gh.vrs.models and is already imported elsewhere in the annotation package.
- Dispatch can be done with a simple
if/else on variation["state"].get("type").
- No schema migration is needed; the stored
post_mapped JSON already contains valid RLE dictionaries. The bug is a deserialization oversight only.
CisPhasedBlock members are also deserialized via the same function (vrs_object_from_mapped_variant delegates to allele_from_mapped_variant_dictionary_result for each member), so the fix also covers haplotype-mapped variants with RLE member states.
Summary
The annotated-variants study-result endpoint crashes with a Pydantic
ValidationErrorwhen a score set contains reference-identical variants whose VRSAllelestate is aReferenceLengthExpression(RLE). The function responsible for deserializing stored mapping results unconditionally constructs aLiteralSequenceExpression(LSE), which fails when the stored state dictionary describes an RLE.Problem
allele_from_mapped_variant_dictionary_resultinsrc/mavedb/lib/annotation/util.pyhard-codes:Reference-identical variants are intentionally mapped to VRS
Alleleobjects whose state is aReferenceLengthExpression(fields:type,length,repeatSubunitLength). When such a variant passes through the annotation pipeline, Pydantic raises four simultaneous validation errors:type—literal_error: value is'ReferenceLengthExpression', not'LiteralSequenceExpression'sequence—missing: LSE requires this field; RLE doesn't have itlength—extra_forbidden: not a valid LSE fieldrepeatSubunitLength—extra_forbidden: not a valid LSE fieldError observed
Stack trace (production)
Steps to reproduce
c.=/p.=/g.=), such asurn:mavedb:00000657-a-1.GET /api/v1/score-sets/{urn}/annotated-variants/study-result.500response containing the PydanticValidationErrorabove.Expected behavior
The endpoint returns a valid study-result payload. Reference-identical variants are deserialized into
Alleleobjects with aReferenceLengthExpressionstate, consistent with how they are stored and howdcd_mappingand the worker pipeline already handle them.Proposed behavior
allele_from_mapped_variant_dictionary_resultshould inspectvariation["state"]["type"]and construct either aLiteralSequenceExpressionor aReferenceLengthExpressionaccordingly, instead of unconditionally constructing an LSE. Downstream code that consumesAllele.stateshould already tolerate RLE states, mirroring theisinstance(allele.state, ReferenceLengthExpression)guards used throughoutdcd_mapping.Acceptance criteria
GET /api/v1/score-sets/{urn}/annotated-variants/study-resultsucceeds for score sets whose variants include reference-identical variants stored with an RLE state.allele_from_mapped_variant_dictionary_resultreturns anAllelewith aReferenceLengthExpressionstate whenvariation["state"]["type"] == "ReferenceLengthExpression", and aLiteralSequenceExpressionstate otherwise.allele_from_mapped_variant_dictionary_resultwith an RLE state input dict (fields:type,length,repeatSubunitLength).vrs_object_from_mapped_variantwith a top-levelAllelewhose state is an RLE.Implementation notes
allele_from_mapped_variant_dictionary_resultinsrc/mavedb/lib/annotation/util.py.ReferenceLengthExpressionis available fromga4gh.vrs.modelsand is already imported elsewhere in the annotation package.if/elseonvariation["state"].get("type").post_mappedJSON already contains valid RLE dictionaries. The bug is a deserialization oversight only.CisPhasedBlockmembers are also deserialized via the same function (vrs_object_from_mapped_variantdelegates toallele_from_mapped_variant_dictionary_resultfor each member), so the fix also covers haplotype-mapped variants with RLE member states.