fix(diann2msstats): keep raw sequence when pyopenms can't resolve a mod#82
fix(diann2msstats): keep raw sequence when pyopenms can't resolve a mod#82ypriverol wants to merge 1 commit into
Conversation
`diann2msstats` crashes the conversion when DIA-NN emits a peptide with a modification that pyopenms cannot resolve, e.g.: RuntimeError: the value 'Met-loss' was used but is not valid; Cannot convert string to peptide modification. No modification matches in our database. This fires in production whenever the runtime container ships pyopenms without the OpenMS share directory (UniMod XML) — visible via the warning that always precedes the traceback: UserWarning: OPENMS_DATA_PATH environment variable not found and no share directory was installed. In that state only a small set of common mods (Carbamidomethyl, Oxidation, Phospho, …) resolves from the compiled-in fallback, so Acetyl, Met-loss, and other valid UniMod entries crash the whole job. Any reanalysis whose SDRF declares such a mod is bricked. Fix the source side: wrap the `AASequence.fromString` round-trip in `_to_openms_sequence`, catch `RuntimeError`, log a warning, and pass the raw DIA-NN string through unchanged. The raw string is still a stable peptide identifier for MSstats (different mod forms => distinct IDs), so downstream consumers are unaffected. Also dedupe the conversion across rows: build a map over the unique PeptideSequence values and re-map back. The old `.apply(... axis=1)` re-parsed identical strings millions of times on real datasets. Adds two unit tests against `_to_openms_sequence`: * unknown mod falls back to the raw input * `^` N-term anchor is preserved on both code paths Note: the deeper fix is the container/recipe packaging — the bioconda recipe for quantms-utils should ship pyopenms with its OpenMS share directory so the parser has the full UniMod database. This source patch is the resilient companion: even after the recipe is fixed, an unexpected mod string should not crash the converter.
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| Documentation | 1 minor |
🟢 Metrics 4 complexity
Metric Results Complexity 4
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
Summary
diann2msstatscrashes the whole conversion when DIA-NN emits a peptide with a modification that pyopenms cannot resolve. This is the bug that surfaces as:…always preceded by:
The runtime container ships pyopenms without the OpenMS share directory (UniMod XML), so only a small set of compiled-in common mods (Carbamidomethyl, Oxidation, Phospho, …) resolves. Any DIA-NN run whose SDRF declares Acetyl, Met-loss, Met-loss+Acetyl, or any less-common UniMod entry crashes outright.
This PR makes the converter resilient to that state.
Changes
_to_openms_sequence(seq)that wrapsAASequence.fromStringin atry/except RuntimeError, logs a warning, and returns the input unchanged on failure. The leading^anchor used by DIA-NN for N-terminal-cleavage peptides is preserved on both the canonical and fallback paths.PeptideSequencevalues and re-map back. The old.apply(..., axis=1)re-parsed identical strings on every row — wasteful on real (multi-million-row) DIA reports.TestDiannCommands:^anchor preserved on both pathsWhy the raw string is safe to pass through
MSstats only needs
PeptideSequenceto be a stable identifier per peptidoform. The DIA-NN string is already that — different mod forms produce different strings — so downstream conversion is unaffected. Only the cosmetic "canonical form" is lost for the unresolvable rows.Test plan
All three relevant tests pass locally:
Related upstream fix needed
The deeper fix is the bioconda recipe:
quantms-utilsshould ship pyopenms with the OpenMS share directory (or setOPENMS_DATA_PATH) so the parser has the full UniMod database. Filing a separate issue against this repo to track the container packaging side; this source PR is the resilient companion that prevents future surprise mods from crashing the converter even after the recipe is fixed.Reproduction context
Hit in the wild during a reanalysis of PXD004701 (Sun et al. 2023, MCP) whose SDRF correctly declared
N-term Met-loss(UniMod:765). Same failure mode previously surfaced with Acetyl (UniMod:1) on other datasets. Repro details in the companion issue.