Skip to content

refactor: uniformize somatic-variant definition across bin scripts#469

Open
m-huertasp wants to merge 9 commits into
devfrom
tests/418-uniformize-somatic-variants
Open

refactor: uniformize somatic-variant definition across bin scripts#469
m-huertasp wants to merge 9 commits into
devfrom
tests/418-uniformize-somatic-variants

Conversation

@m-huertasp

Copy link
Copy Markdown
Collaborator

Small PR.

What this does

Determines single definition of a somatic variant and applies it
consistently across the bin/ scripts. Previously the somatic/germline split was
re-implemented inline in several places with divergent forms (single-column VAF
in filter_cohort.py, three-column elsewhere, plus hardcoded thresholds in
check_contamination.py), which made the somatic and germline sets non-symmetric.
The definition now lives in two shared helpers and every call site uses them. Closes #418.

Changes

  • Add somatic_mask / germline_mask to bin/utils_filter.py — the canonical all-3-column
    rule (VAF, vd_VAF, VAF_AM vs a caller-supplied threshold), with NumPy docstrings.
  • Refactor bin/filter_cohort.py to use the helpers (somatic flagging goes single-column → all-3-column).
  • Refactor bin/check_contamination.py: germline/SNP predicates use the helpers; the hardcoded
    0.2 threshold is replaced by a new --somatic-vaf-boundary CLI option fed from
    params.germline_threshold (wired via modules/local/contamination/main.nf + conf/modules.config);
    NumPy docstrings added to all public functions.
  • Add bin/test/test_utils_filter.py — 46 unit tests covering every public function in utils_filter.py.

What to review

  • Behaviour changes: filter_cohort somatic flagging is now all-3-column; the
    between-samples contamination threshold moves 0.2 → 0.3 (default germline_threshold).

Testing

pytest bin/test/test_utils_filter.py → 46 passed + ruff clean on all touched files.
The full bin/test/ suite has 3 pre-existing failures + 1 collection error in unrelated files
(test_plot_selectionsideplots.py, test_check_samplesheet.py) that are untouched by this PR.

@m-huertasp m-huertasp changed the base branch from main to dev June 2, 2026 13:25
@m-huertasp m-huertasp requested a review from FerriolCalvet June 8, 2026 15:20
@m-huertasp m-huertasp self-assigned this Jun 8, 2026

@FerriolCalvet FerriolCalvet left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, there is only this comment on the check_contamination script that if you can address it to increase the clarity that would be great, if you think that it is clear enough I am happy to accept that as well

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that there is something that pisses me off a bit from this script and it is that the sequence of plots in the code is a bit confusing.

Could you give it a try and split the main function into different ones? I can give it a try as well otherwise.

Comment on lines +453 to +459
contamination_vaf_threshold = 0.05
somatic_snp_positions_maf = snp_positions_maf.loc[
somatic_mask(snp_positions_maf, contamination_vaf_threshold)
].reset_index(drop=True)
germline_snp_positions_maf = snp_positions_maf.loc[
germline_mask(snp_positions_maf, contamination_vaf_threshold)
].reset_index(drop=True)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that this has the same behaviour than before since there might be some mutations that stay in between the two

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe inverting the somatic_mask would be the best way to go?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does not have the same behaviour than before because we are using the three VAFs and we can loose mutations that are VAF > 0.05 but vd_VAF <0.05. Would it be best to use the previous strategy in this case?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that it makes a big difference, but probably yes, it is better to follow the same strategy as before.
I think that my suggestion of using the somatic_mask and then "inverting" the selection should work well and would not be too difficult to apply.

Comment thread bin/filter_cohort.py
@m-huertasp

Copy link
Copy Markdown
Collaborator Author

Still to do

For when I come back, or if someone does this.

1. Address Ferriol's comment

Ferriol flagged that the new somatic_mask / germline_mask usage in
contamination_detection_in_snps does not replicate the old behaviour at the SNP-contamination check:

# Current code (~line 459):
somatic_snp_positions_maf = snp_positions_maf.loc[somatic_mask(snp_positions_maf, contamination_vaf_threshold)]
germline_snp_positions_maf = snp_positions_maf.loc[germline_mask(snp_positions_maf, contamination_vaf_threshold)]

Either go back to previous strategy or invert somatic_mask for the germline side, something like:

somatic_snp_mask = somatic_mask(snp_positions_maf, contamination_vaf_threshold)
somatic_snp_positions_maf = snp_positions_maf.loc[somatic_snp_mask].reset_index(drop=True)
germline_snp_positions_maf = snp_positions_maf.loc[~somatic_snp_mask].reset_index(drop=True)

2. Refactor plot structure

Plan for doing this:

  1. Remove printing, add logging, remove dead/commented code
  2. Extract function to generate heatmap (with parameters to adapt to each case)
  3. Replace the heatmaps with the above-mentioned function
  4. Extract compute_somatic_vs_remaining_germline() as function on its own
  5. Extract detect_contaminated_pairs() as function on its own (rename subseeeet)
  6. Extract compute_snp_somatic_frequency() as function on its own
  7. Write some tests for the compute functions
  8. Compare results with previous'
    *Check plan with specific steps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

uniformize the definition of somatic variants

2 participants