Fix: keep sarek status column when all samples are normal (status=0)#323
Open
Osamaali313 wants to merge 1 commit into
Open
Fix: keep sarek status column when all samples are normal (status=0)#323Osamaali313 wants to merge 1 commit into
status column when all samples are normal (status=0)#323Osamaali313 wants to merge 1 commit into
Conversation
_write_samplesheet selected output columns by truthiness (`any(c in row and row[c] ...)`), which treats the valid value 0 as empty. For nf-core/sarek the `status` column is 0=normal / 1=tumor, so an all-normal cohort - common for germline runs, and the guaranteed result of `--no-interactive` when sample names lack tumor/normal keywords (every sample then defaults to status=0) - wrote a samplesheet with the required `status` column silently dropped. The in-memory validation passes because it runs on the row dicts (which contain status), not on the written CSV, so the problem was masked. Use an explicit presence check (value is not None and not "") so valid falsy values are preserved while genuinely-empty columns are still dropped. Verified: all-normal and tumor-only cohorts now retain status; empty columns (e.g. single-end fastq_2) are still omitted.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Updates samplesheet column filtering so columns with valid falsy values (e.g., 0) are retained rather than being dropped as “empty”.
Changes:
- Replaces truthiness-based filtering with an explicit “has value” check for samplesheet columns.
- Adds an inline helper (
_has_value) to define what counts as “present” data.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+318
to
+323
| # Filter to columns that have data. Use an explicit presence check rather | ||
| # than truthiness so that valid falsy values are not treated as empty - | ||
| # notably sarek's `status` column where 0 means "normal" (an all-normal | ||
| # cohort would otherwise drop the required status column entirely). | ||
| def _has_value(v): | ||
| return v is not None and v != "" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
generate_samplesheet.pycan silently drop the requiredstatuscolumn from a generated nf-core/sarek samplesheet whenever every sample is normal.In
_write_samplesheet, output columns are selected by truthiness:The sarek
statuscolumn is an integer where0 = normal,1 = tumor(config/pipelines/sarek.yaml, described as "critical for somatic calling"). Because0is falsy, a cohort in which all samples are normal makesany(... and row[c] ...)evaluate toFalse, so thestatuscolumn is omitted from the written CSV.This is easy to hit in practice:
status=0.--no-interactivemode: any sample whose name lacks a tumor/normal keyword defaults tostatus=0(_process_sarek_samples), so a directory of plainly-named FASTQs yields an all-normal cohort and loses the column entirely.The bug is masked by validation:
validate_samplesheetruns on the in-memory row dicts (which do containstatus), not on the written file — so it reports the samplesheet as valid while the emitted CSV is missing a required column.Reproduction
A mixed cohort (a
status=1present) was unaffected, which is why this slipped through.Fix
Select columns by explicit presence (
value is not None and value != "") instead of truthiness, so valid falsy values are preserved while genuinely-empty columns are still dropped.Verification
statusretained ✅ (previously dropped)status=1) →statusretained ✅fastq_2all empty) →fastq_2still correctly dropped ✅(The repo has no unit-test harness or pytest config, and CI does not run script tests, so no test file is added — the change is a minimal one-function fix.)