Skip to content

Fix: keep sarek status column when all samples are normal (status=0)#323

Open
Osamaali313 wants to merge 1 commit into
anthropics:mainfrom
Osamaali313:fix/samplesheet-status-column-zero
Open

Fix: keep sarek status column when all samples are normal (status=0)#323
Osamaali313 wants to merge 1 commit into
anthropics:mainfrom
Osamaali313:fix/samplesheet-status-column-zero

Conversation

@Osamaali313

Copy link
Copy Markdown

Problem

generate_samplesheet.py can silently drop the required status column from a generated nf-core/sarek samplesheet whenever every sample is normal.

In _write_samplesheet, output columns are selected by truthiness:

active_columns = [c for c in column_names if any(c in row and row[c] for row in rows)]

The sarek status column is an integer where 0 = normal, 1 = tumor (config/pipelines/sarek.yaml, described as "critical for somatic calling"). Because 0 is falsy, a cohort in which all samples are normal makes any(... and row[c] ...) evaluate to False, so the status column is omitted from the written CSV.

This is easy to hit in practice:

  • Germline-only runs, where every sample is legitimately status=0.
  • --no-interactive mode: any sample whose name lacks a tumor/normal keyword defaults to status=0 (_process_sarek_samples), so a directory of plainly-named FASTQs yields an all-normal cohort and loses the column entirely.

The bug is masked by validation: validate_samplesheet runs on the in-memory row dicts (which do contain status), not on the written file — so it reports the samplesheet as valid while the emitted CSV is missing a required column.

Reproduction

rows = [
    {"patient": "P1", "sample": "P1_blood", "fastq_1": ".../P1_R1.fastq.gz", "fastq_2": ".../P1_R2.fastq.gz", "status": 0},
    {"patient": "P2", "sample": "P2_blood", "fastq_1": ".../P2_R1.fastq.gz", "fastq_2": ".../P2_R2.fastq.gz", "status": 0},
]
# before: header = patient,sample,fastq_1,fastq_2      <-- status dropped
# after:  header = patient,sample,fastq_1,fastq_2,status

A mixed cohort (a status=1 present) was unaffected, which is why this slipped through.

Fix

Select columns by explicit presence (value is not None and value != "") instead of truthiness, so valid falsy values are preserved while genuinely-empty columns are still dropped.

Verification

  • All-normal cohort → status retained ✅ (previously dropped)
  • Tumor-only cohort (all status=1) → status retained ✅
  • Mixed cohort → unchanged ✅
  • Single-end data (fastq_2 all empty) → fastq_2 still correctly dropped ✅

(The repo has no unit-test harness or pytest config, and CI does not run script tests, so no test file is added — the change is a minimal one-function fix.)

_write_samplesheet selected output columns by truthiness
(`any(c in row and row[c] ...)`), which treats the valid value 0 as
empty. For nf-core/sarek the `status` column is 0=normal / 1=tumor, so
an all-normal cohort - common for germline runs, and the guaranteed
result of `--no-interactive` when sample names lack tumor/normal
keywords (every sample then defaults to status=0) - wrote a samplesheet
with the required `status` column silently dropped.

The in-memory validation passes because it runs on the row dicts (which
contain status), not on the written CSV, so the problem was masked.

Use an explicit presence check (value is not None and not "") so valid
falsy values are preserved while genuinely-empty columns are still
dropped. Verified: all-normal and tumor-only cohorts now retain status;
empty columns (e.g. single-end fastq_2) are still omitted.
Copilot AI review requested due to automatic review settings June 14, 2026 15:43

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates samplesheet column filtering so columns with valid falsy values (e.g., 0) are retained rather than being dropped as “empty”.

Changes:

  • Replaces truthiness-based filtering with an explicit “has value” check for samplesheet columns.
  • Adds an inline helper (_has_value) to define what counts as “present” data.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +318 to +323
# Filter to columns that have data. Use an explicit presence check rather
# than truthiness so that valid falsy values are not treated as empty -
# notably sarek's `status` column where 0 means "normal" (an all-normal
# cohort would otherwise drop the required status column entirely).
def _has_value(v):
return v is not None and v != ""
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants