From 6700478177f6322a17c6faa10ba8eb01882a4417 Mon Sep 17 00:00:00 2001 From: Andrey Fedorov Date: Thu, 11 Jun 2026 11:06:53 -0400 Subject: [PATCH 1/9] Add GrandQC slide quality control notebook for IDC DICOM WSI Demonstrates end-to-end use of GrandQC with NCI Imaging Data Commons: - Discover H&E slides via idc-index sm_index metadata - Patch GrandQC scripts for native DICOM directory support via OpenSlide 4.0 - Download TCGA-BRCA DICOM series and run tissue detection + artifact segmentation - Visualize results using GrandQC's own wsi_colors.py palette - Validate the DICOM-based pipeline against GrandQC's pre-computed TCGA masks (zenodo.org/records/14041578) by comparing per-class artifact fractions Co-Authored-By: Claude Sonnet 4.6 --- .../grandqc_slide_quality_with_idc.ipynb | 1135 +++++++++++++++++ 1 file changed, 1135 insertions(+) create mode 100644 notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb diff --git a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb new file mode 100644 index 0000000..d505c19 --- /dev/null +++ b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb @@ -0,0 +1,1135 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Slide Quality Control with GrandQC and NCI Imaging Data Commons\n", + "\n", + "Automated quality control (QC) is an essential preprocessing step in computational pathology pipelines. Artifacts such as tissue folds, out-of-focus regions, pen markings, and air bubbles can introduce noise that degrades the performance of downstream AI models.\n", + "\n", + "[**GrandQC**](https://github.com/cpath-ukk/grandqc) is an open-source tool for automated tissue detection and multi-class artifact segmentation in whole slide images (WSIs). It was validated across slides from 19 international pathology departments and published in:\n", + "\n", + "> Weng Z., Seper A., Pryalukhin A., et al. *GrandQC: A comprehensive solution to quality control problem in digital pathology.* Nature Communications 15, 10685 (2024). https://doi.org/10.1038/s41467-024-54769-y\n", + "\n", + "[**NCI Imaging Data Commons (IDC)**](https://portal.imaging.datacommons.cancer.gov/) hosts ~100 TB of cancer imaging data, including thousands of H&E-stained whole slide images from TCGA, CPTAC, HTAN, and other programs — all stored as DICOM and freely accessible without authentication.\n", + "\n", + "This notebook demonstrates how to:\n", + "1. **Discover** H&E-stained whole slide images in IDC using `idc-index`\n", + "2. **Set up** GrandQC and download its pre-trained models\n", + "3. **Download** slides from IDC and run GrandQC directly on DICOM files\n", + "4. **Visualize** tissue detection and artifact segmentation results\n", + "5. **Leverage** GrandQC's pre-computed QC masks for all 32 TCGA cohorts" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Disclaimer\n", + "\n", + "The code and data of this repository are provided to promote reproducible research. They are not intended for clinical care or commercial use.\n", + "\n", + "The software is provided \"as is\", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. In no event shall the authors or copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.\n", + "\n", + "**GrandQC license**: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0). Non-commercial research use only; citation of the GrandQC paper is required." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Part 1: Discover H&E Slides in IDC\n", + "\n", + "IDC provides the `idc-index` Python package for querying slide metadata and downloading DICOM files. We start by installing it and exploring the available whole slide image collections." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "!pip install --upgrade idc-index openslide-python openslide-bin" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import matplotlib.patches as mpatches\n", + "import numpy as np\n", + "from IPython.display import IFrame, display\n", + "from idc_index import IDCClient\n", + "\n", + "idc_client = IDCClient()\n", + "print(f\"IDC data version: {idc_client.get_idc_version()}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 1.1 Overview of Slide Microscopy Collections\n", + "\n", + "IDC stores whole slide images using the DICOM Slide Microscopy (SM) standard. The `sm_index` table extends the main metadata index with pathology-specific attributes including staining protocol, tissue type, pixel spacing, and image dimensions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Load the slide microscopy index\n", + "idc_client.fetch_index(\"sm_index\")\n", + "\n", + "# Summary of all SM collections\n", + "sm_collections = idc_client.sql_query(\"\"\"\n", + " SELECT\n", + " i.collection_id,\n", + " COUNT(DISTINCT i.PatientID) AS patients,\n", + " COUNT(DISTINCT i.SeriesInstanceUID) AS series,\n", + " ROUND(SUM(i.series_size_MB) / 1024.0, 1) AS size_GB,\n", + " i.license_short_name\n", + " FROM index i\n", + " WHERE i.Modality = 'SM'\n", + " GROUP BY i.collection_id, i.license_short_name\n", + " ORDER BY patients DESC\n", + "\"\"\")\n", + "\n", + "print(f\"Total SM collections: {len(sm_collections)}\")\n", + "print(f\"Total SM series: {sm_collections['series'].sum():,}\")\n", + "print(f\"Total size: {sm_collections['size_GB'].sum():.0f} GB\")\n", + "print()\n", + "sm_collections.head(15)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 1.2 Filter for H&E-Stained Slides\n", + "\n", + "The `sm_index` table includes a `staining_usingSubstance_CodeMeaning` column (stored as an array) that records the staining protocol for each slide. We filter for slides stained with hematoxylin, which identifies H&E preparations." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Count H&E slides per collection\n", + "he_by_collection = idc_client.sql_query(\"\"\"\n", + " SELECT\n", + " i.collection_id,\n", + " COUNT(DISTINCT i.PatientID) AS patients,\n", + " COUNT(DISTINCT i.SeriesInstanceUID) AS he_series,\n", + " ROUND(SUM(i.series_size_MB) / 1024.0, 1) AS size_GB\n", + " FROM index i\n", + " JOIN sm_index s ON i.SeriesInstanceUID = s.SeriesInstanceUID\n", + " WHERE array_to_string(s.staining_usingSubstance_CodeMeaning, ', ') LIKE '%hematoxylin%'\n", + " GROUP BY i.collection_id\n", + " ORDER BY he_series DESC\n", + "\"\"\")\n", + "\n", + "print(f\"Collections with H&E slides: {len(he_by_collection)}\")\n", + "print(f\"Total H&E series: {he_by_collection['he_series'].sum():,}\")\n", + "he_by_collection.head(15)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 1.3 Select Slides for Quality Control\n", + "\n", + "For this tutorial we work with slides from the [TCGA-BRCA](https://portal.imaging.datacommons.cancer.gov/explore/filters/?collection_id=tcga_brca) collection (The Cancer Genome Atlas — Breast Cancer). This collection is a good choice because:\n", + "\n", + "- It contains over 3,000 H&E-stained slides at 40× magnification\n", + "- GrandQC provides **pre-computed QC masks for all TCGA cohorts** (demonstrated in Part 6)\n", + "- Slides span both tumor and adjacent normal tissue, providing variety in tissue composition\n", + "\n", + "We query the `sm_index` to retrieve per-slide metadata including the `ContainerIdentifier` (the original TCGA slide barcode), which we will later use to match IDC series against the pre-computed GrandQC masks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Query TCGA-BRCA H&E slides with key metadata\n", + "tcga_brca_he = idc_client.sql_query(\"\"\"\n", + " SELECT\n", + " i.SeriesInstanceUID,\n", + " i.PatientID,\n", + " s.ContainerIdentifier,\n", + " s.primaryAnatomicStructureModifier_CodeMeaning AS tissue_type,\n", + " s.max_TotalPixelMatrixColumns AS width_px,\n", + " s.max_TotalPixelMatrixRows AS height_px,\n", + " s.min_PixelSpacing_2sf AS pixel_spacing_mm,\n", + " s.ObjectiveLensPower AS objective_power,\n", + " ROUND(i.series_size_MB, 1) AS size_MB\n", + " FROM index i\n", + " JOIN sm_index s ON i.SeriesInstanceUID = s.SeriesInstanceUID\n", + " WHERE i.collection_id = 'tcga_brca'\n", + " AND array_to_string(s.staining_usingSubstance_CodeMeaning, ', ') LIKE '%hematoxylin%'\n", + " ORDER BY i.series_size_MB ASC\n", + "\"\"\")\n", + "\n", + "print(f\"TCGA-BRCA H&E slides: {len(tcga_brca_he)}\")\n", + "print(f\"Tissue types: {tcga_brca_he['tissue_type'].value_counts().to_dict()}\")\n", + "print(f\"\\nSize range: {tcga_brca_he['size_MB'].min()} – {tcga_brca_he['size_MB'].max()} MB\")\n", + "tcga_brca_he.head(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 1.4 Select a Representative Set of Slides\n", + "\n", + "For the hands-on GrandQC run we select a small set of slides that covers:\n", + "- Different tissue types (tumor vs. normal)\n", + "- Different magnifications (20× and 40×)\n", + "- Manageable file sizes for a notebook environment\n", + "\n", + "We cap the selection at slides under 200 MB to keep download and processing times reasonable." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Pick up to 5 small slides with variety in tissue type and magnification\n", + "demo_slides = (\n", + " tcga_brca_he[tcga_brca_he[\"size_MB\"] < 200]\n", + " .drop_duplicates(subset=\"tissue_type\")\n", + " .head(5)\n", + " .reset_index(drop=True)\n", + ")\n", + "\n", + "# Fall back to the 5 smallest slides if tissue_type is missing\n", + "if len(demo_slides) < 3:\n", + " demo_slides = tcga_brca_he[tcga_brca_he[\"size_MB\"] < 200].head(5).reset_index(drop=True)\n", + "\n", + "print(f\"Selected {len(demo_slides)} slides for GrandQC processing:\")\n", + "display(\n", + " demo_slides[[\n", + " \"PatientID\", \"ContainerIdentifier\", \"tissue_type\",\n", + " \"objective_power\", \"width_px\", \"height_px\", \"size_MB\"\n", + " ]]\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Preview slides in the IDC SLIM viewer (pathology viewer)\n", + "for _, row in demo_slides.iterrows():\n", + " url = idc_client.get_viewer_URL(seriesInstanceUID=row[\"SeriesInstanceUID\"])\n", + " print(f\"{row['ContainerIdentifier']} ({row['tissue_type']}, {row['size_MB']} MB)\")\n", + " print(f\" {url}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Part 2: GrandQC Setup\n", + "\n", + "Before running GrandQC we need to:\n", + "1. Clone the repository from GitHub\n", + "2. Install its Python dependencies\n", + "3. Download the pre-trained models from Zenodo\n", + "4. Apply a small patch so both scripts accept IDC's DICOM series directories alongside conventional WSI files\n", + "\n", + "> **Note**: GrandQC's scripts currently iterate over files in a flat folder. IDC downloads each DICOM series into its own subdirectory. The patch below (≤ 10 lines changed across 2 files) teaches GrandQC to recognise a directory whose contents end in `.dcm` as a valid slide entry and resolves the OpenSlide path accordingly. No model weights or algorithm logic are changed." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 2.1 Clone and Install Dependencies" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "# Clone GrandQC (shallow clone — we only need the latest commit)\n", + "!git clone --depth 1 https://github.com/cpath-ukk/grandqc.git\n", + "\n", + "# Install GrandQC Python dependencies.\n", + "# PyTorch is already present in Colab; we install the remaining packages.\n", + "# segmentation-models-pytorch 0.3.1 is pinned because its model.predict() API\n", + "# was removed in 0.4.0 and GrandQC relies on it.\n", + "!pip install -q \\\n", + " \"segmentation-models-pytorch==0.3.1\" \\\n", + " \"rasterio>=1.3\" \\\n", + " \"imagecodecs\" \\\n", + " \"zarr<3\" \\\n", + " \"tifffile\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 2.2 Download Pre-trained Models" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "GRANDQC_SCRIPTS = \"grandqc/01_WSI_inference_OPENSLIDE_QC\"\n", + "MODELS_TD = os.path.join(GRANDQC_SCRIPTS, \"models\", \"td\")\n", + "MODELS_QC = os.path.join(GRANDQC_SCRIPTS, \"models\", \"qc\")\n", + "os.makedirs(MODELS_TD, exist_ok=True)\n", + "os.makedirs(MODELS_QC, exist_ok=True)\n", + "\n", + "# Tissue detection model (~27 MB) — Zenodo record 14507273\n", + "!wget -q --show-progress -O {MODELS_TD}/Tissue_Detection_MPP10.pth \\\n", + " \"https://zenodo.org/records/14507273/files/Tissue_Detection_MPP10.pth?download=1\"\n", + "\n", + "# Artifact segmentation model at MPP=1.5 (~25 MB) — Zenodo record 14041538\n", + "# MPP=1.5 (≈7× objective) is the recommended default for most scanners.\n", + "!wget -q --show-progress -O {MODELS_QC}/GrandQC_MPP15.pth \\\n", + " \"https://zenodo.org/records/14041538/files/GrandQC_MPP15.pth?download=1\"\n", + "\n", + "print(\"Models downloaded:\")\n", + "for f in [f\"{MODELS_TD}/Tissue_Detection_MPP10.pth\", f\"{MODELS_QC}/GrandQC_MPP15.pth\"]:\n", + " size_mb = os.path.getsize(f) / 1024**2\n", + " print(f\" {f} ({size_mb:.1f} MB)\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 2.3 Patch Scripts for DICOM Directory Support\n", + "\n", + "IDC downloads each series as a directory of tiled DICOM files (`*.dcm`). OpenSlide 4.0+ can open any one of those files and reconstruct the full pyramid from the same directory.\n", + "\n", + "The two changes per script are:\n", + "\n", + "| Script | Change |\n", + "|--------|--------|\n", + "| `wsi_tis_detect.py` | Replace file-only `os.listdir` filter with one that also yields subdirectories containing `.dcm` files; resolve the path before calling `OpenSlide()` |\n", + "| `main.py` | Same two changes using `open_slide()` |\n", + "\n", + "Both scripts also get automatic CUDA/MPS/CPU device selection instead of the hardcoded `'cuda'`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from pathlib import Path\n", + "\n", + "def patch_grandqc_for_dicom(script_path: str) -> None:\n", + " text = Path(script_path).read_text()\n", + " original = text\n", + "\n", + " # ── 1. Auto-detect compute device ──────────────────────────────────────\n", + " text = text.replace(\n", + " \"DEVICE = 'cuda'\",\n", + " \"import torch as _t; DEVICE = ('cuda' if _t.cuda.is_available() \"\n", + " \"else ('mps' if _t.backends.mps.is_available() else 'cpu'))\",\n", + " )\n", + "\n", + " # ── 2. Replace slide-list builder to also accept DICOM directories ──────\n", + " DICOM_ITER = (\n", + " \"def _iter_slides(d):\\n\"\n", + " \" for n in sorted(os.listdir(d)):\\n\"\n", + " \" p = os.path.join(d, n)\\n\"\n", + " \" if os.path.isfile(p) or (\\n\"\n", + " \" os.path.isdir(p) and\\n\"\n", + " \" any(x.lower().endswith('.dcm') for x in os.listdir(p))):\\n\"\n", + " \" yield n\\n\"\n", + " )\n", + "\n", + " # wsi_tis_detect.py pattern\n", + " OLD_LIST_TIS = (\n", + " \"slide_names = sorted([f for f in os.listdir(SLIDE_DIR) \"\n", + " \"if os.path.isfile(os.path.join(SLIDE_DIR, f))])\"\n", + " )\n", + " NEW_LIST_TIS = DICOM_ITER + \"slide_names = list(_iter_slides(SLIDE_DIR))\"\n", + "\n", + " # main.py pattern\n", + " OLD_LIST_MAIN = \"slide_names = sorted(os.listdir(SLIDE_DIR))\"\n", + " NEW_LIST_MAIN = DICOM_ITER + \"slide_names = list(_iter_slides(SLIDE_DIR))\"\n", + "\n", + " text = text.replace(OLD_LIST_TIS, NEW_LIST_TIS)\n", + " text = text.replace(OLD_LIST_MAIN, NEW_LIST_MAIN)\n", + "\n", + " # ── 3. Resolve DICOM directory to first .dcm file before open ───────────\n", + " RESOLVE = (\n", + " \" if os.path.isdir(path_slide):\\n\"\n", + " \" _dcms = sorted(x for x in os.listdir(path_slide)\"\n", + " \" if x.lower().endswith('.dcm'))\\n\"\n", + " \" path_slide = os.path.join(path_slide, _dcms[0])\\n\"\n", + " )\n", + "\n", + " for open_call in (\"slide = OpenSlide(path_slide)\", \"slide = open_slide(path_slide)\"):\n", + " old = f\" path_slide = os.path.join(SLIDE_DIR, slide_name)\\n {open_call}\"\n", + " new = f\" path_slide = os.path.join(SLIDE_DIR, slide_name)\\n{RESOLVE} {open_call}\"\n", + " text = text.replace(old, new)\n", + "\n", + " if text == original:\n", + " print(f\" WARNING: no changes applied to {script_path}\")\n", + " else:\n", + " Path(script_path).write_text(text)\n", + " n = sum(1 for a, b in zip(original.splitlines(), text.splitlines()) if a != b)\n", + " n += abs(len(text.splitlines()) - len(original.splitlines()))\n", + " print(f\" Patched {script_path} (+{n} lines changed)\")\n", + "\n", + "\n", + "for script in [\n", + " f\"{GRANDQC_SCRIPTS}/wsi_tis_detect.py\",\n", + " f\"{GRANDQC_SCRIPTS}/main.py\",\n", + "]:\n", + " patch_grandqc_for_dicom(script)\n", + "print(\"Done.\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Sanity-check: confirm the DICOM resolver was inserted in both scripts\n", + "for script in [f\"{GRANDQC_SCRIPTS}/wsi_tis_detect.py\", f\"{GRANDQC_SCRIPTS}/main.py\"]:\n", + " text = Path(script).read_text()\n", + " has_dcm = \"_iter_slides\" in text\n", + " has_res = \"_dcms\" in text\n", + " has_dev = \"cuda.is_available\" in text\n", + " print(f\"{script.split('/')[-1]:30s} dicom_iter={has_dcm} dcm_resolve={has_res} auto_device={has_dev}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Part 3: Download Slides from IDC\n", + "\n", + "`idc_client.download_from_selection()` fetches DICOM files for the requested series and places each series in its own subdirectory named by `SeriesInstanceUID`.\n", + "\n", + "After downloading we rename each subdirectory to its `ContainerIdentifier` (the original TCGA slide barcode). This ensures that GrandQC's output files and the pre-computed TCGA mask filenames are consistently keyed by the human-readable barcode rather than an opaque UID." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 3.1 Download DICOM Series" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "SLIDE_DIR = \"slides\"\n", + "os.makedirs(SLIDE_DIR, exist_ok=True)\n", + "\n", + "print(f\"Downloading {len(demo_slides)} slides → {SLIDE_DIR}/\")\n", + "print(\"(This may take a few minutes depending on file sizes and network speed)\\n\")\n", + "\n", + "idc_client.download_from_selection(\n", + " downloadDir=SLIDE_DIR,\n", + " seriesInstanceUID=demo_slides[\"SeriesInstanceUID\"].tolist(),\n", + ")\n", + "print(\"\\nDownload complete.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 3.2 Rename Series Directories to TCGA Barcodes\n", + "\n", + "GrandQC uses the directory/file name as the key for all output files (tissue mask, QC map, overlay, TSV report row). By renaming each downloaded series directory from its `SeriesInstanceUID` to its `ContainerIdentifier` (TCGA barcode), we get:\n", + "\n", + "- Human-readable output filenames (`TCGA-A7-A26J-01B-02-BS2_MASK.png` instead of a UID)\n", + "- Automatic matching with GrandQC's pre-computed TCGA masks (Part 6), which are keyed by the same barcode" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Build SeriesInstanceUID → ContainerIdentifier map\n", + "uid_to_barcode = dict(zip(demo_slides[\"SeriesInstanceUID\"], demo_slides[\"ContainerIdentifier\"]))\n", + "\n", + "for uid, barcode in uid_to_barcode.items():\n", + " src = os.path.join(SLIDE_DIR, uid)\n", + " dst = os.path.join(SLIDE_DIR, barcode)\n", + " if os.path.isdir(src):\n", + " os.rename(src, dst)\n", + " print(f\" {uid} → {barcode}\")\n", + " elif os.path.isdir(dst):\n", + " print(f\" Already renamed: {barcode}\")\n", + " else:\n", + " print(f\" WARNING: directory not found: {uid}\")\n", + "\n", + "print(\"\\nSlide directories after renaming:\")\n", + "for d in sorted(os.listdir(SLIDE_DIR)):\n", + " dpath = os.path.join(SLIDE_DIR, d)\n", + " if os.path.isdir(dpath):\n", + " n = sum(1 for f in os.listdir(dpath) if f.lower().endswith(\".dcm\"))\n", + " print(f\" {d}/ ({n} .dcm files)\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 3.3 Verify OpenSlide Can Read the DICOM Files\n", + "\n", + "OpenSlide 4.0 (bundled in `openslide-bin`) reads DICOM WSI files natively. Opening any one `.dcm` file from a series directory is sufficient — OpenSlide discovers the remaining pyramid levels from the other files in the same directory that share the same `SeriesInstanceUID`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import openslide\n", + "\n", + "print(f\"OpenSlide version: {openslide.__library_version__}\\n\")\n", + "\n", + "for barcode in uid_to_barcode.values():\n", + " slide_dir = os.path.join(SLIDE_DIR, barcode)\n", + " dcm_files = sorted(f for f in os.listdir(slide_dir) if f.lower().endswith(\".dcm\"))\n", + " if not dcm_files:\n", + " print(f\" {barcode}: no .dcm files found\")\n", + " continue\n", + " entry = os.path.join(slide_dir, dcm_files[0])\n", + " try:\n", + " slide = openslide.OpenSlide(entry)\n", + " w, h = slide.level_dimensions[0]\n", + " mpp = slide.properties.get(\"openslide.mpp-x\", \"N/A\")\n", + " vendor = slide.properties.get(\"openslide.vendor\", \"N/A\")\n", + " print(f\" {barcode}\")\n", + " print(f\" {w:,} × {h:,} px | MPP={mpp} | vendor={vendor} | levels={slide.level_count}\")\n", + " slide.close()\n", + " except Exception as exc:\n", + " print(f\" {barcode}: ERROR — {exc}\")\n", + "\n", + "print(\"\\nAll slides verified.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Part 4: Run GrandQC Quality Control\n", + "\n", + "GrandQC runs as two sequential scripts, both inside `grandqc/01_WSI_inference_OPENSLIDE_QC/`:\n", + "\n", + "| Step | Script | MPP | Time (GPU) | Output |\n", + "|------|--------|-----|------------|--------|\n", + "| 1 | `wsi_tis_detect.py` | 10 | ~0.4 s/slide | Binary tissue masks |\n", + "| 2 | `main.py` | 1.5 | ~30–45 s/slide | 7-class artifact maps, GeoJSON, TSV report |\n", + "\n", + "The 7 artifact classes are: **1** clean tissue · **2** folds · **3** dark spots · **4** pen marks · **5** air bubbles/edges · **6** out-of-focus · **7** background.\n", + "\n", + "Both scripts load models via relative paths (`./models/…`) and must therefore be invoked with their directory as the working directory. We use `subprocess` with `cwd=GRANDQC_SCRIPTS` to achieve this while keeping output streaming to the cell." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 4.1 Step 1 — Tissue Detection\n", + "\n", + "Produces a low-resolution binary mask (tissue vs. background) for each slide. The mask is saved as a PNG and re-used by Step 2 to skip background patches and cut inference time." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import subprocess, sys\n", + "\n", + "OUTPUT_DIR = \"qc_output\"\n", + "os.makedirs(OUTPUT_DIR, exist_ok=True)\n", + "\n", + "slide_dir_abs = os.path.abspath(SLIDE_DIR)\n", + "output_dir_abs = os.path.abspath(OUTPUT_DIR)\n", + "scripts_abs = os.path.abspath(GRANDQC_SCRIPTS)\n", + "\n", + "print(f\"Slides: {slide_dir_abs}\")\n", + "print(f\"Output: {output_dir_abs}\")\n", + "print(f\"Scripts dir: {scripts_abs}\")\n", + "print()\n", + "print(\"Running wsi_tis_detect.py ...\")\n", + "\n", + "result = subprocess.run(\n", + " [sys.executable, \"wsi_tis_detect.py\",\n", + " \"--slide_folder\", slide_dir_abs,\n", + " \"--output_dir\", output_dir_abs],\n", + " cwd=scripts_abs,\n", + ")\n", + "\n", + "if result.returncode != 0:\n", + " print(f\"\\nERROR: wsi_tis_detect.py exited with code {result.returncode}\")\n", + "else:\n", + " mask_dir = os.path.join(output_dir_abs, \"tis_det_mask\")\n", + " masks = sorted(os.listdir(mask_dir))\n", + " print(f\"\\nTissue detection complete — {len(masks)} mask(s) written to {mask_dir}/\")\n", + " for m in masks:\n", + " print(f\" {m}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 4.2 Step 2 — Artifact Segmentation\n", + "\n", + "Runs the 7-class artifact model at MPP=1.5. For each slide it produces:\n", + "\n", + "- `maps_qc/_map_QC.png` — color-coded artifact map (same resolution as tissue mask)\n", + "- `overlays_qc/_overlay_QC.jpg` — artifact map overlaid on a downscaled slide thumbnail\n", + "- `mask_qc/_mask.png` — raw integer mask (class 1–7 per pixel)\n", + "- `geojson_qc/.geojson` — polygon annotations compatible with QuPath and SLIM\n", + "- `report_qc_output_0_N_stats_per_slide.txt` — TSV with per-slide dimensions and processing time\n", + "\n", + "> **GPU note**: On a T4 GPU (Colab free tier), expect ~30–45 s per slide. On CPU only, plan for 10–30 min per slide depending on size." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\"Running main.py (artifact segmentation) ...\")\n", + "\n", + "result = subprocess.run(\n", + " [sys.executable, \"main.py\",\n", + " \"--slide_folder\", slide_dir_abs,\n", + " \"--output_dir\", output_dir_abs,\n", + " \"--mpp_model\", \"1.5\",\n", + " \"--create_geojson\", \"Y\"],\n", + " cwd=scripts_abs,\n", + ")\n", + "\n", + "if result.returncode != 0:\n", + " print(f\"\\nERROR: main.py exited with code {result.returncode}\")\n", + "else:\n", + " maps_dir = os.path.join(output_dir_abs, \"maps_qc\")\n", + " maps = sorted(os.listdir(maps_dir))\n", + " print(f\"\\nArtifact segmentation complete — {len(maps)} map(s) in {maps_dir}/\")\n", + " for m in maps:\n", + " print(f\" {m}\")\n", + "\n", + " # Locate the TSV report (name includes the slide count)\n", + " reports = [f for f in os.listdir(output_dir_abs) if f.endswith(\"_stats_per_slide.txt\")]\n", + " if reports:\n", + " import pandas as pd\n", + " report_path = os.path.join(output_dir_abs, reports[0])\n", + " df = pd.read_csv(report_path, sep=\"\\t\")\n", + " print(f\"\\nPer-slide report ({reports[0]}):\")\n", + " display(df[[\"slide_name\", \"mpp\", \"height\", \"width\", \"time\"]])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Part 5: Visualize and Analyze Results\n", + "\n", + "GrandQC writes several output types per slide. We look at three of them:\n", + "\n", + "1. **Tissue detection overlays** — confirms that tissue vs. background separation was correct\n", + "2. **Artifact segmentation maps** — color-coded composite of the 7-class artifact model overlaid on a slide thumbnail\n", + "3. **Raw integer masks** (`mask_qc/`) — pixel-level class labels (1–7) used to compute per-slide artifact fractions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 5.1 Tissue Detection Overlays\n", + "\n", + "Each overlay blends the original thumbnail (70 %) with a two-class color mask (30 %): blue = tissue, gray = background." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "import matplotlib.image as mpimg\n", + "\n", + "overlay_dir = os.path.join(OUTPUT_DIR, \"tis_det_overlay\")\n", + "fnames = sorted(os.listdir(overlay_dir))\n", + "\n", + "fig, axes = plt.subplots(1, len(fnames), figsize=(6 * len(fnames), 5))\n", + "if len(fnames) == 1:\n", + " axes = [axes]\n", + "for ax, fname in zip(axes, fnames):\n", + " ax.imshow(mpimg.imread(os.path.join(overlay_dir, fname)))\n", + " ax.set_title(fname.replace(\"_OVERLAY.jpg\", \"\"), fontsize=7)\n", + " ax.axis(\"off\")\n", + "plt.suptitle(\"Step 1 — Tissue Detection\", fontsize=11, fontweight=\"bold\")\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 5.2 Artifact Segmentation Maps\n", + "\n", + "The QC overlay composites the artifact color map onto a downscaled thumbnail of the original slide. Colors match the `wsi_colors.py` palette used by GrandQC:\n", + "\n", + "| Class | Color | Artifact |\n", + "|-------|-------|---------|\n", + "| 1 | ■ Gray `#808080` | Clean tissue |\n", + "| 2 | ■ Tomato `#FF6347` | Folds |\n", + "| 3 | ■ Lime `#00FF00` | Dark spots |\n", + "| 4 | ■ Red `#FF0000` | Pen marks |\n", + "| 5 | ■ Magenta `#FF00FF` | Air bubbles / slide edges |\n", + "| 6 | ■ Indigo `#4B0082` | Out-of-focus |\n", + "| 7 | □ White `#FFFFFF` | Background |" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "overlay_qc_dir = os.path.join(OUTPUT_DIR, \"overlays_qc\")\n", + "fnames_qc = sorted(os.listdir(overlay_qc_dir))\n", + "\n", + "fig, axes = plt.subplots(1, len(fnames_qc), figsize=(6 * len(fnames_qc), 5))\n", + "if len(fnames_qc) == 1:\n", + " axes = [axes]\n", + "for ax, fname in zip(axes, fnames_qc):\n", + " ax.imshow(mpimg.imread(os.path.join(overlay_qc_dir, fname)))\n", + " ax.set_title(fname.replace(\"_overlay_QC.jpg\", \"\"), fontsize=7)\n", + " ax.axis(\"off\")\n", + "plt.suptitle(\"Step 2 — Artifact Segmentation Overlays\", fontsize=11, fontweight=\"bold\")\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 5.3 Artifact Fractions per Slide\n", + "\n", + "We read the raw integer masks from `mask_qc/` (pixel values 1–7) and compute what fraction of the **tissue area** (excluding background, class 7) belongs to each artifact class. This is the standard metric reported in the GrandQC paper." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from PIL import Image\n", + "\n", + "# GrandQC wsi_colors.py palette — RGB, indices 0-5 correspond to classes 1-6\n", + "GRANDQC_COLORS_RGB = [\n", + " [128, 128, 128], # 1 Clean tissue\n", + " [255, 99, 71], # 2 Folds\n", + " [ 0, 255, 0], # 3 Dark spots\n", + " [255, 0, 0], # 4 Pen marks\n", + " [255, 0, 255], # 5 Bubbles/edges\n", + " [ 75, 0, 130], # 6 Out-of-focus\n", + "]\n", + "CLASS_NAMES_SHORT = [\n", + " \"Clean tissue\", \"Folds\", \"Dark spots\",\n", + " \"Pen marks\", \"Bubbles/edges\", \"Out-of-focus\",\n", + "]\n", + "# Normalize to [0,1] for matplotlib\n", + "mpl_colors = [[r/255, g/255, b/255] for r, g, b in GRANDQC_COLORS_RGB]\n", + "\n", + "mask_dir = os.path.join(OUTPUT_DIR, \"mask_qc\")\n", + "rows = []\n", + "for fname in sorted(os.listdir(mask_dir)):\n", + " barcode = fname.replace(\"_mask.png\", \"\")\n", + " mask = np.array(Image.open(os.path.join(mask_dir, fname)))\n", + " tissue_px = np.sum(mask != 7) # exclude background\n", + " if tissue_px == 0:\n", + " continue\n", + " row = {\"slide\": barcode}\n", + " for cls_idx, name in enumerate(CLASS_NAMES_SHORT, start=1):\n", + " row[name] = np.sum(mask == cls_idx) / tissue_px * 100\n", + " rows.append(row)\n", + "\n", + "stats_df = pd.DataFrame(rows).set_index(\"slide\")\n", + "print(\"Artifact class fractions (% of tissue area):\")\n", + "display(stats_df.round(1))\n", + "\n", + "# ── Stacked bar chart ────────────────────────────────────────────────────────\n", + "ax = stats_df[CLASS_NAMES_SHORT].plot(\n", + " kind=\"bar\",\n", + " stacked=True,\n", + " color=mpl_colors,\n", + " figsize=(max(6, 2.5 * len(stats_df)), 5),\n", + " edgecolor=\"none\",\n", + ")\n", + "ax.set_ylabel(\"% of tissue area\")\n", + "ax.set_xlabel(\"\")\n", + "ax.set_title(\"GrandQC Artifact Fractions per Slide\", fontweight=\"bold\")\n", + "ax.set_xticklabels(ax.get_xticklabels(), rotation=25, ha=\"right\", fontsize=8)\n", + "ax.legend(loc=\"upper right\", fontsize=8, framealpha=0.8)\n", + "ax.set_ylim(0, 110)\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 5.4 View Original Slides in IDC SLIM Viewer\n", + "\n", + "The IDC SLIM viewer is a web-based pathology viewer for DICOM whole slide images. The links below open each slide at full resolution. The GeoJSON annotations produced by GrandQC (`geojson_qc/`) can be loaded into QuPath or other tools that accept GeoJSON overlays for interactive artifact review." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.display import IFrame, display\n", + "\n", + "print(\"IDC SLIM viewer links (open in browser for full-resolution viewing):\\n\")\n", + "for _, row in demo_slides.iterrows():\n", + " url = idc_client.get_viewer_URL(seriesInstanceUID=row[\"SeriesInstanceUID\"])\n", + " print(f\" {row['ContainerIdentifier']}\")\n", + " print(f\" {url}\\n\")\n", + "\n", + "# Embed the first slide inline (750 px height)\n", + "first_url = idc_client.get_viewer_URL(seriesInstanceUID=demo_slides.iloc[0][\"SeriesInstanceUID\"])\n", + "print(f\"Embedding: {demo_slides.iloc[0]['ContainerIdentifier']}\")\n", + "display(IFrame(first_url, width=\"100%\", height=750))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Part 6: Validating the Pipeline Against Pre-computed TCGA Masks\n", + "\n", + "In Parts 3–5 we ran GrandQC locally on IDC DICOM slides using OpenSlide's native DICOM reader — a path the GrandQC authors did not test in their original publication. Before using the pipeline on new data, it is good practice to verify that our results are consistent with the authors' reference outputs.\n", + "\n", + "The GrandQC authors published pre-computed QC masks for all 32 TCGA cohorts on Zenodo:\n", + "\n", + "> **GrandQC pre-computed masks**: [zenodo.org/records/14041578](https://zenodo.org/records/14041578) \n", + "> 36 tar archives — one per TCGA cohort (some large cohorts split into two parts) — totalling ~18.3 GB.\n", + "\n", + "Each archive unpacks to PNG files named `{ContainerIdentifier}_mask.png` with integer labels 1–7 — the same format and naming convention produced by `main.py`. By downloading the TCGA-BRCA archive and comparing artifact fractions for our demo slides against the reference, we can confirm that the DICOM-based pipeline is producing equivalent results." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 6.1 Available Cohort Archives\n", + "\n", + "The Zenodo record contains one tar file per TCGA cohort. Large cohorts (BRCA, THCA, GBM, LUAD, UCEC, LGG, TGCT) are split into two parts." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "\n", + "# Fetch the Zenodo record metadata to get the authoritative file list\n", + "rec = requests.get(\"https://zenodo.org/api/records/14041578\").json()\n", + "files = rec[\"files\"]\n", + "\n", + "rows_z = []\n", + "for f in sorted(files, key=lambda x: x[\"key\"]):\n", + " rows_z.append({\n", + " \"archive\": f[\"key\"],\n", + " \"size_MB\": round(f[\"size\"] / 1024**2, 0),\n", + " })\n", + "\n", + "zenodo_df = pd.DataFrame(rows_z)\n", + "total_gb = zenodo_df[\"size_MB\"].sum() / 1024\n", + "print(f\"{len(zenodo_df)} archives | total: {total_gb:.1f} GB\\n\")\n", + "display(zenodo_df)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 6.2 Download the TCGA-BRCA Reference Masks\n", + "\n", + "`BRCA.tar` (~2 GB) contains masks for all TCGA-BRCA slides. After extraction, each file is named `{ContainerIdentifier}_mask.png` — identical to what GrandQC writes into `mask_qc/` locally." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "REF_MASKS_DIR = \"brca_ref_masks\"\n", + "os.makedirs(REF_MASKS_DIR, exist_ok=True)\n", + "\n", + "# Download BRCA.tar (~2 GB) — takes 5-10 min on Colab\n", + "!wget -q --show-progress \\\n", + " \"https://zenodo.org/records/14041578/files/BRCA.tar?download=1\" \\\n", + " -O BRCA.tar\n", + "\n", + "# Extract — PNG files land directly in the archive root\n", + "!tar xf BRCA.tar -C {REF_MASKS_DIR}\n", + "\n", + "# Report what we got\n", + "ref_masks = sorted(f for f in os.listdir(REF_MASKS_DIR) if f.endswith(\"_mask.png\"))\n", + "print(f\"\\nExtracted {len(ref_masks)} reference masks to {REF_MASKS_DIR}/\")\n", + "print(\"Sample filenames:\")\n", + "for f in ref_masks[:5]:\n", + " print(f\" {f}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 6.3 Pipeline Validation: Compare Local vs. Reference Masks\n", + "\n", + "The reference masks in `brca_ref_masks/` were produced by the GrandQC authors running on TIFF/SVS files. Our local masks in `qc_output/mask_qc/` were produced by the same model running on IDC DICOM files via OpenSlide's native DICOM reader.\n", + "\n", + "We compare per-class artifact fractions for each demo slide. Close agreement (< 2 percentage points per class) confirms that the DICOM-based path does not introduce bias relative to the authors' original pipeline." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def mask_fractions(mask_path):\n", + " mask = np.array(Image.open(mask_path))\n", + " tissue_px = np.sum(mask != 7)\n", + " if tissue_px == 0:\n", + " return None\n", + " return {name: np.sum(mask == cls) / tissue_px * 100\n", + " for cls, name in enumerate(CLASS_NAMES_SHORT, start=1)}\n", + "\n", + "\n", + "# Build comparison table for each demo slide\n", + "compare_rows = []\n", + "for barcode in uid_to_barcode.values():\n", + " local_path = os.path.join(OUTPUT_DIR, \"mask_qc\", f\"{barcode}_mask.png\")\n", + " ref_path = os.path.join(REF_MASKS_DIR, f\"{barcode}_mask.png\")\n", + "\n", + " if not os.path.exists(local_path):\n", + " print(f\" MISSING local mask: {barcode}\")\n", + " continue\n", + " if not os.path.exists(ref_path):\n", + " print(f\" MISSING reference mask: {barcode} — slide may not be in BRCA.tar\")\n", + " continue\n", + "\n", + " local_frac = mask_fractions(local_path)\n", + " ref_frac = mask_fractions(ref_path)\n", + " if local_frac is None or ref_frac is None:\n", + " continue\n", + "\n", + " for name in CLASS_NAMES_SHORT:\n", + " compare_rows.append({\n", + " \"slide\": barcode,\n", + " \"class\": name,\n", + " \"local_%\": round(local_frac[name], 2),\n", + " \"ref_%\": round(ref_frac[name], 2),\n", + " \"delta_pp\": round(local_frac[name] - ref_frac[name], 2),\n", + " })\n", + "\n", + "compare_df = pd.DataFrame(compare_rows)\n", + "print(\"Per-class fraction comparison (local DICOM run vs. Zenodo reference):\\n\")\n", + "display(compare_df.pivot_table(index=\"slide\", columns=\"class\", values=\"delta_pp\").round(2))\n", + "print(\"\\nδ = local − reference (percentage points). Values near 0 confirm agreement.\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Scatter plot: local fraction vs. reference fraction for all classes × slides\n", + "fig, ax = plt.subplots(figsize=(5, 5))\n", + "\n", + "class_colors = {name: mpl_colors[i] for i, name in enumerate(CLASS_NAMES_SHORT)}\n", + "\n", + "for cls_name, grp in compare_df.groupby(\"class\"):\n", + " ax.scatter(grp[\"ref_%\"], grp[\"local_%\"],\n", + " label=cls_name, color=class_colors[cls_name],\n", + " s=60, edgecolors=\"black\", linewidths=0.4, zorder=3)\n", + "\n", + "# Identity line\n", + "lim_max = max(compare_df[[\"local_%\", \"ref_%\"]].max()) * 1.05\n", + "ax.plot([0, lim_max], [0, lim_max], \"k--\", linewidth=0.8, label=\"y = x\")\n", + "ax.set_xlabel(\"Reference fraction (%) — Zenodo\")\n", + "ax.set_ylabel(\"Local fraction (%) — DICOM run\")\n", + "ax.set_title(\"Pipeline Validation: Local vs. Reference Artifact Fractions\",\n", + " fontweight=\"bold\", fontsize=9)\n", + "ax.legend(fontsize=7, framealpha=0.8)\n", + "ax.set_xlim(0, lim_max)\n", + "ax.set_ylim(0, lim_max)\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 6.4 Cohort-level QC Statistics\n", + "\n", + "With validation confirmed, we compute per-slide artifact fractions across all TCGA-BRCA slides in the reference archive. This gives a population-level view of slide quality without running any inference locally — each mask is read once and discarded, keeping memory usage constant." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from tqdm.notebook import tqdm\n", + "\n", + "cohort_rows = []\n", + "for fname in tqdm(ref_masks, desc=\"Computing stats\"):\n", + " barcode = fname.replace(\"_mask.png\", \"\")\n", + " fracs = mask_fractions(os.path.join(REF_MASKS_DIR, fname))\n", + " if fracs is not None:\n", + " fracs[\"slide\"] = barcode\n", + " cohort_rows.append(fracs)\n", + "\n", + "cohort_df = pd.DataFrame(cohort_rows).set_index(\"slide\")\n", + "\n", + "print(f\"Cohort: TCGA-BRCA | {len(cohort_df)} slides\\n\")\n", + "print(\"Median artifact fractions (% of tissue area):\")\n", + "display(cohort_df[CLASS_NAMES_SHORT].median().rename(\"median_%\").to_frame().T.round(1))\n", + "print()\n", + "print(\"90th-percentile artifact fractions:\")\n", + "display(cohort_df[CLASS_NAMES_SHORT].quantile(0.9).rename(\"p90_%\").to_frame().T.round(1))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Violin plot of per-slide artifact fractions across TCGA-BRCA\n", + "# Exclude Clean tissue (class 1) — typically >80%, dominates the y-axis\n", + "artifact_classes = [c for c in CLASS_NAMES_SHORT if c != \"Clean tissue\"]\n", + "artifact_colors = [mpl_colors[i] for i, c in enumerate(CLASS_NAMES_SHORT) if c != \"Clean tissue\"]\n", + "\n", + "fig, ax = plt.subplots(figsize=(9, 4))\n", + "data = [cohort_df[cls].dropna().values for cls in artifact_classes]\n", + "parts = ax.violinplot(data, positions=range(len(artifact_classes)),\n", + " showmedians=True, showextrema=False)\n", + "\n", + "for pc, color in zip(parts[\"bodies\"], artifact_colors):\n", + " pc.set_facecolor(color)\n", + " pc.set_alpha(0.75)\n", + "parts[\"cmedians\"].set_color(\"black\")\n", + "parts[\"cmedians\"].set_linewidth(1.5)\n", + "\n", + "ax.set_xticks(range(len(artifact_classes)))\n", + "ax.set_xticklabels(artifact_classes, rotation=20, ha=\"right\")\n", + "ax.set_ylabel(\"% of tissue area\")\n", + "ax.set_title(f\"Artifact fraction distribution — TCGA-BRCA ({len(cohort_df)} slides)\",\n", + " fontweight=\"bold\")\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Join QC stats with IDC sm_index metadata via ContainerIdentifier\n", + "cohort_meta = cohort_df.reset_index().rename(columns={\"slide\": \"ContainerIdentifier\"})\n", + "brca_meta = tcga_brca_he[[\"ContainerIdentifier\", \"tissue_type\"]].dropna(subset=[\"tissue_type\"])\n", + "\n", + "merged = cohort_meta.merge(brca_meta, on=\"ContainerIdentifier\", how=\"inner\")\n", + "\n", + "print(f\"Slides with tissue-type annotation: {len(merged)} / {len(cohort_df)}\\n\")\n", + "\n", + "# Median artifact fractions by tissue type\n", + "by_tissue = (\n", + " merged.groupby(\"tissue_type\")[CLASS_NAMES_SHORT]\n", + " .median()\n", + " .round(1)\n", + ")\n", + "print(\"Median artifact fractions (%) by tissue type:\")\n", + "display(by_tissue)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} From c410ab19f3d692251eb2e68725c8fb424b545885 Mon Sep 17 00:00:00 2001 From: Andrey Fedorov Date: Mon, 15 Jun 2026 14:25:42 -0400 Subject: [PATCH 2/9] Fix GrandQC artifact segmentation step and document inference resolution main.py loads a fully-pickled QC model via torch.load(), which fails with exit code 1 under PyTorch >=2.6 (weights_only=True default). Patch the load to weights_only=False so the trusted Zenodo weights unpickle. Also add [i/N] progress reporting and a failure guard that records per-slide errors and exits non-zero, so the notebook surfaces failures the loop used to swallow. Add grandqc_slide_quality_resolution_notes.md documenting the model working resolution (MPP 1.5) vs. the fixed 512x512 input tile, how MPP is read from openslide.mpp-x (DICOM PixelSpacing), and an empirical confirmation on an IDC slide; link it from the Part 4 intro. Co-Authored-By: Claude Opus 4.8 --- .../grandqc_slide_quality_resolution_notes.md | 178 +++ .../grandqc_slide_quality_with_idc.ipynb | 1181 +++++++++++++++-- 2 files changed, 1235 insertions(+), 124 deletions(-) create mode 100644 notebooks/pathomics/grandqc_slide_quality_resolution_notes.md diff --git a/notebooks/pathomics/grandqc_slide_quality_resolution_notes.md b/notebooks/pathomics/grandqc_slide_quality_resolution_notes.md new file mode 100644 index 0000000..0206cf2 --- /dev/null +++ b/notebooks/pathomics/grandqc_slide_quality_resolution_notes.md @@ -0,0 +1,178 @@ +# GrandQC + IDC: Notes on Inference Resolution + +Supplementary notes for [`grandqc_slide_quality_with_idc.ipynb`](grandqc_slide_quality_with_idc.ipynb). + +This document explains, for the GrandQC inference scripts in +`grandqc/01_WSI_inference_OPENSLIDE_QC/`, **what resolution the model runs at, +where that is controlled, and how the per-slide microns-per-pixel (MPP) value is +determined** when slides are read from NCI Imaging Data Commons (IDC) as DICOM. + +--- + +## Two distinct resolutions + +There are two separate resolution concepts in the pipeline. They are easy to +conflate but are set in different places. + +| Resolution concept | Value (this tutorial) | Where it is set | +|---|---|---| +| **Model working resolution** (MPP the network "sees") | `1.5` µm/px (≈ 7× objective) | `--mpp_model` CLI flag → `MPP_MODEL` in `main.py` | +| **Tissue-detection resolution** (Step 1) | `10` µm/px | hardcoded in `wsi_tis_detect.py` | +| **Model input tile size** | `512 × 512` px | `M_P_S_MODEL = 512`, hardcoded in `main.py` | +| **Native read size per tile** | `mpp_model / mpp × 512` px (level 0) | computed in `wsi_slide_info.py:slide_info()` | + +The model **never** runs at the slide's native 20×/40× magnification. The tissue +detector runs coarsest (MPP 10); the artifact model runs at MPP 1.5. + +--- + +## 1. Model working resolution — `--mpp_model` + +The artifact-segmentation step (notebook Part 4.2) is invoked as: + +```python +subprocess.run([sys.executable, "main.py", + "--slide_folder", slide_dir_abs, + "--output_dir", output_dir_abs, + "--mpp_model", "1.5", + "--create_geojson", "Y"], cwd=scripts_abs) +``` + +`--mpp_model` sets `MPP_MODEL` in `main.py` and does two things: + +1. **Selects the model weights** — only three values are valid: + + | `--mpp_model` | weights file | + |---|---| + | `1.0` | `GrandQC_MPP1.pth` | + | `1.5` | `GrandQC_MPP15.pth` | + | `2.0` | `GrandQC_MPP2.pth` | + + Any other value raises `Exception("mpp of the model can only be 1.0, 1.5, 2.0")`. + +2. **Sets the magnification slides are read at** (see §3). + +> **To change the inference resolution** you must set `--mpp_model` to `1.0` or +> `2.0` **and** download the matching weights. The notebook's model-download cell +> (Part 2.2) only fetches `GrandQC_MPP15.pth`, so switching MPP requires adding the +> corresponding Zenodo download. + +--- + +## 2. Model input tile size — fixed at 512×512 + +Hardcoded in `main.py`; not a CLI flag: + +```python +M_P_S_MODEL = 512 # the network always receives 512×512 px tiles +``` + +Every tile is downsampled to 512×512 before inference (§3). Changing this would +require retraining and should not be altered. + +--- + +## 3. How the two combine + +`wsi_slide_info.py:slide_info()` converts the model MPP into a patch size measured +in the slide's **native level-0 pixels**: + +```python +mpp = float(slide.properties["openslide.mpp-x"]) # slide's native MPP +p_s = int(mpp_model / mpp * m_p_s) # patch size in level-0 px +``` + +Then `wsi_process.py:slide_process_single()` reads that region at full resolution +and downsamples it to the fixed model input size: + +```python +work_patch = slide.read_region((w, h), 0, (p_s, p_s)) # read p_s × p_s from level 0 +work_patch = work_patch.resize((m_p_s, m_p_s), Image.Resampling.LANCZOS) # → 512×512 +predictions = model.predict(x_tensor) # inference at 512×512 (= MPP 1.5) +``` + +**Worked example** — a 40× TCGA-BRCA slide with native MPP ≈ `0.2485`, `mpp_model = 1.5`: + +``` +p_s = int(1.5 / 0.2485 * 512) = int(3091.0) = 3091 px # read from level 0 + → downsampled to 512×512 → fed to the network +``` + +The downsampling makes the effective resolution seen by the network MPP 1.5, +regardless of whether the slide was scanned at 20× or 40×. + +--- + +## 4. How MPP is determined + +MPP is **read from OpenSlide**, not computed. The only place it is obtained is the +single line in `wsi_slide_info.py:slide_info()`: + +```python +mpp = float(slide.properties["openslide.mpp-x"]) +``` + +Key points: + +- It is **read, not derived** from objective power. Only the X axis is used + (square pixels are assumed; `mpp-y` is ignored). +- **No fallback for MPP.** Note the asymmetry in the same function — objective + power has a fallback, MPP does not: + + ```python + try: + obj_power = slide.properties["openslide.objective-power"] + except: + obj_power = 99 # objective power: falls back to 99 + + mpp = float(slide.properties["openslide.mpp-x"]) # MPP: no fallback + ``` + + If `openslide.mpp-x` is missing/`None`, `float(...)` raises. With the + failure-guard patch added in the notebook, this is caught and reported per + slide rather than silently mis-scaling. +- `openslide.mpp-x` is the **single point of failure** for scaling any new slide: + if it is wrong or absent, `p_s` is sized wrong and the model sees the slide at + the wrong scale. + +### For IDC DICOM slides specifically + +OpenSlide's DICOM driver (added in OpenSlide 4.0.0) populates `openslide.mpp-x` +from the DICOM `PixelSpacing` attribute (mm → µm). + +**Verified empirically** on IDC slide `TCGA-A7-A26J-01B-02-BS2` +(the 9.1 MB series used in the notebook), downloaded with `idc-index` and opened +with OpenSlide 4.0.1: + +``` +openslide.vendor = dicom +openslide.mpp-x = 0.24850000000000003 +openslide.mpp-y = 0.24850000000000003 +openslide.objective-power = 40 +``` + +The raw DICOM source attribute is exposed on the same slide: + +``` +dicom.SharedFunctionalGroupsSequence[0].PixelMeasuresSequence[0].PixelSpacing[0] = .0002485 # mm +``` + +and `0.0002485 mm × 1000 = 0.2485 µm/px`, which matches `openslide.mpp-x` exactly. +This also matches the `pixel_spacing_mm = 0.00025` shown in the notebook's +metadata query (the 2-significant-figure rounded form). + +The notebook's Part 3.3 verification cell already surfaces this value with +`slide.properties.get("openslide.mpp-x", "N/A")`, so you can inspect MPP per slide +before running Step 2. + +--- + +## Quick reference: changing resolution safely + +1. Pick a supported MPP: `1.0`, `1.5`, or `2.0`. +2. Download the matching weights from Zenodo into `models/qc/` + (`GrandQC_MPP1.pth` / `GrandQC_MPP15.pth` / `GrandQC_MPP2.pth`). +3. Pass it via `--mpp_model` in the Step 2 `subprocess.run` call. +4. Leave `M_P_S_MODEL = 512` unchanged. +5. Confirm each slide reports a valid `openslide.mpp-x` (Part 3.3) before running — + it has no fallback. diff --git a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb index d505c19..2f7be61 100644 --- a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb +++ b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb @@ -53,7 +53,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ @@ -63,9 +63,17 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 2, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "IDC data version: v24\n" + ] + } + ], "source": [ "import os\n", "import pandas as pd\n", @@ -90,9 +98,280 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 3, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Total SM collections: 73\n", + "Total SM series: 76,299\n", + "Total size: 47871 GB\n", + "\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
collection_idpatientsseriessize_GBlicense_short_name
0ccdi_mci440745764548.0CC BY 4.0
1tcga_brca109831111678.8CC BY 3.0
2gtex971255038354.2CC BY 4.0
3pdxnet919919122.7CC BY 4.0
4tcga_gbm6072053639.9CC BY 3.0
5tcga_ov5901481476.5CC BY 3.0
6tcga_ucec56013711078.5CC BY 3.0
7tcga_kirc5372173811.7CC BY 3.0
8tcga_hnsc5231263560.5CC BY 3.0
9tcga_luad5221608631.1CC BY 3.0
10tcga_lgg51615721166.6CC BY 3.0
11tcga_thca5071158747.1CC BY 3.0
12tcga_lusc5041612610.4CC BY 3.0
13tcga_prad5001172596.4CC BY 3.0
14tcga_skcm470950721.5CC BY 3.0
\n", + "
\n", + "
\n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + " \n", + "
\n", + "
\n", + " " + ], + "text/plain": [ + " collection_id patients series size_GB license_short_name\n", + "0 ccdi_mci 4407 4576 4548.0 CC BY 4.0\n", + "1 tcga_brca 1098 3111 1678.8 CC BY 3.0\n", + "2 gtex 971 25503 8354.2 CC BY 4.0\n", + "3 pdxnet 919 919 122.7 CC BY 4.0\n", + "4 tcga_gbm 607 2053 639.9 CC BY 3.0\n", + "5 tcga_ov 590 1481 476.5 CC BY 3.0\n", + "6 tcga_ucec 560 1371 1078.5 CC BY 3.0\n", + "7 tcga_kirc 537 2173 811.7 CC BY 3.0\n", + "8 tcga_hnsc 523 1263 560.5 CC BY 3.0\n", + "9 tcga_luad 522 1608 631.1 CC BY 3.0\n", + "10 tcga_lgg 516 1572 1166.6 CC BY 3.0\n", + "11 tcga_thca 507 1158 747.1 CC BY 3.0\n", + "12 tcga_lusc 504 1612 610.4 CC BY 3.0\n", + "13 tcga_prad 500 1172 596.4 CC BY 3.0\n", + "14 tcga_skcm 470 950 721.5 CC BY 3.0" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "# Load the slide microscopy index\n", "idc_client.fetch_index(\"sm_index\")\n", @@ -129,9 +408,262 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 4, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Collections with H&E slides: 70\n", + "Total H&E series: 73,422\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
collection_idpatientshe_seriessize_GB
0gtex971255038354.2
1ccdi_mci440145694545.8
2tcga_brca109831111678.8
3tcga_kirc5372173811.7
4tcga_gbm6072053639.9
5tcga_lusc5041612610.4
6tcga_luad5221608631.1
7tcga_lgg51615721166.6
8tcga_ov5901481476.5
9tcga_coad4601442446.5
10tcga_ucec56013711078.5
11tcga_hnsc5231263560.5
12nlst4521259847.9
13tcga_stad4431197670.4
14tcga_prad5001172596.4
\n", + "
\n", + "
\n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + " \n", + "
\n", + "
\n", + " " + ], + "text/plain": [ + " collection_id patients he_series size_GB\n", + "0 gtex 971 25503 8354.2\n", + "1 ccdi_mci 4401 4569 4545.8\n", + "2 tcga_brca 1098 3111 1678.8\n", + "3 tcga_kirc 537 2173 811.7\n", + "4 tcga_gbm 607 2053 639.9\n", + "5 tcga_lusc 504 1612 610.4\n", + "6 tcga_luad 522 1608 631.1\n", + "7 tcga_lgg 516 1572 1166.6\n", + "8 tcga_ov 590 1481 476.5\n", + "9 tcga_coad 460 1442 446.5\n", + "10 tcga_ucec 560 1371 1078.5\n", + "11 tcga_hnsc 523 1263 560.5\n", + "12 nlst 452 1259 847.9\n", + "13 tcga_stad 443 1197 670.4\n", + "14 tcga_prad 500 1172 596.4" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "# Count H&E slides per collection\n", "he_by_collection = idc_client.sql_query(\"\"\"\n", @@ -169,9 +701,303 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 5, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TCGA-BRCA H&E slides: 3111\n", + "Tissue types: {'Neoplasm, Primary': 2704, 'Normal': 399}\n", + "\n", + "Size range: 9.1 – 3586.3 MB\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SeriesInstanceUIDPatientIDContainerIdentifiertissue_typewidth_pxheight_pxpixel_spacing_mmobjective_powersize_MB
01.3.6.1.4.1.5962.99.1.1247079754.248369703.163...TCGA-A7-A26JTCGA-A7-A26J-01B-02-BS2Neoplasm, Primary861872430.00025409.1
11.3.6.1.4.1.5962.99.1.1279025007.575797294.163...TCGA-AC-A2QJTCGA-AC-A2QJ-11A-02-TS2Normal3399369230.000502015.5
21.3.6.1.4.1.5962.99.1.1306193840.1223239355.16...TCGA-AC-A2FBTCGA-AC-A2FB-11A-01-TSANormal3199376900.000502016.5
31.3.6.1.4.1.5962.99.1.1263739509.328647659.163...TCGA-BH-A5IZTCGA-BH-A5IZ-11A-01-TS1Normal20041150480.000502018.7
41.3.6.1.4.1.5962.99.1.1290706954.176399971.163...TCGA-AC-A2QJTCGA-AC-A2QJ-11A-01-TS1Normal4199193160.000502022.8
51.3.6.1.4.1.5962.99.1.1315471301.865934712.163...TCGA-E9-A1RFTCGA-E9-A1RF-11A-03-TSCNormal2788897840.000502023.0
61.3.6.1.4.1.5962.99.1.1322835661.1509329658.16...TCGA-GM-A2D9TCGA-GM-A2D9-11A-02-TS2Normal3599289450.000502025.0
71.3.6.1.4.1.5962.99.1.1255105493.445850269.163...TCGA-E9-A1RFTCGA-E9-A1RF-11A-01-TSANormal23904149290.000502026.9
81.3.6.1.4.1.5962.99.1.1247939226.656219549.163...TCGA-OL-A5RWTCGA-OL-A5RW-01Z-00-DX1Neoplasm, Primary12879136180.00050<NA>27.1
91.3.6.1.4.1.5962.99.1.1244105102.1853648157.16...TCGA-E9-A1NFTCGA-E9-A1NF-11A-05-TSENormal13943173300.000492027.3
\n", + "
\n", + "
\n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + " \n", + "
\n", + "
\n", + " " + ], + "text/plain": [ + " SeriesInstanceUID PatientID \\\n", + "0 1.3.6.1.4.1.5962.99.1.1247079754.248369703.163... TCGA-A7-A26J \n", + "1 1.3.6.1.4.1.5962.99.1.1279025007.575797294.163... TCGA-AC-A2QJ \n", + "2 1.3.6.1.4.1.5962.99.1.1306193840.1223239355.16... TCGA-AC-A2FB \n", + "3 1.3.6.1.4.1.5962.99.1.1263739509.328647659.163... TCGA-BH-A5IZ \n", + "4 1.3.6.1.4.1.5962.99.1.1290706954.176399971.163... TCGA-AC-A2QJ \n", + "5 1.3.6.1.4.1.5962.99.1.1315471301.865934712.163... TCGA-E9-A1RF \n", + "6 1.3.6.1.4.1.5962.99.1.1322835661.1509329658.16... TCGA-GM-A2D9 \n", + "7 1.3.6.1.4.1.5962.99.1.1255105493.445850269.163... TCGA-E9-A1RF \n", + "8 1.3.6.1.4.1.5962.99.1.1247939226.656219549.163... TCGA-OL-A5RW \n", + "9 1.3.6.1.4.1.5962.99.1.1244105102.1853648157.16... TCGA-E9-A1NF \n", + "\n", + " ContainerIdentifier tissue_type width_px height_px \\\n", + "0 TCGA-A7-A26J-01B-02-BS2 Neoplasm, Primary 8618 7243 \n", + "1 TCGA-AC-A2QJ-11A-02-TS2 Normal 33993 6923 \n", + "2 TCGA-AC-A2FB-11A-01-TSA Normal 31993 7690 \n", + "3 TCGA-BH-A5IZ-11A-01-TS1 Normal 20041 15048 \n", + "4 TCGA-AC-A2QJ-11A-01-TS1 Normal 41991 9316 \n", + "5 TCGA-E9-A1RF-11A-03-TSC Normal 27888 9784 \n", + "6 TCGA-GM-A2D9-11A-02-TS2 Normal 35992 8945 \n", + "7 TCGA-E9-A1RF-11A-01-TSA Normal 23904 14929 \n", + "8 TCGA-OL-A5RW-01Z-00-DX1 Neoplasm, Primary 12879 13618 \n", + "9 TCGA-E9-A1NF-11A-05-TSE Normal 13943 17330 \n", + "\n", + " pixel_spacing_mm objective_power size_MB \n", + "0 0.00025 40 9.1 \n", + "1 0.00050 20 15.5 \n", + "2 0.00050 20 16.5 \n", + "3 0.00050 20 18.7 \n", + "4 0.00050 20 22.8 \n", + "5 0.00050 20 23.0 \n", + "6 0.00050 20 25.0 \n", + "7 0.00050 20 26.9 \n", + "8 0.00050 27.1 \n", + "9 0.00049 20 27.3 " + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "# Query TCGA-BRCA H&E slides with key metadata\n", "tcga_brca_he = idc_client.sql_query(\"\"\"\n", @@ -214,9 +1040,181 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 6, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Selected 3 slides for GrandQC processing:\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
PatientIDContainerIdentifiertissue_typeobjective_powerwidth_pxheight_pxsize_MB
0TCGA-A7-A26JTCGA-A7-A26J-01B-02-BS2Neoplasm, Primary40861872439.1
1TCGA-AC-A2QJTCGA-AC-A2QJ-11A-02-TS2Normal2033993692315.5
2TCGA-E2-A15KTCGA-E2-A15K-06A-01-TS1None40333191996285.7
\n", + "
\n", + "
\n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "\n", + " \n", + "
\n", + " \n", + "
\n", + "
\n", + " " + ], + "text/plain": [ + " PatientID ContainerIdentifier tissue_type objective_power \\\n", + "0 TCGA-A7-A26J TCGA-A7-A26J-01B-02-BS2 Neoplasm, Primary 40 \n", + "1 TCGA-AC-A2QJ TCGA-AC-A2QJ-11A-02-TS2 Normal 20 \n", + "2 TCGA-E2-A15K TCGA-E2-A15K-06A-01-TS1 None 40 \n", + "\n", + " width_px height_px size_MB \n", + "0 8618 7243 9.1 \n", + "1 33993 6923 15.5 \n", + "2 33319 19962 85.7 " + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], "source": [ "# Pick up to 5 small slides with variety in tissue type and magnification\n", "demo_slides = (\n", @@ -241,9 +1239,22 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 7, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TCGA-A7-A26J-01B-02-BS2 (Neoplasm, Primary, 9.1 MB)\n", + " https://viewer.imaging.datacommons.cancer.gov/slim/studies/2.25.313270449313106073714723178854675013710/series/1.3.6.1.4.1.5962.99.1.1247079754.248369703.1637629652298.2.0\n", + "TCGA-AC-A2QJ-11A-02-TS2 (Normal, 15.5 MB)\n", + " https://viewer.imaging.datacommons.cancer.gov/slim/studies/2.25.337996839190358918582393097516172189888/series/1.3.6.1.4.1.5962.99.1.1279025007.575797294.1637661597551.2.0\n", + "TCGA-E2-A15K-06A-01-TS1 (None, 85.7 MB)\n", + " https://viewer.imaging.datacommons.cancer.gov/slim/studies/2.25.54614374037792363265218330907954861388/series/1.3.6.1.4.1.5962.99.1.1231257225.1365336880.1637613829769.2.0\n" + ] + } + ], "source": [ "# Preview slides in the IDC SLIM viewer (pathology viewer)\n", "for _, row in demo_slides.iterrows():\n", @@ -276,7 +1287,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 8, "metadata": {}, "outputs": [], "source": [ @@ -305,9 +1316,21 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 9, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "grandqc/01_WSI_infe 100%[===================>] 25.41M 1.06MB/s in 26s \n", + "grandqc/01_WSI_infe 100%[===================>] 24.21M 1.08MB/s in 25s \n", + "Models downloaded:\n", + " grandqc/01_WSI_inference_OPENSLIDE_QC/models/td/Tissue_Detection_MPP10.pth (25.4 MB)\n", + " grandqc/01_WSI_inference_OPENSLIDE_QC/models/qc/GrandQC_MPP15.pth (24.2 MB)\n" + ] + } + ], "source": [ "import os\n", "\n", @@ -335,109 +1358,21 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "### 2.3 Patch Scripts for DICOM Directory Support\n", - "\n", - "IDC downloads each series as a directory of tiled DICOM files (`*.dcm`). OpenSlide 4.0+ can open any one of those files and reconstruct the full pyramid from the same directory.\n", - "\n", - "The two changes per script are:\n", - "\n", - "| Script | Change |\n", - "|--------|--------|\n", - "| `wsi_tis_detect.py` | Replace file-only `os.listdir` filter with one that also yields subdirectories containing `.dcm` files; resolve the path before calling `OpenSlide()` |\n", - "| `main.py` | Same two changes using `open_slide()` |\n", - "\n", - "Both scripts also get automatic CUDA/MPS/CPU device selection instead of the hardcoded `'cuda'`." - ] + "source": "### 2.3 Patch Scripts for DICOM Directory Support\n\nIDC downloads each series as a directory of tiled DICOM files (`*.dcm`). OpenSlide 4.0+ can open any one of those files and reconstruct the full pyramid from the same directory.\n\nThe changes per script are:\n\n| Script | Change |\n|--------|--------|\n| `wsi_tis_detect.py` | Replace file-only `os.listdir` filter with one that also yields subdirectories containing `.dcm` files; resolve the path before calling `OpenSlide()` |\n| `main.py` | Same two changes using `open_slide()`; plus `torch.load(..., weights_only=False)` so the pickled QC model loads under PyTorch ≥ 2.6 |\n\nBoth scripts also get automatic CUDA/MPS/CPU device selection instead of the hardcoded `'cuda'`.\n\n> **Note**: GrandQC's scripts iterate over files in a flat folder; IDC downloads each DICOM series into its own subdirectory, so the patch teaches them to recognise a directory whose contents end in `.dcm` as a valid slide entry and resolves the OpenSlide path accordingly. The `weights_only=False` change is required because PyTorch 2.6 made `weights_only=True` the default for `torch.load`, which refuses to unpickle the `segmentation_models_pytorch` model object that `main.py` loads. No model weights or algorithm logic are changed." }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], - "source": [ - "from pathlib import Path\n", - "\n", - "def patch_grandqc_for_dicom(script_path: str) -> None:\n", - " text = Path(script_path).read_text()\n", - " original = text\n", - "\n", - " # ── 1. Auto-detect compute device ──────────────────────────────────────\n", - " text = text.replace(\n", - " \"DEVICE = 'cuda'\",\n", - " \"import torch as _t; DEVICE = ('cuda' if _t.cuda.is_available() \"\n", - " \"else ('mps' if _t.backends.mps.is_available() else 'cpu'))\",\n", - " )\n", - "\n", - " # ── 2. Replace slide-list builder to also accept DICOM directories ──────\n", - " DICOM_ITER = (\n", - " \"def _iter_slides(d):\\n\"\n", - " \" for n in sorted(os.listdir(d)):\\n\"\n", - " \" p = os.path.join(d, n)\\n\"\n", - " \" if os.path.isfile(p) or (\\n\"\n", - " \" os.path.isdir(p) and\\n\"\n", - " \" any(x.lower().endswith('.dcm') for x in os.listdir(p))):\\n\"\n", - " \" yield n\\n\"\n", - " )\n", - "\n", - " # wsi_tis_detect.py pattern\n", - " OLD_LIST_TIS = (\n", - " \"slide_names = sorted([f for f in os.listdir(SLIDE_DIR) \"\n", - " \"if os.path.isfile(os.path.join(SLIDE_DIR, f))])\"\n", - " )\n", - " NEW_LIST_TIS = DICOM_ITER + \"slide_names = list(_iter_slides(SLIDE_DIR))\"\n", - "\n", - " # main.py pattern\n", - " OLD_LIST_MAIN = \"slide_names = sorted(os.listdir(SLIDE_DIR))\"\n", - " NEW_LIST_MAIN = DICOM_ITER + \"slide_names = list(_iter_slides(SLIDE_DIR))\"\n", - "\n", - " text = text.replace(OLD_LIST_TIS, NEW_LIST_TIS)\n", - " text = text.replace(OLD_LIST_MAIN, NEW_LIST_MAIN)\n", - "\n", - " # ── 3. Resolve DICOM directory to first .dcm file before open ───────────\n", - " RESOLVE = (\n", - " \" if os.path.isdir(path_slide):\\n\"\n", - " \" _dcms = sorted(x for x in os.listdir(path_slide)\"\n", - " \" if x.lower().endswith('.dcm'))\\n\"\n", - " \" path_slide = os.path.join(path_slide, _dcms[0])\\n\"\n", - " )\n", - "\n", - " for open_call in (\"slide = OpenSlide(path_slide)\", \"slide = open_slide(path_slide)\"):\n", - " old = f\" path_slide = os.path.join(SLIDE_DIR, slide_name)\\n {open_call}\"\n", - " new = f\" path_slide = os.path.join(SLIDE_DIR, slide_name)\\n{RESOLVE} {open_call}\"\n", - " text = text.replace(old, new)\n", - "\n", - " if text == original:\n", - " print(f\" WARNING: no changes applied to {script_path}\")\n", - " else:\n", - " Path(script_path).write_text(text)\n", - " n = sum(1 for a, b in zip(original.splitlines(), text.splitlines()) if a != b)\n", - " n += abs(len(text.splitlines()) - len(original.splitlines()))\n", - " print(f\" Patched {script_path} (+{n} lines changed)\")\n", - "\n", - "\n", - "for script in [\n", - " f\"{GRANDQC_SCRIPTS}/wsi_tis_detect.py\",\n", - " f\"{GRANDQC_SCRIPTS}/main.py\",\n", - "]:\n", - " patch_grandqc_for_dicom(script)\n", - "print(\"Done.\")" - ] + "source": "from pathlib import Path\n\ndef patch_grandqc_for_dicom(script_path: str) -> None:\n text = Path(script_path).read_text()\n original = text\n\n # ── 1. Auto-detect compute device ──────────────────────────────────────\n text = text.replace(\n \"DEVICE = 'cuda'\",\n \"import torch as _t; DEVICE = ('cuda' if _t.cuda.is_available() \"\n \"else ('mps' if _t.backends.mps.is_available() else 'cpu'))\",\n )\n\n # ── 2. Replace slide-list builder to also accept DICOM directories ──────\n DICOM_ITER = (\n \"def _iter_slides(d):\\n\"\n \" for n in sorted(os.listdir(d)):\\n\"\n \" p = os.path.join(d, n)\\n\"\n \" if os.path.isfile(p) or (\\n\"\n \" os.path.isdir(p) and\\n\"\n \" any(x.lower().endswith('.dcm') for x in os.listdir(p))):\\n\"\n \" yield n\\n\"\n )\n\n # wsi_tis_detect.py pattern\n OLD_LIST_TIS = (\n \"slide_names = sorted([f for f in os.listdir(SLIDE_DIR) \"\n \"if os.path.isfile(os.path.join(SLIDE_DIR, f))])\"\n )\n NEW_LIST_TIS = DICOM_ITER + \"slide_names = list(_iter_slides(SLIDE_DIR))\"\n\n # main.py pattern\n OLD_LIST_MAIN = \"slide_names = sorted(os.listdir(SLIDE_DIR))\"\n NEW_LIST_MAIN = DICOM_ITER + \"slide_names = list(_iter_slides(SLIDE_DIR))\"\n\n text = text.replace(OLD_LIST_TIS, NEW_LIST_TIS)\n text = text.replace(OLD_LIST_MAIN, NEW_LIST_MAIN)\n\n # ── 3. Resolve DICOM directory to first .dcm file before open ───────────\n RESOLVE = (\n \" if os.path.isdir(path_slide):\\n\"\n \" _dcms = sorted(x for x in os.listdir(path_slide)\"\n \" if x.lower().endswith('.dcm'))\\n\"\n \" path_slide = os.path.join(path_slide, _dcms[0])\\n\"\n )\n\n for open_call in (\"slide = OpenSlide(path_slide)\", \"slide = open_slide(path_slide)\"):\n old = f\" path_slide = os.path.join(SLIDE_DIR, slide_name)\\n {open_call}\"\n new = f\" path_slide = os.path.join(SLIDE_DIR, slide_name)\\n{RESOLVE} {open_call}\"\n text = text.replace(old, new)\n\n # ── 4. Load the full QC model with weights_only=False ───────────────────\n # main.py loads a pickled nn.Module via torch.load(). Since PyTorch 2.6 the\n # default weights_only=True refuses to unpickle the segmentation_models_pytorch\n # classes inside, raising UnpicklingError before the per-slide try/except and\n # killing the process with exit code 1. The Zenodo weights are trusted, so we\n # opt back into the full unpickling path. (wsi_tis_detect.py is unaffected —\n # it loads a plain state_dict, which weights_only=True allows.)\n text = text.replace(\n \"model_prim = torch.load(MODEL_QC_DIR + MODEL_QC_NAME, map_location=DEVICE)\",\n \"model_prim = torch.load(MODEL_QC_DIR + MODEL_QC_NAME, map_location=DEVICE, \"\n \"weights_only=False)\",\n )\n\n # ── 5. Progress reporting in main.py's slide loop ───────────────────────\n # Track failures and report [i/N] progress. (main.py-specific anchors — these\n # strings do not occur in wsi_tis_detect.py, so this is a no-op there.)\n text = text.replace(\n \"for slide_name in slide_names[start:end]:\\n\"\n \" try:\\n\"\n \" # Register start time\\n\"\n \" start = timeit.default_timer()\\n\"\n \"\\n\"\n \" print(\\\"\\\")\\n\"\n \" print(\\\"Processing:\\\", slide_name)\",\n \"_slides_to_run = slide_names[start:end]\\n\"\n \"_failures = []\\n\"\n \"print(f\\\"Found {len(_slides_to_run)} slide(s) to process.\\\")\\n\"\n \"for _idx, slide_name in enumerate(_slides_to_run, 1):\\n\"\n \" try:\\n\"\n \" # Register start time\\n\"\n \" start = timeit.default_timer()\\n\"\n \"\\n\"\n \" print(\\\"\\\")\\n\"\n \" print(f\\\"[{_idx}/{len(_slides_to_run)}] Processing:\\\", slide_name)\",\n )\n\n # ── 6. Record per-slide failures and exit non-zero if any occurred ──────\n # The upstream loop swallows every per-slide error and the script exits 0,\n # so the notebook can't tell a slide failed. Collect failures and sys.exit(1)\n # after the loop so the caller surfaces them.\n text = text.replace(\n \" except Exception as e:\\n\"\n \" print(f\\\"There was some problem with the slide. The error is: {e}\\\")\",\n \" except Exception as e:\\n\"\n \" print(f\\\"There was some problem with the slide. The error is: {e}\\\")\\n\"\n \" _failures.append((slide_name, str(e)))\\n\"\n \"\\n\"\n \"# Surface per-slide failures with a non-zero exit so the caller sees them\\n\"\n \"if _failures:\\n\"\n \" print(f\\\"\\\\n{len(_failures)} of {len(_slides_to_run)} slide(s) FAILED:\\\")\\n\"\n \" for _name, _err in _failures:\\n\"\n \" print(f\\\" - {_name}: {_err}\\\")\\n\"\n \" import sys as _sys; _sys.exit(1)\\n\"\n \"print(f\\\"\\\\nAll {len(_slides_to_run)} slide(s) processed successfully.\\\")\",\n )\n\n if text == original:\n print(f\" WARNING: no changes applied to {script_path}\")\n else:\n Path(script_path).write_text(text)\n n = sum(1 for a, b in zip(original.splitlines(), text.splitlines()) if a != b)\n n += abs(len(text.splitlines()) - len(original.splitlines()))\n print(f\" Patched {script_path} (+{n} lines changed)\")\n\n\nfor script in [\n f\"{GRANDQC_SCRIPTS}/wsi_tis_detect.py\",\n f\"{GRANDQC_SCRIPTS}/main.py\",\n]:\n patch_grandqc_for_dicom(script)\nprint(\"Done.\")" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], - "source": [ - "# Sanity-check: confirm the DICOM resolver was inserted in both scripts\n", - "for script in [f\"{GRANDQC_SCRIPTS}/wsi_tis_detect.py\", f\"{GRANDQC_SCRIPTS}/main.py\"]:\n", - " text = Path(script).read_text()\n", - " has_dcm = \"_iter_slides\" in text\n", - " has_res = \"_dcms\" in text\n", - " has_dev = \"cuda.is_available\" in text\n", - " print(f\"{script.split('/')[-1]:30s} dicom_iter={has_dcm} dcm_resolve={has_res} auto_device={has_dev}\")" - ] + "source": "# Sanity-check: confirm the patches were inserted in both scripts\nfor script in [f\"{GRANDQC_SCRIPTS}/wsi_tis_detect.py\", f\"{GRANDQC_SCRIPTS}/main.py\"]:\n text = Path(script).read_text()\n has_dcm = \"_iter_slides\" in text\n has_res = \"_dcms\" in text\n has_dev = \"cuda.is_available\" in text\n # weights_only=False only applies to main.py's full-model load\n has_wo = (\"weights_only=False\" in text) if script.endswith(\"main.py\") else \"n/a\"\n print(f\"{script.split('/')[-1]:30s} dicom_iter={has_dcm} dcm_resolve={has_res} \"\n f\"auto_device={has_dev} weights_only_fix={has_wo}\")" }, { "cell_type": "markdown", @@ -472,6 +1407,7 @@ "idc_client.download_from_selection(\n", " downloadDir=SLIDE_DIR,\n", " seriesInstanceUID=demo_slides[\"SeriesInstanceUID\"].tolist(),\n", + " dirTemplate=\"%SeriesInstanceUID\",\n", ")\n", "print(\"\\nDownload complete.\")" ] @@ -559,20 +1495,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "## Part 4: Run GrandQC Quality Control\n", - "\n", - "GrandQC runs as two sequential scripts, both inside `grandqc/01_WSI_inference_OPENSLIDE_QC/`:\n", - "\n", - "| Step | Script | MPP | Time (GPU) | Output |\n", - "|------|--------|-----|------------|--------|\n", - "| 1 | `wsi_tis_detect.py` | 10 | ~0.4 s/slide | Binary tissue masks |\n", - "| 2 | `main.py` | 1.5 | ~30–45 s/slide | 7-class artifact maps, GeoJSON, TSV report |\n", - "\n", - "The 7 artifact classes are: **1** clean tissue · **2** folds · **3** dark spots · **4** pen marks · **5** air bubbles/edges · **6** out-of-focus · **7** background.\n", - "\n", - "Both scripts load models via relative paths (`./models/…`) and must therefore be invoked with their directory as the working directory. We use `subprocess` with `cwd=GRANDQC_SCRIPTS` to achieve this while keeping output streaming to the cell." - ] + "source": "## Part 4: Run GrandQC Quality Control\n\nGrandQC runs as two sequential scripts, both inside `grandqc/01_WSI_inference_OPENSLIDE_QC/`:\n\n| Step | Script | MPP | Time (GPU) | Output |\n|------|--------|-----|------------|--------|\n| 1 | `wsi_tis_detect.py` | 10 | ~0.4 s/slide | Binary tissue masks |\n| 2 | `main.py` | 1.5 | ~30–45 s/slide | 7-class artifact maps, GeoJSON, TSV report |\n\nThe 7 artifact classes are: **1** clean tissue · **2** folds · **3** dark spots · **4** pen marks · **5** air bubbles/edges · **6** out-of-focus · **7** background.\n\nBoth scripts load models via relative paths (`./models/…`) and must therefore be invoked with their directory as the working directory. We use `subprocess` with `cwd=GRANDQC_SCRIPTS` to achieve this while keeping output streaming to the cell.\n\n> **On inference resolution**: the artifact model runs at MPP 1.5 (≈7× objective), *not* at the slide's native 20×/40×, and the per-slide MPP is read from OpenSlide's `openslide.mpp-x` property (derived from the DICOM `PixelSpacing` for IDC slides). For how this is controlled, the patch-size math, and an empirical confirmation on an IDC slide, see [`grandqc_slide_quality_resolution_notes.md`](grandqc_slide_quality_resolution_notes.md)." }, { "cell_type": "markdown", @@ -1123,13 +2046,23 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "Python 3 (ipykernel)", + "language": "python", "name": "python3" }, "language_info": { - "name": "python" + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.13" } }, "nbformat": 4, "nbformat_minor": 0 -} +} \ No newline at end of file From 475a24098849d7952f1701b76979e2c950b5e308 Mon Sep 17 00:00:00 2001 From: Andrey Fedorov Date: Mon, 15 Jun 2026 16:05:40 -0400 Subject: [PATCH 3/9] Use patched GrandQC fork instead of runtime monkey-patching Clone the fedorov/grandqc idc-dicom-fixes branch, which carries the DICOM-directory, device-detection, PyTorch >=2.6 load, small-slide overlay, and fail-loud fixes as proper commits. Replace the in-notebook patch_grandqc_for_dicom() cell with a verification cell that asserts the fixes are present in the cloned scripts, and update Part 2 prose plus the resolution notes accordingly. Co-Authored-By: Claude Opus 4.8 --- .../grandqc_slide_quality_resolution_notes.md | 4 +- .../grandqc_slide_quality_with_idc.ipynb | 42 +++---------------- 2 files changed, 7 insertions(+), 39 deletions(-) diff --git a/notebooks/pathomics/grandqc_slide_quality_resolution_notes.md b/notebooks/pathomics/grandqc_slide_quality_resolution_notes.md index 0206cf2..51f585c 100644 --- a/notebooks/pathomics/grandqc_slide_quality_resolution_notes.md +++ b/notebooks/pathomics/grandqc_slide_quality_resolution_notes.md @@ -129,8 +129,8 @@ Key points: ``` If `openslide.mpp-x` is missing/`None`, `float(...)` raises. With the - failure-guard patch added in the notebook, this is caught and reported per - slide rather than silently mis-scaling. + fail-loud error handling in the `idc-dicom-fixes` fork branch, this is caught + and reported per slide (with a non-zero exit) rather than silently mis-scaling. - `openslide.mpp-x` is the **single point of failure** for scaling any new slide: if it is wrong or absent, `p_s` is sized wrong and the model sees the slide at the wrong scale. diff --git a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb index 2f7be61..a68cddf 100644 --- a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb +++ b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb @@ -1266,17 +1266,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "## Part 2: GrandQC Setup\n", - "\n", - "Before running GrandQC we need to:\n", - "1. Clone the repository from GitHub\n", - "2. Install its Python dependencies\n", - "3. Download the pre-trained models from Zenodo\n", - "4. Apply a small patch so both scripts accept IDC's DICOM series directories alongside conventional WSI files\n", - "\n", - "> **Note**: GrandQC's scripts currently iterate over files in a flat folder. IDC downloads each DICOM series into its own subdirectory. The patch below (≤ 10 lines changed across 2 files) teaches GrandQC to recognise a directory whose contents end in `.dcm` as a valid slide entry and resolves the OpenSlide path accordingly. No model weights or algorithm logic are changed." - ] + "source": "## Part 2: GrandQC Setup\n\nBefore running GrandQC we need to:\n1. Clone the repository from GitHub — we use the [`fedorov/grandqc`](https://github.com/fedorov/grandqc/tree/idc-dicom-fixes) fork, whose `idc-dicom-fixes` branch lets the inference scripts run directly on IDC DICOM series\n2. Install its Python dependencies\n3. Download the pre-trained models from Zenodo\n\n> **Note**: The `idc-dicom-fixes` branch carries a handful of small, self-contained fixes on top of upstream `cpath-ukk/grandqc` — DICOM-directory input, compute-device auto-detection, a PyTorch ≥ 2.6 model-load fix, a small-slide overlay fix, and fail-loud error handling. None of them change model weights or algorithm logic (see Part 2.3 for the full list). Working from a fork keeps the notebook clean instead of monkey-patching the scripts at runtime." }, { "cell_type": "markdown", @@ -1287,25 +1277,10 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": null, "metadata": {}, "outputs": [], - "source": [ - "%%capture\n", - "# Clone GrandQC (shallow clone — we only need the latest commit)\n", - "!git clone --depth 1 https://github.com/cpath-ukk/grandqc.git\n", - "\n", - "# Install GrandQC Python dependencies.\n", - "# PyTorch is already present in Colab; we install the remaining packages.\n", - "# segmentation-models-pytorch 0.3.1 is pinned because its model.predict() API\n", - "# was removed in 0.4.0 and GrandQC relies on it.\n", - "!pip install -q \\\n", - " \"segmentation-models-pytorch==0.3.1\" \\\n", - " \"rasterio>=1.3\" \\\n", - " \"imagecodecs\" \\\n", - " \"zarr<3\" \\\n", - " \"tifffile\"" - ] + "source": "%%capture\n# Clone the IDC-patched GrandQC fork (branch idc-dicom-fixes, shallow clone).\n# This branch applies small, self-contained fixes on top of upstream\n# cpath-ukk/grandqc so the inference scripts run directly on IDC DICOM series.\n# See the verification cell in Part 2.3 for the list of fixes.\n!git clone --depth 1 --branch idc-dicom-fixes https://github.com/fedorov/grandqc.git\n\n# Install GrandQC Python dependencies.\n# PyTorch is already present in Colab; we install the remaining packages.\n# segmentation-models-pytorch 0.3.1 is pinned because its model.predict() API\n# was removed in 0.4.0 and GrandQC relies on it.\n!pip install -q \\\n \"segmentation-models-pytorch==0.3.1\" \\\n \"rasterio>=1.3\" \\\n \"imagecodecs\" \\\n \"zarr<3\" \\\n \"tifffile\"" }, { "cell_type": "markdown", @@ -1358,21 +1333,14 @@ { "cell_type": "markdown", "metadata": {}, - "source": "### 2.3 Patch Scripts for DICOM Directory Support\n\nIDC downloads each series as a directory of tiled DICOM files (`*.dcm`). OpenSlide 4.0+ can open any one of those files and reconstruct the full pyramid from the same directory.\n\nThe changes per script are:\n\n| Script | Change |\n|--------|--------|\n| `wsi_tis_detect.py` | Replace file-only `os.listdir` filter with one that also yields subdirectories containing `.dcm` files; resolve the path before calling `OpenSlide()` |\n| `main.py` | Same two changes using `open_slide()`; plus `torch.load(..., weights_only=False)` so the pickled QC model loads under PyTorch ≥ 2.6 |\n\nBoth scripts also get automatic CUDA/MPS/CPU device selection instead of the hardcoded `'cuda'`.\n\n> **Note**: GrandQC's scripts iterate over files in a flat folder; IDC downloads each DICOM series into its own subdirectory, so the patch teaches them to recognise a directory whose contents end in `.dcm` as a valid slide entry and resolves the OpenSlide path accordingly. The `weights_only=False` change is required because PyTorch 2.6 made `weights_only=True` the default for `torch.load`, which refuses to unpickle the `segmentation_models_pytorch` model object that `main.py` loads. No model weights or algorithm logic are changed." - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": "from pathlib import Path\n\ndef patch_grandqc_for_dicom(script_path: str) -> None:\n text = Path(script_path).read_text()\n original = text\n\n # ── 1. Auto-detect compute device ──────────────────────────────────────\n text = text.replace(\n \"DEVICE = 'cuda'\",\n \"import torch as _t; DEVICE = ('cuda' if _t.cuda.is_available() \"\n \"else ('mps' if _t.backends.mps.is_available() else 'cpu'))\",\n )\n\n # ── 2. Replace slide-list builder to also accept DICOM directories ──────\n DICOM_ITER = (\n \"def _iter_slides(d):\\n\"\n \" for n in sorted(os.listdir(d)):\\n\"\n \" p = os.path.join(d, n)\\n\"\n \" if os.path.isfile(p) or (\\n\"\n \" os.path.isdir(p) and\\n\"\n \" any(x.lower().endswith('.dcm') for x in os.listdir(p))):\\n\"\n \" yield n\\n\"\n )\n\n # wsi_tis_detect.py pattern\n OLD_LIST_TIS = (\n \"slide_names = sorted([f for f in os.listdir(SLIDE_DIR) \"\n \"if os.path.isfile(os.path.join(SLIDE_DIR, f))])\"\n )\n NEW_LIST_TIS = DICOM_ITER + \"slide_names = list(_iter_slides(SLIDE_DIR))\"\n\n # main.py pattern\n OLD_LIST_MAIN = \"slide_names = sorted(os.listdir(SLIDE_DIR))\"\n NEW_LIST_MAIN = DICOM_ITER + \"slide_names = list(_iter_slides(SLIDE_DIR))\"\n\n text = text.replace(OLD_LIST_TIS, NEW_LIST_TIS)\n text = text.replace(OLD_LIST_MAIN, NEW_LIST_MAIN)\n\n # ── 3. Resolve DICOM directory to first .dcm file before open ───────────\n RESOLVE = (\n \" if os.path.isdir(path_slide):\\n\"\n \" _dcms = sorted(x for x in os.listdir(path_slide)\"\n \" if x.lower().endswith('.dcm'))\\n\"\n \" path_slide = os.path.join(path_slide, _dcms[0])\\n\"\n )\n\n for open_call in (\"slide = OpenSlide(path_slide)\", \"slide = open_slide(path_slide)\"):\n old = f\" path_slide = os.path.join(SLIDE_DIR, slide_name)\\n {open_call}\"\n new = f\" path_slide = os.path.join(SLIDE_DIR, slide_name)\\n{RESOLVE} {open_call}\"\n text = text.replace(old, new)\n\n # ── 4. Load the full QC model with weights_only=False ───────────────────\n # main.py loads a pickled nn.Module via torch.load(). Since PyTorch 2.6 the\n # default weights_only=True refuses to unpickle the segmentation_models_pytorch\n # classes inside, raising UnpicklingError before the per-slide try/except and\n # killing the process with exit code 1. The Zenodo weights are trusted, so we\n # opt back into the full unpickling path. (wsi_tis_detect.py is unaffected —\n # it loads a plain state_dict, which weights_only=True allows.)\n text = text.replace(\n \"model_prim = torch.load(MODEL_QC_DIR + MODEL_QC_NAME, map_location=DEVICE)\",\n \"model_prim = torch.load(MODEL_QC_DIR + MODEL_QC_NAME, map_location=DEVICE, \"\n \"weights_only=False)\",\n )\n\n # ── 5. Progress reporting in main.py's slide loop ───────────────────────\n # Track failures and report [i/N] progress. (main.py-specific anchors — these\n # strings do not occur in wsi_tis_detect.py, so this is a no-op there.)\n text = text.replace(\n \"for slide_name in slide_names[start:end]:\\n\"\n \" try:\\n\"\n \" # Register start time\\n\"\n \" start = timeit.default_timer()\\n\"\n \"\\n\"\n \" print(\\\"\\\")\\n\"\n \" print(\\\"Processing:\\\", slide_name)\",\n \"_slides_to_run = slide_names[start:end]\\n\"\n \"_failures = []\\n\"\n \"print(f\\\"Found {len(_slides_to_run)} slide(s) to process.\\\")\\n\"\n \"for _idx, slide_name in enumerate(_slides_to_run, 1):\\n\"\n \" try:\\n\"\n \" # Register start time\\n\"\n \" start = timeit.default_timer()\\n\"\n \"\\n\"\n \" print(\\\"\\\")\\n\"\n \" print(f\\\"[{_idx}/{len(_slides_to_run)}] Processing:\\\", slide_name)\",\n )\n\n # ── 6. Record per-slide failures and exit non-zero if any occurred ──────\n # The upstream loop swallows every per-slide error and the script exits 0,\n # so the notebook can't tell a slide failed. Collect failures and sys.exit(1)\n # after the loop so the caller surfaces them.\n text = text.replace(\n \" except Exception as e:\\n\"\n \" print(f\\\"There was some problem with the slide. The error is: {e}\\\")\",\n \" except Exception as e:\\n\"\n \" print(f\\\"There was some problem with the slide. The error is: {e}\\\")\\n\"\n \" _failures.append((slide_name, str(e)))\\n\"\n \"\\n\"\n \"# Surface per-slide failures with a non-zero exit so the caller sees them\\n\"\n \"if _failures:\\n\"\n \" print(f\\\"\\\\n{len(_failures)} of {len(_slides_to_run)} slide(s) FAILED:\\\")\\n\"\n \" for _name, _err in _failures:\\n\"\n \" print(f\\\" - {_name}: {_err}\\\")\\n\"\n \" import sys as _sys; _sys.exit(1)\\n\"\n \"print(f\\\"\\\\nAll {len(_slides_to_run)} slide(s) processed successfully.\\\")\",\n )\n\n if text == original:\n print(f\" WARNING: no changes applied to {script_path}\")\n else:\n Path(script_path).write_text(text)\n n = sum(1 for a, b in zip(original.splitlines(), text.splitlines()) if a != b)\n n += abs(len(text.splitlines()) - len(original.splitlines()))\n print(f\" Patched {script_path} (+{n} lines changed)\")\n\n\nfor script in [\n f\"{GRANDQC_SCRIPTS}/wsi_tis_detect.py\",\n f\"{GRANDQC_SCRIPTS}/main.py\",\n]:\n patch_grandqc_for_dicom(script)\nprint(\"Done.\")" + "source": "### 2.3 Fixes Included in the Fork\n\nThe upstream GrandQC inference scripts assume conventional single-file WSIs on an NVIDIA GPU. Running them directly on IDC DICOM series needs a few small changes, already applied in the [`idc-dicom-fixes`](https://github.com/fedorov/grandqc/tree/idc-dicom-fixes) branch cloned above. **No model weights or algorithm logic are changed.**\n\n| Fix | Scripts | Why it is needed |\n|-----|---------|------------------|\n| Accept DICOM **series directories** as slides | both | IDC stores each slide as a directory of `.dcm` files; upstream only accepted single files, so the slide list came out empty and nothing was processed |\n| Auto-detect compute device (CUDA / MPS / CPU) | both | upstream hardcodes `'cuda'`, which crashes on CPU/Apple Silicon |\n| `torch.load(..., weights_only=False)` for the QC model | `main.py` | PyTorch ≥ 2.6 changed the default and refuses to unpickle the model object |\n| Crop the tissue-detection mask to the thumbnail size | `wsi_tis_detect.py` | small slides produce a sub-512 px MPP-10 thumbnail, which crashed `cv2.addWeighted` in the overlay step (left `tis_det_overlay/` empty) |\n| Surface per-slide failures with a non-zero exit | both | upstream swallowed errors and exited `0`, so empty output looked like a successful run |\n\nThe cell below confirms these fixes are present in the cloned scripts." }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], - "source": "# Sanity-check: confirm the patches were inserted in both scripts\nfor script in [f\"{GRANDQC_SCRIPTS}/wsi_tis_detect.py\", f\"{GRANDQC_SCRIPTS}/main.py\"]:\n text = Path(script).read_text()\n has_dcm = \"_iter_slides\" in text\n has_res = \"_dcms\" in text\n has_dev = \"cuda.is_available\" in text\n # weights_only=False only applies to main.py's full-model load\n has_wo = (\"weights_only=False\" in text) if script.endswith(\"main.py\") else \"n/a\"\n print(f\"{script.split('/')[-1]:30s} dicom_iter={has_dcm} dcm_resolve={has_res} \"\n f\"auto_device={has_dev} weights_only_fix={has_wo}\")" + "source": "from pathlib import Path\n\n# Confirm the IDC fixes shipped in the cloned fork branch are present.\ntis = Path(f\"{GRANDQC_SCRIPTS}/wsi_tis_detect.py\").read_text()\nmn = Path(f\"{GRANDQC_SCRIPTS}/main.py\").read_text()\n\nchecks = {\n \"DICOM-directory input (wsi_tis_detect.py)\": \"_iter_slides\" in tis,\n \"DICOM-directory input (main.py)\": \"_iter_slides\" in mn,\n \"auto device select (wsi_tis_detect.py)\": \"cuda.is_available\" in tis,\n \"auto device select (main.py)\": \"cuda.is_available\" in mn,\n \"weights_only=False (main.py)\": \"weights_only=False\" in mn,\n \"small-slide overlay crop (wsi_tis_detect.py)\":\"end_image[-height:, -width:]\" in tis,\n \"fail-loud exit (wsi_tis_detect.py)\": \"sys.exit(1)\" in tis,\n \"fail-loud exit (main.py)\": \"sys.exit(1)\" in mn,\n}\nfor name, ok in checks.items():\n print(f\" [{'OK ' if ok else 'MISSING'}] {name}\")\n\nassert all(checks.values()), (\n \"Expected GrandQC fixes are not all present — check that the clone used \"\n \"branch 'idc-dicom-fixes' of github.com/fedorov/grandqc.\"\n)\nprint(\"\\nAll expected fixes present — scripts are ready to run on IDC DICOM slides.\")" }, { "cell_type": "markdown", From 11e62c2d5b2daadebac4b431c074a34dd290309e Mon Sep 17 00:00:00 2001 From: Andrey Fedorov Date: Tue, 16 Jun 2026 10:19:09 -0400 Subject: [PATCH 4/9] Select BRCA slides present in GrandQC reference masks MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The demo selection picked the smallest TCGA-BRCA slides, which are frozen-section (TS/BS) slides absent from GrandQC's Zenodo reference archive (DX diagnostic slides only) — so Part 6 validation matched nothing. Switch to five verified DX barcodes (variety by magnification, not tissue type, since all DX slides are primary tumor). Part 6 now downloads just those 5 reference masks (~700 KB, re-hosted) instead of the full 2 GB BRCA.tar, and fixes the mask-name matching: reference masks are {barcode}.{UUID}.svs_mask.png while local masks are keyed by the bare ContainerIdentifier, so a barcode->filename map is built. Drop the cohort-wide statistics that required the full archive. Co-Authored-By: Claude Opus 4.8 --- .../grandqc_slide_quality_with_idc.ipynb | 424 +----------------- 1 file changed, 13 insertions(+), 411 deletions(-) diff --git a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb index a68cddf..2e5a3cd 100644 --- a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb +++ b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb @@ -10,24 +10,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "# Slide Quality Control with GrandQC and NCI Imaging Data Commons\n", - "\n", - "Automated quality control (QC) is an essential preprocessing step in computational pathology pipelines. Artifacts such as tissue folds, out-of-focus regions, pen markings, and air bubbles can introduce noise that degrades the performance of downstream AI models.\n", - "\n", - "[**GrandQC**](https://github.com/cpath-ukk/grandqc) is an open-source tool for automated tissue detection and multi-class artifact segmentation in whole slide images (WSIs). It was validated across slides from 19 international pathology departments and published in:\n", - "\n", - "> Weng Z., Seper A., Pryalukhin A., et al. *GrandQC: A comprehensive solution to quality control problem in digital pathology.* Nature Communications 15, 10685 (2024). https://doi.org/10.1038/s41467-024-54769-y\n", - "\n", - "[**NCI Imaging Data Commons (IDC)**](https://portal.imaging.datacommons.cancer.gov/) hosts ~100 TB of cancer imaging data, including thousands of H&E-stained whole slide images from TCGA, CPTAC, HTAN, and other programs — all stored as DICOM and freely accessible without authentication.\n", - "\n", - "This notebook demonstrates how to:\n", - "1. **Discover** H&E-stained whole slide images in IDC using `idc-index`\n", - "2. **Set up** GrandQC and download its pre-trained models\n", - "3. **Download** slides from IDC and run GrandQC directly on DICOM files\n", - "4. **Visualize** tissue detection and artifact segmentation results\n", - "5. **Leverage** GrandQC's pre-computed QC masks for all 32 TCGA cohorts" - ] + "source": "# Slide Quality Control with GrandQC and NCI Imaging Data Commons\n\nAutomated quality control (QC) is an essential preprocessing step in computational pathology pipelines. Artifacts such as tissue folds, out-of-focus regions, pen markings, and air bubbles can introduce noise that degrades the performance of downstream AI models.\n\n[**GrandQC**](https://github.com/cpath-ukk/grandqc) is an open-source tool for automated tissue detection and multi-class artifact segmentation in whole slide images (WSIs). It was validated across slides from 19 international pathology departments and published in:\n\n> Weng Z., Seper A., Pryalukhin A., et al. *GrandQC: A comprehensive solution to quality control problem in digital pathology.* Nature Communications 15, 10685 (2024). https://doi.org/10.1038/s41467-024-54769-y\n\n[**NCI Imaging Data Commons (IDC)**](https://portal.imaging.datacommons.cancer.gov/) hosts ~100 TB of cancer imaging data, including thousands of H&E-stained whole slide images from TCGA, CPTAC, HTAN, and other programs — all stored as DICOM and freely accessible without authentication.\n\nThis notebook demonstrates how to:\n1. **Discover** H&E-stained whole slide images in IDC using `idc-index`\n2. **Set up** GrandQC and download its pre-trained models\n3. **Download** slides from IDC and run GrandQC directly on DICOM files\n4. **Visualize** tissue detection and artifact segmentation results\n5. **Validate** the DICOM-based pipeline against GrandQC's pre-computed QC masks for the TCGA cohorts" }, { "cell_type": "markdown", @@ -687,17 +670,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "### 1.3 Select Slides for Quality Control\n", - "\n", - "For this tutorial we work with slides from the [TCGA-BRCA](https://portal.imaging.datacommons.cancer.gov/explore/filters/?collection_id=tcga_brca) collection (The Cancer Genome Atlas — Breast Cancer). This collection is a good choice because:\n", - "\n", - "- It contains over 3,000 H&E-stained slides at 40× magnification\n", - "- GrandQC provides **pre-computed QC masks for all TCGA cohorts** (demonstrated in Part 6)\n", - "- Slides span both tumor and adjacent normal tissue, providing variety in tissue composition\n", - "\n", - "We query the `sm_index` to retrieve per-slide metadata including the `ContainerIdentifier` (the original TCGA slide barcode), which we will later use to match IDC series against the pre-computed GrandQC masks." - ] + "source": "### 1.3 Select Slides for Quality Control\n\nFor this tutorial we work with slides from the [TCGA-BRCA](https://portal.imaging.datacommons.cancer.gov/explore/filters/?collection_id=tcga_brca) collection (The Cancer Genome Atlas — Breast Cancer). This collection is a good choice because:\n\n- It contains over 3,000 H&E-stained slides\n- GrandQC provides **pre-computed QC masks for the TCGA cohorts**, which we use to validate our DICOM-based run in Part 6\n\nThose reference masks cover only the **diagnostic FFPE slides** (TCGA `DX` barcodes), not the frozen-section slides (`TS`/`BS`), so we deliberately select diagnostic slides here. We query the `sm_index` to retrieve per-slide metadata including the `ContainerIdentifier` (the original TCGA slide barcode), which we use to match IDC series against the pre-computed GrandQC masks." }, { "cell_type": "code", @@ -1027,215 +1000,14 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "### 1.4 Select a Representative Set of Slides\n", - "\n", - "For the hands-on GrandQC run we select a small set of slides that covers:\n", - "- Different tissue types (tumor vs. normal)\n", - "- Different magnifications (20× and 40×)\n", - "- Manageable file sizes for a notebook environment\n", - "\n", - "We cap the selection at slides under 200 MB to keep download and processing times reasonable." - ] + "source": "### 1.4 Select a Representative Set of Slides\n\nFor the hands-on GrandQC run we use a small set of **diagnostic (DX) H&E slides** that:\n- are confirmed to have pre-computed GrandQC reference masks in the Zenodo `BRCA.tar` archive (so Part 6 can validate against them),\n- span both magnifications GrandQC encounters in TCGA-BRCA (20× and 40×), and\n- are small enough (≤ ~150 MB) for quick download and processing in a notebook.\n\nAll TCGA-BRCA diagnostic slides are primary-tumor sections, so the variety here comes from magnification rather than tissue type. The five barcodes below were verified against the Zenodo archive." }, { "cell_type": "code", - "execution_count": 6, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Selected 3 slides for GrandQC processing:\n" - ] - }, - { - "data": { - "text/html": [ - "\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
PatientIDContainerIdentifiertissue_typeobjective_powerwidth_pxheight_pxsize_MB
0TCGA-A7-A26JTCGA-A7-A26J-01B-02-BS2Neoplasm, Primary40861872439.1
1TCGA-AC-A2QJTCGA-AC-A2QJ-11A-02-TS2Normal2033993692315.5
2TCGA-E2-A15KTCGA-E2-A15K-06A-01-TS1None40333191996285.7
\n", - "
\n", - "
\n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - " \n", - "
\n", - " \n", - "
\n", - "
\n", - " " - ], - "text/plain": [ - " PatientID ContainerIdentifier tissue_type objective_power \\\n", - "0 TCGA-A7-A26J TCGA-A7-A26J-01B-02-BS2 Neoplasm, Primary 40 \n", - "1 TCGA-AC-A2QJ TCGA-AC-A2QJ-11A-02-TS2 Normal 20 \n", - "2 TCGA-E2-A15K TCGA-E2-A15K-06A-01-TS1 None 40 \n", - "\n", - " width_px height_px size_MB \n", - "0 8618 7243 9.1 \n", - "1 33993 6923 15.5 \n", - "2 33319 19962 85.7 " - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "# Pick up to 5 small slides with variety in tissue type and magnification\n", - "demo_slides = (\n", - " tcga_brca_he[tcga_brca_he[\"size_MB\"] < 200]\n", - " .drop_duplicates(subset=\"tissue_type\")\n", - " .head(5)\n", - " .reset_index(drop=True)\n", - ")\n", - "\n", - "# Fall back to the 5 smallest slides if tissue_type is missing\n", - "if len(demo_slides) < 3:\n", - " demo_slides = tcga_brca_he[tcga_brca_he[\"size_MB\"] < 200].head(5).reset_index(drop=True)\n", - "\n", - "print(f\"Selected {len(demo_slides)} slides for GrandQC processing:\")\n", - "display(\n", - " demo_slides[[\n", - " \"PatientID\", \"ContainerIdentifier\", \"tissue_type\",\n", - " \"objective_power\", \"width_px\", \"height_px\", \"size_MB\"\n", - " ]]\n", - ")" - ] + "outputs": [], + "source": "# Diagnostic (DX) H&E slides verified to have pre-computed GrandQC reference\n# masks in the TCGA-BRCA archive on Zenodo (record 14041578). Every DX slide is\n# primary tumor, so the variety here is in objective power (40x and 20x).\nDEMO_BARCODES = [\n \"TCGA-AC-A23G-01Z-00-DX1\", # 40x, 44 MB\n \"TCGA-AC-A23C-01Z-00-DX1\", # 40x, 57 MB\n \"TCGA-AC-A62V-01Z-00-DX1\", # 40x, 60 MB\n \"TCGA-MS-A51U-01Z-00-DX1\", # 20x, 76 MB\n \"TCGA-A8-A0AB-01Z-00-DX1\", # 20x, 151 MB\n]\n\navailable = set(tcga_brca_he[\"ContainerIdentifier\"])\npresent = [b for b in DEMO_BARCODES if b in available]\nmissing = [b for b in DEMO_BARCODES if b not in available]\nif missing:\n print(f\"WARNING: {len(missing)} barcode(s) not found in this IDC version: {missing}\")\n\ndemo_slides = (\n tcga_brca_he[tcga_brca_he[\"ContainerIdentifier\"].isin(present)]\n .drop_duplicates(subset=\"ContainerIdentifier\")\n .set_index(\"ContainerIdentifier\")\n .loc[present] # preserve our chosen order\n .reset_index()\n)\n\nprint(f\"Selected {len(demo_slides)} slides for GrandQC processing:\")\ndisplay(\n demo_slides[[\n \"PatientID\", \"ContainerIdentifier\", \"tissue_type\",\n \"objective_power\", \"width_px\", \"height_px\", \"size_MB\"\n ]]\n)" }, { "cell_type": "code", @@ -1383,14 +1155,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "### 3.2 Rename Series Directories to TCGA Barcodes\n", - "\n", - "GrandQC uses the directory/file name as the key for all output files (tissue mask, QC map, overlay, TSV report row). By renaming each downloaded series directory from its `SeriesInstanceUID` to its `ContainerIdentifier` (TCGA barcode), we get:\n", - "\n", - "- Human-readable output filenames (`TCGA-A7-A26J-01B-02-BS2_MASK.png` instead of a UID)\n", - "- Automatic matching with GrandQC's pre-computed TCGA masks (Part 6), which are keyed by the same barcode" - ] + "source": "### 3.2 Rename Series Directories to TCGA Barcodes\n\nGrandQC uses the directory/file name as the key for all output files (tissue mask, QC map, overlay, TSV report row). By renaming each downloaded series directory from its `SeriesInstanceUID` to its `ContainerIdentifier` (TCGA barcode), we get:\n\n- Human-readable output filenames (`TCGA-AC-A23G-01Z-00-DX1_mask.png` instead of a UID)\n- Straightforward matching with GrandQC's pre-computed TCGA masks (Part 6), which embed the same barcode in their filename" }, { "cell_type": "code", @@ -1751,27 +1516,12 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "## Part 6: Validating the Pipeline Against Pre-computed TCGA Masks\n", - "\n", - "In Parts 3–5 we ran GrandQC locally on IDC DICOM slides using OpenSlide's native DICOM reader — a path the GrandQC authors did not test in their original publication. Before using the pipeline on new data, it is good practice to verify that our results are consistent with the authors' reference outputs.\n", - "\n", - "The GrandQC authors published pre-computed QC masks for all 32 TCGA cohorts on Zenodo:\n", - "\n", - "> **GrandQC pre-computed masks**: [zenodo.org/records/14041578](https://zenodo.org/records/14041578) \n", - "> 36 tar archives — one per TCGA cohort (some large cohorts split into two parts) — totalling ~18.3 GB.\n", - "\n", - "Each archive unpacks to PNG files named `{ContainerIdentifier}_mask.png` with integer labels 1–7 — the same format and naming convention produced by `main.py`. By downloading the TCGA-BRCA archive and comparing artifact fractions for our demo slides against the reference, we can confirm that the DICOM-based pipeline is producing equivalent results." - ] + "source": "## Part 6: Validating the Pipeline Against Pre-computed TCGA Masks\n\nIn Parts 3–5 we ran GrandQC locally on IDC DICOM slides using OpenSlide's native DICOM reader — a path the GrandQC authors did not test in their original publication. Before using the pipeline on new data, it is good practice to verify that our results are consistent with the authors' reference outputs.\n\nThe GrandQC authors published pre-computed QC masks for the TCGA cohorts on Zenodo:\n\n> **GrandQC pre-computed masks**: [zenodo.org/records/14041578](https://zenodo.org/records/14041578) \n> One tar archive per TCGA cohort (a few large cohorts are split into two parts), ~18 GB in total. The TCGA-BRCA archive (`BRCA.tar`) alone is ~2 GB and holds masks for 1,105 **diagnostic (DX) slides**.\n\nEach mask is a PNG named after the original slide file — `{barcode}.{UUID}.svs_mask.png` — with integer labels 1–7, the same classes `main.py` produces. Rather than download the full 2 GB cohort archive, we pull just the reference masks for our five demo slides (~700 KB, extracted from `BRCA.tar` and re-hosted) and compare artifact fractions against our local DICOM run." }, { "cell_type": "markdown", "metadata": {}, - "source": [ - "### 6.1 Available Cohort Archives\n", - "\n", - "The Zenodo record contains one tar file per TCGA cohort. Large cohorts (BRCA, THCA, GBM, LUAD, UCEC, LGG, TGCT) are split into two parts." - ] + "source": "### 6.1 Available Cohort Archives\n\nThe Zenodo record contains one tar file per TCGA cohort; a few large cohorts (LUAD, LUSC, THCA, TGCT) are split into two parts. The cell below lists them straight from the Zenodo API — no large download." }, { "cell_type": "code", @@ -1801,36 +1551,14 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "### 6.2 Download the TCGA-BRCA Reference Masks\n", - "\n", - "`BRCA.tar` (~2 GB) contains masks for all TCGA-BRCA slides. After extraction, each file is named `{ContainerIdentifier}_mask.png` — identical to what GrandQC writes into `mask_qc/` locally." - ] + "source": "### 6.2 Download the Reference Masks for Our Demo Slides\n\nWe download just the pre-computed masks for the five demo slides (~700 KB), extracted from `BRCA.tar` and re-hosted as a small folder. Each reference mask is named `{barcode}.{UUID}.svs_mask.png`, whereas our local masks in `qc_output/mask_qc/` are keyed by the bare `ContainerIdentifier` barcode — so we also build a lookup from barcode to the full reference filename.\n\n> To reproduce population-level statistics across the whole cohort, download the full `BRCA.tar` (~2 GB) from the Zenodo record above and point `REF_MASKS_DIR` at the extracted folder; the comparison loop below works unchanged on any number of masks." }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], - "source": [ - "REF_MASKS_DIR = \"brca_ref_masks\"\n", - "os.makedirs(REF_MASKS_DIR, exist_ok=True)\n", - "\n", - "# Download BRCA.tar (~2 GB) — takes 5-10 min on Colab\n", - "!wget -q --show-progress \\\n", - " \"https://zenodo.org/records/14041578/files/BRCA.tar?download=1\" \\\n", - " -O BRCA.tar\n", - "\n", - "# Extract — PNG files land directly in the archive root\n", - "!tar xf BRCA.tar -C {REF_MASKS_DIR}\n", - "\n", - "# Report what we got\n", - "ref_masks = sorted(f for f in os.listdir(REF_MASKS_DIR) if f.endswith(\"_mask.png\"))\n", - "print(f\"\\nExtracted {len(ref_masks)} reference masks to {REF_MASKS_DIR}/\")\n", - "print(\"Sample filenames:\")\n", - "for f in ref_masks[:5]:\n", - " print(f\" {f}\")" - ] + "source": "import io, zipfile, requests\n\nREF_MASKS_DIR = \"brca_ref_masks\"\nos.makedirs(REF_MASKS_DIR, exist_ok=True)\n\n# Pre-computed GrandQC masks for our five demo slides, extracted from the\n# TCGA-BRCA Zenodo archive (record 14041578) and re-hosted as a small folder\n# (~700 KB) so we don't have to download the full ~2 GB BRCA.tar. The link is a\n# Dropbox shared folder; appending '&dl=1' returns the whole folder as a zip.\nREF_MASKS_URL = (\n \"https://www.dropbox.com/scl/fo/z5jtldr5ed384s8nntiwe/\"\n \"ALUgs22f_0PI8oGO6XJ5eG0?rlkey=b92k8rt6jvqllwkad31obvbyx&dl=1\"\n)\n\nresp = requests.get(REF_MASKS_URL)\nresp.raise_for_status()\nwith zipfile.ZipFile(io.BytesIO(resp.content)) as zf:\n for member in zf.namelist():\n if member.endswith(\"_mask.png\"):\n target = os.path.join(REF_MASKS_DIR, os.path.basename(member))\n with zf.open(member) as src, open(target, \"wb\") as dst:\n dst.write(src.read())\n\nref_masks = sorted(f for f in os.listdir(REF_MASKS_DIR) if f.endswith(\"_mask.png\"))\n\n# Reference masks are named '{barcode}.{UUID}.svs_mask.png'; our local masks are\n# keyed by the bare ContainerIdentifier barcode (no dots). Map barcode -> filename.\nref_mask_by_barcode = {f.split(\".\")[0]: f for f in ref_masks}\n\nprint(f\"Downloaded {len(ref_masks)} reference masks to {REF_MASKS_DIR}/\")\nfor f in ref_masks:\n print(f\" {f}\")" }, { "cell_type": "markdown", @@ -1848,48 +1576,7 @@ "execution_count": null, "metadata": {}, "outputs": [], - "source": [ - "def mask_fractions(mask_path):\n", - " mask = np.array(Image.open(mask_path))\n", - " tissue_px = np.sum(mask != 7)\n", - " if tissue_px == 0:\n", - " return None\n", - " return {name: np.sum(mask == cls) / tissue_px * 100\n", - " for cls, name in enumerate(CLASS_NAMES_SHORT, start=1)}\n", - "\n", - "\n", - "# Build comparison table for each demo slide\n", - "compare_rows = []\n", - "for barcode in uid_to_barcode.values():\n", - " local_path = os.path.join(OUTPUT_DIR, \"mask_qc\", f\"{barcode}_mask.png\")\n", - " ref_path = os.path.join(REF_MASKS_DIR, f\"{barcode}_mask.png\")\n", - "\n", - " if not os.path.exists(local_path):\n", - " print(f\" MISSING local mask: {barcode}\")\n", - " continue\n", - " if not os.path.exists(ref_path):\n", - " print(f\" MISSING reference mask: {barcode} — slide may not be in BRCA.tar\")\n", - " continue\n", - "\n", - " local_frac = mask_fractions(local_path)\n", - " ref_frac = mask_fractions(ref_path)\n", - " if local_frac is None or ref_frac is None:\n", - " continue\n", - "\n", - " for name in CLASS_NAMES_SHORT:\n", - " compare_rows.append({\n", - " \"slide\": barcode,\n", - " \"class\": name,\n", - " \"local_%\": round(local_frac[name], 2),\n", - " \"ref_%\": round(ref_frac[name], 2),\n", - " \"delta_pp\": round(local_frac[name] - ref_frac[name], 2),\n", - " })\n", - "\n", - "compare_df = pd.DataFrame(compare_rows)\n", - "print(\"Per-class fraction comparison (local DICOM run vs. Zenodo reference):\\n\")\n", - "display(compare_df.pivot_table(index=\"slide\", columns=\"class\", values=\"delta_pp\").round(2))\n", - "print(\"\\nδ = local − reference (percentage points). Values near 0 confirm agreement.\")" - ] + "source": "def mask_fractions(mask_path):\n mask = np.array(Image.open(mask_path))\n tissue_px = np.sum(mask != 7)\n if tissue_px == 0:\n return None\n return {name: np.sum(mask == cls) / tissue_px * 100\n for cls, name in enumerate(CLASS_NAMES_SHORT, start=1)}\n\n\n# Build comparison table for each demo slide\ncompare_rows = []\nfor barcode in uid_to_barcode.values():\n local_path = os.path.join(OUTPUT_DIR, \"mask_qc\", f\"{barcode}_mask.png\")\n ref_fname = ref_mask_by_barcode.get(barcode)\n ref_path = os.path.join(REF_MASKS_DIR, ref_fname) if ref_fname else None\n\n if not os.path.exists(local_path):\n print(f\" MISSING local mask: {barcode}\")\n continue\n if ref_path is None or not os.path.exists(ref_path):\n print(f\" MISSING reference mask: {barcode} — slide not in the reference set\")\n continue\n\n local_frac = mask_fractions(local_path)\n ref_frac = mask_fractions(ref_path)\n if local_frac is None or ref_frac is None:\n continue\n\n for name in CLASS_NAMES_SHORT:\n compare_rows.append({\n \"slide\": barcode,\n \"class\": name,\n \"local_%\": round(local_frac[name], 2),\n \"ref_%\": round(ref_frac[name], 2),\n \"delta_pp\": round(local_frac[name] - ref_frac[name], 2),\n })\n\ncompare_df = pd.DataFrame(compare_rows)\nprint(\"Per-class fraction comparison (local DICOM run vs. Zenodo reference):\\n\")\ndisplay(compare_df.pivot_table(index=\"slide\", columns=\"class\", values=\"delta_pp\").round(2))\nprint(\"\\nδ = local − reference (percentage points). Values near 0 confirm agreement.\")" }, { "cell_type": "code", @@ -1924,92 +1611,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "### 6.4 Cohort-level QC Statistics\n", - "\n", - "With validation confirmed, we compute per-slide artifact fractions across all TCGA-BRCA slides in the reference archive. This gives a population-level view of slide quality without running any inference locally — each mask is read once and discarded, keeping memory usage constant." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from tqdm.notebook import tqdm\n", - "\n", - "cohort_rows = []\n", - "for fname in tqdm(ref_masks, desc=\"Computing stats\"):\n", - " barcode = fname.replace(\"_mask.png\", \"\")\n", - " fracs = mask_fractions(os.path.join(REF_MASKS_DIR, fname))\n", - " if fracs is not None:\n", - " fracs[\"slide\"] = barcode\n", - " cohort_rows.append(fracs)\n", - "\n", - "cohort_df = pd.DataFrame(cohort_rows).set_index(\"slide\")\n", - "\n", - "print(f\"Cohort: TCGA-BRCA | {len(cohort_df)} slides\\n\")\n", - "print(\"Median artifact fractions (% of tissue area):\")\n", - "display(cohort_df[CLASS_NAMES_SHORT].median().rename(\"median_%\").to_frame().T.round(1))\n", - "print()\n", - "print(\"90th-percentile artifact fractions:\")\n", - "display(cohort_df[CLASS_NAMES_SHORT].quantile(0.9).rename(\"p90_%\").to_frame().T.round(1))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Violin plot of per-slide artifact fractions across TCGA-BRCA\n", - "# Exclude Clean tissue (class 1) — typically >80%, dominates the y-axis\n", - "artifact_classes = [c for c in CLASS_NAMES_SHORT if c != \"Clean tissue\"]\n", - "artifact_colors = [mpl_colors[i] for i, c in enumerate(CLASS_NAMES_SHORT) if c != \"Clean tissue\"]\n", - "\n", - "fig, ax = plt.subplots(figsize=(9, 4))\n", - "data = [cohort_df[cls].dropna().values for cls in artifact_classes]\n", - "parts = ax.violinplot(data, positions=range(len(artifact_classes)),\n", - " showmedians=True, showextrema=False)\n", - "\n", - "for pc, color in zip(parts[\"bodies\"], artifact_colors):\n", - " pc.set_facecolor(color)\n", - " pc.set_alpha(0.75)\n", - "parts[\"cmedians\"].set_color(\"black\")\n", - "parts[\"cmedians\"].set_linewidth(1.5)\n", - "\n", - "ax.set_xticks(range(len(artifact_classes)))\n", - "ax.set_xticklabels(artifact_classes, rotation=20, ha=\"right\")\n", - "ax.set_ylabel(\"% of tissue area\")\n", - "ax.set_title(f\"Artifact fraction distribution — TCGA-BRCA ({len(cohort_df)} slides)\",\n", - " fontweight=\"bold\")\n", - "plt.tight_layout()\n", - "plt.show()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Join QC stats with IDC sm_index metadata via ContainerIdentifier\n", - "cohort_meta = cohort_df.reset_index().rename(columns={\"slide\": \"ContainerIdentifier\"})\n", - "brca_meta = tcga_brca_he[[\"ContainerIdentifier\", \"tissue_type\"]].dropna(subset=[\"tissue_type\"])\n", - "\n", - "merged = cohort_meta.merge(brca_meta, on=\"ContainerIdentifier\", how=\"inner\")\n", - "\n", - "print(f\"Slides with tissue-type annotation: {len(merged)} / {len(cohort_df)}\\n\")\n", - "\n", - "# Median artifact fractions by tissue type\n", - "by_tissue = (\n", - " merged.groupby(\"tissue_type\")[CLASS_NAMES_SHORT]\n", - " .median()\n", - " .round(1)\n", - ")\n", - "print(\"Median artifact fractions (%) by tissue type:\")\n", - "display(by_tissue)" - ] + "source": "### 6.4 Scaling to Cohort-level QC\n\nBecause each reference mask is just a small PNG, the comparison above scales to an entire cohort. Downloading the full `BRCA.tar` (~2 GB) from the Zenodo record gives pre-computed masks for all 1,105 TCGA-BRCA diagnostic slides — enough to compute population-level artifact statistics (medians, percentiles, distributions) **without running any inference locally**. We keep this notebook lightweight by validating on the five demo slides only; the loop in section 6.3 generalizes directly to the full mask set, reading one mask at a time so memory stays constant." } ], "metadata": { From df0c20c2ea3864390cda44cf368b41a3774426a0 Mon Sep 17 00:00:00 2001 From: Andrey Fedorov Date: Tue, 16 Jun 2026 10:52:23 -0400 Subject: [PATCH 5/9] Add artifact-boundary comparison to GrandQC validation New section 6.4 overlays per-class artifact region boundaries for the reference (Zenodo) and local DICOM runs on a shared grid: solid outline for the reference, dashed for the local run, in the matching artifact colour. This exposes boundary-level localisation differences that the aggregate per-class fractions in 6.3 can hide. The reference mask is resized onto the local grid (nearest-neighbour) since the two can differ slightly in pixel dimensions. Existing cohort-scaling note renumbered to 6.5. Co-Authored-By: Claude Opus 4.8 --- .../grandqc_slide_quality_with_idc.ipynb | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb index 2e5a3cd..fac1ee8 100644 --- a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb +++ b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb @@ -1608,10 +1608,22 @@ "plt.show()" ] }, + { + "cell_type": "markdown", + "source": "### 6.4 Compare Artifact Region Boundaries\n\nPer-class fractions (6.3) tell us *how much* of each artifact the two runs agree on, but not *where*. Here we overlay the **boundaries** of each artifact region for both runs on the same grid: a **solid** outline for the reference (Zenodo) mask and a **dashed** outline for our local DICOM run, drawn in the matching artifact colour. Where the two outlines hug each other the pipelines localise the artifact identically; visible gaps reveal boundary-level differences that aggregate fractions can hide.\n\nThe reference and local masks are both produced at MPP 1.5, but their pixel dimensions can differ slightly, so the reference is resized onto the local grid (nearest-neighbour) before overlaying. Clean tissue (class 1) and background (class 7) are omitted — only the five artifact classes are outlined.", + "metadata": {} + }, + { + "cell_type": "code", + "source": "from matplotlib.lines import Line2D\n\n# Artifact classes to outline (skip clean tissue = 1 and background = 7)\nCONTOUR_CLASSES = [2, 3, 4, 5, 6]\n\nslides = list(uid_to_barcode.values())\nn = len(slides)\nfig, axes = plt.subplots(1, n, figsize=(4.5 * n, 4.8))\nif n == 1:\n axes = [axes]\n\nfor ax, barcode in zip(axes, slides):\n local_path = os.path.join(OUTPUT_DIR, \"mask_qc\", f\"{barcode}_mask.png\")\n ref_fname = ref_mask_by_barcode.get(barcode)\n if not os.path.exists(local_path) or not ref_fname:\n ax.set_title(f\"{barcode}\\n(mask missing)\", fontsize=7)\n ax.axis(\"off\")\n continue\n\n local = np.array(Image.open(local_path))\n # Both masks are produced at MPP 1.5, but the exact pixel dimensions can\n # differ slightly; resize the reference onto the local grid with nearest-\n # neighbour interpolation (preserves the integer class labels) so the two\n # sets of contours share one coordinate system.\n ref_img = Image.open(os.path.join(REF_MASKS_DIR, ref_fname))\n if ref_img.size != (local.shape[1], local.shape[0]):\n ref_img = ref_img.resize((local.shape[1], local.shape[0]), Image.NEAREST)\n ref = np.array(ref_img)\n\n # Light-grey tissue silhouette for spatial context (white = background)\n ax.imshow(np.where(local != 7, 0.9, 1.0), cmap=\"gray\", vmin=0, vmax=1)\n\n for cls in CONTOUR_CLASSES:\n color = mpl_colors[cls - 1]\n if np.any(ref == cls):\n ax.contour((ref == cls).astype(float), levels=[0.5], colors=[color],\n linewidths=1.1, linestyles=\"solid\", origin=\"upper\")\n if np.any(local == cls):\n ax.contour((local == cls).astype(float), levels=[0.5], colors=[color],\n linewidths=1.1, linestyles=\"dashed\", origin=\"upper\")\n ax.set_title(barcode, fontsize=7)\n ax.axis(\"off\")\n\n# Legend: colour encodes the artifact class, line style encodes the source\nclass_handles = [Line2D([0], [0], color=mpl_colors[c - 1], lw=2,\n label=CLASS_NAMES_SHORT[c - 1]) for c in CONTOUR_CLASSES]\nstyle_handles = [\n Line2D([0], [0], color=\"black\", lw=1.4, linestyle=\"solid\", label=\"Reference (Zenodo)\"),\n Line2D([0], [0], color=\"black\", lw=1.4, linestyle=\"dashed\", label=\"Local (DICOM)\"),\n]\nfig.legend(handles=class_handles + style_handles, loc=\"lower center\",\n ncol=len(class_handles) + len(style_handles), fontsize=7,\n framealpha=0.9, bbox_to_anchor=(0.5, -0.02))\nplt.suptitle(\"Artifact region boundaries — reference (solid) vs. local DICOM (dashed)\",\n fontweight=\"bold\", fontsize=10)\nplt.tight_layout(rect=[0, 0.04, 1, 0.97])\nplt.show()", + "metadata": {}, + "execution_count": null, + "outputs": [] + }, { "cell_type": "markdown", "metadata": {}, - "source": "### 6.4 Scaling to Cohort-level QC\n\nBecause each reference mask is just a small PNG, the comparison above scales to an entire cohort. Downloading the full `BRCA.tar` (~2 GB) from the Zenodo record gives pre-computed masks for all 1,105 TCGA-BRCA diagnostic slides — enough to compute population-level artifact statistics (medians, percentiles, distributions) **without running any inference locally**. We keep this notebook lightweight by validating on the five demo slides only; the loop in section 6.3 generalizes directly to the full mask set, reading one mask at a time so memory stays constant." + "source": "### 6.5 Scaling to Cohort-level QC\n\nBecause each reference mask is just a small PNG, the comparison above scales to an entire cohort. Downloading the full `BRCA.tar` (~2 GB) from the Zenodo record gives pre-computed masks for all 1,105 TCGA-BRCA diagnostic slides — enough to compute population-level artifact statistics (medians, percentiles, distributions) **without running any inference locally**. We keep this notebook lightweight by validating on the five demo slides only; the loop in section 6.3 generalizes directly to the full mask set, reading one mask at a time so memory stays constant." } ], "metadata": { @@ -1635,4 +1647,4 @@ }, "nbformat": 4, "nbformat_minor": 0 -} \ No newline at end of file +} From c6231c45679d52aaaadbbba5aae28631da56d425 Mon Sep 17 00:00:00 2001 From: Andrey Fedorov Date: Tue, 16 Jun 2026 13:50:54 -0400 Subject: [PATCH 6/9] Make slide-directory rename idempotent and re-run safe If the download cell is re-run after a rename, both the UID directory and the renamed barcode directory exist, and os.rename(src, dst) onto a non-empty dst raised OSError. Handle the both-present case by dropping the redundant UID copy and keeping the barcode directory, so the cell is safe to re-run in any order. Co-Authored-By: Claude Opus 4.8 --- .../grandqc_slide_quality_with_idc.ipynb | 25 ++----------------- 1 file changed, 2 insertions(+), 23 deletions(-) diff --git a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb index fac1ee8..6f3e206 100644 --- a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb +++ b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb @@ -1162,28 +1162,7 @@ "execution_count": null, "metadata": {}, "outputs": [], - "source": [ - "# Build SeriesInstanceUID → ContainerIdentifier map\n", - "uid_to_barcode = dict(zip(demo_slides[\"SeriesInstanceUID\"], demo_slides[\"ContainerIdentifier\"]))\n", - "\n", - "for uid, barcode in uid_to_barcode.items():\n", - " src = os.path.join(SLIDE_DIR, uid)\n", - " dst = os.path.join(SLIDE_DIR, barcode)\n", - " if os.path.isdir(src):\n", - " os.rename(src, dst)\n", - " print(f\" {uid} → {barcode}\")\n", - " elif os.path.isdir(dst):\n", - " print(f\" Already renamed: {barcode}\")\n", - " else:\n", - " print(f\" WARNING: directory not found: {uid}\")\n", - "\n", - "print(\"\\nSlide directories after renaming:\")\n", - "for d in sorted(os.listdir(SLIDE_DIR)):\n", - " dpath = os.path.join(SLIDE_DIR, d)\n", - " if os.path.isdir(dpath):\n", - " n = sum(1 for f in os.listdir(dpath) if f.lower().endswith(\".dcm\"))\n", - " print(f\" {d}/ ({n} .dcm files)\")" - ] + "source": "import shutil\n\n# Build SeriesInstanceUID → ContainerIdentifier map\nuid_to_barcode = dict(zip(demo_slides[\"SeriesInstanceUID\"], demo_slides[\"ContainerIdentifier\"]))\n\n# Rename each downloaded UID directory to its barcode. This is written to be\n# safe to re-run, and to tolerate the download cell being re-run (which would\n# re-create the UID directory next to an already-renamed barcode directory).\nfor uid, barcode in uid_to_barcode.items():\n src = os.path.join(SLIDE_DIR, uid) # freshly downloaded, named by UID\n dst = os.path.join(SLIDE_DIR, barcode) # renamed target, named by barcode\n\n if os.path.isdir(src) and os.path.isdir(dst):\n # Both present → the download cell was re-run after a previous rename.\n # The UID directory is a redundant copy of the same (immutable) series,\n # so drop it and keep the barcode directory downstream code expects.\n shutil.rmtree(src)\n print(f\" Removed duplicate UID dir; kept {barcode}\")\n elif os.path.isdir(src):\n os.rename(src, dst)\n print(f\" {uid} → {barcode}\")\n elif os.path.isdir(dst):\n print(f\" Already renamed: {barcode}\")\n else:\n print(f\" WARNING: directory not found: {uid}\")\n\nprint(\"\\nSlide directories after renaming:\")\nfor d in sorted(os.listdir(SLIDE_DIR)):\n dpath = os.path.join(SLIDE_DIR, d)\n if os.path.isdir(dpath):\n n = sum(1 for f in os.listdir(dpath) if f.lower().endswith(\".dcm\"))\n print(f\" {d}/ ({n} .dcm files)\")" }, { "cell_type": "markdown", @@ -1647,4 +1626,4 @@ }, "nbformat": 4, "nbformat_minor": 0 -} +} \ No newline at end of file From a46a901e1a1215291c83d6e3977c3e9e3d15ba06 Mon Sep 17 00:00:00 2001 From: Andrey Fedorov Date: Tue, 16 Jun 2026 13:55:18 -0400 Subject: [PATCH 7/9] Lay review panels out 2 per row for larger images With five demo slides the single-row figures made each panel tiny. Switch the tissue-detection overlays, artifact overlays, and the artifact-boundary comparison to a 2-column grid with larger per-panel sizes, hiding any unused trailing panel. Co-Authored-By: Claude Opus 4.8 --- .../grandqc_slide_quality_with_idc.ipynb | 37 ++----------------- 1 file changed, 3 insertions(+), 34 deletions(-) diff --git a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb index 6f3e206..829ccaa 100644 --- a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb +++ b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb @@ -1336,24 +1336,7 @@ "execution_count": null, "metadata": {}, "outputs": [], - "source": [ - "import matplotlib.pyplot as plt\n", - "import matplotlib.image as mpimg\n", - "\n", - "overlay_dir = os.path.join(OUTPUT_DIR, \"tis_det_overlay\")\n", - "fnames = sorted(os.listdir(overlay_dir))\n", - "\n", - "fig, axes = plt.subplots(1, len(fnames), figsize=(6 * len(fnames), 5))\n", - "if len(fnames) == 1:\n", - " axes = [axes]\n", - "for ax, fname in zip(axes, fnames):\n", - " ax.imshow(mpimg.imread(os.path.join(overlay_dir, fname)))\n", - " ax.set_title(fname.replace(\"_OVERLAY.jpg\", \"\"), fontsize=7)\n", - " ax.axis(\"off\")\n", - "plt.suptitle(\"Step 1 — Tissue Detection\", fontsize=11, fontweight=\"bold\")\n", - "plt.tight_layout()\n", - "plt.show()" - ] + "source": "import matplotlib.pyplot as plt\nimport matplotlib.image as mpimg\n\noverlay_dir = os.path.join(OUTPUT_DIR, \"tis_det_overlay\")\nfnames = sorted(os.listdir(overlay_dir))\n\n# Lay panels out 2 per row so each image is large enough to inspect\nncols = 2 if len(fnames) > 1 else 1\nnrows = (len(fnames) + ncols - 1) // ncols\nfig, axes = plt.subplots(nrows, ncols, figsize=(8 * ncols, 6 * nrows))\naxes = np.array(axes).reshape(-1)\n\nfor ax, fname in zip(axes, fnames):\n ax.imshow(mpimg.imread(os.path.join(overlay_dir, fname)))\n ax.set_title(fname.replace(\"_OVERLAY.jpg\", \"\"), fontsize=9)\n ax.axis(\"off\")\nfor ax in axes[len(fnames):]: # hide unused panels in the last row\n ax.axis(\"off\")\n\nplt.suptitle(\"Step 1 — Tissue Detection\", fontsize=13, fontweight=\"bold\")\nplt.tight_layout()\nplt.show()" }, { "cell_type": "markdown", @@ -1379,21 +1362,7 @@ "execution_count": null, "metadata": {}, "outputs": [], - "source": [ - "overlay_qc_dir = os.path.join(OUTPUT_DIR, \"overlays_qc\")\n", - "fnames_qc = sorted(os.listdir(overlay_qc_dir))\n", - "\n", - "fig, axes = plt.subplots(1, len(fnames_qc), figsize=(6 * len(fnames_qc), 5))\n", - "if len(fnames_qc) == 1:\n", - " axes = [axes]\n", - "for ax, fname in zip(axes, fnames_qc):\n", - " ax.imshow(mpimg.imread(os.path.join(overlay_qc_dir, fname)))\n", - " ax.set_title(fname.replace(\"_overlay_QC.jpg\", \"\"), fontsize=7)\n", - " ax.axis(\"off\")\n", - "plt.suptitle(\"Step 2 — Artifact Segmentation Overlays\", fontsize=11, fontweight=\"bold\")\n", - "plt.tight_layout()\n", - "plt.show()" - ] + "source": "overlay_qc_dir = os.path.join(OUTPUT_DIR, \"overlays_qc\")\nfnames_qc = sorted(os.listdir(overlay_qc_dir))\n\n# Lay panels out 2 per row so each image is large enough to inspect\nncols = 2 if len(fnames_qc) > 1 else 1\nnrows = (len(fnames_qc) + ncols - 1) // ncols\nfig, axes = plt.subplots(nrows, ncols, figsize=(8 * ncols, 6 * nrows))\naxes = np.array(axes).reshape(-1)\n\nfor ax, fname in zip(axes, fnames_qc):\n ax.imshow(mpimg.imread(os.path.join(overlay_qc_dir, fname)))\n ax.set_title(fname.replace(\"_overlay_QC.jpg\", \"\"), fontsize=9)\n ax.axis(\"off\")\nfor ax in axes[len(fnames_qc):]: # hide unused panels in the last row\n ax.axis(\"off\")\n\nplt.suptitle(\"Step 2 — Artifact Segmentation Overlays\", fontsize=13, fontweight=\"bold\")\nplt.tight_layout()\nplt.show()" }, { "cell_type": "markdown", @@ -1594,7 +1563,7 @@ }, { "cell_type": "code", - "source": "from matplotlib.lines import Line2D\n\n# Artifact classes to outline (skip clean tissue = 1 and background = 7)\nCONTOUR_CLASSES = [2, 3, 4, 5, 6]\n\nslides = list(uid_to_barcode.values())\nn = len(slides)\nfig, axes = plt.subplots(1, n, figsize=(4.5 * n, 4.8))\nif n == 1:\n axes = [axes]\n\nfor ax, barcode in zip(axes, slides):\n local_path = os.path.join(OUTPUT_DIR, \"mask_qc\", f\"{barcode}_mask.png\")\n ref_fname = ref_mask_by_barcode.get(barcode)\n if not os.path.exists(local_path) or not ref_fname:\n ax.set_title(f\"{barcode}\\n(mask missing)\", fontsize=7)\n ax.axis(\"off\")\n continue\n\n local = np.array(Image.open(local_path))\n # Both masks are produced at MPP 1.5, but the exact pixel dimensions can\n # differ slightly; resize the reference onto the local grid with nearest-\n # neighbour interpolation (preserves the integer class labels) so the two\n # sets of contours share one coordinate system.\n ref_img = Image.open(os.path.join(REF_MASKS_DIR, ref_fname))\n if ref_img.size != (local.shape[1], local.shape[0]):\n ref_img = ref_img.resize((local.shape[1], local.shape[0]), Image.NEAREST)\n ref = np.array(ref_img)\n\n # Light-grey tissue silhouette for spatial context (white = background)\n ax.imshow(np.where(local != 7, 0.9, 1.0), cmap=\"gray\", vmin=0, vmax=1)\n\n for cls in CONTOUR_CLASSES:\n color = mpl_colors[cls - 1]\n if np.any(ref == cls):\n ax.contour((ref == cls).astype(float), levels=[0.5], colors=[color],\n linewidths=1.1, linestyles=\"solid\", origin=\"upper\")\n if np.any(local == cls):\n ax.contour((local == cls).astype(float), levels=[0.5], colors=[color],\n linewidths=1.1, linestyles=\"dashed\", origin=\"upper\")\n ax.set_title(barcode, fontsize=7)\n ax.axis(\"off\")\n\n# Legend: colour encodes the artifact class, line style encodes the source\nclass_handles = [Line2D([0], [0], color=mpl_colors[c - 1], lw=2,\n label=CLASS_NAMES_SHORT[c - 1]) for c in CONTOUR_CLASSES]\nstyle_handles = [\n Line2D([0], [0], color=\"black\", lw=1.4, linestyle=\"solid\", label=\"Reference (Zenodo)\"),\n Line2D([0], [0], color=\"black\", lw=1.4, linestyle=\"dashed\", label=\"Local (DICOM)\"),\n]\nfig.legend(handles=class_handles + style_handles, loc=\"lower center\",\n ncol=len(class_handles) + len(style_handles), fontsize=7,\n framealpha=0.9, bbox_to_anchor=(0.5, -0.02))\nplt.suptitle(\"Artifact region boundaries — reference (solid) vs. local DICOM (dashed)\",\n fontweight=\"bold\", fontsize=10)\nplt.tight_layout(rect=[0, 0.04, 1, 0.97])\nplt.show()", + "source": "from matplotlib.lines import Line2D\n\n# Artifact classes to outline (skip clean tissue = 1 and background = 7)\nCONTOUR_CLASSES = [2, 3, 4, 5, 6]\n\nslides = list(uid_to_barcode.values())\nn = len(slides)\n# Lay panels out 2 per row so each image is large enough to inspect\nncols = 2 if n > 1 else 1\nnrows = (n + ncols - 1) // ncols\nfig, axes = plt.subplots(nrows, ncols, figsize=(7 * ncols, 6.5 * nrows))\naxes = np.array(axes).reshape(-1)\n\nfor ax, barcode in zip(axes, slides):\n local_path = os.path.join(OUTPUT_DIR, \"mask_qc\", f\"{barcode}_mask.png\")\n ref_fname = ref_mask_by_barcode.get(barcode)\n if not os.path.exists(local_path) or not ref_fname:\n ax.set_title(f\"{barcode}\\n(mask missing)\", fontsize=9)\n ax.axis(\"off\")\n continue\n\n local = np.array(Image.open(local_path))\n # Both masks are produced at MPP 1.5, but the exact pixel dimensions can\n # differ slightly; resize the reference onto the local grid with nearest-\n # neighbour interpolation (preserves the integer class labels) so the two\n # sets of contours share one coordinate system.\n ref_img = Image.open(os.path.join(REF_MASKS_DIR, ref_fname))\n if ref_img.size != (local.shape[1], local.shape[0]):\n ref_img = ref_img.resize((local.shape[1], local.shape[0]), Image.NEAREST)\n ref = np.array(ref_img)\n\n # Light-grey tissue silhouette for spatial context (white = background)\n ax.imshow(np.where(local != 7, 0.9, 1.0), cmap=\"gray\", vmin=0, vmax=1)\n\n for cls in CONTOUR_CLASSES:\n color = mpl_colors[cls - 1]\n if np.any(ref == cls):\n ax.contour((ref == cls).astype(float), levels=[0.5], colors=[color],\n linewidths=1.1, linestyles=\"solid\", origin=\"upper\")\n if np.any(local == cls):\n ax.contour((local == cls).astype(float), levels=[0.5], colors=[color],\n linewidths=1.1, linestyles=\"dashed\", origin=\"upper\")\n ax.set_title(barcode, fontsize=9)\n ax.axis(\"off\")\n\nfor ax in axes[n:]: # hide unused panels in the last row\n ax.axis(\"off\")\n\n# Legend: colour encodes the artifact class, line style encodes the source\nclass_handles = [Line2D([0], [0], color=mpl_colors[c - 1], lw=2,\n label=CLASS_NAMES_SHORT[c - 1]) for c in CONTOUR_CLASSES]\nstyle_handles = [\n Line2D([0], [0], color=\"black\", lw=1.4, linestyle=\"solid\", label=\"Reference (Zenodo)\"),\n Line2D([0], [0], color=\"black\", lw=1.4, linestyle=\"dashed\", label=\"Local (DICOM)\"),\n]\nfig.legend(handles=class_handles + style_handles, loc=\"lower center\",\n ncol=len(class_handles) + len(style_handles), fontsize=8,\n framealpha=0.9, bbox_to_anchor=(0.5, 0.0))\nplt.suptitle(\"Artifact region boundaries — reference (solid) vs. local DICOM (dashed)\",\n fontweight=\"bold\", fontsize=12)\nplt.tight_layout(rect=[0, 0.03, 1, 0.97])\nplt.show()", "metadata": {}, "execution_count": null, "outputs": [] From 9c96d99c532258d646618f82f0eab345ea4801b7 Mon Sep 17 00:00:00 2001 From: Andrey Fedorov Date: Tue, 16 Jun 2026 14:32:20 -0400 Subject: [PATCH 8/9] Replace contour overlay with side-by-side mask comparison The overlaid solid/dashed contours were hard to read. Instead show the reference (Zenodo) and local DICOM artifact maps side by side, one slide per row, each colourised with the same GrandQC palette and a shared class legend. No resizing needed since the panels are not overlaid. Co-Authored-By: Claude Opus 4.8 --- notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb index 829ccaa..b9dd4f7 100644 --- a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb +++ b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb @@ -1558,12 +1558,12 @@ }, { "cell_type": "markdown", - "source": "### 6.4 Compare Artifact Region Boundaries\n\nPer-class fractions (6.3) tell us *how much* of each artifact the two runs agree on, but not *where*. Here we overlay the **boundaries** of each artifact region for both runs on the same grid: a **solid** outline for the reference (Zenodo) mask and a **dashed** outline for our local DICOM run, drawn in the matching artifact colour. Where the two outlines hug each other the pipelines localise the artifact identically; visible gaps reveal boundary-level differences that aggregate fractions can hide.\n\nThe reference and local masks are both produced at MPP 1.5, but their pixel dimensions can differ slightly, so the reference is resized onto the local grid (nearest-neighbour) before overlaying. Clean tissue (class 1) and background (class 7) are omitted — only the five artifact classes are outlined.", + "source": "### 6.4 Compare Reference and Local Results Side by Side\n\nPer-class fractions (6.3) summarise agreement as numbers. Here we place the two artifact maps next to each other — the reference (Zenodo) result on the left and our local DICOM result on the right, one slide per row — so differences in *where* each artifact class is detected are easy to spot.\n\nBoth masks are colourised with the same GrandQC palette: clean tissue (grey) plus the five artifact classes; background (class 7) is shown white. The reference and local masks can differ slightly in pixel dimensions, but since the panels are shown side by side rather than overlaid, no resizing is needed.", "metadata": {} }, { "cell_type": "code", - "source": "from matplotlib.lines import Line2D\n\n# Artifact classes to outline (skip clean tissue = 1 and background = 7)\nCONTOUR_CLASSES = [2, 3, 4, 5, 6]\n\nslides = list(uid_to_barcode.values())\nn = len(slides)\n# Lay panels out 2 per row so each image is large enough to inspect\nncols = 2 if n > 1 else 1\nnrows = (n + ncols - 1) // ncols\nfig, axes = plt.subplots(nrows, ncols, figsize=(7 * ncols, 6.5 * nrows))\naxes = np.array(axes).reshape(-1)\n\nfor ax, barcode in zip(axes, slides):\n local_path = os.path.join(OUTPUT_DIR, \"mask_qc\", f\"{barcode}_mask.png\")\n ref_fname = ref_mask_by_barcode.get(barcode)\n if not os.path.exists(local_path) or not ref_fname:\n ax.set_title(f\"{barcode}\\n(mask missing)\", fontsize=9)\n ax.axis(\"off\")\n continue\n\n local = np.array(Image.open(local_path))\n # Both masks are produced at MPP 1.5, but the exact pixel dimensions can\n # differ slightly; resize the reference onto the local grid with nearest-\n # neighbour interpolation (preserves the integer class labels) so the two\n # sets of contours share one coordinate system.\n ref_img = Image.open(os.path.join(REF_MASKS_DIR, ref_fname))\n if ref_img.size != (local.shape[1], local.shape[0]):\n ref_img = ref_img.resize((local.shape[1], local.shape[0]), Image.NEAREST)\n ref = np.array(ref_img)\n\n # Light-grey tissue silhouette for spatial context (white = background)\n ax.imshow(np.where(local != 7, 0.9, 1.0), cmap=\"gray\", vmin=0, vmax=1)\n\n for cls in CONTOUR_CLASSES:\n color = mpl_colors[cls - 1]\n if np.any(ref == cls):\n ax.contour((ref == cls).astype(float), levels=[0.5], colors=[color],\n linewidths=1.1, linestyles=\"solid\", origin=\"upper\")\n if np.any(local == cls):\n ax.contour((local == cls).astype(float), levels=[0.5], colors=[color],\n linewidths=1.1, linestyles=\"dashed\", origin=\"upper\")\n ax.set_title(barcode, fontsize=9)\n ax.axis(\"off\")\n\nfor ax in axes[n:]: # hide unused panels in the last row\n ax.axis(\"off\")\n\n# Legend: colour encodes the artifact class, line style encodes the source\nclass_handles = [Line2D([0], [0], color=mpl_colors[c - 1], lw=2,\n label=CLASS_NAMES_SHORT[c - 1]) for c in CONTOUR_CLASSES]\nstyle_handles = [\n Line2D([0], [0], color=\"black\", lw=1.4, linestyle=\"solid\", label=\"Reference (Zenodo)\"),\n Line2D([0], [0], color=\"black\", lw=1.4, linestyle=\"dashed\", label=\"Local (DICOM)\"),\n]\nfig.legend(handles=class_handles + style_handles, loc=\"lower center\",\n ncol=len(class_handles) + len(style_handles), fontsize=8,\n framealpha=0.9, bbox_to_anchor=(0.5, 0.0))\nplt.suptitle(\"Artifact region boundaries — reference (solid) vs. local DICOM (dashed)\",\n fontweight=\"bold\", fontsize=12)\nplt.tight_layout(rect=[0, 0.03, 1, 0.97])\nplt.show()", + "source": "from matplotlib.patches import Patch\n\n# Colour lookup for GrandQC class labels: 1-6 use the artifact palette, while\n# 0 (padding) and 7 (background) are shown white.\npalette = np.ones((8, 3))\nfor cls in range(1, 7):\n palette[cls] = mpl_colors[cls - 1]\n\ndef colorize(mask):\n return palette[np.clip(mask, 0, 7)]\n\nslides = list(uid_to_barcode.values())\nn = len(slides)\nfig, axes = plt.subplots(n, 2, figsize=(13, 5.0 * n))\naxes = np.atleast_2d(axes)\n\nfor row, barcode in enumerate(slides):\n local_path = os.path.join(OUTPUT_DIR, \"mask_qc\", f\"{barcode}_mask.png\")\n ref_fname = ref_mask_by_barcode.get(barcode)\n ref_path = os.path.join(REF_MASKS_DIR, ref_fname) if ref_fname else None\n\n ax_ref, ax_local = axes[row, 0], axes[row, 1]\n ax_ref.axis(\"off\")\n ax_local.axis(\"off\")\n\n if ref_path and os.path.exists(ref_path):\n ax_ref.imshow(colorize(np.array(Image.open(ref_path))))\n else:\n ax_ref.text(0.5, 0.5, \"reference mask missing\", ha=\"center\", va=\"center\")\n if os.path.exists(local_path):\n ax_local.imshow(colorize(np.array(Image.open(local_path))))\n else:\n ax_local.text(0.5, 0.5, \"local mask missing\", ha=\"center\", va=\"center\")\n\n ax_ref.set_title(f\"{barcode}\\nReference (Zenodo)\", fontsize=10)\n ax_local.set_title(\"Local (DICOM)\", fontsize=10)\n\n# Shared legend of artifact classes\nlegend_handles = [Patch(facecolor=mpl_colors[i], edgecolor=\"black\",\n label=CLASS_NAMES_SHORT[i]) for i in range(len(CLASS_NAMES_SHORT))]\nfig.legend(handles=legend_handles, loc=\"lower center\", ncol=len(legend_handles),\n fontsize=9, framealpha=0.9, bbox_to_anchor=(0.5, 0.0))\nplt.suptitle(\"Artifact maps — reference vs. local DICOM run\", fontweight=\"bold\", fontsize=13)\nplt.tight_layout(rect=[0, 0.03, 1, 0.97])\nplt.show()", "metadata": {}, "execution_count": null, "outputs": [] From 6a54113e12477906641b4369d9b55b663f0684e6 Mon Sep 17 00:00:00 2001 From: Andrey Fedorov Date: Tue, 16 Jun 2026 14:33:59 -0400 Subject: [PATCH 9/9] Note T4 high-RAM runtime requirement in the intro Co-Authored-By: Claude Opus 4.8 --- notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb index b9dd4f7..03c0bc6 100644 --- a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb +++ b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb @@ -10,7 +10,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": "# Slide Quality Control with GrandQC and NCI Imaging Data Commons\n\nAutomated quality control (QC) is an essential preprocessing step in computational pathology pipelines. Artifacts such as tissue folds, out-of-focus regions, pen markings, and air bubbles can introduce noise that degrades the performance of downstream AI models.\n\n[**GrandQC**](https://github.com/cpath-ukk/grandqc) is an open-source tool for automated tissue detection and multi-class artifact segmentation in whole slide images (WSIs). It was validated across slides from 19 international pathology departments and published in:\n\n> Weng Z., Seper A., Pryalukhin A., et al. *GrandQC: A comprehensive solution to quality control problem in digital pathology.* Nature Communications 15, 10685 (2024). https://doi.org/10.1038/s41467-024-54769-y\n\n[**NCI Imaging Data Commons (IDC)**](https://portal.imaging.datacommons.cancer.gov/) hosts ~100 TB of cancer imaging data, including thousands of H&E-stained whole slide images from TCGA, CPTAC, HTAN, and other programs — all stored as DICOM and freely accessible without authentication.\n\nThis notebook demonstrates how to:\n1. **Discover** H&E-stained whole slide images in IDC using `idc-index`\n2. **Set up** GrandQC and download its pre-trained models\n3. **Download** slides from IDC and run GrandQC directly on DICOM files\n4. **Visualize** tissue detection and artifact segmentation results\n5. **Validate** the DICOM-based pipeline against GrandQC's pre-computed QC masks for the TCGA cohorts" + "source": "# Slide Quality Control with GrandQC and NCI Imaging Data Commons\n\nAutomated quality control (QC) is an essential preprocessing step in computational pathology pipelines. Artifacts such as tissue folds, out-of-focus regions, pen markings, and air bubbles can introduce noise that degrades the performance of downstream AI models.\n\n[**GrandQC**](https://github.com/cpath-ukk/grandqc) is an open-source tool for automated tissue detection and multi-class artifact segmentation in whole slide images (WSIs). It was validated across slides from 19 international pathology departments and published in:\n\n> Weng Z., Seper A., Pryalukhin A., et al. *GrandQC: A comprehensive solution to quality control problem in digital pathology.* Nature Communications 15, 10685 (2024). https://doi.org/10.1038/s41467-024-54769-y\n\n[**NCI Imaging Data Commons (IDC)**](https://portal.imaging.datacommons.cancer.gov/) hosts ~100 TB of cancer imaging data, including thousands of H&E-stained whole slide images from TCGA, CPTAC, HTAN, and other programs — all stored as DICOM and freely accessible without authentication.\n\nThis notebook demonstrates how to:\n1. **Discover** H&E-stained whole slide images in IDC using `idc-index`\n2. **Set up** GrandQC and download its pre-trained models\n3. **Download** slides from IDC and run GrandQC directly on DICOM files\n4. **Visualize** tissue detection and artifact segmentation results\n5. **Validate** the DICOM-based pipeline against GrandQC's pre-computed QC masks for the TCGA cohorts\n\n> ⚠️ **Runtime requirement**: Run this notebook on a **T4 high-RAM** runtime. In Colab, select **Runtime → Change runtime type → T4 GPU** and enable the **High-RAM** option. The artifact-segmentation step needs the GPU for reasonable speed (~30–45 s/slide vs. 10–30 min/slide on CPU), and the high-RAM shape avoids out-of-memory errors when loading the larger whole-slide images." }, { "cell_type": "markdown",