From 6700478177f6322a17c6faa10ba8eb01882a4417 Mon Sep 17 00:00:00 2001
From: Andrey Fedorov <andrey.fedorov@gmail.com>
Date: Thu, 11 Jun 2026 11:06:53 -0400
Subject: [PATCH 1/9] Add GrandQC slide quality control notebook for IDC DICOM
 WSI

Demonstrates end-to-end use of GrandQC with NCI Imaging Data Commons:
- Discover H&E slides via idc-index sm_index metadata
- Patch GrandQC scripts for native DICOM directory support via OpenSlide 4.0
- Download TCGA-BRCA DICOM series and run tissue detection + artifact segmentation
- Visualize results using GrandQC's own wsi_colors.py palette
- Validate the DICOM-based pipeline against GrandQC's pre-computed TCGA masks
  (zenodo.org/records/14041578) by comparing per-class artifact fractions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 .../grandqc_slide_quality_with_idc.ipynb      | 1135 +++++++++++++++++
 1 file changed, 1135 insertions(+)
 create mode 100644 notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb
diff --git a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb
new file mode 100644
index 0000000..d505c19
--- /dev/null
+++ b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb
@@ -0,0 +1,1135 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a href=\"https://colab.research.google.com/github/ImagingDataCommons/IDC-Tutorials/blob/master/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Slide Quality Control with GrandQC and NCI Imaging Data Commons\n",
+    "\n",
+    "Automated quality control (QC) is an essential preprocessing step in computational pathology pipelines. Artifacts such as tissue folds, out-of-focus regions, pen markings, and air bubbles can introduce noise that degrades the performance of downstream AI models.\n",
+    "\n",
+    "[**GrandQC**](https://github.com/cpath-ukk/grandqc) is an open-source tool for automated tissue detection and multi-class artifact segmentation in whole slide images (WSIs). It was validated across slides from 19 international pathology departments and published in:\n",
+    "\n",
+    "> Weng Z., Seper A., Pryalukhin A., et al. *GrandQC: A comprehensive solution to quality control problem in digital pathology.* Nature Communications 15, 10685 (2024). https://doi.org/10.1038/s41467-024-54769-y\n",
+    "\n",
+    "[**NCI Imaging Data Commons (IDC)**](https://portal.imaging.datacommons.cancer.gov/) hosts ~100 TB of cancer imaging data, including thousands of H&E-stained whole slide images from TCGA, CPTAC, HTAN, and other programs — all stored as DICOM and freely accessible without authentication.\n",
+    "\n",
+    "This notebook demonstrates how to:\n",
+    "1. **Discover** H&E-stained whole slide images in IDC using `idc-index`\n",
+    "2. **Set up** GrandQC and download its pre-trained models\n",
+    "3. **Download** slides from IDC and run GrandQC directly on DICOM files\n",
+    "4. **Visualize** tissue detection and artifact segmentation results\n",
+    "5. **Leverage** GrandQC's pre-computed QC masks for all 32 TCGA cohorts"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Disclaimer\n",
+    "\n",
+    "The code and data of this repository are provided to promote reproducible research. They are not intended for clinical care or commercial use.\n",
+    "\n",
+    "The software is provided \"as is\", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. In no event shall the authors or copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.\n",
+    "\n",
+    "**GrandQC license**: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0). Non-commercial research use only; citation of the GrandQC paper is required."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Part 1: Discover H&E Slides in IDC\n",
+    "\n",
+    "IDC provides the `idc-index` Python package for querying slide metadata and downloading DICOM files. We start by installing it and exploring the available whole slide image collections."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%capture\n",
+    "!pip install --upgrade idc-index openslide-python openslide-bin"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import pandas as pd\n",
+    "import matplotlib.pyplot as plt\n",
+    "import matplotlib.patches as mpatches\n",
+    "import numpy as np\n",
+    "from IPython.display import IFrame, display\n",
+    "from idc_index import IDCClient\n",
+    "\n",
+    "idc_client = IDCClient()\n",
+    "print(f\"IDC data version: {idc_client.get_idc_version()}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 1.1 Overview of Slide Microscopy Collections\n",
+    "\n",
+    "IDC stores whole slide images using the DICOM Slide Microscopy (SM) standard. The `sm_index` table extends the main metadata index with pathology-specific attributes including staining protocol, tissue type, pixel spacing, and image dimensions."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Load the slide microscopy index\n",
+    "idc_client.fetch_index(\"sm_index\")\n",
+    "\n",
+    "# Summary of all SM collections\n",
+    "sm_collections = idc_client.sql_query(\"\"\"\n",
+    "    SELECT\n",
+    "        i.collection_id,\n",
+    "        COUNT(DISTINCT i.PatientID)      AS patients,\n",
+    "        COUNT(DISTINCT i.SeriesInstanceUID) AS series,\n",
+    "        ROUND(SUM(i.series_size_MB) / 1024.0, 1) AS size_GB,\n",
+    "        i.license_short_name\n",
+    "    FROM index i\n",
+    "    WHERE i.Modality = 'SM'\n",
+    "    GROUP BY i.collection_id, i.license_short_name\n",
+    "    ORDER BY patients DESC\n",
+    "\"\"\")\n",
+    "\n",
+    "print(f\"Total SM collections: {len(sm_collections)}\")\n",
+    "print(f\"Total SM series:      {sm_collections['series'].sum():,}\")\n",
+    "print(f\"Total size:           {sm_collections['size_GB'].sum():.0f} GB\")\n",
+    "print()\n",
+    "sm_collections.head(15)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 1.2 Filter for H&E-Stained Slides\n",
+    "\n",
+    "The `sm_index` table includes a `staining_usingSubstance_CodeMeaning` column (stored as an array) that records the staining protocol for each slide. We filter for slides stained with hematoxylin, which identifies H&E preparations."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Count H&E slides per collection\n",
+    "he_by_collection = idc_client.sql_query(\"\"\"\n",
+    "    SELECT\n",
+    "        i.collection_id,\n",
+    "        COUNT(DISTINCT i.PatientID)         AS patients,\n",
+    "        COUNT(DISTINCT i.SeriesInstanceUID) AS he_series,\n",
+    "        ROUND(SUM(i.series_size_MB) / 1024.0, 1) AS size_GB\n",
+    "    FROM index i\n",
+    "    JOIN sm_index s ON i.SeriesInstanceUID = s.SeriesInstanceUID\n",
+    "    WHERE array_to_string(s.staining_usingSubstance_CodeMeaning, ', ') LIKE '%hematoxylin%'\n",
+    "    GROUP BY i.collection_id\n",
+    "    ORDER BY he_series DESC\n",
+    "\"\"\")\n",
+    "\n",
+    "print(f\"Collections with H&E slides: {len(he_by_collection)}\")\n",
+    "print(f\"Total H&E series: {he_by_collection['he_series'].sum():,}\")\n",
+    "he_by_collection.head(15)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 1.3 Select Slides for Quality Control\n",
+    "\n",
+    "For this tutorial we work with slides from the [TCGA-BRCA](https://portal.imaging.datacommons.cancer.gov/explore/filters/?collection_id=tcga_brca) collection (The Cancer Genome Atlas — Breast Cancer). This collection is a good choice because:\n",
+    "\n",
+    "- It contains over 3,000 H&E-stained slides at 40× magnification\n",
+    "- GrandQC provides **pre-computed QC masks for all TCGA cohorts** (demonstrated in Part 6)\n",
+    "- Slides span both tumor and adjacent normal tissue, providing variety in tissue composition\n",
+    "\n",
+    "We query the `sm_index` to retrieve per-slide metadata including the `ContainerIdentifier` (the original TCGA slide barcode), which we will later use to match IDC series against the pre-computed GrandQC masks."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Query TCGA-BRCA H&E slides with key metadata\n",
+    "tcga_brca_he = idc_client.sql_query(\"\"\"\n",
+    "    SELECT\n",
+    "        i.SeriesInstanceUID,\n",
+    "        i.PatientID,\n",
+    "        s.ContainerIdentifier,\n",
+    "        s.primaryAnatomicStructureModifier_CodeMeaning AS tissue_type,\n",
+    "        s.max_TotalPixelMatrixColumns                  AS width_px,\n",
+    "        s.max_TotalPixelMatrixRows                     AS height_px,\n",
+    "        s.min_PixelSpacing_2sf                         AS pixel_spacing_mm,\n",
+    "        s.ObjectiveLensPower                           AS objective_power,\n",
+    "        ROUND(i.series_size_MB, 1)                     AS size_MB\n",
+    "    FROM index i\n",
+    "    JOIN sm_index s ON i.SeriesInstanceUID = s.SeriesInstanceUID\n",
+    "    WHERE i.collection_id = 'tcga_brca'\n",
+    "      AND array_to_string(s.staining_usingSubstance_CodeMeaning, ', ') LIKE '%hematoxylin%'\n",
+    "    ORDER BY i.series_size_MB ASC\n",
+    "\"\"\")\n",
+    "\n",
+    "print(f\"TCGA-BRCA H&E slides: {len(tcga_brca_he)}\")\n",
+    "print(f\"Tissue types: {tcga_brca_he['tissue_type'].value_counts().to_dict()}\")\n",
+    "print(f\"\\nSize range: {tcga_brca_he['size_MB'].min()} – {tcga_brca_he['size_MB'].max()} MB\")\n",
+    "tcga_brca_he.head(10)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 1.4 Select a Representative Set of Slides\n",
+    "\n",
+    "For the hands-on GrandQC run we select a small set of slides that covers:\n",
+    "- Different tissue types (tumor vs. normal)\n",
+    "- Different magnifications (20× and 40×)\n",
+    "- Manageable file sizes for a notebook environment\n",
+    "\n",
+    "We cap the selection at slides under 200 MB to keep download and processing times reasonable."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Pick up to 5 small slides with variety in tissue type and magnification\n",
+    "demo_slides = (\n",
+    "    tcga_brca_he[tcga_brca_he[\"size_MB\"] < 200]\n",
+    "    .drop_duplicates(subset=\"tissue_type\")\n",
+    "    .head(5)\n",
+    "    .reset_index(drop=True)\n",
+    ")\n",
+    "\n",
+    "# Fall back to the 5 smallest slides if tissue_type is missing\n",
+    "if len(demo_slides) < 3:\n",
+    "    demo_slides = tcga_brca_he[tcga_brca_he[\"size_MB\"] < 200].head(5).reset_index(drop=True)\n",
+    "\n",
+    "print(f\"Selected {len(demo_slides)} slides for GrandQC processing:\")\n",
+    "display(\n",
+    "    demo_slides[[\n",
+    "        \"PatientID\", \"ContainerIdentifier\", \"tissue_type\",\n",
+    "        \"objective_power\", \"width_px\", \"height_px\", \"size_MB\"\n",
+    "    ]]\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Preview slides in the IDC SLIM viewer (pathology viewer)\n",
+    "for _, row in demo_slides.iterrows():\n",
+    "    url = idc_client.get_viewer_URL(seriesInstanceUID=row[\"SeriesInstanceUID\"])\n",
+    "    print(f\"{row['ContainerIdentifier']}  ({row['tissue_type']}, {row['size_MB']} MB)\")\n",
+    "    print(f\"  {url}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Part 2: GrandQC Setup\n",
+    "\n",
+    "Before running GrandQC we need to:\n",
+    "1. Clone the repository from GitHub\n",
+    "2. Install its Python dependencies\n",
+    "3. Download the pre-trained models from Zenodo\n",
+    "4. Apply a small patch so both scripts accept IDC's DICOM series directories alongside conventional WSI files\n",
+    "\n",
+    "> **Note**: GrandQC's scripts currently iterate over files in a flat folder. IDC downloads each DICOM series into its own subdirectory. The patch below (≤ 10 lines changed across 2 files) teaches GrandQC to recognise a directory whose contents end in `.dcm` as a valid slide entry and resolves the OpenSlide path accordingly. No model weights or algorithm logic are changed."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 2.1 Clone and Install Dependencies"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%capture\n",
+    "# Clone GrandQC (shallow clone — we only need the latest commit)\n",
+    "!git clone --depth 1 https://github.com/cpath-ukk/grandqc.git\n",
+    "\n",
+    "# Install GrandQC Python dependencies.\n",
+    "# PyTorch is already present in Colab; we install the remaining packages.\n",
+    "# segmentation-models-pytorch 0.3.1 is pinned because its model.predict() API\n",
+    "# was removed in 0.4.0 and GrandQC relies on it.\n",
+    "!pip install -q \\\n",
+    "    \"segmentation-models-pytorch==0.3.1\" \\\n",
+    "    \"rasterio>=1.3\" \\\n",
+    "    \"imagecodecs\" \\\n",
+    "    \"zarr<3\" \\\n",
+    "    \"tifffile\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 2.2 Download Pre-trained Models"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "GRANDQC_SCRIPTS = \"grandqc/01_WSI_inference_OPENSLIDE_QC\"\n",
+    "MODELS_TD = os.path.join(GRANDQC_SCRIPTS, \"models\", \"td\")\n",
+    "MODELS_QC = os.path.join(GRANDQC_SCRIPTS, \"models\", \"qc\")\n",
+    "os.makedirs(MODELS_TD, exist_ok=True)\n",
+    "os.makedirs(MODELS_QC, exist_ok=True)\n",
+    "\n",
+    "# Tissue detection model (~27 MB) — Zenodo record 14507273\n",
+    "!wget -q --show-progress -O {MODELS_TD}/Tissue_Detection_MPP10.pth \\\n",
+    "    \"https://zenodo.org/records/14507273/files/Tissue_Detection_MPP10.pth?download=1\"\n",
+    "\n",
+    "# Artifact segmentation model at MPP=1.5 (~25 MB) — Zenodo record 14041538\n",
+    "# MPP=1.5 (≈7× objective) is the recommended default for most scanners.\n",
+    "!wget -q --show-progress -O {MODELS_QC}/GrandQC_MPP15.pth \\\n",
+    "    \"https://zenodo.org/records/14041538/files/GrandQC_MPP15.pth?download=1\"\n",
+    "\n",
+    "print(\"Models downloaded:\")\n",
+    "for f in [f\"{MODELS_TD}/Tissue_Detection_MPP10.pth\", f\"{MODELS_QC}/GrandQC_MPP15.pth\"]:\n",
+    "    size_mb = os.path.getsize(f) / 1024**2\n",
+    "    print(f\"  {f}  ({size_mb:.1f} MB)\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 2.3 Patch Scripts for DICOM Directory Support\n",
+    "\n",
+    "IDC downloads each series as a directory of tiled DICOM files (`*.dcm`). OpenSlide 4.0+ can open any one of those files and reconstruct the full pyramid from the same directory.\n",
+    "\n",
+    "The two changes per script are:\n",
+    "\n",
+    "| Script | Change |\n",
+    "|--------|--------|\n",
+    "| `wsi_tis_detect.py` | Replace file-only `os.listdir` filter with one that also yields subdirectories containing `.dcm` files; resolve the path before calling `OpenSlide()` |\n",
+    "| `main.py` | Same two changes using `open_slide()` |\n",
+    "\n",
+    "Both scripts also get automatic CUDA/MPS/CPU device selection instead of the hardcoded `'cuda'`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pathlib import Path\n",
+    "\n",
+    "def patch_grandqc_for_dicom(script_path: str) -> None:\n",
+    "    text = Path(script_path).read_text()\n",
+    "    original = text\n",
+    "\n",
+    "    # ── 1. Auto-detect compute device ──────────────────────────────────────\n",
+    "    text = text.replace(\n",
+    "        \"DEVICE = 'cuda'\",\n",
+    "        \"import torch as _t; DEVICE = ('cuda' if _t.cuda.is_available() \"\n",
+    "        \"else ('mps' if _t.backends.mps.is_available() else 'cpu'))\",\n",
+    "    )\n",
+    "\n",
+    "    # ── 2. Replace slide-list builder to also accept DICOM directories ──────\n",
+    "    DICOM_ITER = (\n",
+    "        \"def _iter_slides(d):\\n\"\n",
+    "        \"    for n in sorted(os.listdir(d)):\\n\"\n",
+    "        \"        p = os.path.join(d, n)\\n\"\n",
+    "        \"        if os.path.isfile(p) or (\\n\"\n",
+    "        \"                os.path.isdir(p) and\\n\"\n",
+    "        \"                any(x.lower().endswith('.dcm') for x in os.listdir(p))):\\n\"\n",
+    "        \"            yield n\\n\"\n",
+    "    )\n",
+    "\n",
+    "    # wsi_tis_detect.py pattern\n",
+    "    OLD_LIST_TIS = (\n",
+    "        \"slide_names = sorted([f for f in os.listdir(SLIDE_DIR) \"\n",
+    "        \"if os.path.isfile(os.path.join(SLIDE_DIR, f))])\"\n",
+    "    )\n",
+    "    NEW_LIST_TIS = DICOM_ITER + \"slide_names = list(_iter_slides(SLIDE_DIR))\"\n",
+    "\n",
+    "    # main.py pattern\n",
+    "    OLD_LIST_MAIN = \"slide_names = sorted(os.listdir(SLIDE_DIR))\"\n",
+    "    NEW_LIST_MAIN = DICOM_ITER + \"slide_names = list(_iter_slides(SLIDE_DIR))\"\n",
+    "\n",
+    "    text = text.replace(OLD_LIST_TIS, NEW_LIST_TIS)\n",
+    "    text = text.replace(OLD_LIST_MAIN, NEW_LIST_MAIN)\n",
+    "\n",
+    "    # ── 3. Resolve DICOM directory to first .dcm file before open ───────────\n",
+    "    RESOLVE = (\n",
+    "        \"        if os.path.isdir(path_slide):\\n\"\n",
+    "        \"            _dcms = sorted(x for x in os.listdir(path_slide)\"\n",
+    "        \" if x.lower().endswith('.dcm'))\\n\"\n",
+    "        \"            path_slide = os.path.join(path_slide, _dcms[0])\\n\"\n",
+    "    )\n",
+    "\n",
+    "    for open_call in (\"slide = OpenSlide(path_slide)\", \"slide = open_slide(path_slide)\"):\n",
+    "        old = f\"        path_slide = os.path.join(SLIDE_DIR, slide_name)\\n        {open_call}\"\n",
+    "        new = f\"        path_slide = os.path.join(SLIDE_DIR, slide_name)\\n{RESOLVE}        {open_call}\"\n",
+    "        text = text.replace(old, new)\n",
+    "\n",
+    "    if text == original:\n",
+    "        print(f\"  WARNING: no changes applied to {script_path}\")\n",
+    "    else:\n",
+    "        Path(script_path).write_text(text)\n",
+    "        n = sum(1 for a, b in zip(original.splitlines(), text.splitlines()) if a != b)\n",
+    "        n += abs(len(text.splitlines()) - len(original.splitlines()))\n",
+    "        print(f\"  Patched {script_path}  (+{n} lines changed)\")\n",
+    "\n",
+    "\n",
+    "for script in [\n",
+    "    f\"{GRANDQC_SCRIPTS}/wsi_tis_detect.py\",\n",
+    "    f\"{GRANDQC_SCRIPTS}/main.py\",\n",
+    "]:\n",
+    "    patch_grandqc_for_dicom(script)\n",
+    "print(\"Done.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Sanity-check: confirm the DICOM resolver was inserted in both scripts\n",
+    "for script in [f\"{GRANDQC_SCRIPTS}/wsi_tis_detect.py\", f\"{GRANDQC_SCRIPTS}/main.py\"]:\n",
+    "    text = Path(script).read_text()\n",
+    "    has_dcm  = \"_iter_slides\" in text\n",
+    "    has_res  = \"_dcms\" in text\n",
+    "    has_dev  = \"cuda.is_available\" in text\n",
+    "    print(f\"{script.split('/')[-1]:30s}  dicom_iter={has_dcm}  dcm_resolve={has_res}  auto_device={has_dev}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Part 3: Download Slides from IDC\n",
+    "\n",
+    "`idc_client.download_from_selection()` fetches DICOM files for the requested series and places each series in its own subdirectory named by `SeriesInstanceUID`.\n",
+    "\n",
+    "After downloading we rename each subdirectory to its `ContainerIdentifier` (the original TCGA slide barcode). This ensures that GrandQC's output files and the pre-computed TCGA mask filenames are consistently keyed by the human-readable barcode rather than an opaque UID."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 3.1 Download DICOM Series"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "SLIDE_DIR = \"slides\"\n",
+    "os.makedirs(SLIDE_DIR, exist_ok=True)\n",
+    "\n",
+    "print(f\"Downloading {len(demo_slides)} slides  →  {SLIDE_DIR}/\")\n",
+    "print(\"(This may take a few minutes depending on file sizes and network speed)\\n\")\n",
+    "\n",
+    "idc_client.download_from_selection(\n",
+    "    downloadDir=SLIDE_DIR,\n",
+    "    seriesInstanceUID=demo_slides[\"SeriesInstanceUID\"].tolist(),\n",
+    ")\n",
+    "print(\"\\nDownload complete.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 3.2 Rename Series Directories to TCGA Barcodes\n",
+    "\n",
+    "GrandQC uses the directory/file name as the key for all output files (tissue mask, QC map, overlay, TSV report row). By renaming each downloaded series directory from its `SeriesInstanceUID` to its `ContainerIdentifier` (TCGA barcode), we get:\n",
+    "\n",
+    "- Human-readable output filenames (`TCGA-A7-A26J-01B-02-BS2_MASK.png` instead of a UID)\n",
+    "- Automatic matching with GrandQC's pre-computed TCGA masks (Part 6), which are keyed by the same barcode"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Build SeriesInstanceUID → ContainerIdentifier map\n",
+    "uid_to_barcode = dict(zip(demo_slides[\"SeriesInstanceUID\"], demo_slides[\"ContainerIdentifier\"]))\n",
+    "\n",
+    "for uid, barcode in uid_to_barcode.items():\n",
+    "    src = os.path.join(SLIDE_DIR, uid)\n",
+    "    dst = os.path.join(SLIDE_DIR, barcode)\n",
+    "    if os.path.isdir(src):\n",
+    "        os.rename(src, dst)\n",
+    "        print(f\"  {uid}  →  {barcode}\")\n",
+    "    elif os.path.isdir(dst):\n",
+    "        print(f\"  Already renamed: {barcode}\")\n",
+    "    else:\n",
+    "        print(f\"  WARNING: directory not found: {uid}\")\n",
+    "\n",
+    "print(\"\\nSlide directories after renaming:\")\n",
+    "for d in sorted(os.listdir(SLIDE_DIR)):\n",
+    "    dpath = os.path.join(SLIDE_DIR, d)\n",
+    "    if os.path.isdir(dpath):\n",
+    "        n = sum(1 for f in os.listdir(dpath) if f.lower().endswith(\".dcm\"))\n",
+    "        print(f\"  {d}/  ({n} .dcm files)\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 3.3 Verify OpenSlide Can Read the DICOM Files\n",
+    "\n",
+    "OpenSlide 4.0 (bundled in `openslide-bin`) reads DICOM WSI files natively. Opening any one `.dcm` file from a series directory is sufficient — OpenSlide discovers the remaining pyramid levels from the other files in the same directory that share the same `SeriesInstanceUID`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import openslide\n",
+    "\n",
+    "print(f\"OpenSlide version: {openslide.__library_version__}\\n\")\n",
+    "\n",
+    "for barcode in uid_to_barcode.values():\n",
+    "    slide_dir = os.path.join(SLIDE_DIR, barcode)\n",
+    "    dcm_files = sorted(f for f in os.listdir(slide_dir) if f.lower().endswith(\".dcm\"))\n",
+    "    if not dcm_files:\n",
+    "        print(f\"  {barcode}: no .dcm files found\")\n",
+    "        continue\n",
+    "    entry = os.path.join(slide_dir, dcm_files[0])\n",
+    "    try:\n",
+    "        slide = openslide.OpenSlide(entry)\n",
+    "        w, h = slide.level_dimensions[0]\n",
+    "        mpp = slide.properties.get(\"openslide.mpp-x\", \"N/A\")\n",
+    "        vendor = slide.properties.get(\"openslide.vendor\", \"N/A\")\n",
+    "        print(f\"  {barcode}\")\n",
+    "        print(f\"    {w:,} × {h:,} px  |  MPP={mpp}  |  vendor={vendor}  |  levels={slide.level_count}\")\n",
+    "        slide.close()\n",
+    "    except Exception as exc:\n",
+    "        print(f\"  {barcode}: ERROR — {exc}\")\n",
+    "\n",
+    "print(\"\\nAll slides verified.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Part 4: Run GrandQC Quality Control\n",
+    "\n",
+    "GrandQC runs as two sequential scripts, both inside `grandqc/01_WSI_inference_OPENSLIDE_QC/`:\n",
+    "\n",
+    "| Step | Script | MPP | Time (GPU) | Output |\n",
+    "|------|--------|-----|------------|--------|\n",
+    "| 1 | `wsi_tis_detect.py` | 10 | ~0.4 s/slide | Binary tissue masks |\n",
+    "| 2 | `main.py` | 1.5 | ~30–45 s/slide | 7-class artifact maps, GeoJSON, TSV report |\n",
+    "\n",
+    "The 7 artifact classes are: **1** clean tissue · **2** folds · **3** dark spots · **4** pen marks · **5** air bubbles/edges · **6** out-of-focus · **7** background.\n",
+    "\n",
+    "Both scripts load models via relative paths (`./models/…`) and must therefore be invoked with their directory as the working directory. We use `subprocess` with `cwd=GRANDQC_SCRIPTS` to achieve this while keeping output streaming to the cell."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 4.1 Step 1 — Tissue Detection\n",
+    "\n",
+    "Produces a low-resolution binary mask (tissue vs. background) for each slide. The mask is saved as a PNG and re-used by Step 2 to skip background patches and cut inference time."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import subprocess, sys\n",
+    "\n",
+    "OUTPUT_DIR = \"qc_output\"\n",
+    "os.makedirs(OUTPUT_DIR, exist_ok=True)\n",
+    "\n",
+    "slide_dir_abs  = os.path.abspath(SLIDE_DIR)\n",
+    "output_dir_abs = os.path.abspath(OUTPUT_DIR)\n",
+    "scripts_abs    = os.path.abspath(GRANDQC_SCRIPTS)\n",
+    "\n",
+    "print(f\"Slides:          {slide_dir_abs}\")\n",
+    "print(f\"Output:          {output_dir_abs}\")\n",
+    "print(f\"Scripts dir:     {scripts_abs}\")\n",
+    "print()\n",
+    "print(\"Running wsi_tis_detect.py ...\")\n",
+    "\n",
+    "result = subprocess.run(\n",
+    "    [sys.executable, \"wsi_tis_detect.py\",\n",
+    "     \"--slide_folder\", slide_dir_abs,\n",
+    "     \"--output_dir\",   output_dir_abs],\n",
+    "    cwd=scripts_abs,\n",
+    ")\n",
+    "\n",
+    "if result.returncode != 0:\n",
+    "    print(f\"\\nERROR: wsi_tis_detect.py exited with code {result.returncode}\")\n",
+    "else:\n",
+    "    mask_dir = os.path.join(output_dir_abs, \"tis_det_mask\")\n",
+    "    masks = sorted(os.listdir(mask_dir))\n",
+    "    print(f\"\\nTissue detection complete — {len(masks)} mask(s) written to {mask_dir}/\")\n",
+    "    for m in masks:\n",
+    "        print(f\"  {m}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 4.2 Step 2 — Artifact Segmentation\n",
+    "\n",
+    "Runs the 7-class artifact model at MPP=1.5. For each slide it produces:\n",
+    "\n",
+    "- `maps_qc/<barcode>_map_QC.png` — color-coded artifact map (same resolution as tissue mask)\n",
+    "- `overlays_qc/<barcode>_overlay_QC.jpg` — artifact map overlaid on a downscaled slide thumbnail\n",
+    "- `mask_qc/<barcode>_mask.png` — raw integer mask (class 1–7 per pixel)\n",
+    "- `geojson_qc/<barcode>.geojson` — polygon annotations compatible with QuPath and SLIM\n",
+    "- `report_qc_output_0_N_stats_per_slide.txt` — TSV with per-slide dimensions and processing time\n",
+    "\n",
+    "> **GPU note**: On a T4 GPU (Colab free tier), expect ~30–45 s per slide. On CPU only, plan for 10–30 min per slide depending on size."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"Running main.py (artifact segmentation) ...\")\n",
+    "\n",
+    "result = subprocess.run(\n",
+    "    [sys.executable, \"main.py\",\n",
+    "     \"--slide_folder\", slide_dir_abs,\n",
+    "     \"--output_dir\",   output_dir_abs,\n",
+    "     \"--mpp_model\",    \"1.5\",\n",
+    "     \"--create_geojson\", \"Y\"],\n",
+    "    cwd=scripts_abs,\n",
+    ")\n",
+    "\n",
+    "if result.returncode != 0:\n",
+    "    print(f\"\\nERROR: main.py exited with code {result.returncode}\")\n",
+    "else:\n",
+    "    maps_dir = os.path.join(output_dir_abs, \"maps_qc\")\n",
+    "    maps = sorted(os.listdir(maps_dir))\n",
+    "    print(f\"\\nArtifact segmentation complete — {len(maps)} map(s) in {maps_dir}/\")\n",
+    "    for m in maps:\n",
+    "        print(f\"  {m}\")\n",
+    "\n",
+    "    # Locate the TSV report (name includes the slide count)\n",
+    "    reports = [f for f in os.listdir(output_dir_abs) if f.endswith(\"_stats_per_slide.txt\")]\n",
+    "    if reports:\n",
+    "        import pandas as pd\n",
+    "        report_path = os.path.join(output_dir_abs, reports[0])\n",
+    "        df = pd.read_csv(report_path, sep=\"\\t\")\n",
+    "        print(f\"\\nPer-slide report ({reports[0]}):\")\n",
+    "        display(df[[\"slide_name\", \"mpp\", \"height\", \"width\", \"time\"]])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Part 5: Visualize and Analyze Results\n",
+    "\n",
+    "GrandQC writes several output types per slide. We look at three of them:\n",
+    "\n",
+    "1. **Tissue detection overlays** — confirms that tissue vs. background separation was correct\n",
+    "2. **Artifact segmentation maps** — color-coded composite of the 7-class artifact model overlaid on a slide thumbnail\n",
+    "3. **Raw integer masks** (`mask_qc/`) — pixel-level class labels (1–7) used to compute per-slide artifact fractions"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 5.1 Tissue Detection Overlays\n",
+    "\n",
+    "Each overlay blends the original thumbnail (70 %) with a two-class color mask (30 %): blue = tissue, gray = background."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import matplotlib.pyplot as plt\n",
+    "import matplotlib.image as mpimg\n",
+    "\n",
+    "overlay_dir = os.path.join(OUTPUT_DIR, \"tis_det_overlay\")\n",
+    "fnames = sorted(os.listdir(overlay_dir))\n",
+    "\n",
+    "fig, axes = plt.subplots(1, len(fnames), figsize=(6 * len(fnames), 5))\n",
+    "if len(fnames) == 1:\n",
+    "    axes = [axes]\n",
+    "for ax, fname in zip(axes, fnames):\n",
+    "    ax.imshow(mpimg.imread(os.path.join(overlay_dir, fname)))\n",
+    "    ax.set_title(fname.replace(\"_OVERLAY.jpg\", \"\"), fontsize=7)\n",
+    "    ax.axis(\"off\")\n",
+    "plt.suptitle(\"Step 1 — Tissue Detection\", fontsize=11, fontweight=\"bold\")\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 5.2 Artifact Segmentation Maps\n",
+    "\n",
+    "The QC overlay composites the artifact color map onto a downscaled thumbnail of the original slide. Colors match the `wsi_colors.py` palette used by GrandQC:\n",
+    "\n",
+    "| Class | Color | Artifact |\n",
+    "|-------|-------|---------|\n",
+    "| 1 | ■ Gray `#808080` | Clean tissue |\n",
+    "| 2 | ■ Tomato `#FF6347` | Folds |\n",
+    "| 3 | ■ Lime `#00FF00` | Dark spots |\n",
+    "| 4 | ■ Red `#FF0000` | Pen marks |\n",
+    "| 5 | ■ Magenta `#FF00FF` | Air bubbles / slide edges |\n",
+    "| 6 | ■ Indigo `#4B0082` | Out-of-focus |\n",
+    "| 7 | □ White `#FFFFFF` | Background |"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "overlay_qc_dir = os.path.join(OUTPUT_DIR, \"overlays_qc\")\n",
+    "fnames_qc = sorted(os.listdir(overlay_qc_dir))\n",
+    "\n",
+    "fig, axes = plt.subplots(1, len(fnames_qc), figsize=(6 * len(fnames_qc), 5))\n",
+    "if len(fnames_qc) == 1:\n",
+    "    axes = [axes]\n",
+    "for ax, fname in zip(axes, fnames_qc):\n",
+    "    ax.imshow(mpimg.imread(os.path.join(overlay_qc_dir, fname)))\n",
+    "    ax.set_title(fname.replace(\"_overlay_QC.jpg\", \"\"), fontsize=7)\n",
+    "    ax.axis(\"off\")\n",
+    "plt.suptitle(\"Step 2 — Artifact Segmentation Overlays\", fontsize=11, fontweight=\"bold\")\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 5.3 Artifact Fractions per Slide\n",
+    "\n",
+    "We read the raw integer masks from `mask_qc/` (pixel values 1–7) and compute what fraction of the **tissue area** (excluding background, class 7) belongs to each artifact class. This is the standard metric reported in the GrandQC paper."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from PIL import Image\n",
+    "\n",
+    "# GrandQC wsi_colors.py palette — RGB, indices 0-5 correspond to classes 1-6\n",
+    "GRANDQC_COLORS_RGB = [\n",
+    "    [128, 128, 128],  # 1 Clean tissue\n",
+    "    [255,  99,  71],  # 2 Folds\n",
+    "    [  0, 255,   0],  # 3 Dark spots\n",
+    "    [255,   0,   0],  # 4 Pen marks\n",
+    "    [255,   0, 255],  # 5 Bubbles/edges\n",
+    "    [ 75,   0, 130],  # 6 Out-of-focus\n",
+    "]\n",
+    "CLASS_NAMES_SHORT = [\n",
+    "    \"Clean tissue\", \"Folds\", \"Dark spots\",\n",
+    "    \"Pen marks\", \"Bubbles/edges\", \"Out-of-focus\",\n",
+    "]\n",
+    "# Normalize to [0,1] for matplotlib\n",
+    "mpl_colors = [[r/255, g/255, b/255] for r, g, b in GRANDQC_COLORS_RGB]\n",
+    "\n",
+    "mask_dir = os.path.join(OUTPUT_DIR, \"mask_qc\")\n",
+    "rows = []\n",
+    "for fname in sorted(os.listdir(mask_dir)):\n",
+    "    barcode = fname.replace(\"_mask.png\", \"\")\n",
+    "    mask = np.array(Image.open(os.path.join(mask_dir, fname)))\n",
+    "    tissue_px = np.sum(mask != 7)  # exclude background\n",
+    "    if tissue_px == 0:\n",
+    "        continue\n",
+    "    row = {\"slide\": barcode}\n",
+    "    for cls_idx, name in enumerate(CLASS_NAMES_SHORT, start=1):\n",
+    "        row[name] = np.sum(mask == cls_idx) / tissue_px * 100\n",
+    "    rows.append(row)\n",
+    "\n",
+    "stats_df = pd.DataFrame(rows).set_index(\"slide\")\n",
+    "print(\"Artifact class fractions (% of tissue area):\")\n",
+    "display(stats_df.round(1))\n",
+    "\n",
+    "# ── Stacked bar chart ────────────────────────────────────────────────────────\n",
+    "ax = stats_df[CLASS_NAMES_SHORT].plot(\n",
+    "    kind=\"bar\",\n",
+    "    stacked=True,\n",
+    "    color=mpl_colors,\n",
+    "    figsize=(max(6, 2.5 * len(stats_df)), 5),\n",
+    "    edgecolor=\"none\",\n",
+    ")\n",
+    "ax.set_ylabel(\"% of tissue area\")\n",
+    "ax.set_xlabel(\"\")\n",
+    "ax.set_title(\"GrandQC Artifact Fractions per Slide\", fontweight=\"bold\")\n",
+    "ax.set_xticklabels(ax.get_xticklabels(), rotation=25, ha=\"right\", fontsize=8)\n",
+    "ax.legend(loc=\"upper right\", fontsize=8, framealpha=0.8)\n",
+    "ax.set_ylim(0, 110)\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 5.4 View Original Slides in IDC SLIM Viewer\n",
+    "\n",
+    "The IDC SLIM viewer is a web-based pathology viewer for DICOM whole slide images. The links below open each slide at full resolution. The GeoJSON annotations produced by GrandQC (`geojson_qc/`) can be loaded into QuPath or other tools that accept GeoJSON overlays for interactive artifact review."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from IPython.display import IFrame, display\n",
+    "\n",
+    "print(\"IDC SLIM viewer links (open in browser for full-resolution viewing):\\n\")\n",
+    "for _, row in demo_slides.iterrows():\n",
+    "    url = idc_client.get_viewer_URL(seriesInstanceUID=row[\"SeriesInstanceUID\"])\n",
+    "    print(f\"  {row['ContainerIdentifier']}\")\n",
+    "    print(f\"    {url}\\n\")\n",
+    "\n",
+    "# Embed the first slide inline (750 px height)\n",
+    "first_url = idc_client.get_viewer_URL(seriesInstanceUID=demo_slides.iloc[0][\"SeriesInstanceUID\"])\n",
+    "print(f\"Embedding: {demo_slides.iloc[0]['ContainerIdentifier']}\")\n",
+    "display(IFrame(first_url, width=\"100%\", height=750))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Part 6: Validating the Pipeline Against Pre-computed TCGA Masks\n",
+    "\n",
+    "In Parts 3–5 we ran GrandQC locally on IDC DICOM slides using OpenSlide's native DICOM reader — a path the GrandQC authors did not test in their original publication. Before using the pipeline on new data, it is good practice to verify that our results are consistent with the authors' reference outputs.\n",
+    "\n",
+    "The GrandQC authors published pre-computed QC masks for all 32 TCGA cohorts on Zenodo:\n",
+    "\n",
+    "> **GrandQC pre-computed masks**: [zenodo.org/records/14041578](https://zenodo.org/records/14041578)  \n",
+    "> 36 tar archives — one per TCGA cohort (some large cohorts split into two parts) — totalling ~18.3 GB.\n",
+    "\n",
+    "Each archive unpacks to PNG files named `{ContainerIdentifier}_mask.png` with integer labels 1–7 — the same format and naming convention produced by `main.py`. By downloading the TCGA-BRCA archive and comparing artifact fractions for our demo slides against the reference, we can confirm that the DICOM-based pipeline is producing equivalent results."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 6.1 Available Cohort Archives\n",
+    "\n",
+    "The Zenodo record contains one tar file per TCGA cohort. Large cohorts (BRCA, THCA, GBM, LUAD, UCEC, LGG, TGCT) are split into two parts."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import requests\n",
+    "\n",
+    "# Fetch the Zenodo record metadata to get the authoritative file list\n",
+    "rec = requests.get(\"https://zenodo.org/api/records/14041578\").json()\n",
+    "files = rec[\"files\"]\n",
+    "\n",
+    "rows_z = []\n",
+    "for f in sorted(files, key=lambda x: x[\"key\"]):\n",
+    "    rows_z.append({\n",
+    "        \"archive\": f[\"key\"],\n",
+    "        \"size_MB\": round(f[\"size\"] / 1024**2, 0),\n",
+    "    })\n",
+    "\n",
+    "zenodo_df = pd.DataFrame(rows_z)\n",
+    "total_gb  = zenodo_df[\"size_MB\"].sum() / 1024\n",
+    "print(f\"{len(zenodo_df)} archives  |  total: {total_gb:.1f} GB\\n\")\n",
+    "display(zenodo_df)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 6.2 Download the TCGA-BRCA Reference Masks\n",
+    "\n",
+    "`BRCA.tar` (~2 GB) contains masks for all TCGA-BRCA slides. After extraction, each file is named `{ContainerIdentifier}_mask.png` — identical to what GrandQC writes into `mask_qc/` locally."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "REF_MASKS_DIR = \"brca_ref_masks\"\n",
+    "os.makedirs(REF_MASKS_DIR, exist_ok=True)\n",
+    "\n",
+    "# Download BRCA.tar (~2 GB) — takes 5-10 min on Colab\n",
+    "!wget -q --show-progress \\\n",
+    "    \"https://zenodo.org/records/14041578/files/BRCA.tar?download=1\" \\\n",
+    "    -O BRCA.tar\n",
+    "\n",
+    "# Extract — PNG files land directly in the archive root\n",
+    "!tar xf BRCA.tar -C {REF_MASKS_DIR}\n",
+    "\n",
+    "# Report what we got\n",
+    "ref_masks = sorted(f for f in os.listdir(REF_MASKS_DIR) if f.endswith(\"_mask.png\"))\n",
+    "print(f\"\\nExtracted {len(ref_masks)} reference masks to {REF_MASKS_DIR}/\")\n",
+    "print(\"Sample filenames:\")\n",
+    "for f in ref_masks[:5]:\n",
+    "    print(f\"  {f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 6.3 Pipeline Validation: Compare Local vs. Reference Masks\n",
+    "\n",
+    "The reference masks in `brca_ref_masks/` were produced by the GrandQC authors running on TIFF/SVS files. Our local masks in `qc_output/mask_qc/` were produced by the same model running on IDC DICOM files via OpenSlide's native DICOM reader.\n",
+    "\n",
+    "We compare per-class artifact fractions for each demo slide. Close agreement (< 2 percentage points per class) confirms that the DICOM-based path does not introduce bias relative to the authors' original pipeline."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def mask_fractions(mask_path):\n",
+    "    mask = np.array(Image.open(mask_path))\n",
+    "    tissue_px = np.sum(mask != 7)\n",
+    "    if tissue_px == 0:\n",
+    "        return None\n",
+    "    return {name: np.sum(mask == cls) / tissue_px * 100\n",
+    "            for cls, name in enumerate(CLASS_NAMES_SHORT, start=1)}\n",
+    "\n",
+    "\n",
+    "# Build comparison table for each demo slide\n",
+    "compare_rows = []\n",
+    "for barcode in uid_to_barcode.values():\n",
+    "    local_path = os.path.join(OUTPUT_DIR, \"mask_qc\", f\"{barcode}_mask.png\")\n",
+    "    ref_path   = os.path.join(REF_MASKS_DIR, f\"{barcode}_mask.png\")\n",
+    "\n",
+    "    if not os.path.exists(local_path):\n",
+    "        print(f\"  MISSING local mask: {barcode}\")\n",
+    "        continue\n",
+    "    if not os.path.exists(ref_path):\n",
+    "        print(f\"  MISSING reference mask: {barcode} — slide may not be in BRCA.tar\")\n",
+    "        continue\n",
+    "\n",
+    "    local_frac = mask_fractions(local_path)\n",
+    "    ref_frac   = mask_fractions(ref_path)\n",
+    "    if local_frac is None or ref_frac is None:\n",
+    "        continue\n",
+    "\n",
+    "    for name in CLASS_NAMES_SHORT:\n",
+    "        compare_rows.append({\n",
+    "            \"slide\":    barcode,\n",
+    "            \"class\":    name,\n",
+    "            \"local_%\":  round(local_frac[name], 2),\n",
+    "            \"ref_%\":    round(ref_frac[name], 2),\n",
+    "            \"delta_pp\": round(local_frac[name] - ref_frac[name], 2),\n",
+    "        })\n",
+    "\n",
+    "compare_df = pd.DataFrame(compare_rows)\n",
+    "print(\"Per-class fraction comparison (local DICOM run vs. Zenodo reference):\\n\")\n",
+    "display(compare_df.pivot_table(index=\"slide\", columns=\"class\", values=\"delta_pp\").round(2))\n",
+    "print(\"\\nδ = local − reference (percentage points).  Values near 0 confirm agreement.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Scatter plot: local fraction vs. reference fraction for all classes × slides\n",
+    "fig, ax = plt.subplots(figsize=(5, 5))\n",
+    "\n",
+    "class_colors = {name: mpl_colors[i] for i, name in enumerate(CLASS_NAMES_SHORT)}\n",
+    "\n",
+    "for cls_name, grp in compare_df.groupby(\"class\"):\n",
+    "    ax.scatter(grp[\"ref_%\"], grp[\"local_%\"],\n",
+    "               label=cls_name, color=class_colors[cls_name],\n",
+    "               s=60, edgecolors=\"black\", linewidths=0.4, zorder=3)\n",
+    "\n",
+    "# Identity line\n",
+    "lim_max = max(compare_df[[\"local_%\", \"ref_%\"]].max()) * 1.05\n",
+    "ax.plot([0, lim_max], [0, lim_max], \"k--\", linewidth=0.8, label=\"y = x\")\n",
+    "ax.set_xlabel(\"Reference fraction (%) — Zenodo\")\n",
+    "ax.set_ylabel(\"Local fraction (%) — DICOM run\")\n",
+    "ax.set_title(\"Pipeline Validation: Local vs. Reference Artifact Fractions\",\n",
+    "             fontweight=\"bold\", fontsize=9)\n",
+    "ax.legend(fontsize=7, framealpha=0.8)\n",
+    "ax.set_xlim(0, lim_max)\n",
+    "ax.set_ylim(0, lim_max)\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 6.4 Cohort-level QC Statistics\n",
+    "\n",
+    "With validation confirmed, we compute per-slide artifact fractions across all TCGA-BRCA slides in the reference archive. This gives a population-level view of slide quality without running any inference locally — each mask is read once and discarded, keeping memory usage constant."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from tqdm.notebook import tqdm\n",
+    "\n",
+    "cohort_rows = []\n",
+    "for fname in tqdm(ref_masks, desc=\"Computing stats\"):\n",
+    "    barcode = fname.replace(\"_mask.png\", \"\")\n",
+    "    fracs = mask_fractions(os.path.join(REF_MASKS_DIR, fname))\n",
+    "    if fracs is not None:\n",
+    "        fracs[\"slide\"] = barcode\n",
+    "        cohort_rows.append(fracs)\n",
+    "\n",
+    "cohort_df = pd.DataFrame(cohort_rows).set_index(\"slide\")\n",
+    "\n",
+    "print(f\"Cohort: TCGA-BRCA  |  {len(cohort_df)} slides\\n\")\n",
+    "print(\"Median artifact fractions (% of tissue area):\")\n",
+    "display(cohort_df[CLASS_NAMES_SHORT].median().rename(\"median_%\").to_frame().T.round(1))\n",
+    "print()\n",
+    "print(\"90th-percentile artifact fractions:\")\n",
+    "display(cohort_df[CLASS_NAMES_SHORT].quantile(0.9).rename(\"p90_%\").to_frame().T.round(1))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Violin plot of per-slide artifact fractions across TCGA-BRCA\n",
+    "# Exclude Clean tissue (class 1) — typically >80%, dominates the y-axis\n",
+    "artifact_classes = [c for c in CLASS_NAMES_SHORT if c != \"Clean tissue\"]\n",
+    "artifact_colors  = [mpl_colors[i] for i, c in enumerate(CLASS_NAMES_SHORT) if c != \"Clean tissue\"]\n",
+    "\n",
+    "fig, ax = plt.subplots(figsize=(9, 4))\n",
+    "data    = [cohort_df[cls].dropna().values for cls in artifact_classes]\n",
+    "parts   = ax.violinplot(data, positions=range(len(artifact_classes)),\n",
+    "                        showmedians=True, showextrema=False)\n",
+    "\n",
+    "for pc, color in zip(parts[\"bodies\"], artifact_colors):\n",
+    "    pc.set_facecolor(color)\n",
+    "    pc.set_alpha(0.75)\n",
+    "parts[\"cmedians\"].set_color(\"black\")\n",
+    "parts[\"cmedians\"].set_linewidth(1.5)\n",
+    "\n",
+    "ax.set_xticks(range(len(artifact_classes)))\n",
+    "ax.set_xticklabels(artifact_classes, rotation=20, ha=\"right\")\n",
+    "ax.set_ylabel(\"% of tissue area\")\n",
+    "ax.set_title(f\"Artifact fraction distribution — TCGA-BRCA ({len(cohort_df)} slides)\",\n",
+    "             fontweight=\"bold\")\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Join QC stats with IDC sm_index metadata via ContainerIdentifier\n",
+    "cohort_meta = cohort_df.reset_index().rename(columns={\"slide\": \"ContainerIdentifier\"})\n",
+    "brca_meta   = tcga_brca_he[[\"ContainerIdentifier\", \"tissue_type\"]].dropna(subset=[\"tissue_type\"])\n",
+    "\n",
+    "merged = cohort_meta.merge(brca_meta, on=\"ContainerIdentifier\", how=\"inner\")\n",
+    "\n",
+    "print(f\"Slides with tissue-type annotation: {len(merged)} / {len(cohort_df)}\\n\")\n",
+    "\n",
+    "# Median artifact fractions by tissue type\n",
+    "by_tissue = (\n",
+    "    merged.groupby(\"tissue_type\")[CLASS_NAMES_SHORT]\n",
+    "    .median()\n",
+    "    .round(1)\n",
+    ")\n",
+    "print(\"Median artifact fractions (%) by tissue type:\")\n",
+    "display(by_tissue)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}

From c410ab19f3d692251eb2e68725c8fb424b545885 Mon Sep 17 00:00:00 2001
From: Andrey Fedorov <andrey.fedorov@gmail.com>
Date: Mon, 15 Jun 2026 14:25:42 -0400
Subject: [PATCH 2/9] Fix GrandQC artifact segmentation step and document
 inference resolution

main.py loads a fully-pickled QC model via torch.load(), which fails with
exit code 1 under PyTorch >=2.6 (weights_only=True default). Patch the load
to weights_only=False so the trusted Zenodo weights unpickle. Also add
[i/N] progress reporting and a failure guard that records per-slide errors
and exits non-zero, so the notebook surfaces failures the loop used to swallow.

Add grandqc_slide_quality_resolution_notes.md documenting the model working
resolution (MPP 1.5) vs. the fixed 512x512 input tile, how MPP is read from
openslide.mpp-x (DICOM PixelSpacing), and an empirical confirmation on an IDC
slide; link it from the Part 4 intro.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 .../grandqc_slide_quality_resolution_notes.md |  178 +++
 .../grandqc_slide_quality_with_idc.ipynb      | 1181 +++++++++++++++--
 2 files changed, 1235 insertions(+), 124 deletions(-)
 create mode 100644 notebooks/pathomics/grandqc_slide_quality_resolution_notes.md

diff --git a/notebooks/pathomics/grandqc_slide_quality_resolution_notes.md b/notebooks/pathomics/grandqc_slide_quality_resolution_notes.md
new file mode 100644
index 0000000..0206cf2
--- /dev/null
+++ b/notebooks/pathomics/grandqc_slide_quality_resolution_notes.md
@@ -0,0 +1,178 @@
+# GrandQC + IDC: Notes on Inference Resolution
+
+Supplementary notes for [`grandqc_slide_quality_with_idc.ipynb`](grandqc_slide_quality_with_idc.ipynb).
+
+This document explains, for the GrandQC inference scripts in
+`grandqc/01_WSI_inference_OPENSLIDE_QC/`, **what resolution the model runs at,
+where that is controlled, and how the per-slide microns-per-pixel (MPP) value is
+determined** when slides are read from NCI Imaging Data Commons (IDC) as DICOM.
+
+---
+
+## Two distinct resolutions
+
+There are two separate resolution concepts in the pipeline. They are easy to
+conflate but are set in different places.
+
+| Resolution concept | Value (this tutorial) | Where it is set |
+|---|---|---|
+| **Model working resolution** (MPP the network "sees") | `1.5` µm/px (≈ 7× objective) | `--mpp_model` CLI flag → `MPP_MODEL` in `main.py` |
+| **Tissue-detection resolution** (Step 1) | `10` µm/px | hardcoded in `wsi_tis_detect.py` |
+| **Model input tile size** | `512 × 512` px | `M_P_S_MODEL = 512`, hardcoded in `main.py` |
+| **Native read size per tile** | `mpp_model / mpp × 512` px (level 0) | computed in `wsi_slide_info.py:slide_info()` |
+
+The model **never** runs at the slide's native 20×/40× magnification. The tissue
+detector runs coarsest (MPP 10); the artifact model runs at MPP 1.5.
+
+---
+
+## 1. Model working resolution — `--mpp_model`
+
+The artifact-segmentation step (notebook Part 4.2) is invoked as:
+
+```python
+subprocess.run([sys.executable, "main.py",
+    "--slide_folder", slide_dir_abs,
+    "--output_dir",   output_dir_abs,
+    "--mpp_model",    "1.5",
+    "--create_geojson", "Y"], cwd=scripts_abs)
+```
+
+`--mpp_model` sets `MPP_MODEL` in `main.py` and does two things:
+
+1. **Selects the model weights** — only three values are valid:
+
+   | `--mpp_model` | weights file |
+   |---|---|
+   | `1.0` | `GrandQC_MPP1.pth` |
+   | `1.5` | `GrandQC_MPP15.pth` |
+   | `2.0` | `GrandQC_MPP2.pth` |
+
+   Any other value raises `Exception("mpp of the model can only be 1.0, 1.5, 2.0")`.
+
+2. **Sets the magnification slides are read at** (see §3).
+
+> **To change the inference resolution** you must set `--mpp_model` to `1.0` or
+> `2.0` **and** download the matching weights. The notebook's model-download cell
+> (Part 2.2) only fetches `GrandQC_MPP15.pth`, so switching MPP requires adding the
+> corresponding Zenodo download.
+
+---
+
+## 2. Model input tile size — fixed at 512×512
+
+Hardcoded in `main.py`; not a CLI flag:
+
+```python
+M_P_S_MODEL = 512   # the network always receives 512×512 px tiles
+```
+
+Every tile is downsampled to 512×512 before inference (§3). Changing this would
+require retraining and should not be altered.
+
+---
+
+## 3. How the two combine
+
+`wsi_slide_info.py:slide_info()` converts the model MPP into a patch size measured
+in the slide's **native level-0 pixels**:
+
+```python
+mpp = float(slide.properties["openslide.mpp-x"])   # slide's native MPP
+p_s = int(mpp_model / mpp * m_p_s)                  # patch size in level-0 px
+```
+
+Then `wsi_process.py:slide_process_single()` reads that region at full resolution
+and downsamples it to the fixed model input size:
+
+```python
+work_patch = slide.read_region((w, h), 0, (p_s, p_s))                  # read p_s × p_s from level 0
+work_patch = work_patch.resize((m_p_s, m_p_s), Image.Resampling.LANCZOS)  # → 512×512
+predictions = model.predict(x_tensor)                                  # inference at 512×512 (= MPP 1.5)
+```
+
+**Worked example** — a 40× TCGA-BRCA slide with native MPP ≈ `0.2485`, `mpp_model = 1.5`:
+
+```
+p_s = int(1.5 / 0.2485 * 512) = int(3091.0) = 3091 px   # read from level 0
+     → downsampled to 512×512 → fed to the network
+```
+
+The downsampling makes the effective resolution seen by the network MPP 1.5,
+regardless of whether the slide was scanned at 20× or 40×.
+
+---
+
+## 4. How MPP is determined
+
+MPP is **read from OpenSlide**, not computed. The only place it is obtained is the
+single line in `wsi_slide_info.py:slide_info()`:
+
+```python
+mpp = float(slide.properties["openslide.mpp-x"])
+```
+
+Key points:
+
+- It is **read, not derived** from objective power. Only the X axis is used
+  (square pixels are assumed; `mpp-y` is ignored).
+- **No fallback for MPP.** Note the asymmetry in the same function — objective
+  power has a fallback, MPP does not:
+
+  ```python
+  try:
+      obj_power = slide.properties["openslide.objective-power"]
+  except:
+      obj_power = 99            # objective power: falls back to 99
+
+  mpp = float(slide.properties["openslide.mpp-x"])   # MPP: no fallback
+  ```
+
+  If `openslide.mpp-x` is missing/`None`, `float(...)` raises. With the
+  failure-guard patch added in the notebook, this is caught and reported per
+  slide rather than silently mis-scaling.
+- `openslide.mpp-x` is the **single point of failure** for scaling any new slide:
+  if it is wrong or absent, `p_s` is sized wrong and the model sees the slide at
+  the wrong scale.
+
+### For IDC DICOM slides specifically
+
+OpenSlide's DICOM driver (added in OpenSlide 4.0.0) populates `openslide.mpp-x`
+from the DICOM `PixelSpacing` attribute (mm → µm).
+
+**Verified empirically** on IDC slide `TCGA-A7-A26J-01B-02-BS2`
+(the 9.1 MB series used in the notebook), downloaded with `idc-index` and opened
+with OpenSlide 4.0.1:
+
+```
+openslide.vendor          = dicom
+openslide.mpp-x           = 0.24850000000000003
+openslide.mpp-y           = 0.24850000000000003
+openslide.objective-power = 40
+```
+
+The raw DICOM source attribute is exposed on the same slide:
+
+```
+dicom.SharedFunctionalGroupsSequence[0].PixelMeasuresSequence[0].PixelSpacing[0] = .0002485   # mm
+```
+
+and `0.0002485 mm × 1000 = 0.2485 µm/px`, which matches `openslide.mpp-x` exactly.
+This also matches the `pixel_spacing_mm = 0.00025` shown in the notebook's
+metadata query (the 2-significant-figure rounded form).
+
+The notebook's Part 3.3 verification cell already surfaces this value with
+`slide.properties.get("openslide.mpp-x", "N/A")`, so you can inspect MPP per slide
+before running Step 2.
+
+---
+
+## Quick reference: changing resolution safely
+
+1. Pick a supported MPP: `1.0`, `1.5`, or `2.0`.
+2. Download the matching weights from Zenodo into `models/qc/`
+   (`GrandQC_MPP1.pth` / `GrandQC_MPP15.pth` / `GrandQC_MPP2.pth`).
+3. Pass it via `--mpp_model` in the Step 2 `subprocess.run` call.
+4. Leave `M_P_S_MODEL = 512` unchanged.
+5. Confirm each slide reports a valid `openslide.mpp-x` (Part 3.3) before running —
+   it has no fallback.
diff --git a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb
index d505c19..2f7be61 100644
--- a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb
+++ b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb
@@ -53,7 +53,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -63,9 +63,17 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 2,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "IDC data version: v24\n"
+     ]
+    }
+   ],
    "source": [
     "import os\n",
     "import pandas as pd\n",
@@ -90,9 +98,280 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 3,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Total SM collections: 73\n",
+      "Total SM series:      76,299\n",
+      "Total size:           47871 GB\n",
+      "\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "  <div id=\"df-0be3dee0-bb7c-449b-8c89-d4c7e39402fe\" class=\"colab-df-container\">\n",
+       "    <div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>collection_id</th>\n",
+       "      <th>patients</th>\n",
+       "      <th>series</th>\n",
+       "      <th>size_GB</th>\n",
+       "      <th>license_short_name</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>ccdi_mci</td>\n",
+       "      <td>4407</td>\n",
+       "      <td>4576</td>\n",
+       "      <td>4548.0</td>\n",
+       "      <td>CC BY 4.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>tcga_brca</td>\n",
+       "      <td>1098</td>\n",
+       "      <td>3111</td>\n",
+       "      <td>1678.8</td>\n",
+       "      <td>CC BY 3.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>gtex</td>\n",
+       "      <td>971</td>\n",
+       "      <td>25503</td>\n",
+       "      <td>8354.2</td>\n",
+       "      <td>CC BY 4.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>pdxnet</td>\n",
+       "      <td>919</td>\n",
+       "      <td>919</td>\n",
+       "      <td>122.7</td>\n",
+       "      <td>CC BY 4.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>tcga_gbm</td>\n",
+       "      <td>607</td>\n",
+       "      <td>2053</td>\n",
+       "      <td>639.9</td>\n",
+       "      <td>CC BY 3.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5</th>\n",
+       "      <td>tcga_ov</td>\n",
+       "      <td>590</td>\n",
+       "      <td>1481</td>\n",
+       "      <td>476.5</td>\n",
+       "      <td>CC BY 3.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6</th>\n",
+       "      <td>tcga_ucec</td>\n",
+       "      <td>560</td>\n",
+       "      <td>1371</td>\n",
+       "      <td>1078.5</td>\n",
+       "      <td>CC BY 3.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>7</th>\n",
+       "      <td>tcga_kirc</td>\n",
+       "      <td>537</td>\n",
+       "      <td>2173</td>\n",
+       "      <td>811.7</td>\n",
+       "      <td>CC BY 3.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>8</th>\n",
+       "      <td>tcga_hnsc</td>\n",
+       "      <td>523</td>\n",
+       "      <td>1263</td>\n",
+       "      <td>560.5</td>\n",
+       "      <td>CC BY 3.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>9</th>\n",
+       "      <td>tcga_luad</td>\n",
+       "      <td>522</td>\n",
+       "      <td>1608</td>\n",
+       "      <td>631.1</td>\n",
+       "      <td>CC BY 3.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>10</th>\n",
+       "      <td>tcga_lgg</td>\n",
+       "      <td>516</td>\n",
+       "      <td>1572</td>\n",
+       "      <td>1166.6</td>\n",
+       "      <td>CC BY 3.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>11</th>\n",
+       "      <td>tcga_thca</td>\n",
+       "      <td>507</td>\n",
+       "      <td>1158</td>\n",
+       "      <td>747.1</td>\n",
+       "      <td>CC BY 3.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12</th>\n",
+       "      <td>tcga_lusc</td>\n",
+       "      <td>504</td>\n",
+       "      <td>1612</td>\n",
+       "      <td>610.4</td>\n",
+       "      <td>CC BY 3.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>13</th>\n",
+       "      <td>tcga_prad</td>\n",
+       "      <td>500</td>\n",
+       "      <td>1172</td>\n",
+       "      <td>596.4</td>\n",
+       "      <td>CC BY 3.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>14</th>\n",
+       "      <td>tcga_skcm</td>\n",
+       "      <td>470</td>\n",
+       "      <td>950</td>\n",
+       "      <td>721.5</td>\n",
+       "      <td>CC BY 3.0</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>\n",
+       "    <div class=\"colab-df-buttons\">\n",
+       "      \n",
+       "  <div class=\"colab-df-container\">\n",
+       "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0be3dee0-bb7c-449b-8c89-d4c7e39402fe')\"\n",
+       "            title=\"Convert this dataframe to an interactive table.\"\n",
+       "            style=\"display:none;\">\n",
+       "      \n",
+       "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
+       "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
+       "  </svg>\n",
+       "    </button>\n",
+       "    \n",
+       "  <style>\n",
+       "    .colab-df-container {\n",
+       "      display:flex;\n",
+       "      gap: 12px;\n",
+       "    }\n",
+       "\n",
+       "    .colab-df-convert {\n",
+       "      background-color: #E8F0FE;\n",
+       "      border: none;\n",
+       "      border-radius: 50%;\n",
+       "      cursor: pointer;\n",
+       "      display: none;\n",
+       "      fill: #1967D2;\n",
+       "      height: 32px;\n",
+       "      padding: 0 0 0 0;\n",
+       "      width: 32px;\n",
+       "    }\n",
+       "\n",
+       "    .colab-df-convert:hover {\n",
+       "      background-color: #E2EBFA;\n",
+       "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+       "      fill: #174EA6;\n",
+       "    }\n",
+       "\n",
+       "    .colab-df-buttons div {\n",
+       "      margin-bottom: 4px;\n",
+       "    }\n",
+       "\n",
+       "    [theme=dark] .colab-df-convert {\n",
+       "      background-color: #3B4455;\n",
+       "      fill: #D2E3FC;\n",
+       "    }\n",
+       "\n",
+       "    [theme=dark] .colab-df-convert:hover {\n",
+       "      background-color: #434B5C;\n",
+       "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+       "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+       "      fill: #FFFFFF;\n",
+       "    }\n",
+       "  </style>\n",
+       "\n",
+       "    <script>\n",
+       "      const buttonEl =\n",
+       "        document.querySelector('#df-0be3dee0-bb7c-449b-8c89-d4c7e39402fe button.colab-df-convert');\n",
+       "      buttonEl.style.display =\n",
+       "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+       "\n",
+       "      async function convertToInteractive(key) {\n",
+       "        const element = document.querySelector('#df-0be3dee0-bb7c-449b-8c89-d4c7e39402fe');\n",
+       "        const dataTable =\n",
+       "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+       "                                                    [key], {});\n",
+       "        if (!dataTable) return;\n",
+       "\n",
+       "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
+       "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+       "          + ' to learn more about interactive tables.';\n",
+       "        element.innerHTML = '';\n",
+       "        dataTable['output_type'] = 'display_data';\n",
+       "        await google.colab.output.renderOutput(dataTable, element);\n",
+       "        const docLink = document.createElement('div');\n",
+       "        docLink.innerHTML = docLinkHtml;\n",
+       "        element.appendChild(docLink);\n",
+       "      }\n",
+       "    </script>\n",
+       "  </div>\n",
+       "  \n",
+       "    </div>\n",
+       "  </div>\n",
+       "  "
+      ],
+      "text/plain": [
+       "   collection_id  patients  series  size_GB license_short_name\n",
+       "0       ccdi_mci      4407    4576   4548.0          CC BY 4.0\n",
+       "1      tcga_brca      1098    3111   1678.8          CC BY 3.0\n",
+       "2           gtex       971   25503   8354.2          CC BY 4.0\n",
+       "3         pdxnet       919     919    122.7          CC BY 4.0\n",
+       "4       tcga_gbm       607    2053    639.9          CC BY 3.0\n",
+       "5        tcga_ov       590    1481    476.5          CC BY 3.0\n",
+       "6      tcga_ucec       560    1371   1078.5          CC BY 3.0\n",
+       "7      tcga_kirc       537    2173    811.7          CC BY 3.0\n",
+       "8      tcga_hnsc       523    1263    560.5          CC BY 3.0\n",
+       "9      tcga_luad       522    1608    631.1          CC BY 3.0\n",
+       "10      tcga_lgg       516    1572   1166.6          CC BY 3.0\n",
+       "11     tcga_thca       507    1158    747.1          CC BY 3.0\n",
+       "12     tcga_lusc       504    1612    610.4          CC BY 3.0\n",
+       "13     tcga_prad       500    1172    596.4          CC BY 3.0\n",
+       "14     tcga_skcm       470     950    721.5          CC BY 3.0"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
    "source": [
     "# Load the slide microscopy index\n",
     "idc_client.fetch_index(\"sm_index\")\n",
@@ -129,9 +408,262 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 4,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Collections with H&E slides: 70\n",
+      "Total H&E series: 73,422\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "  <div id=\"df-21fe0668-deb3-49ee-a101-eba663e364d5\" class=\"colab-df-container\">\n",
+       "    <div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>collection_id</th>\n",
+       "      <th>patients</th>\n",
+       "      <th>he_series</th>\n",
+       "      <th>size_GB</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>gtex</td>\n",
+       "      <td>971</td>\n",
+       "      <td>25503</td>\n",
+       "      <td>8354.2</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>ccdi_mci</td>\n",
+       "      <td>4401</td>\n",
+       "      <td>4569</td>\n",
+       "      <td>4545.8</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>tcga_brca</td>\n",
+       "      <td>1098</td>\n",
+       "      <td>3111</td>\n",
+       "      <td>1678.8</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>tcga_kirc</td>\n",
+       "      <td>537</td>\n",
+       "      <td>2173</td>\n",
+       "      <td>811.7</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>tcga_gbm</td>\n",
+       "      <td>607</td>\n",
+       "      <td>2053</td>\n",
+       "      <td>639.9</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5</th>\n",
+       "      <td>tcga_lusc</td>\n",
+       "      <td>504</td>\n",
+       "      <td>1612</td>\n",
+       "      <td>610.4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6</th>\n",
+       "      <td>tcga_luad</td>\n",
+       "      <td>522</td>\n",
+       "      <td>1608</td>\n",
+       "      <td>631.1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>7</th>\n",
+       "      <td>tcga_lgg</td>\n",
+       "      <td>516</td>\n",
+       "      <td>1572</td>\n",
+       "      <td>1166.6</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>8</th>\n",
+       "      <td>tcga_ov</td>\n",
+       "      <td>590</td>\n",
+       "      <td>1481</td>\n",
+       "      <td>476.5</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>9</th>\n",
+       "      <td>tcga_coad</td>\n",
+       "      <td>460</td>\n",
+       "      <td>1442</td>\n",
+       "      <td>446.5</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>10</th>\n",
+       "      <td>tcga_ucec</td>\n",
+       "      <td>560</td>\n",
+       "      <td>1371</td>\n",
+       "      <td>1078.5</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>11</th>\n",
+       "      <td>tcga_hnsc</td>\n",
+       "      <td>523</td>\n",
+       "      <td>1263</td>\n",
+       "      <td>560.5</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12</th>\n",
+       "      <td>nlst</td>\n",
+       "      <td>452</td>\n",
+       "      <td>1259</td>\n",
+       "      <td>847.9</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>13</th>\n",
+       "      <td>tcga_stad</td>\n",
+       "      <td>443</td>\n",
+       "      <td>1197</td>\n",
+       "      <td>670.4</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>14</th>\n",
+       "      <td>tcga_prad</td>\n",
+       "      <td>500</td>\n",
+       "      <td>1172</td>\n",
+       "      <td>596.4</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>\n",
+       "    <div class=\"colab-df-buttons\">\n",
+       "      \n",
+       "  <div class=\"colab-df-container\">\n",
+       "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-21fe0668-deb3-49ee-a101-eba663e364d5')\"\n",
+       "            title=\"Convert this dataframe to an interactive table.\"\n",
+       "            style=\"display:none;\">\n",
+       "      \n",
+       "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
+       "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
+       "  </svg>\n",
+       "    </button>\n",
+       "    \n",
+       "  <style>\n",
+       "    .colab-df-container {\n",
+       "      display:flex;\n",
+       "      gap: 12px;\n",
+       "    }\n",
+       "\n",
+       "    .colab-df-convert {\n",
+       "      background-color: #E8F0FE;\n",
+       "      border: none;\n",
+       "      border-radius: 50%;\n",
+       "      cursor: pointer;\n",
+       "      display: none;\n",
+       "      fill: #1967D2;\n",
+       "      height: 32px;\n",
+       "      padding: 0 0 0 0;\n",
+       "      width: 32px;\n",
+       "    }\n",
+       "\n",
+       "    .colab-df-convert:hover {\n",
+       "      background-color: #E2EBFA;\n",
+       "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+       "      fill: #174EA6;\n",
+       "    }\n",
+       "\n",
+       "    .colab-df-buttons div {\n",
+       "      margin-bottom: 4px;\n",
+       "    }\n",
+       "\n",
+       "    [theme=dark] .colab-df-convert {\n",
+       "      background-color: #3B4455;\n",
+       "      fill: #D2E3FC;\n",
+       "    }\n",
+       "\n",
+       "    [theme=dark] .colab-df-convert:hover {\n",
+       "      background-color: #434B5C;\n",
+       "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+       "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+       "      fill: #FFFFFF;\n",
+       "    }\n",
+       "  </style>\n",
+       "\n",
+       "    <script>\n",
+       "      const buttonEl =\n",
+       "        document.querySelector('#df-21fe0668-deb3-49ee-a101-eba663e364d5 button.colab-df-convert');\n",
+       "      buttonEl.style.display =\n",
+       "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+       "\n",
+       "      async function convertToInteractive(key) {\n",
+       "        const element = document.querySelector('#df-21fe0668-deb3-49ee-a101-eba663e364d5');\n",
+       "        const dataTable =\n",
+       "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+       "                                                    [key], {});\n",
+       "        if (!dataTable) return;\n",
+       "\n",
+       "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
+       "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+       "          + ' to learn more about interactive tables.';\n",
+       "        element.innerHTML = '';\n",
+       "        dataTable['output_type'] = 'display_data';\n",
+       "        await google.colab.output.renderOutput(dataTable, element);\n",
+       "        const docLink = document.createElement('div');\n",
+       "        docLink.innerHTML = docLinkHtml;\n",
+       "        element.appendChild(docLink);\n",
+       "      }\n",
+       "    </script>\n",
+       "  </div>\n",
+       "  \n",
+       "    </div>\n",
+       "  </div>\n",
+       "  "
+      ],
+      "text/plain": [
+       "   collection_id  patients  he_series  size_GB\n",
+       "0           gtex       971      25503   8354.2\n",
+       "1       ccdi_mci      4401       4569   4545.8\n",
+       "2      tcga_brca      1098       3111   1678.8\n",
+       "3      tcga_kirc       537       2173    811.7\n",
+       "4       tcga_gbm       607       2053    639.9\n",
+       "5      tcga_lusc       504       1612    610.4\n",
+       "6      tcga_luad       522       1608    631.1\n",
+       "7       tcga_lgg       516       1572   1166.6\n",
+       "8        tcga_ov       590       1481    476.5\n",
+       "9      tcga_coad       460       1442    446.5\n",
+       "10     tcga_ucec       560       1371   1078.5\n",
+       "11     tcga_hnsc       523       1263    560.5\n",
+       "12          nlst       452       1259    847.9\n",
+       "13     tcga_stad       443       1197    670.4\n",
+       "14     tcga_prad       500       1172    596.4"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
    "source": [
     "# Count H&E slides per collection\n",
     "he_by_collection = idc_client.sql_query(\"\"\"\n",
@@ -169,9 +701,303 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 5,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "TCGA-BRCA H&E slides: 3111\n",
+      "Tissue types: {'Neoplasm, Primary': 2704, 'Normal': 399}\n",
+      "\n",
+      "Size range: 9.1 – 3586.3 MB\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "  <div id=\"df-2e85a5c3-dcf0-4755-bc48-1a5aac50f58a\" class=\"colab-df-container\">\n",
+       "    <div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>SeriesInstanceUID</th>\n",
+       "      <th>PatientID</th>\n",
+       "      <th>ContainerIdentifier</th>\n",
+       "      <th>tissue_type</th>\n",
+       "      <th>width_px</th>\n",
+       "      <th>height_px</th>\n",
+       "      <th>pixel_spacing_mm</th>\n",
+       "      <th>objective_power</th>\n",
+       "      <th>size_MB</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>1.3.6.1.4.1.5962.99.1.1247079754.248369703.163...</td>\n",
+       "      <td>TCGA-A7-A26J</td>\n",
+       "      <td>TCGA-A7-A26J-01B-02-BS2</td>\n",
+       "      <td>Neoplasm, Primary</td>\n",
+       "      <td>8618</td>\n",
+       "      <td>7243</td>\n",
+       "      <td>0.00025</td>\n",
+       "      <td>40</td>\n",
+       "      <td>9.1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>1.3.6.1.4.1.5962.99.1.1279025007.575797294.163...</td>\n",
+       "      <td>TCGA-AC-A2QJ</td>\n",
+       "      <td>TCGA-AC-A2QJ-11A-02-TS2</td>\n",
+       "      <td>Normal</td>\n",
+       "      <td>33993</td>\n",
+       "      <td>6923</td>\n",
+       "      <td>0.00050</td>\n",
+       "      <td>20</td>\n",
+       "      <td>15.5</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>1.3.6.1.4.1.5962.99.1.1306193840.1223239355.16...</td>\n",
+       "      <td>TCGA-AC-A2FB</td>\n",
+       "      <td>TCGA-AC-A2FB-11A-01-TSA</td>\n",
+       "      <td>Normal</td>\n",
+       "      <td>31993</td>\n",
+       "      <td>7690</td>\n",
+       "      <td>0.00050</td>\n",
+       "      <td>20</td>\n",
+       "      <td>16.5</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>1.3.6.1.4.1.5962.99.1.1263739509.328647659.163...</td>\n",
+       "      <td>TCGA-BH-A5IZ</td>\n",
+       "      <td>TCGA-BH-A5IZ-11A-01-TS1</td>\n",
+       "      <td>Normal</td>\n",
+       "      <td>20041</td>\n",
+       "      <td>15048</td>\n",
+       "      <td>0.00050</td>\n",
+       "      <td>20</td>\n",
+       "      <td>18.7</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>1.3.6.1.4.1.5962.99.1.1290706954.176399971.163...</td>\n",
+       "      <td>TCGA-AC-A2QJ</td>\n",
+       "      <td>TCGA-AC-A2QJ-11A-01-TS1</td>\n",
+       "      <td>Normal</td>\n",
+       "      <td>41991</td>\n",
+       "      <td>9316</td>\n",
+       "      <td>0.00050</td>\n",
+       "      <td>20</td>\n",
+       "      <td>22.8</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5</th>\n",
+       "      <td>1.3.6.1.4.1.5962.99.1.1315471301.865934712.163...</td>\n",
+       "      <td>TCGA-E9-A1RF</td>\n",
+       "      <td>TCGA-E9-A1RF-11A-03-TSC</td>\n",
+       "      <td>Normal</td>\n",
+       "      <td>27888</td>\n",
+       "      <td>9784</td>\n",
+       "      <td>0.00050</td>\n",
+       "      <td>20</td>\n",
+       "      <td>23.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6</th>\n",
+       "      <td>1.3.6.1.4.1.5962.99.1.1322835661.1509329658.16...</td>\n",
+       "      <td>TCGA-GM-A2D9</td>\n",
+       "      <td>TCGA-GM-A2D9-11A-02-TS2</td>\n",
+       "      <td>Normal</td>\n",
+       "      <td>35992</td>\n",
+       "      <td>8945</td>\n",
+       "      <td>0.00050</td>\n",
+       "      <td>20</td>\n",
+       "      <td>25.0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>7</th>\n",
+       "      <td>1.3.6.1.4.1.5962.99.1.1255105493.445850269.163...</td>\n",
+       "      <td>TCGA-E9-A1RF</td>\n",
+       "      <td>TCGA-E9-A1RF-11A-01-TSA</td>\n",
+       "      <td>Normal</td>\n",
+       "      <td>23904</td>\n",
+       "      <td>14929</td>\n",
+       "      <td>0.00050</td>\n",
+       "      <td>20</td>\n",
+       "      <td>26.9</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>8</th>\n",
+       "      <td>1.3.6.1.4.1.5962.99.1.1247939226.656219549.163...</td>\n",
+       "      <td>TCGA-OL-A5RW</td>\n",
+       "      <td>TCGA-OL-A5RW-01Z-00-DX1</td>\n",
+       "      <td>Neoplasm, Primary</td>\n",
+       "      <td>12879</td>\n",
+       "      <td>13618</td>\n",
+       "      <td>0.00050</td>\n",
+       "      <td>&lt;NA&gt;</td>\n",
+       "      <td>27.1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>9</th>\n",
+       "      <td>1.3.6.1.4.1.5962.99.1.1244105102.1853648157.16...</td>\n",
+       "      <td>TCGA-E9-A1NF</td>\n",
+       "      <td>TCGA-E9-A1NF-11A-05-TSE</td>\n",
+       "      <td>Normal</td>\n",
+       "      <td>13943</td>\n",
+       "      <td>17330</td>\n",
+       "      <td>0.00049</td>\n",
+       "      <td>20</td>\n",
+       "      <td>27.3</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>\n",
+       "    <div class=\"colab-df-buttons\">\n",
+       "      \n",
+       "  <div class=\"colab-df-container\">\n",
+       "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-2e85a5c3-dcf0-4755-bc48-1a5aac50f58a')\"\n",
+       "            title=\"Convert this dataframe to an interactive table.\"\n",
+       "            style=\"display:none;\">\n",
+       "      \n",
+       "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
+       "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
+       "  </svg>\n",
+       "    </button>\n",
+       "    \n",
+       "  <style>\n",
+       "    .colab-df-container {\n",
+       "      display:flex;\n",
+       "      gap: 12px;\n",
+       "    }\n",
+       "\n",
+       "    .colab-df-convert {\n",
+       "      background-color: #E8F0FE;\n",
+       "      border: none;\n",
+       "      border-radius: 50%;\n",
+       "      cursor: pointer;\n",
+       "      display: none;\n",
+       "      fill: #1967D2;\n",
+       "      height: 32px;\n",
+       "      padding: 0 0 0 0;\n",
+       "      width: 32px;\n",
+       "    }\n",
+       "\n",
+       "    .colab-df-convert:hover {\n",
+       "      background-color: #E2EBFA;\n",
+       "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+       "      fill: #174EA6;\n",
+       "    }\n",
+       "\n",
+       "    .colab-df-buttons div {\n",
+       "      margin-bottom: 4px;\n",
+       "    }\n",
+       "\n",
+       "    [theme=dark] .colab-df-convert {\n",
+       "      background-color: #3B4455;\n",
+       "      fill: #D2E3FC;\n",
+       "    }\n",
+       "\n",
+       "    [theme=dark] .colab-df-convert:hover {\n",
+       "      background-color: #434B5C;\n",
+       "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+       "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+       "      fill: #FFFFFF;\n",
+       "    }\n",
+       "  </style>\n",
+       "\n",
+       "    <script>\n",
+       "      const buttonEl =\n",
+       "        document.querySelector('#df-2e85a5c3-dcf0-4755-bc48-1a5aac50f58a button.colab-df-convert');\n",
+       "      buttonEl.style.display =\n",
+       "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+       "\n",
+       "      async function convertToInteractive(key) {\n",
+       "        const element = document.querySelector('#df-2e85a5c3-dcf0-4755-bc48-1a5aac50f58a');\n",
+       "        const dataTable =\n",
+       "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+       "                                                    [key], {});\n",
+       "        if (!dataTable) return;\n",
+       "\n",
+       "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
+       "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+       "          + ' to learn more about interactive tables.';\n",
+       "        element.innerHTML = '';\n",
+       "        dataTable['output_type'] = 'display_data';\n",
+       "        await google.colab.output.renderOutput(dataTable, element);\n",
+       "        const docLink = document.createElement('div');\n",
+       "        docLink.innerHTML = docLinkHtml;\n",
+       "        element.appendChild(docLink);\n",
+       "      }\n",
+       "    </script>\n",
+       "  </div>\n",
+       "  \n",
+       "    </div>\n",
+       "  </div>\n",
+       "  "
+      ],
+      "text/plain": [
+       "                                   SeriesInstanceUID     PatientID  \\\n",
+       "0  1.3.6.1.4.1.5962.99.1.1247079754.248369703.163...  TCGA-A7-A26J   \n",
+       "1  1.3.6.1.4.1.5962.99.1.1279025007.575797294.163...  TCGA-AC-A2QJ   \n",
+       "2  1.3.6.1.4.1.5962.99.1.1306193840.1223239355.16...  TCGA-AC-A2FB   \n",
+       "3  1.3.6.1.4.1.5962.99.1.1263739509.328647659.163...  TCGA-BH-A5IZ   \n",
+       "4  1.3.6.1.4.1.5962.99.1.1290706954.176399971.163...  TCGA-AC-A2QJ   \n",
+       "5  1.3.6.1.4.1.5962.99.1.1315471301.865934712.163...  TCGA-E9-A1RF   \n",
+       "6  1.3.6.1.4.1.5962.99.1.1322835661.1509329658.16...  TCGA-GM-A2D9   \n",
+       "7  1.3.6.1.4.1.5962.99.1.1255105493.445850269.163...  TCGA-E9-A1RF   \n",
+       "8  1.3.6.1.4.1.5962.99.1.1247939226.656219549.163...  TCGA-OL-A5RW   \n",
+       "9  1.3.6.1.4.1.5962.99.1.1244105102.1853648157.16...  TCGA-E9-A1NF   \n",
+       "\n",
+       "       ContainerIdentifier        tissue_type  width_px  height_px  \\\n",
+       "0  TCGA-A7-A26J-01B-02-BS2  Neoplasm, Primary      8618       7243   \n",
+       "1  TCGA-AC-A2QJ-11A-02-TS2             Normal     33993       6923   \n",
+       "2  TCGA-AC-A2FB-11A-01-TSA             Normal     31993       7690   \n",
+       "3  TCGA-BH-A5IZ-11A-01-TS1             Normal     20041      15048   \n",
+       "4  TCGA-AC-A2QJ-11A-01-TS1             Normal     41991       9316   \n",
+       "5  TCGA-E9-A1RF-11A-03-TSC             Normal     27888       9784   \n",
+       "6  TCGA-GM-A2D9-11A-02-TS2             Normal     35992       8945   \n",
+       "7  TCGA-E9-A1RF-11A-01-TSA             Normal     23904      14929   \n",
+       "8  TCGA-OL-A5RW-01Z-00-DX1  Neoplasm, Primary     12879      13618   \n",
+       "9  TCGA-E9-A1NF-11A-05-TSE             Normal     13943      17330   \n",
+       "\n",
+       "   pixel_spacing_mm  objective_power  size_MB  \n",
+       "0           0.00025               40      9.1  \n",
+       "1           0.00050               20     15.5  \n",
+       "2           0.00050               20     16.5  \n",
+       "3           0.00050               20     18.7  \n",
+       "4           0.00050               20     22.8  \n",
+       "5           0.00050               20     23.0  \n",
+       "6           0.00050               20     25.0  \n",
+       "7           0.00050               20     26.9  \n",
+       "8           0.00050             <NA>     27.1  \n",
+       "9           0.00049               20     27.3  "
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
    "source": [
     "# Query TCGA-BRCA H&E slides with key metadata\n",
     "tcga_brca_he = idc_client.sql_query(\"\"\"\n",
@@ -214,9 +1040,181 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 6,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Selected 3 slides for GrandQC processing:\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "  <div id=\"df-b42db108-ec3b-4b58-b20f-013dba0c6cde\" class=\"colab-df-container\">\n",
+       "    <div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>PatientID</th>\n",
+       "      <th>ContainerIdentifier</th>\n",
+       "      <th>tissue_type</th>\n",
+       "      <th>objective_power</th>\n",
+       "      <th>width_px</th>\n",
+       "      <th>height_px</th>\n",
+       "      <th>size_MB</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>TCGA-A7-A26J</td>\n",
+       "      <td>TCGA-A7-A26J-01B-02-BS2</td>\n",
+       "      <td>Neoplasm, Primary</td>\n",
+       "      <td>40</td>\n",
+       "      <td>8618</td>\n",
+       "      <td>7243</td>\n",
+       "      <td>9.1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>TCGA-AC-A2QJ</td>\n",
+       "      <td>TCGA-AC-A2QJ-11A-02-TS2</td>\n",
+       "      <td>Normal</td>\n",
+       "      <td>20</td>\n",
+       "      <td>33993</td>\n",
+       "      <td>6923</td>\n",
+       "      <td>15.5</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>TCGA-E2-A15K</td>\n",
+       "      <td>TCGA-E2-A15K-06A-01-TS1</td>\n",
+       "      <td>None</td>\n",
+       "      <td>40</td>\n",
+       "      <td>33319</td>\n",
+       "      <td>19962</td>\n",
+       "      <td>85.7</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>\n",
+       "    <div class=\"colab-df-buttons\">\n",
+       "      \n",
+       "  <div class=\"colab-df-container\">\n",
+       "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-b42db108-ec3b-4b58-b20f-013dba0c6cde')\"\n",
+       "            title=\"Convert this dataframe to an interactive table.\"\n",
+       "            style=\"display:none;\">\n",
+       "      \n",
+       "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
+       "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
+       "  </svg>\n",
+       "    </button>\n",
+       "    \n",
+       "  <style>\n",
+       "    .colab-df-container {\n",
+       "      display:flex;\n",
+       "      gap: 12px;\n",
+       "    }\n",
+       "\n",
+       "    .colab-df-convert {\n",
+       "      background-color: #E8F0FE;\n",
+       "      border: none;\n",
+       "      border-radius: 50%;\n",
+       "      cursor: pointer;\n",
+       "      display: none;\n",
+       "      fill: #1967D2;\n",
+       "      height: 32px;\n",
+       "      padding: 0 0 0 0;\n",
+       "      width: 32px;\n",
+       "    }\n",
+       "\n",
+       "    .colab-df-convert:hover {\n",
+       "      background-color: #E2EBFA;\n",
+       "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+       "      fill: #174EA6;\n",
+       "    }\n",
+       "\n",
+       "    .colab-df-buttons div {\n",
+       "      margin-bottom: 4px;\n",
+       "    }\n",
+       "\n",
+       "    [theme=dark] .colab-df-convert {\n",
+       "      background-color: #3B4455;\n",
+       "      fill: #D2E3FC;\n",
+       "    }\n",
+       "\n",
+       "    [theme=dark] .colab-df-convert:hover {\n",
+       "      background-color: #434B5C;\n",
+       "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+       "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+       "      fill: #FFFFFF;\n",
+       "    }\n",
+       "  </style>\n",
+       "\n",
+       "    <script>\n",
+       "      const buttonEl =\n",
+       "        document.querySelector('#df-b42db108-ec3b-4b58-b20f-013dba0c6cde button.colab-df-convert');\n",
+       "      buttonEl.style.display =\n",
+       "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+       "\n",
+       "      async function convertToInteractive(key) {\n",
+       "        const element = document.querySelector('#df-b42db108-ec3b-4b58-b20f-013dba0c6cde');\n",
+       "        const dataTable =\n",
+       "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+       "                                                    [key], {});\n",
+       "        if (!dataTable) return;\n",
+       "\n",
+       "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
+       "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+       "          + ' to learn more about interactive tables.';\n",
+       "        element.innerHTML = '';\n",
+       "        dataTable['output_type'] = 'display_data';\n",
+       "        await google.colab.output.renderOutput(dataTable, element);\n",
+       "        const docLink = document.createElement('div');\n",
+       "        docLink.innerHTML = docLinkHtml;\n",
+       "        element.appendChild(docLink);\n",
+       "      }\n",
+       "    </script>\n",
+       "  </div>\n",
+       "  \n",
+       "    </div>\n",
+       "  </div>\n",
+       "  "
+      ],
+      "text/plain": [
+       "      PatientID      ContainerIdentifier        tissue_type  objective_power  \\\n",
+       "0  TCGA-A7-A26J  TCGA-A7-A26J-01B-02-BS2  Neoplasm, Primary               40   \n",
+       "1  TCGA-AC-A2QJ  TCGA-AC-A2QJ-11A-02-TS2             Normal               20   \n",
+       "2  TCGA-E2-A15K  TCGA-E2-A15K-06A-01-TS1               None               40   \n",
+       "\n",
+       "   width_px  height_px  size_MB  \n",
+       "0      8618       7243      9.1  \n",
+       "1     33993       6923     15.5  \n",
+       "2     33319      19962     85.7  "
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
    "source": [
     "# Pick up to 5 small slides with variety in tissue type and magnification\n",
     "demo_slides = (\n",
@@ -241,9 +1239,22 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 7,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "TCGA-A7-A26J-01B-02-BS2  (Neoplasm, Primary, 9.1 MB)\n",
+      "  https://viewer.imaging.datacommons.cancer.gov/slim/studies/2.25.313270449313106073714723178854675013710/series/1.3.6.1.4.1.5962.99.1.1247079754.248369703.1637629652298.2.0\n",
+      "TCGA-AC-A2QJ-11A-02-TS2  (Normal, 15.5 MB)\n",
+      "  https://viewer.imaging.datacommons.cancer.gov/slim/studies/2.25.337996839190358918582393097516172189888/series/1.3.6.1.4.1.5962.99.1.1279025007.575797294.1637661597551.2.0\n",
+      "TCGA-E2-A15K-06A-01-TS1  (None, 85.7 MB)\n",
+      "  https://viewer.imaging.datacommons.cancer.gov/slim/studies/2.25.54614374037792363265218330907954861388/series/1.3.6.1.4.1.5962.99.1.1231257225.1365336880.1637613829769.2.0\n"
+     ]
+    }
+   ],
    "source": [
     "# Preview slides in the IDC SLIM viewer (pathology viewer)\n",
     "for _, row in demo_slides.iterrows():\n",
@@ -276,7 +1287,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 8,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -305,9 +1316,21 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 9,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "grandqc/01_WSI_infe 100%[===================>]  25.41M  1.06MB/s    in 26s     \n",
+      "grandqc/01_WSI_infe 100%[===================>]  24.21M  1.08MB/s    in 25s     \n",
+      "Models downloaded:\n",
+      "  grandqc/01_WSI_inference_OPENSLIDE_QC/models/td/Tissue_Detection_MPP10.pth  (25.4 MB)\n",
+      "  grandqc/01_WSI_inference_OPENSLIDE_QC/models/qc/GrandQC_MPP15.pth  (24.2 MB)\n"
+     ]
+    }
+   ],
    "source": [
     "import os\n",
     "\n",
@@ -335,109 +1358,21 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": [
-    "### 2.3 Patch Scripts for DICOM Directory Support\n",
-    "\n",
-    "IDC downloads each series as a directory of tiled DICOM files (`*.dcm`). OpenSlide 4.0+ can open any one of those files and reconstruct the full pyramid from the same directory.\n",
-    "\n",
-    "The two changes per script are:\n",
-    "\n",
-    "| Script | Change |\n",
-    "|--------|--------|\n",
-    "| `wsi_tis_detect.py` | Replace file-only `os.listdir` filter with one that also yields subdirectories containing `.dcm` files; resolve the path before calling `OpenSlide()` |\n",
-    "| `main.py` | Same two changes using `open_slide()` |\n",
-    "\n",
-    "Both scripts also get automatic CUDA/MPS/CPU device selection instead of the hardcoded `'cuda'`."
-   ]
+   "source": "### 2.3 Patch Scripts for DICOM Directory Support\n\nIDC downloads each series as a directory of tiled DICOM files (`*.dcm`). OpenSlide 4.0+ can open any one of those files and reconstruct the full pyramid from the same directory.\n\nThe changes per script are:\n\n| Script | Change |\n|--------|--------|\n| `wsi_tis_detect.py` | Replace file-only `os.listdir` filter with one that also yields subdirectories containing `.dcm` files; resolve the path before calling `OpenSlide()` |\n| `main.py` | Same two changes using `open_slide()`; plus `torch.load(..., weights_only=False)` so the pickled QC model loads under PyTorch ≥ 2.6 |\n\nBoth scripts also get automatic CUDA/MPS/CPU device selection instead of the hardcoded `'cuda'`.\n\n> **Note**: GrandQC's scripts iterate over files in a flat folder; IDC downloads each DICOM series into its own subdirectory, so the patch teaches them to recognise a directory whose contents end in `.dcm` as a valid slide entry and resolves the OpenSlide path accordingly. The `weights_only=False` change is required because PyTorch 2.6 made `weights_only=True` the default for `torch.load`, which refuses to unpickle the `segmentation_models_pytorch` model object that `main.py` loads. No model weights or algorithm logic are changed."
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": [
-    "from pathlib import Path\n",
-    "\n",
-    "def patch_grandqc_for_dicom(script_path: str) -> None:\n",
-    "    text = Path(script_path).read_text()\n",
-    "    original = text\n",
-    "\n",
-    "    # ── 1. Auto-detect compute device ──────────────────────────────────────\n",
-    "    text = text.replace(\n",
-    "        \"DEVICE = 'cuda'\",\n",
-    "        \"import torch as _t; DEVICE = ('cuda' if _t.cuda.is_available() \"\n",
-    "        \"else ('mps' if _t.backends.mps.is_available() else 'cpu'))\",\n",
-    "    )\n",
-    "\n",
-    "    # ── 2. Replace slide-list builder to also accept DICOM directories ──────\n",
-    "    DICOM_ITER = (\n",
-    "        \"def _iter_slides(d):\\n\"\n",
-    "        \"    for n in sorted(os.listdir(d)):\\n\"\n",
-    "        \"        p = os.path.join(d, n)\\n\"\n",
-    "        \"        if os.path.isfile(p) or (\\n\"\n",
-    "        \"                os.path.isdir(p) and\\n\"\n",
-    "        \"                any(x.lower().endswith('.dcm') for x in os.listdir(p))):\\n\"\n",
-    "        \"            yield n\\n\"\n",
-    "    )\n",
-    "\n",
-    "    # wsi_tis_detect.py pattern\n",
-    "    OLD_LIST_TIS = (\n",
-    "        \"slide_names = sorted([f for f in os.listdir(SLIDE_DIR) \"\n",
-    "        \"if os.path.isfile(os.path.join(SLIDE_DIR, f))])\"\n",
-    "    )\n",
-    "    NEW_LIST_TIS = DICOM_ITER + \"slide_names = list(_iter_slides(SLIDE_DIR))\"\n",
-    "\n",
-    "    # main.py pattern\n",
-    "    OLD_LIST_MAIN = \"slide_names = sorted(os.listdir(SLIDE_DIR))\"\n",
-    "    NEW_LIST_MAIN = DICOM_ITER + \"slide_names = list(_iter_slides(SLIDE_DIR))\"\n",
-    "\n",
-    "    text = text.replace(OLD_LIST_TIS, NEW_LIST_TIS)\n",
-    "    text = text.replace(OLD_LIST_MAIN, NEW_LIST_MAIN)\n",
-    "\n",
-    "    # ── 3. Resolve DICOM directory to first .dcm file before open ───────────\n",
-    "    RESOLVE = (\n",
-    "        \"        if os.path.isdir(path_slide):\\n\"\n",
-    "        \"            _dcms = sorted(x for x in os.listdir(path_slide)\"\n",
-    "        \" if x.lower().endswith('.dcm'))\\n\"\n",
-    "        \"            path_slide = os.path.join(path_slide, _dcms[0])\\n\"\n",
-    "    )\n",
-    "\n",
-    "    for open_call in (\"slide = OpenSlide(path_slide)\", \"slide = open_slide(path_slide)\"):\n",
-    "        old = f\"        path_slide = os.path.join(SLIDE_DIR, slide_name)\\n        {open_call}\"\n",
-    "        new = f\"        path_slide = os.path.join(SLIDE_DIR, slide_name)\\n{RESOLVE}        {open_call}\"\n",
-    "        text = text.replace(old, new)\n",
-    "\n",
-    "    if text == original:\n",
-    "        print(f\"  WARNING: no changes applied to {script_path}\")\n",
-    "    else:\n",
-    "        Path(script_path).write_text(text)\n",
-    "        n = sum(1 for a, b in zip(original.splitlines(), text.splitlines()) if a != b)\n",
-    "        n += abs(len(text.splitlines()) - len(original.splitlines()))\n",
-    "        print(f\"  Patched {script_path}  (+{n} lines changed)\")\n",
-    "\n",
-    "\n",
-    "for script in [\n",
-    "    f\"{GRANDQC_SCRIPTS}/wsi_tis_detect.py\",\n",
-    "    f\"{GRANDQC_SCRIPTS}/main.py\",\n",
-    "]:\n",
-    "    patch_grandqc_for_dicom(script)\n",
-    "print(\"Done.\")"
-   ]
+   "source": "from pathlib import Path\n\ndef patch_grandqc_for_dicom(script_path: str) -> None:\n    text = Path(script_path).read_text()\n    original = text\n\n    # ── 1. Auto-detect compute device ──────────────────────────────────────\n    text = text.replace(\n        \"DEVICE = 'cuda'\",\n        \"import torch as _t; DEVICE = ('cuda' if _t.cuda.is_available() \"\n        \"else ('mps' if _t.backends.mps.is_available() else 'cpu'))\",\n    )\n\n    # ── 2. Replace slide-list builder to also accept DICOM directories ──────\n    DICOM_ITER = (\n        \"def _iter_slides(d):\\n\"\n        \"    for n in sorted(os.listdir(d)):\\n\"\n        \"        p = os.path.join(d, n)\\n\"\n        \"        if os.path.isfile(p) or (\\n\"\n        \"                os.path.isdir(p) and\\n\"\n        \"                any(x.lower().endswith('.dcm') for x in os.listdir(p))):\\n\"\n        \"            yield n\\n\"\n    )\n\n    # wsi_tis_detect.py pattern\n    OLD_LIST_TIS = (\n        \"slide_names = sorted([f for f in os.listdir(SLIDE_DIR) \"\n        \"if os.path.isfile(os.path.join(SLIDE_DIR, f))])\"\n    )\n    NEW_LIST_TIS = DICOM_ITER + \"slide_names = list(_iter_slides(SLIDE_DIR))\"\n\n    # main.py pattern\n    OLD_LIST_MAIN = \"slide_names = sorted(os.listdir(SLIDE_DIR))\"\n    NEW_LIST_MAIN = DICOM_ITER + \"slide_names = list(_iter_slides(SLIDE_DIR))\"\n\n    text = text.replace(OLD_LIST_TIS, NEW_LIST_TIS)\n    text = text.replace(OLD_LIST_MAIN, NEW_LIST_MAIN)\n\n    # ── 3. Resolve DICOM directory to first .dcm file before open ───────────\n    RESOLVE = (\n        \"        if os.path.isdir(path_slide):\\n\"\n        \"            _dcms = sorted(x for x in os.listdir(path_slide)\"\n        \" if x.lower().endswith('.dcm'))\\n\"\n        \"            path_slide = os.path.join(path_slide, _dcms[0])\\n\"\n    )\n\n    for open_call in (\"slide = OpenSlide(path_slide)\", \"slide = open_slide(path_slide)\"):\n        old = f\"        path_slide = os.path.join(SLIDE_DIR, slide_name)\\n        {open_call}\"\n        new = f\"        path_slide = os.path.join(SLIDE_DIR, slide_name)\\n{RESOLVE}        {open_call}\"\n        text = text.replace(old, new)\n\n    # ── 4. Load the full QC model with weights_only=False ───────────────────\n    # main.py loads a pickled nn.Module via torch.load(). Since PyTorch 2.6 the\n    # default weights_only=True refuses to unpickle the segmentation_models_pytorch\n    # classes inside, raising UnpicklingError before the per-slide try/except and\n    # killing the process with exit code 1. The Zenodo weights are trusted, so we\n    # opt back into the full unpickling path. (wsi_tis_detect.py is unaffected —\n    # it loads a plain state_dict, which weights_only=True allows.)\n    text = text.replace(\n        \"model_prim = torch.load(MODEL_QC_DIR + MODEL_QC_NAME, map_location=DEVICE)\",\n        \"model_prim = torch.load(MODEL_QC_DIR + MODEL_QC_NAME, map_location=DEVICE, \"\n        \"weights_only=False)\",\n    )\n\n    # ── 5. Progress reporting in main.py's slide loop ───────────────────────\n    # Track failures and report [i/N] progress. (main.py-specific anchors — these\n    # strings do not occur in wsi_tis_detect.py, so this is a no-op there.)\n    text = text.replace(\n        \"for slide_name in slide_names[start:end]:\\n\"\n        \"    try:\\n\"\n        \"        # Register start time\\n\"\n        \"        start = timeit.default_timer()\\n\"\n        \"\\n\"\n        \"        print(\\\"\\\")\\n\"\n        \"        print(\\\"Processing:\\\", slide_name)\",\n        \"_slides_to_run = slide_names[start:end]\\n\"\n        \"_failures = []\\n\"\n        \"print(f\\\"Found {len(_slides_to_run)} slide(s) to process.\\\")\\n\"\n        \"for _idx, slide_name in enumerate(_slides_to_run, 1):\\n\"\n        \"    try:\\n\"\n        \"        # Register start time\\n\"\n        \"        start = timeit.default_timer()\\n\"\n        \"\\n\"\n        \"        print(\\\"\\\")\\n\"\n        \"        print(f\\\"[{_idx}/{len(_slides_to_run)}] Processing:\\\", slide_name)\",\n    )\n\n    # ── 6. Record per-slide failures and exit non-zero if any occurred ──────\n    # The upstream loop swallows every per-slide error and the script exits 0,\n    # so the notebook can't tell a slide failed. Collect failures and sys.exit(1)\n    # after the loop so the caller surfaces them.\n    text = text.replace(\n        \"    except Exception as e:\\n\"\n        \"        print(f\\\"There was some problem with the slide. The error is: {e}\\\")\",\n        \"    except Exception as e:\\n\"\n        \"        print(f\\\"There was some problem with the slide. The error is: {e}\\\")\\n\"\n        \"        _failures.append((slide_name, str(e)))\\n\"\n        \"\\n\"\n        \"# Surface per-slide failures with a non-zero exit so the caller sees them\\n\"\n        \"if _failures:\\n\"\n        \"    print(f\\\"\\\\n{len(_failures)} of {len(_slides_to_run)} slide(s) FAILED:\\\")\\n\"\n        \"    for _name, _err in _failures:\\n\"\n        \"        print(f\\\"  - {_name}: {_err}\\\")\\n\"\n        \"    import sys as _sys; _sys.exit(1)\\n\"\n        \"print(f\\\"\\\\nAll {len(_slides_to_run)} slide(s) processed successfully.\\\")\",\n    )\n\n    if text == original:\n        print(f\"  WARNING: no changes applied to {script_path}\")\n    else:\n        Path(script_path).write_text(text)\n        n = sum(1 for a, b in zip(original.splitlines(), text.splitlines()) if a != b)\n        n += abs(len(text.splitlines()) - len(original.splitlines()))\n        print(f\"  Patched {script_path}  (+{n} lines changed)\")\n\n\nfor script in [\n    f\"{GRANDQC_SCRIPTS}/wsi_tis_detect.py\",\n    f\"{GRANDQC_SCRIPTS}/main.py\",\n]:\n    patch_grandqc_for_dicom(script)\nprint(\"Done.\")"
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": [
-    "# Sanity-check: confirm the DICOM resolver was inserted in both scripts\n",
-    "for script in [f\"{GRANDQC_SCRIPTS}/wsi_tis_detect.py\", f\"{GRANDQC_SCRIPTS}/main.py\"]:\n",
-    "    text = Path(script).read_text()\n",
-    "    has_dcm  = \"_iter_slides\" in text\n",
-    "    has_res  = \"_dcms\" in text\n",
-    "    has_dev  = \"cuda.is_available\" in text\n",
-    "    print(f\"{script.split('/')[-1]:30s}  dicom_iter={has_dcm}  dcm_resolve={has_res}  auto_device={has_dev}\")"
-   ]
+   "source": "# Sanity-check: confirm the patches were inserted in both scripts\nfor script in [f\"{GRANDQC_SCRIPTS}/wsi_tis_detect.py\", f\"{GRANDQC_SCRIPTS}/main.py\"]:\n    text = Path(script).read_text()\n    has_dcm  = \"_iter_slides\" in text\n    has_res  = \"_dcms\" in text\n    has_dev  = \"cuda.is_available\" in text\n    # weights_only=False only applies to main.py's full-model load\n    has_wo   = (\"weights_only=False\" in text) if script.endswith(\"main.py\") else \"n/a\"\n    print(f\"{script.split('/')[-1]:30s}  dicom_iter={has_dcm}  dcm_resolve={has_res}  \"\n          f\"auto_device={has_dev}  weights_only_fix={has_wo}\")"
   },
   {
    "cell_type": "markdown",
@@ -472,6 +1407,7 @@
     "idc_client.download_from_selection(\n",
     "    downloadDir=SLIDE_DIR,\n",
     "    seriesInstanceUID=demo_slides[\"SeriesInstanceUID\"].tolist(),\n",
+    "    dirTemplate=\"%SeriesInstanceUID\",\n",
     ")\n",
     "print(\"\\nDownload complete.\")"
    ]
@@ -559,20 +1495,7 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": [
-    "## Part 4: Run GrandQC Quality Control\n",
-    "\n",
-    "GrandQC runs as two sequential scripts, both inside `grandqc/01_WSI_inference_OPENSLIDE_QC/`:\n",
-    "\n",
-    "| Step | Script | MPP | Time (GPU) | Output |\n",
-    "|------|--------|-----|------------|--------|\n",
-    "| 1 | `wsi_tis_detect.py` | 10 | ~0.4 s/slide | Binary tissue masks |\n",
-    "| 2 | `main.py` | 1.5 | ~30–45 s/slide | 7-class artifact maps, GeoJSON, TSV report |\n",
-    "\n",
-    "The 7 artifact classes are: **1** clean tissue · **2** folds · **3** dark spots · **4** pen marks · **5** air bubbles/edges · **6** out-of-focus · **7** background.\n",
-    "\n",
-    "Both scripts load models via relative paths (`./models/…`) and must therefore be invoked with their directory as the working directory. We use `subprocess` with `cwd=GRANDQC_SCRIPTS` to achieve this while keeping output streaming to the cell."
-   ]
+   "source": "## Part 4: Run GrandQC Quality Control\n\nGrandQC runs as two sequential scripts, both inside `grandqc/01_WSI_inference_OPENSLIDE_QC/`:\n\n| Step | Script | MPP | Time (GPU) | Output |\n|------|--------|-----|------------|--------|\n| 1 | `wsi_tis_detect.py` | 10 | ~0.4 s/slide | Binary tissue masks |\n| 2 | `main.py` | 1.5 | ~30–45 s/slide | 7-class artifact maps, GeoJSON, TSV report |\n\nThe 7 artifact classes are: **1** clean tissue · **2** folds · **3** dark spots · **4** pen marks · **5** air bubbles/edges · **6** out-of-focus · **7** background.\n\nBoth scripts load models via relative paths (`./models/…`) and must therefore be invoked with their directory as the working directory. We use `subprocess` with `cwd=GRANDQC_SCRIPTS` to achieve this while keeping output streaming to the cell.\n\n> **On inference resolution**: the artifact model runs at MPP 1.5 (≈7× objective), *not* at the slide's native 20×/40×, and the per-slide MPP is read from OpenSlide's `openslide.mpp-x` property (derived from the DICOM `PixelSpacing` for IDC slides). For how this is controlled, the patch-size math, and an empirical confirmation on an IDC slide, see [`grandqc_slide_quality_resolution_notes.md`](grandqc_slide_quality_resolution_notes.md)."
   },
   {
    "cell_type": "markdown",
@@ -1123,13 +2046,23 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
    "name": "python3"
   },
   "language_info": {
-   "name": "python"
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.13"
   }
  },
  "nbformat": 4,
  "nbformat_minor": 0
-}
+}
\ No newline at end of file

From 475a24098849d7952f1701b76979e2c950b5e308 Mon Sep 17 00:00:00 2001
From: Andrey Fedorov <andrey.fedorov@gmail.com>
Date: Mon, 15 Jun 2026 16:05:40 -0400
Subject: [PATCH 3/9] Use patched GrandQC fork instead of runtime
 monkey-patching

Clone the fedorov/grandqc idc-dicom-fixes branch, which carries the
DICOM-directory, device-detection, PyTorch >=2.6 load, small-slide
overlay, and fail-loud fixes as proper commits. Replace the in-notebook
patch_grandqc_for_dicom() cell with a verification cell that asserts the
fixes are present in the cloned scripts, and update Part 2 prose plus the
resolution notes accordingly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 .../grandqc_slide_quality_resolution_notes.md |  4 +-
 .../grandqc_slide_quality_with_idc.ipynb      | 42 +++----------------
 2 files changed, 7 insertions(+), 39 deletions(-)

diff --git a/notebooks/pathomics/grandqc_slide_quality_resolution_notes.md b/notebooks/pathomics/grandqc_slide_quality_resolution_notes.md
index 0206cf2..51f585c 100644
--- a/notebooks/pathomics/grandqc_slide_quality_resolution_notes.md
+++ b/notebooks/pathomics/grandqc_slide_quality_resolution_notes.md
@@ -129,8 +129,8 @@ Key points:
   ```
 
   If `openslide.mpp-x` is missing/`None`, `float(...)` raises. With the
-  failure-guard patch added in the notebook, this is caught and reported per
-  slide rather than silently mis-scaling.
+  fail-loud error handling in the `idc-dicom-fixes` fork branch, this is caught
+  and reported per slide (with a non-zero exit) rather than silently mis-scaling.
 - `openslide.mpp-x` is the **single point of failure** for scaling any new slide:
   if it is wrong or absent, `p_s` is sized wrong and the model sees the slide at
   the wrong scale.
diff --git a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb
index 2f7be61..a68cddf 100644
--- a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb
+++ b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb
@@ -1266,17 +1266,7 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": [
-    "## Part 2: GrandQC Setup\n",
-    "\n",
-    "Before running GrandQC we need to:\n",
-    "1. Clone the repository from GitHub\n",
-    "2. Install its Python dependencies\n",
-    "3. Download the pre-trained models from Zenodo\n",
-    "4. Apply a small patch so both scripts accept IDC's DICOM series directories alongside conventional WSI files\n",
-    "\n",
-    "> **Note**: GrandQC's scripts currently iterate over files in a flat folder. IDC downloads each DICOM series into its own subdirectory. The patch below (≤ 10 lines changed across 2 files) teaches GrandQC to recognise a directory whose contents end in `.dcm` as a valid slide entry and resolves the OpenSlide path accordingly. No model weights or algorithm logic are changed."
-   ]
+   "source": "## Part 2: GrandQC Setup\n\nBefore running GrandQC we need to:\n1. Clone the repository from GitHub — we use the [`fedorov/grandqc`](https://github.com/fedorov/grandqc/tree/idc-dicom-fixes) fork, whose `idc-dicom-fixes` branch lets the inference scripts run directly on IDC DICOM series\n2. Install its Python dependencies\n3. Download the pre-trained models from Zenodo\n\n> **Note**: The `idc-dicom-fixes` branch carries a handful of small, self-contained fixes on top of upstream `cpath-ukk/grandqc` — DICOM-directory input, compute-device auto-detection, a PyTorch ≥ 2.6 model-load fix, a small-slide overlay fix, and fail-loud error handling. None of them change model weights or algorithm logic (see Part 2.3 for the full list). Working from a fork keeps the notebook clean instead of monkey-patching the scripts at runtime."
   },
   {
    "cell_type": "markdown",
@@ -1287,25 +1277,10 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": [
-    "%%capture\n",
-    "# Clone GrandQC (shallow clone — we only need the latest commit)\n",
-    "!git clone --depth 1 https://github.com/cpath-ukk/grandqc.git\n",
-    "\n",
-    "# Install GrandQC Python dependencies.\n",
-    "# PyTorch is already present in Colab; we install the remaining packages.\n",
-    "# segmentation-models-pytorch 0.3.1 is pinned because its model.predict() API\n",
-    "# was removed in 0.4.0 and GrandQC relies on it.\n",
-    "!pip install -q \\\n",
-    "    \"segmentation-models-pytorch==0.3.1\" \\\n",
-    "    \"rasterio>=1.3\" \\\n",
-    "    \"imagecodecs\" \\\n",
-    "    \"zarr<3\" \\\n",
-    "    \"tifffile\""
-   ]
+   "source": "%%capture\n# Clone the IDC-patched GrandQC fork (branch idc-dicom-fixes, shallow clone).\n# This branch applies small, self-contained fixes on top of upstream\n# cpath-ukk/grandqc so the inference scripts run directly on IDC DICOM series.\n# See the verification cell in Part 2.3 for the list of fixes.\n!git clone --depth 1 --branch idc-dicom-fixes https://github.com/fedorov/grandqc.git\n\n# Install GrandQC Python dependencies.\n# PyTorch is already present in Colab; we install the remaining packages.\n# segmentation-models-pytorch 0.3.1 is pinned because its model.predict() API\n# was removed in 0.4.0 and GrandQC relies on it.\n!pip install -q \\\n    \"segmentation-models-pytorch==0.3.1\" \\\n    \"rasterio>=1.3\" \\\n    \"imagecodecs\" \\\n    \"zarr<3\" \\\n    \"tifffile\""
   },
   {
    "cell_type": "markdown",
@@ -1358,21 +1333,14 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": "### 2.3 Patch Scripts for DICOM Directory Support\n\nIDC downloads each series as a directory of tiled DICOM files (`*.dcm`). OpenSlide 4.0+ can open any one of those files and reconstruct the full pyramid from the same directory.\n\nThe changes per script are:\n\n| Script | Change |\n|--------|--------|\n| `wsi_tis_detect.py` | Replace file-only `os.listdir` filter with one that also yields subdirectories containing `.dcm` files; resolve the path before calling `OpenSlide()` |\n| `main.py` | Same two changes using `open_slide()`; plus `torch.load(..., weights_only=False)` so the pickled QC model loads under PyTorch ≥ 2.6 |\n\nBoth scripts also get automatic CUDA/MPS/CPU device selection instead of the hardcoded `'cuda'`.\n\n> **Note**: GrandQC's scripts iterate over files in a flat folder; IDC downloads each DICOM series into its own subdirectory, so the patch teaches them to recognise a directory whose contents end in `.dcm` as a valid slide entry and resolves the OpenSlide path accordingly. The `weights_only=False` change is required because PyTorch 2.6 made `weights_only=True` the default for `torch.load`, which refuses to unpickle the `segmentation_models_pytorch` model object that `main.py` loads. No model weights or algorithm logic are changed."
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": "from pathlib import Path\n\ndef patch_grandqc_for_dicom(script_path: str) -> None:\n    text = Path(script_path).read_text()\n    original = text\n\n    # ── 1. Auto-detect compute device ──────────────────────────────────────\n    text = text.replace(\n        \"DEVICE = 'cuda'\",\n        \"import torch as _t; DEVICE = ('cuda' if _t.cuda.is_available() \"\n        \"else ('mps' if _t.backends.mps.is_available() else 'cpu'))\",\n    )\n\n    # ── 2. Replace slide-list builder to also accept DICOM directories ──────\n    DICOM_ITER = (\n        \"def _iter_slides(d):\\n\"\n        \"    for n in sorted(os.listdir(d)):\\n\"\n        \"        p = os.path.join(d, n)\\n\"\n        \"        if os.path.isfile(p) or (\\n\"\n        \"                os.path.isdir(p) and\\n\"\n        \"                any(x.lower().endswith('.dcm') for x in os.listdir(p))):\\n\"\n        \"            yield n\\n\"\n    )\n\n    # wsi_tis_detect.py pattern\n    OLD_LIST_TIS = (\n        \"slide_names = sorted([f for f in os.listdir(SLIDE_DIR) \"\n        \"if os.path.isfile(os.path.join(SLIDE_DIR, f))])\"\n    )\n    NEW_LIST_TIS = DICOM_ITER + \"slide_names = list(_iter_slides(SLIDE_DIR))\"\n\n    # main.py pattern\n    OLD_LIST_MAIN = \"slide_names = sorted(os.listdir(SLIDE_DIR))\"\n    NEW_LIST_MAIN = DICOM_ITER + \"slide_names = list(_iter_slides(SLIDE_DIR))\"\n\n    text = text.replace(OLD_LIST_TIS, NEW_LIST_TIS)\n    text = text.replace(OLD_LIST_MAIN, NEW_LIST_MAIN)\n\n    # ── 3. Resolve DICOM directory to first .dcm file before open ───────────\n    RESOLVE = (\n        \"        if os.path.isdir(path_slide):\\n\"\n        \"            _dcms = sorted(x for x in os.listdir(path_slide)\"\n        \" if x.lower().endswith('.dcm'))\\n\"\n        \"            path_slide = os.path.join(path_slide, _dcms[0])\\n\"\n    )\n\n    for open_call in (\"slide = OpenSlide(path_slide)\", \"slide = open_slide(path_slide)\"):\n        old = f\"        path_slide = os.path.join(SLIDE_DIR, slide_name)\\n        {open_call}\"\n        new = f\"        path_slide = os.path.join(SLIDE_DIR, slide_name)\\n{RESOLVE}        {open_call}\"\n        text = text.replace(old, new)\n\n    # ── 4. Load the full QC model with weights_only=False ───────────────────\n    # main.py loads a pickled nn.Module via torch.load(). Since PyTorch 2.6 the\n    # default weights_only=True refuses to unpickle the segmentation_models_pytorch\n    # classes inside, raising UnpicklingError before the per-slide try/except and\n    # killing the process with exit code 1. The Zenodo weights are trusted, so we\n    # opt back into the full unpickling path. (wsi_tis_detect.py is unaffected —\n    # it loads a plain state_dict, which weights_only=True allows.)\n    text = text.replace(\n        \"model_prim = torch.load(MODEL_QC_DIR + MODEL_QC_NAME, map_location=DEVICE)\",\n        \"model_prim = torch.load(MODEL_QC_DIR + MODEL_QC_NAME, map_location=DEVICE, \"\n        \"weights_only=False)\",\n    )\n\n    # ── 5. Progress reporting in main.py's slide loop ───────────────────────\n    # Track failures and report [i/N] progress. (main.py-specific anchors — these\n    # strings do not occur in wsi_tis_detect.py, so this is a no-op there.)\n    text = text.replace(\n        \"for slide_name in slide_names[start:end]:\\n\"\n        \"    try:\\n\"\n        \"        # Register start time\\n\"\n        \"        start = timeit.default_timer()\\n\"\n        \"\\n\"\n        \"        print(\\\"\\\")\\n\"\n        \"        print(\\\"Processing:\\\", slide_name)\",\n        \"_slides_to_run = slide_names[start:end]\\n\"\n        \"_failures = []\\n\"\n        \"print(f\\\"Found {len(_slides_to_run)} slide(s) to process.\\\")\\n\"\n        \"for _idx, slide_name in enumerate(_slides_to_run, 1):\\n\"\n        \"    try:\\n\"\n        \"        # Register start time\\n\"\n        \"        start = timeit.default_timer()\\n\"\n        \"\\n\"\n        \"        print(\\\"\\\")\\n\"\n        \"        print(f\\\"[{_idx}/{len(_slides_to_run)}] Processing:\\\", slide_name)\",\n    )\n\n    # ── 6. Record per-slide failures and exit non-zero if any occurred ──────\n    # The upstream loop swallows every per-slide error and the script exits 0,\n    # so the notebook can't tell a slide failed. Collect failures and sys.exit(1)\n    # after the loop so the caller surfaces them.\n    text = text.replace(\n        \"    except Exception as e:\\n\"\n        \"        print(f\\\"There was some problem with the slide. The error is: {e}\\\")\",\n        \"    except Exception as e:\\n\"\n        \"        print(f\\\"There was some problem with the slide. The error is: {e}\\\")\\n\"\n        \"        _failures.append((slide_name, str(e)))\\n\"\n        \"\\n\"\n        \"# Surface per-slide failures with a non-zero exit so the caller sees them\\n\"\n        \"if _failures:\\n\"\n        \"    print(f\\\"\\\\n{len(_failures)} of {len(_slides_to_run)} slide(s) FAILED:\\\")\\n\"\n        \"    for _name, _err in _failures:\\n\"\n        \"        print(f\\\"  - {_name}: {_err}\\\")\\n\"\n        \"    import sys as _sys; _sys.exit(1)\\n\"\n        \"print(f\\\"\\\\nAll {len(_slides_to_run)} slide(s) processed successfully.\\\")\",\n    )\n\n    if text == original:\n        print(f\"  WARNING: no changes applied to {script_path}\")\n    else:\n        Path(script_path).write_text(text)\n        n = sum(1 for a, b in zip(original.splitlines(), text.splitlines()) if a != b)\n        n += abs(len(text.splitlines()) - len(original.splitlines()))\n        print(f\"  Patched {script_path}  (+{n} lines changed)\")\n\n\nfor script in [\n    f\"{GRANDQC_SCRIPTS}/wsi_tis_detect.py\",\n    f\"{GRANDQC_SCRIPTS}/main.py\",\n]:\n    patch_grandqc_for_dicom(script)\nprint(\"Done.\")"
+   "source": "### 2.3 Fixes Included in the Fork\n\nThe upstream GrandQC inference scripts assume conventional single-file WSIs on an NVIDIA GPU. Running them directly on IDC DICOM series needs a few small changes, already applied in the [`idc-dicom-fixes`](https://github.com/fedorov/grandqc/tree/idc-dicom-fixes) branch cloned above. **No model weights or algorithm logic are changed.**\n\n| Fix | Scripts | Why it is needed |\n|-----|---------|------------------|\n| Accept DICOM **series directories** as slides | both | IDC stores each slide as a directory of `.dcm` files; upstream only accepted single files, so the slide list came out empty and nothing was processed |\n| Auto-detect compute device (CUDA / MPS / CPU) | both | upstream hardcodes `'cuda'`, which crashes on CPU/Apple Silicon |\n| `torch.load(..., weights_only=False)` for the QC model | `main.py` | PyTorch ≥ 2.6 changed the default and refuses to unpickle the model object |\n| Crop the tissue-detection mask to the thumbnail size | `wsi_tis_detect.py` | small slides produce a sub-512 px MPP-10 thumbnail, which crashed `cv2.addWeighted` in the overlay step (left `tis_det_overlay/` empty) |\n| Surface per-slide failures with a non-zero exit | both | upstream swallowed errors and exited `0`, so empty output looked like a successful run |\n\nThe cell below confirms these fixes are present in the cloned scripts."
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": "# Sanity-check: confirm the patches were inserted in both scripts\nfor script in [f\"{GRANDQC_SCRIPTS}/wsi_tis_detect.py\", f\"{GRANDQC_SCRIPTS}/main.py\"]:\n    text = Path(script).read_text()\n    has_dcm  = \"_iter_slides\" in text\n    has_res  = \"_dcms\" in text\n    has_dev  = \"cuda.is_available\" in text\n    # weights_only=False only applies to main.py's full-model load\n    has_wo   = (\"weights_only=False\" in text) if script.endswith(\"main.py\") else \"n/a\"\n    print(f\"{script.split('/')[-1]:30s}  dicom_iter={has_dcm}  dcm_resolve={has_res}  \"\n          f\"auto_device={has_dev}  weights_only_fix={has_wo}\")"
+   "source": "from pathlib import Path\n\n# Confirm the IDC fixes shipped in the cloned fork branch are present.\ntis = Path(f\"{GRANDQC_SCRIPTS}/wsi_tis_detect.py\").read_text()\nmn  = Path(f\"{GRANDQC_SCRIPTS}/main.py\").read_text()\n\nchecks = {\n    \"DICOM-directory input (wsi_tis_detect.py)\":   \"_iter_slides\" in tis,\n    \"DICOM-directory input (main.py)\":             \"_iter_slides\" in mn,\n    \"auto device select (wsi_tis_detect.py)\":      \"cuda.is_available\" in tis,\n    \"auto device select (main.py)\":                \"cuda.is_available\" in mn,\n    \"weights_only=False (main.py)\":                \"weights_only=False\" in mn,\n    \"small-slide overlay crop (wsi_tis_detect.py)\":\"end_image[-height:, -width:]\" in tis,\n    \"fail-loud exit (wsi_tis_detect.py)\":          \"sys.exit(1)\" in tis,\n    \"fail-loud exit (main.py)\":                    \"sys.exit(1)\" in mn,\n}\nfor name, ok in checks.items():\n    print(f\"  [{'OK ' if ok else 'MISSING'}] {name}\")\n\nassert all(checks.values()), (\n    \"Expected GrandQC fixes are not all present — check that the clone used \"\n    \"branch 'idc-dicom-fixes' of github.com/fedorov/grandqc.\"\n)\nprint(\"\\nAll expected fixes present — scripts are ready to run on IDC DICOM slides.\")"
   },
   {
    "cell_type": "markdown",

From 11e62c2d5b2daadebac4b431c074a34dd290309e Mon Sep 17 00:00:00 2001
From: Andrey Fedorov <andrey.fedorov@gmail.com>
Date: Tue, 16 Jun 2026 10:19:09 -0400
Subject: [PATCH 4/9] Select BRCA slides present in GrandQC reference masks
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The demo selection picked the smallest TCGA-BRCA slides, which are
frozen-section (TS/BS) slides absent from GrandQC's Zenodo reference
archive (DX diagnostic slides only) — so Part 6 validation matched
nothing. Switch to five verified DX barcodes (variety by magnification,
not tissue type, since all DX slides are primary tumor).

Part 6 now downloads just those 5 reference masks (~700 KB, re-hosted)
instead of the full 2 GB BRCA.tar, and fixes the mask-name matching:
reference masks are {barcode}.{UUID}.svs_mask.png while local masks are
keyed by the bare ContainerIdentifier, so a barcode->filename map is
built. Drop the cohort-wide statistics that required the full archive.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 .../grandqc_slide_quality_with_idc.ipynb      | 424 +-----------------
 1 file changed, 13 insertions(+), 411 deletions(-)

diff --git a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb
index a68cddf..2e5a3cd 100644
--- a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb
+++ b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb
@@ -10,24 +10,7 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": [
-    "# Slide Quality Control with GrandQC and NCI Imaging Data Commons\n",
-    "\n",
-    "Automated quality control (QC) is an essential preprocessing step in computational pathology pipelines. Artifacts such as tissue folds, out-of-focus regions, pen markings, and air bubbles can introduce noise that degrades the performance of downstream AI models.\n",
-    "\n",
-    "[**GrandQC**](https://github.com/cpath-ukk/grandqc) is an open-source tool for automated tissue detection and multi-class artifact segmentation in whole slide images (WSIs). It was validated across slides from 19 international pathology departments and published in:\n",
-    "\n",
-    "> Weng Z., Seper A., Pryalukhin A., et al. *GrandQC: A comprehensive solution to quality control problem in digital pathology.* Nature Communications 15, 10685 (2024). https://doi.org/10.1038/s41467-024-54769-y\n",
-    "\n",
-    "[**NCI Imaging Data Commons (IDC)**](https://portal.imaging.datacommons.cancer.gov/) hosts ~100 TB of cancer imaging data, including thousands of H&E-stained whole slide images from TCGA, CPTAC, HTAN, and other programs — all stored as DICOM and freely accessible without authentication.\n",
-    "\n",
-    "This notebook demonstrates how to:\n",
-    "1. **Discover** H&E-stained whole slide images in IDC using `idc-index`\n",
-    "2. **Set up** GrandQC and download its pre-trained models\n",
-    "3. **Download** slides from IDC and run GrandQC directly on DICOM files\n",
-    "4. **Visualize** tissue detection and artifact segmentation results\n",
-    "5. **Leverage** GrandQC's pre-computed QC masks for all 32 TCGA cohorts"
-   ]
+   "source": "# Slide Quality Control with GrandQC and NCI Imaging Data Commons\n\nAutomated quality control (QC) is an essential preprocessing step in computational pathology pipelines. Artifacts such as tissue folds, out-of-focus regions, pen markings, and air bubbles can introduce noise that degrades the performance of downstream AI models.\n\n[**GrandQC**](https://github.com/cpath-ukk/grandqc) is an open-source tool for automated tissue detection and multi-class artifact segmentation in whole slide images (WSIs). It was validated across slides from 19 international pathology departments and published in:\n\n> Weng Z., Seper A., Pryalukhin A., et al. *GrandQC: A comprehensive solution to quality control problem in digital pathology.* Nature Communications 15, 10685 (2024). https://doi.org/10.1038/s41467-024-54769-y\n\n[**NCI Imaging Data Commons (IDC)**](https://portal.imaging.datacommons.cancer.gov/) hosts ~100 TB of cancer imaging data, including thousands of H&E-stained whole slide images from TCGA, CPTAC, HTAN, and other programs — all stored as DICOM and freely accessible without authentication.\n\nThis notebook demonstrates how to:\n1. **Discover** H&E-stained whole slide images in IDC using `idc-index`\n2. **Set up** GrandQC and download its pre-trained models\n3. **Download** slides from IDC and run GrandQC directly on DICOM files\n4. **Visualize** tissue detection and artifact segmentation results\n5. **Validate** the DICOM-based pipeline against GrandQC's pre-computed QC masks for the TCGA cohorts"
   },
   {
    "cell_type": "markdown",
@@ -687,17 +670,7 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": [
-    "### 1.3 Select Slides for Quality Control\n",
-    "\n",
-    "For this tutorial we work with slides from the [TCGA-BRCA](https://portal.imaging.datacommons.cancer.gov/explore/filters/?collection_id=tcga_brca) collection (The Cancer Genome Atlas — Breast Cancer). This collection is a good choice because:\n",
-    "\n",
-    "- It contains over 3,000 H&E-stained slides at 40× magnification\n",
-    "- GrandQC provides **pre-computed QC masks for all TCGA cohorts** (demonstrated in Part 6)\n",
-    "- Slides span both tumor and adjacent normal tissue, providing variety in tissue composition\n",
-    "\n",
-    "We query the `sm_index` to retrieve per-slide metadata including the `ContainerIdentifier` (the original TCGA slide barcode), which we will later use to match IDC series against the pre-computed GrandQC masks."
-   ]
+   "source": "### 1.3 Select Slides for Quality Control\n\nFor this tutorial we work with slides from the [TCGA-BRCA](https://portal.imaging.datacommons.cancer.gov/explore/filters/?collection_id=tcga_brca) collection (The Cancer Genome Atlas — Breast Cancer). This collection is a good choice because:\n\n- It contains over 3,000 H&E-stained slides\n- GrandQC provides **pre-computed QC masks for the TCGA cohorts**, which we use to validate our DICOM-based run in Part 6\n\nThose reference masks cover only the **diagnostic FFPE slides** (TCGA `DX` barcodes), not the frozen-section slides (`TS`/`BS`), so we deliberately select diagnostic slides here. We query the `sm_index` to retrieve per-slide metadata including the `ContainerIdentifier` (the original TCGA slide barcode), which we use to match IDC series against the pre-computed GrandQC masks."
   },
   {
    "cell_type": "code",
@@ -1027,215 +1000,14 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": [
-    "### 1.4 Select a Representative Set of Slides\n",
-    "\n",
-    "For the hands-on GrandQC run we select a small set of slides that covers:\n",
-    "- Different tissue types (tumor vs. normal)\n",
-    "- Different magnifications (20× and 40×)\n",
-    "- Manageable file sizes for a notebook environment\n",
-    "\n",
-    "We cap the selection at slides under 200 MB to keep download and processing times reasonable."
-   ]
+   "source": "### 1.4 Select a Representative Set of Slides\n\nFor the hands-on GrandQC run we use a small set of **diagnostic (DX) H&E slides** that:\n- are confirmed to have pre-computed GrandQC reference masks in the Zenodo `BRCA.tar` archive (so Part 6 can validate against them),\n- span both magnifications GrandQC encounters in TCGA-BRCA (20× and 40×), and\n- are small enough (≤ ~150 MB) for quick download and processing in a notebook.\n\nAll TCGA-BRCA diagnostic slides are primary-tumor sections, so the variety here comes from magnification rather than tissue type. The five barcodes below were verified against the Zenodo archive."
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": null,
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Selected 3 slides for GrandQC processing:\n"
-     ]
-    },
-    {
-     "data": {
-      "text/html": [
-       "\n",
-       "  <div id=\"df-b42db108-ec3b-4b58-b20f-013dba0c6cde\" class=\"colab-df-container\">\n",
-       "    <div>\n",
-       "<style scoped>\n",
-       "    .dataframe tbody tr th:only-of-type {\n",
-       "        vertical-align: middle;\n",
-       "    }\n",
-       "\n",
-       "    .dataframe tbody tr th {\n",
-       "        vertical-align: top;\n",
-       "    }\n",
-       "\n",
-       "    .dataframe thead th {\n",
-       "        text-align: right;\n",
-       "    }\n",
-       "</style>\n",
-       "<table border=\"1\" class=\"dataframe\">\n",
-       "  <thead>\n",
-       "    <tr style=\"text-align: right;\">\n",
-       "      <th></th>\n",
-       "      <th>PatientID</th>\n",
-       "      <th>ContainerIdentifier</th>\n",
-       "      <th>tissue_type</th>\n",
-       "      <th>objective_power</th>\n",
-       "      <th>width_px</th>\n",
-       "      <th>height_px</th>\n",
-       "      <th>size_MB</th>\n",
-       "    </tr>\n",
-       "  </thead>\n",
-       "  <tbody>\n",
-       "    <tr>\n",
-       "      <th>0</th>\n",
-       "      <td>TCGA-A7-A26J</td>\n",
-       "      <td>TCGA-A7-A26J-01B-02-BS2</td>\n",
-       "      <td>Neoplasm, Primary</td>\n",
-       "      <td>40</td>\n",
-       "      <td>8618</td>\n",
-       "      <td>7243</td>\n",
-       "      <td>9.1</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>1</th>\n",
-       "      <td>TCGA-AC-A2QJ</td>\n",
-       "      <td>TCGA-AC-A2QJ-11A-02-TS2</td>\n",
-       "      <td>Normal</td>\n",
-       "      <td>20</td>\n",
-       "      <td>33993</td>\n",
-       "      <td>6923</td>\n",
-       "      <td>15.5</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>2</th>\n",
-       "      <td>TCGA-E2-A15K</td>\n",
-       "      <td>TCGA-E2-A15K-06A-01-TS1</td>\n",
-       "      <td>None</td>\n",
-       "      <td>40</td>\n",
-       "      <td>33319</td>\n",
-       "      <td>19962</td>\n",
-       "      <td>85.7</td>\n",
-       "    </tr>\n",
-       "  </tbody>\n",
-       "</table>\n",
-       "</div>\n",
-       "    <div class=\"colab-df-buttons\">\n",
-       "      \n",
-       "  <div class=\"colab-df-container\">\n",
-       "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-b42db108-ec3b-4b58-b20f-013dba0c6cde')\"\n",
-       "            title=\"Convert this dataframe to an interactive table.\"\n",
-       "            style=\"display:none;\">\n",
-       "      \n",
-       "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
-       "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
-       "  </svg>\n",
-       "    </button>\n",
-       "    \n",
-       "  <style>\n",
-       "    .colab-df-container {\n",
-       "      display:flex;\n",
-       "      gap: 12px;\n",
-       "    }\n",
-       "\n",
-       "    .colab-df-convert {\n",
-       "      background-color: #E8F0FE;\n",
-       "      border: none;\n",
-       "      border-radius: 50%;\n",
-       "      cursor: pointer;\n",
-       "      display: none;\n",
-       "      fill: #1967D2;\n",
-       "      height: 32px;\n",
-       "      padding: 0 0 0 0;\n",
-       "      width: 32px;\n",
-       "    }\n",
-       "\n",
-       "    .colab-df-convert:hover {\n",
-       "      background-color: #E2EBFA;\n",
-       "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
-       "      fill: #174EA6;\n",
-       "    }\n",
-       "\n",
-       "    .colab-df-buttons div {\n",
-       "      margin-bottom: 4px;\n",
-       "    }\n",
-       "\n",
-       "    [theme=dark] .colab-df-convert {\n",
-       "      background-color: #3B4455;\n",
-       "      fill: #D2E3FC;\n",
-       "    }\n",
-       "\n",
-       "    [theme=dark] .colab-df-convert:hover {\n",
-       "      background-color: #434B5C;\n",
-       "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
-       "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
-       "      fill: #FFFFFF;\n",
-       "    }\n",
-       "  </style>\n",
-       "\n",
-       "    <script>\n",
-       "      const buttonEl =\n",
-       "        document.querySelector('#df-b42db108-ec3b-4b58-b20f-013dba0c6cde button.colab-df-convert');\n",
-       "      buttonEl.style.display =\n",
-       "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
-       "\n",
-       "      async function convertToInteractive(key) {\n",
-       "        const element = document.querySelector('#df-b42db108-ec3b-4b58-b20f-013dba0c6cde');\n",
-       "        const dataTable =\n",
-       "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
-       "                                                    [key], {});\n",
-       "        if (!dataTable) return;\n",
-       "\n",
-       "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
-       "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
-       "          + ' to learn more about interactive tables.';\n",
-       "        element.innerHTML = '';\n",
-       "        dataTable['output_type'] = 'display_data';\n",
-       "        await google.colab.output.renderOutput(dataTable, element);\n",
-       "        const docLink = document.createElement('div');\n",
-       "        docLink.innerHTML = docLinkHtml;\n",
-       "        element.appendChild(docLink);\n",
-       "      }\n",
-       "    </script>\n",
-       "  </div>\n",
-       "  \n",
-       "    </div>\n",
-       "  </div>\n",
-       "  "
-      ],
-      "text/plain": [
-       "      PatientID      ContainerIdentifier        tissue_type  objective_power  \\\n",
-       "0  TCGA-A7-A26J  TCGA-A7-A26J-01B-02-BS2  Neoplasm, Primary               40   \n",
-       "1  TCGA-AC-A2QJ  TCGA-AC-A2QJ-11A-02-TS2             Normal               20   \n",
-       "2  TCGA-E2-A15K  TCGA-E2-A15K-06A-01-TS1               None               40   \n",
-       "\n",
-       "   width_px  height_px  size_MB  \n",
-       "0      8618       7243      9.1  \n",
-       "1     33993       6923     15.5  \n",
-       "2     33319      19962     85.7  "
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    }
-   ],
-   "source": [
-    "# Pick up to 5 small slides with variety in tissue type and magnification\n",
-    "demo_slides = (\n",
-    "    tcga_brca_he[tcga_brca_he[\"size_MB\"] < 200]\n",
-    "    .drop_duplicates(subset=\"tissue_type\")\n",
-    "    .head(5)\n",
-    "    .reset_index(drop=True)\n",
-    ")\n",
-    "\n",
-    "# Fall back to the 5 smallest slides if tissue_type is missing\n",
-    "if len(demo_slides) < 3:\n",
-    "    demo_slides = tcga_brca_he[tcga_brca_he[\"size_MB\"] < 200].head(5).reset_index(drop=True)\n",
-    "\n",
-    "print(f\"Selected {len(demo_slides)} slides for GrandQC processing:\")\n",
-    "display(\n",
-    "    demo_slides[[\n",
-    "        \"PatientID\", \"ContainerIdentifier\", \"tissue_type\",\n",
-    "        \"objective_power\", \"width_px\", \"height_px\", \"size_MB\"\n",
-    "    ]]\n",
-    ")"
-   ]
+   "outputs": [],
+   "source": "# Diagnostic (DX) H&E slides verified to have pre-computed GrandQC reference\n# masks in the TCGA-BRCA archive on Zenodo (record 14041578). Every DX slide is\n# primary tumor, so the variety here is in objective power (40x and 20x).\nDEMO_BARCODES = [\n    \"TCGA-AC-A23G-01Z-00-DX1\",  # 40x,  44 MB\n    \"TCGA-AC-A23C-01Z-00-DX1\",  # 40x,  57 MB\n    \"TCGA-AC-A62V-01Z-00-DX1\",  # 40x,  60 MB\n    \"TCGA-MS-A51U-01Z-00-DX1\",  # 20x,  76 MB\n    \"TCGA-A8-A0AB-01Z-00-DX1\",  # 20x, 151 MB\n]\n\navailable = set(tcga_brca_he[\"ContainerIdentifier\"])\npresent   = [b for b in DEMO_BARCODES if b in available]\nmissing   = [b for b in DEMO_BARCODES if b not in available]\nif missing:\n    print(f\"WARNING: {len(missing)} barcode(s) not found in this IDC version: {missing}\")\n\ndemo_slides = (\n    tcga_brca_he[tcga_brca_he[\"ContainerIdentifier\"].isin(present)]\n    .drop_duplicates(subset=\"ContainerIdentifier\")\n    .set_index(\"ContainerIdentifier\")\n    .loc[present]                  # preserve our chosen order\n    .reset_index()\n)\n\nprint(f\"Selected {len(demo_slides)} slides for GrandQC processing:\")\ndisplay(\n    demo_slides[[\n        \"PatientID\", \"ContainerIdentifier\", \"tissue_type\",\n        \"objective_power\", \"width_px\", \"height_px\", \"size_MB\"\n    ]]\n)"
   },
   {
    "cell_type": "code",
@@ -1383,14 +1155,7 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": [
-    "### 3.2 Rename Series Directories to TCGA Barcodes\n",
-    "\n",
-    "GrandQC uses the directory/file name as the key for all output files (tissue mask, QC map, overlay, TSV report row). By renaming each downloaded series directory from its `SeriesInstanceUID` to its `ContainerIdentifier` (TCGA barcode), we get:\n",
-    "\n",
-    "- Human-readable output filenames (`TCGA-A7-A26J-01B-02-BS2_MASK.png` instead of a UID)\n",
-    "- Automatic matching with GrandQC's pre-computed TCGA masks (Part 6), which are keyed by the same barcode"
-   ]
+   "source": "### 3.2 Rename Series Directories to TCGA Barcodes\n\nGrandQC uses the directory/file name as the key for all output files (tissue mask, QC map, overlay, TSV report row). By renaming each downloaded series directory from its `SeriesInstanceUID` to its `ContainerIdentifier` (TCGA barcode), we get:\n\n- Human-readable output filenames (`TCGA-AC-A23G-01Z-00-DX1_mask.png` instead of a UID)\n- Straightforward matching with GrandQC's pre-computed TCGA masks (Part 6), which embed the same barcode in their filename"
   },
   {
    "cell_type": "code",
@@ -1751,27 +1516,12 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": [
-    "## Part 6: Validating the Pipeline Against Pre-computed TCGA Masks\n",
-    "\n",
-    "In Parts 3–5 we ran GrandQC locally on IDC DICOM slides using OpenSlide's native DICOM reader — a path the GrandQC authors did not test in their original publication. Before using the pipeline on new data, it is good practice to verify that our results are consistent with the authors' reference outputs.\n",
-    "\n",
-    "The GrandQC authors published pre-computed QC masks for all 32 TCGA cohorts on Zenodo:\n",
-    "\n",
-    "> **GrandQC pre-computed masks**: [zenodo.org/records/14041578](https://zenodo.org/records/14041578)  \n",
-    "> 36 tar archives — one per TCGA cohort (some large cohorts split into two parts) — totalling ~18.3 GB.\n",
-    "\n",
-    "Each archive unpacks to PNG files named `{ContainerIdentifier}_mask.png` with integer labels 1–7 — the same format and naming convention produced by `main.py`. By downloading the TCGA-BRCA archive and comparing artifact fractions for our demo slides against the reference, we can confirm that the DICOM-based pipeline is producing equivalent results."
-   ]
+   "source": "## Part 6: Validating the Pipeline Against Pre-computed TCGA Masks\n\nIn Parts 3–5 we ran GrandQC locally on IDC DICOM slides using OpenSlide's native DICOM reader — a path the GrandQC authors did not test in their original publication. Before using the pipeline on new data, it is good practice to verify that our results are consistent with the authors' reference outputs.\n\nThe GrandQC authors published pre-computed QC masks for the TCGA cohorts on Zenodo:\n\n> **GrandQC pre-computed masks**: [zenodo.org/records/14041578](https://zenodo.org/records/14041578)  \n> One tar archive per TCGA cohort (a few large cohorts are split into two parts), ~18 GB in total. The TCGA-BRCA archive (`BRCA.tar`) alone is ~2 GB and holds masks for 1,105 **diagnostic (DX) slides**.\n\nEach mask is a PNG named after the original slide file — `{barcode}.{UUID}.svs_mask.png` — with integer labels 1–7, the same classes `main.py` produces. Rather than download the full 2 GB cohort archive, we pull just the reference masks for our five demo slides (~700 KB, extracted from `BRCA.tar` and re-hosted) and compare artifact fractions against our local DICOM run."
   },
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": [
-    "### 6.1 Available Cohort Archives\n",
-    "\n",
-    "The Zenodo record contains one tar file per TCGA cohort. Large cohorts (BRCA, THCA, GBM, LUAD, UCEC, LGG, TGCT) are split into two parts."
-   ]
+   "source": "### 6.1 Available Cohort Archives\n\nThe Zenodo record contains one tar file per TCGA cohort; a few large cohorts (LUAD, LUSC, THCA, TGCT) are split into two parts. The cell below lists them straight from the Zenodo API — no large download."
   },
   {
    "cell_type": "code",
@@ -1801,36 +1551,14 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": [
-    "### 6.2 Download the TCGA-BRCA Reference Masks\n",
-    "\n",
-    "`BRCA.tar` (~2 GB) contains masks for all TCGA-BRCA slides. After extraction, each file is named `{ContainerIdentifier}_mask.png` — identical to what GrandQC writes into `mask_qc/` locally."
-   ]
+   "source": "### 6.2 Download the Reference Masks for Our Demo Slides\n\nWe download just the pre-computed masks for the five demo slides (~700 KB), extracted from `BRCA.tar` and re-hosted as a small folder. Each reference mask is named `{barcode}.{UUID}.svs_mask.png`, whereas our local masks in `qc_output/mask_qc/` are keyed by the bare `ContainerIdentifier` barcode — so we also build a lookup from barcode to the full reference filename.\n\n> To reproduce population-level statistics across the whole cohort, download the full `BRCA.tar` (~2 GB) from the Zenodo record above and point `REF_MASKS_DIR` at the extracted folder; the comparison loop below works unchanged on any number of masks."
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": [
-    "REF_MASKS_DIR = \"brca_ref_masks\"\n",
-    "os.makedirs(REF_MASKS_DIR, exist_ok=True)\n",
-    "\n",
-    "# Download BRCA.tar (~2 GB) — takes 5-10 min on Colab\n",
-    "!wget -q --show-progress \\\n",
-    "    \"https://zenodo.org/records/14041578/files/BRCA.tar?download=1\" \\\n",
-    "    -O BRCA.tar\n",
-    "\n",
-    "# Extract — PNG files land directly in the archive root\n",
-    "!tar xf BRCA.tar -C {REF_MASKS_DIR}\n",
-    "\n",
-    "# Report what we got\n",
-    "ref_masks = sorted(f for f in os.listdir(REF_MASKS_DIR) if f.endswith(\"_mask.png\"))\n",
-    "print(f\"\\nExtracted {len(ref_masks)} reference masks to {REF_MASKS_DIR}/\")\n",
-    "print(\"Sample filenames:\")\n",
-    "for f in ref_masks[:5]:\n",
-    "    print(f\"  {f}\")"
-   ]
+   "source": "import io, zipfile, requests\n\nREF_MASKS_DIR = \"brca_ref_masks\"\nos.makedirs(REF_MASKS_DIR, exist_ok=True)\n\n# Pre-computed GrandQC masks for our five demo slides, extracted from the\n# TCGA-BRCA Zenodo archive (record 14041578) and re-hosted as a small folder\n# (~700 KB) so we don't have to download the full ~2 GB BRCA.tar. The link is a\n# Dropbox shared folder; appending '&dl=1' returns the whole folder as a zip.\nREF_MASKS_URL = (\n    \"https://www.dropbox.com/scl/fo/z5jtldr5ed384s8nntiwe/\"\n    \"ALUgs22f_0PI8oGO6XJ5eG0?rlkey=b92k8rt6jvqllwkad31obvbyx&dl=1\"\n)\n\nresp = requests.get(REF_MASKS_URL)\nresp.raise_for_status()\nwith zipfile.ZipFile(io.BytesIO(resp.content)) as zf:\n    for member in zf.namelist():\n        if member.endswith(\"_mask.png\"):\n            target = os.path.join(REF_MASKS_DIR, os.path.basename(member))\n            with zf.open(member) as src, open(target, \"wb\") as dst:\n                dst.write(src.read())\n\nref_masks = sorted(f for f in os.listdir(REF_MASKS_DIR) if f.endswith(\"_mask.png\"))\n\n# Reference masks are named '{barcode}.{UUID}.svs_mask.png'; our local masks are\n# keyed by the bare ContainerIdentifier barcode (no dots). Map barcode -> filename.\nref_mask_by_barcode = {f.split(\".\")[0]: f for f in ref_masks}\n\nprint(f\"Downloaded {len(ref_masks)} reference masks to {REF_MASKS_DIR}/\")\nfor f in ref_masks:\n    print(f\"  {f}\")"
   },
   {
    "cell_type": "markdown",
@@ -1848,48 +1576,7 @@
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": [
-    "def mask_fractions(mask_path):\n",
-    "    mask = np.array(Image.open(mask_path))\n",
-    "    tissue_px = np.sum(mask != 7)\n",
-    "    if tissue_px == 0:\n",
-    "        return None\n",
-    "    return {name: np.sum(mask == cls) / tissue_px * 100\n",
-    "            for cls, name in enumerate(CLASS_NAMES_SHORT, start=1)}\n",
-    "\n",
-    "\n",
-    "# Build comparison table for each demo slide\n",
-    "compare_rows = []\n",
-    "for barcode in uid_to_barcode.values():\n",
-    "    local_path = os.path.join(OUTPUT_DIR, \"mask_qc\", f\"{barcode}_mask.png\")\n",
-    "    ref_path   = os.path.join(REF_MASKS_DIR, f\"{barcode}_mask.png\")\n",
-    "\n",
-    "    if not os.path.exists(local_path):\n",
-    "        print(f\"  MISSING local mask: {barcode}\")\n",
-    "        continue\n",
-    "    if not os.path.exists(ref_path):\n",
-    "        print(f\"  MISSING reference mask: {barcode} — slide may not be in BRCA.tar\")\n",
-    "        continue\n",
-    "\n",
-    "    local_frac = mask_fractions(local_path)\n",
-    "    ref_frac   = mask_fractions(ref_path)\n",
-    "    if local_frac is None or ref_frac is None:\n",
-    "        continue\n",
-    "\n",
-    "    for name in CLASS_NAMES_SHORT:\n",
-    "        compare_rows.append({\n",
-    "            \"slide\":    barcode,\n",
-    "            \"class\":    name,\n",
-    "            \"local_%\":  round(local_frac[name], 2),\n",
-    "            \"ref_%\":    round(ref_frac[name], 2),\n",
-    "            \"delta_pp\": round(local_frac[name] - ref_frac[name], 2),\n",
-    "        })\n",
-    "\n",
-    "compare_df = pd.DataFrame(compare_rows)\n",
-    "print(\"Per-class fraction comparison (local DICOM run vs. Zenodo reference):\\n\")\n",
-    "display(compare_df.pivot_table(index=\"slide\", columns=\"class\", values=\"delta_pp\").round(2))\n",
-    "print(\"\\nδ = local − reference (percentage points).  Values near 0 confirm agreement.\")"
-   ]
+   "source": "def mask_fractions(mask_path):\n    mask = np.array(Image.open(mask_path))\n    tissue_px = np.sum(mask != 7)\n    if tissue_px == 0:\n        return None\n    return {name: np.sum(mask == cls) / tissue_px * 100\n            for cls, name in enumerate(CLASS_NAMES_SHORT, start=1)}\n\n\n# Build comparison table for each demo slide\ncompare_rows = []\nfor barcode in uid_to_barcode.values():\n    local_path = os.path.join(OUTPUT_DIR, \"mask_qc\", f\"{barcode}_mask.png\")\n    ref_fname  = ref_mask_by_barcode.get(barcode)\n    ref_path   = os.path.join(REF_MASKS_DIR, ref_fname) if ref_fname else None\n\n    if not os.path.exists(local_path):\n        print(f\"  MISSING local mask: {barcode}\")\n        continue\n    if ref_path is None or not os.path.exists(ref_path):\n        print(f\"  MISSING reference mask: {barcode} — slide not in the reference set\")\n        continue\n\n    local_frac = mask_fractions(local_path)\n    ref_frac   = mask_fractions(ref_path)\n    if local_frac is None or ref_frac is None:\n        continue\n\n    for name in CLASS_NAMES_SHORT:\n        compare_rows.append({\n            \"slide\":    barcode,\n            \"class\":    name,\n            \"local_%\":  round(local_frac[name], 2),\n            \"ref_%\":    round(ref_frac[name], 2),\n            \"delta_pp\": round(local_frac[name] - ref_frac[name], 2),\n        })\n\ncompare_df = pd.DataFrame(compare_rows)\nprint(\"Per-class fraction comparison (local DICOM run vs. Zenodo reference):\\n\")\ndisplay(compare_df.pivot_table(index=\"slide\", columns=\"class\", values=\"delta_pp\").round(2))\nprint(\"\\nδ = local − reference (percentage points).  Values near 0 confirm agreement.\")"
   },
   {
    "cell_type": "code",
@@ -1924,92 +1611,7 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": [
-    "### 6.4 Cohort-level QC Statistics\n",
-    "\n",
-    "With validation confirmed, we compute per-slide artifact fractions across all TCGA-BRCA slides in the reference archive. This gives a population-level view of slide quality without running any inference locally — each mask is read once and discarded, keeping memory usage constant."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from tqdm.notebook import tqdm\n",
-    "\n",
-    "cohort_rows = []\n",
-    "for fname in tqdm(ref_masks, desc=\"Computing stats\"):\n",
-    "    barcode = fname.replace(\"_mask.png\", \"\")\n",
-    "    fracs = mask_fractions(os.path.join(REF_MASKS_DIR, fname))\n",
-    "    if fracs is not None:\n",
-    "        fracs[\"slide\"] = barcode\n",
-    "        cohort_rows.append(fracs)\n",
-    "\n",
-    "cohort_df = pd.DataFrame(cohort_rows).set_index(\"slide\")\n",
-    "\n",
-    "print(f\"Cohort: TCGA-BRCA  |  {len(cohort_df)} slides\\n\")\n",
-    "print(\"Median artifact fractions (% of tissue area):\")\n",
-    "display(cohort_df[CLASS_NAMES_SHORT].median().rename(\"median_%\").to_frame().T.round(1))\n",
-    "print()\n",
-    "print(\"90th-percentile artifact fractions:\")\n",
-    "display(cohort_df[CLASS_NAMES_SHORT].quantile(0.9).rename(\"p90_%\").to_frame().T.round(1))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Violin plot of per-slide artifact fractions across TCGA-BRCA\n",
-    "# Exclude Clean tissue (class 1) — typically >80%, dominates the y-axis\n",
-    "artifact_classes = [c for c in CLASS_NAMES_SHORT if c != \"Clean tissue\"]\n",
-    "artifact_colors  = [mpl_colors[i] for i, c in enumerate(CLASS_NAMES_SHORT) if c != \"Clean tissue\"]\n",
-    "\n",
-    "fig, ax = plt.subplots(figsize=(9, 4))\n",
-    "data    = [cohort_df[cls].dropna().values for cls in artifact_classes]\n",
-    "parts   = ax.violinplot(data, positions=range(len(artifact_classes)),\n",
-    "                        showmedians=True, showextrema=False)\n",
-    "\n",
-    "for pc, color in zip(parts[\"bodies\"], artifact_colors):\n",
-    "    pc.set_facecolor(color)\n",
-    "    pc.set_alpha(0.75)\n",
-    "parts[\"cmedians\"].set_color(\"black\")\n",
-    "parts[\"cmedians\"].set_linewidth(1.5)\n",
-    "\n",
-    "ax.set_xticks(range(len(artifact_classes)))\n",
-    "ax.set_xticklabels(artifact_classes, rotation=20, ha=\"right\")\n",
-    "ax.set_ylabel(\"% of tissue area\")\n",
-    "ax.set_title(f\"Artifact fraction distribution — TCGA-BRCA ({len(cohort_df)} slides)\",\n",
-    "             fontweight=\"bold\")\n",
-    "plt.tight_layout()\n",
-    "plt.show()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Join QC stats with IDC sm_index metadata via ContainerIdentifier\n",
-    "cohort_meta = cohort_df.reset_index().rename(columns={\"slide\": \"ContainerIdentifier\"})\n",
-    "brca_meta   = tcga_brca_he[[\"ContainerIdentifier\", \"tissue_type\"]].dropna(subset=[\"tissue_type\"])\n",
-    "\n",
-    "merged = cohort_meta.merge(brca_meta, on=\"ContainerIdentifier\", how=\"inner\")\n",
-    "\n",
-    "print(f\"Slides with tissue-type annotation: {len(merged)} / {len(cohort_df)}\\n\")\n",
-    "\n",
-    "# Median artifact fractions by tissue type\n",
-    "by_tissue = (\n",
-    "    merged.groupby(\"tissue_type\")[CLASS_NAMES_SHORT]\n",
-    "    .median()\n",
-    "    .round(1)\n",
-    ")\n",
-    "print(\"Median artifact fractions (%) by tissue type:\")\n",
-    "display(by_tissue)"
-   ]
+   "source": "### 6.4 Scaling to Cohort-level QC\n\nBecause each reference mask is just a small PNG, the comparison above scales to an entire cohort. Downloading the full `BRCA.tar` (~2 GB) from the Zenodo record gives pre-computed masks for all 1,105 TCGA-BRCA diagnostic slides — enough to compute population-level artifact statistics (medians, percentiles, distributions) **without running any inference locally**. We keep this notebook lightweight by validating on the five demo slides only; the loop in section 6.3 generalizes directly to the full mask set, reading one mask at a time so memory stays constant."
   }
  ],
  "metadata": {

From df0c20c2ea3864390cda44cf368b41a3774426a0 Mon Sep 17 00:00:00 2001
From: Andrey Fedorov <andrey.fedorov@gmail.com>
Date: Tue, 16 Jun 2026 10:52:23 -0400
Subject: [PATCH 5/9] Add artifact-boundary comparison to GrandQC validation

New section 6.4 overlays per-class artifact region boundaries for the
reference (Zenodo) and local DICOM runs on a shared grid: solid outline
for the reference, dashed for the local run, in the matching artifact
colour. This exposes boundary-level localisation differences that the
aggregate per-class fractions in 6.3 can hide. The reference mask is
resized onto the local grid (nearest-neighbour) since the two can differ
slightly in pixel dimensions. Existing cohort-scaling note renumbered to 6.5.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 .../grandqc_slide_quality_with_idc.ipynb         | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb
index 2e5a3cd..fac1ee8 100644
--- a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb
+++ b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb
@@ -1608,10 +1608,22 @@
     "plt.show()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "source": "### 6.4 Compare Artifact Region Boundaries\n\nPer-class fractions (6.3) tell us *how much* of each artifact the two runs agree on, but not *where*. Here we overlay the **boundaries** of each artifact region for both runs on the same grid: a **solid** outline for the reference (Zenodo) mask and a **dashed** outline for our local DICOM run, drawn in the matching artifact colour. Where the two outlines hug each other the pipelines localise the artifact identically; visible gaps reveal boundary-level differences that aggregate fractions can hide.\n\nThe reference and local masks are both produced at MPP 1.5, but their pixel dimensions can differ slightly, so the reference is resized onto the local grid (nearest-neighbour) before overlaying. Clean tissue (class 1) and background (class 7) are omitted — only the five artifact classes are outlined.",
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "source": "from matplotlib.lines import Line2D\n\n# Artifact classes to outline (skip clean tissue = 1 and background = 7)\nCONTOUR_CLASSES = [2, 3, 4, 5, 6]\n\nslides = list(uid_to_barcode.values())\nn = len(slides)\nfig, axes = plt.subplots(1, n, figsize=(4.5 * n, 4.8))\nif n == 1:\n    axes = [axes]\n\nfor ax, barcode in zip(axes, slides):\n    local_path = os.path.join(OUTPUT_DIR, \"mask_qc\", f\"{barcode}_mask.png\")\n    ref_fname  = ref_mask_by_barcode.get(barcode)\n    if not os.path.exists(local_path) or not ref_fname:\n        ax.set_title(f\"{barcode}\\n(mask missing)\", fontsize=7)\n        ax.axis(\"off\")\n        continue\n\n    local = np.array(Image.open(local_path))\n    # Both masks are produced at MPP 1.5, but the exact pixel dimensions can\n    # differ slightly; resize the reference onto the local grid with nearest-\n    # neighbour interpolation (preserves the integer class labels) so the two\n    # sets of contours share one coordinate system.\n    ref_img = Image.open(os.path.join(REF_MASKS_DIR, ref_fname))\n    if ref_img.size != (local.shape[1], local.shape[0]):\n        ref_img = ref_img.resize((local.shape[1], local.shape[0]), Image.NEAREST)\n    ref = np.array(ref_img)\n\n    # Light-grey tissue silhouette for spatial context (white = background)\n    ax.imshow(np.where(local != 7, 0.9, 1.0), cmap=\"gray\", vmin=0, vmax=1)\n\n    for cls in CONTOUR_CLASSES:\n        color = mpl_colors[cls - 1]\n        if np.any(ref == cls):\n            ax.contour((ref == cls).astype(float), levels=[0.5], colors=[color],\n                       linewidths=1.1, linestyles=\"solid\", origin=\"upper\")\n        if np.any(local == cls):\n            ax.contour((local == cls).astype(float), levels=[0.5], colors=[color],\n                       linewidths=1.1, linestyles=\"dashed\", origin=\"upper\")\n    ax.set_title(barcode, fontsize=7)\n    ax.axis(\"off\")\n\n# Legend: colour encodes the artifact class, line style encodes the source\nclass_handles = [Line2D([0], [0], color=mpl_colors[c - 1], lw=2,\n                        label=CLASS_NAMES_SHORT[c - 1]) for c in CONTOUR_CLASSES]\nstyle_handles = [\n    Line2D([0], [0], color=\"black\", lw=1.4, linestyle=\"solid\",  label=\"Reference (Zenodo)\"),\n    Line2D([0], [0], color=\"black\", lw=1.4, linestyle=\"dashed\", label=\"Local (DICOM)\"),\n]\nfig.legend(handles=class_handles + style_handles, loc=\"lower center\",\n           ncol=len(class_handles) + len(style_handles), fontsize=7,\n           framealpha=0.9, bbox_to_anchor=(0.5, -0.02))\nplt.suptitle(\"Artifact region boundaries — reference (solid) vs. local DICOM (dashed)\",\n             fontweight=\"bold\", fontsize=10)\nplt.tight_layout(rect=[0, 0.04, 1, 0.97])\nplt.show()",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": []
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": "### 6.4 Scaling to Cohort-level QC\n\nBecause each reference mask is just a small PNG, the comparison above scales to an entire cohort. Downloading the full `BRCA.tar` (~2 GB) from the Zenodo record gives pre-computed masks for all 1,105 TCGA-BRCA diagnostic slides — enough to compute population-level artifact statistics (medians, percentiles, distributions) **without running any inference locally**. We keep this notebook lightweight by validating on the five demo slides only; the loop in section 6.3 generalizes directly to the full mask set, reading one mask at a time so memory stays constant."
+   "source": "### 6.5 Scaling to Cohort-level QC\n\nBecause each reference mask is just a small PNG, the comparison above scales to an entire cohort. Downloading the full `BRCA.tar` (~2 GB) from the Zenodo record gives pre-computed masks for all 1,105 TCGA-BRCA diagnostic slides — enough to compute population-level artifact statistics (medians, percentiles, distributions) **without running any inference locally**. We keep this notebook lightweight by validating on the five demo slides only; the loop in section 6.3 generalizes directly to the full mask set, reading one mask at a time so memory stays constant."
   }
  ],
  "metadata": {
@@ -1635,4 +1647,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 0
-}
\ No newline at end of file
+}

From c6231c45679d52aaaadbbba5aae28631da56d425 Mon Sep 17 00:00:00 2001
From: Andrey Fedorov <andrey.fedorov@gmail.com>
Date: Tue, 16 Jun 2026 13:50:54 -0400
Subject: [PATCH 6/9] Make slide-directory rename idempotent and re-run safe

If the download cell is re-run after a rename, both the UID directory
and the renamed barcode directory exist, and os.rename(src, dst) onto a
non-empty dst raised OSError. Handle the both-present case by dropping
the redundant UID copy and keeping the barcode directory, so the cell is
safe to re-run in any order.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 .../grandqc_slide_quality_with_idc.ipynb      | 25 ++-----------------
 1 file changed, 2 insertions(+), 23 deletions(-)

diff --git a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb
index fac1ee8..6f3e206 100644
--- a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb
+++ b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb
@@ -1162,28 +1162,7 @@
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": [
-    "# Build SeriesInstanceUID → ContainerIdentifier map\n",
-    "uid_to_barcode = dict(zip(demo_slides[\"SeriesInstanceUID\"], demo_slides[\"ContainerIdentifier\"]))\n",
-    "\n",
-    "for uid, barcode in uid_to_barcode.items():\n",
-    "    src = os.path.join(SLIDE_DIR, uid)\n",
-    "    dst = os.path.join(SLIDE_DIR, barcode)\n",
-    "    if os.path.isdir(src):\n",
-    "        os.rename(src, dst)\n",
-    "        print(f\"  {uid}  →  {barcode}\")\n",
-    "    elif os.path.isdir(dst):\n",
-    "        print(f\"  Already renamed: {barcode}\")\n",
-    "    else:\n",
-    "        print(f\"  WARNING: directory not found: {uid}\")\n",
-    "\n",
-    "print(\"\\nSlide directories after renaming:\")\n",
-    "for d in sorted(os.listdir(SLIDE_DIR)):\n",
-    "    dpath = os.path.join(SLIDE_DIR, d)\n",
-    "    if os.path.isdir(dpath):\n",
-    "        n = sum(1 for f in os.listdir(dpath) if f.lower().endswith(\".dcm\"))\n",
-    "        print(f\"  {d}/  ({n} .dcm files)\")"
-   ]
+   "source": "import shutil\n\n# Build SeriesInstanceUID → ContainerIdentifier map\nuid_to_barcode = dict(zip(demo_slides[\"SeriesInstanceUID\"], demo_slides[\"ContainerIdentifier\"]))\n\n# Rename each downloaded UID directory to its barcode. This is written to be\n# safe to re-run, and to tolerate the download cell being re-run (which would\n# re-create the UID directory next to an already-renamed barcode directory).\nfor uid, barcode in uid_to_barcode.items():\n    src = os.path.join(SLIDE_DIR, uid)      # freshly downloaded, named by UID\n    dst = os.path.join(SLIDE_DIR, barcode)  # renamed target, named by barcode\n\n    if os.path.isdir(src) and os.path.isdir(dst):\n        # Both present → the download cell was re-run after a previous rename.\n        # The UID directory is a redundant copy of the same (immutable) series,\n        # so drop it and keep the barcode directory downstream code expects.\n        shutil.rmtree(src)\n        print(f\"  Removed duplicate UID dir; kept {barcode}\")\n    elif os.path.isdir(src):\n        os.rename(src, dst)\n        print(f\"  {uid}  →  {barcode}\")\n    elif os.path.isdir(dst):\n        print(f\"  Already renamed: {barcode}\")\n    else:\n        print(f\"  WARNING: directory not found: {uid}\")\n\nprint(\"\\nSlide directories after renaming:\")\nfor d in sorted(os.listdir(SLIDE_DIR)):\n    dpath = os.path.join(SLIDE_DIR, d)\n    if os.path.isdir(dpath):\n        n = sum(1 for f in os.listdir(dpath) if f.lower().endswith(\".dcm\"))\n        print(f\"  {d}/  ({n} .dcm files)\")"
   },
   {
    "cell_type": "markdown",
@@ -1647,4 +1626,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 0
-}
+}
\ No newline at end of file

From a46a901e1a1215291c83d6e3977c3e9e3d15ba06 Mon Sep 17 00:00:00 2001
From: Andrey Fedorov <andrey.fedorov@gmail.com>
Date: Tue, 16 Jun 2026 13:55:18 -0400
Subject: [PATCH 7/9] Lay review panels out 2 per row for larger images

With five demo slides the single-row figures made each panel tiny.
Switch the tissue-detection overlays, artifact overlays, and the
artifact-boundary comparison to a 2-column grid with larger per-panel
sizes, hiding any unused trailing panel.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 .../grandqc_slide_quality_with_idc.ipynb      | 37 ++-----------------
 1 file changed, 3 insertions(+), 34 deletions(-)

diff --git a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb
index 6f3e206..829ccaa 100644
--- a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb
+++ b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb
@@ -1336,24 +1336,7 @@
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": [
-    "import matplotlib.pyplot as plt\n",
-    "import matplotlib.image as mpimg\n",
-    "\n",
-    "overlay_dir = os.path.join(OUTPUT_DIR, \"tis_det_overlay\")\n",
-    "fnames = sorted(os.listdir(overlay_dir))\n",
-    "\n",
-    "fig, axes = plt.subplots(1, len(fnames), figsize=(6 * len(fnames), 5))\n",
-    "if len(fnames) == 1:\n",
-    "    axes = [axes]\n",
-    "for ax, fname in zip(axes, fnames):\n",
-    "    ax.imshow(mpimg.imread(os.path.join(overlay_dir, fname)))\n",
-    "    ax.set_title(fname.replace(\"_OVERLAY.jpg\", \"\"), fontsize=7)\n",
-    "    ax.axis(\"off\")\n",
-    "plt.suptitle(\"Step 1 — Tissue Detection\", fontsize=11, fontweight=\"bold\")\n",
-    "plt.tight_layout()\n",
-    "plt.show()"
-   ]
+   "source": "import matplotlib.pyplot as plt\nimport matplotlib.image as mpimg\n\noverlay_dir = os.path.join(OUTPUT_DIR, \"tis_det_overlay\")\nfnames = sorted(os.listdir(overlay_dir))\n\n# Lay panels out 2 per row so each image is large enough to inspect\nncols = 2 if len(fnames) > 1 else 1\nnrows = (len(fnames) + ncols - 1) // ncols\nfig, axes = plt.subplots(nrows, ncols, figsize=(8 * ncols, 6 * nrows))\naxes = np.array(axes).reshape(-1)\n\nfor ax, fname in zip(axes, fnames):\n    ax.imshow(mpimg.imread(os.path.join(overlay_dir, fname)))\n    ax.set_title(fname.replace(\"_OVERLAY.jpg\", \"\"), fontsize=9)\n    ax.axis(\"off\")\nfor ax in axes[len(fnames):]:   # hide unused panels in the last row\n    ax.axis(\"off\")\n\nplt.suptitle(\"Step 1 — Tissue Detection\", fontsize=13, fontweight=\"bold\")\nplt.tight_layout()\nplt.show()"
   },
   {
    "cell_type": "markdown",
@@ -1379,21 +1362,7 @@
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": [
-    "overlay_qc_dir = os.path.join(OUTPUT_DIR, \"overlays_qc\")\n",
-    "fnames_qc = sorted(os.listdir(overlay_qc_dir))\n",
-    "\n",
-    "fig, axes = plt.subplots(1, len(fnames_qc), figsize=(6 * len(fnames_qc), 5))\n",
-    "if len(fnames_qc) == 1:\n",
-    "    axes = [axes]\n",
-    "for ax, fname in zip(axes, fnames_qc):\n",
-    "    ax.imshow(mpimg.imread(os.path.join(overlay_qc_dir, fname)))\n",
-    "    ax.set_title(fname.replace(\"_overlay_QC.jpg\", \"\"), fontsize=7)\n",
-    "    ax.axis(\"off\")\n",
-    "plt.suptitle(\"Step 2 — Artifact Segmentation Overlays\", fontsize=11, fontweight=\"bold\")\n",
-    "plt.tight_layout()\n",
-    "plt.show()"
-   ]
+   "source": "overlay_qc_dir = os.path.join(OUTPUT_DIR, \"overlays_qc\")\nfnames_qc = sorted(os.listdir(overlay_qc_dir))\n\n# Lay panels out 2 per row so each image is large enough to inspect\nncols = 2 if len(fnames_qc) > 1 else 1\nnrows = (len(fnames_qc) + ncols - 1) // ncols\nfig, axes = plt.subplots(nrows, ncols, figsize=(8 * ncols, 6 * nrows))\naxes = np.array(axes).reshape(-1)\n\nfor ax, fname in zip(axes, fnames_qc):\n    ax.imshow(mpimg.imread(os.path.join(overlay_qc_dir, fname)))\n    ax.set_title(fname.replace(\"_overlay_QC.jpg\", \"\"), fontsize=9)\n    ax.axis(\"off\")\nfor ax in axes[len(fnames_qc):]:   # hide unused panels in the last row\n    ax.axis(\"off\")\n\nplt.suptitle(\"Step 2 — Artifact Segmentation Overlays\", fontsize=13, fontweight=\"bold\")\nplt.tight_layout()\nplt.show()"
   },
   {
    "cell_type": "markdown",
@@ -1594,7 +1563,7 @@
   },
   {
    "cell_type": "code",
-   "source": "from matplotlib.lines import Line2D\n\n# Artifact classes to outline (skip clean tissue = 1 and background = 7)\nCONTOUR_CLASSES = [2, 3, 4, 5, 6]\n\nslides = list(uid_to_barcode.values())\nn = len(slides)\nfig, axes = plt.subplots(1, n, figsize=(4.5 * n, 4.8))\nif n == 1:\n    axes = [axes]\n\nfor ax, barcode in zip(axes, slides):\n    local_path = os.path.join(OUTPUT_DIR, \"mask_qc\", f\"{barcode}_mask.png\")\n    ref_fname  = ref_mask_by_barcode.get(barcode)\n    if not os.path.exists(local_path) or not ref_fname:\n        ax.set_title(f\"{barcode}\\n(mask missing)\", fontsize=7)\n        ax.axis(\"off\")\n        continue\n\n    local = np.array(Image.open(local_path))\n    # Both masks are produced at MPP 1.5, but the exact pixel dimensions can\n    # differ slightly; resize the reference onto the local grid with nearest-\n    # neighbour interpolation (preserves the integer class labels) so the two\n    # sets of contours share one coordinate system.\n    ref_img = Image.open(os.path.join(REF_MASKS_DIR, ref_fname))\n    if ref_img.size != (local.shape[1], local.shape[0]):\n        ref_img = ref_img.resize((local.shape[1], local.shape[0]), Image.NEAREST)\n    ref = np.array(ref_img)\n\n    # Light-grey tissue silhouette for spatial context (white = background)\n    ax.imshow(np.where(local != 7, 0.9, 1.0), cmap=\"gray\", vmin=0, vmax=1)\n\n    for cls in CONTOUR_CLASSES:\n        color = mpl_colors[cls - 1]\n        if np.any(ref == cls):\n            ax.contour((ref == cls).astype(float), levels=[0.5], colors=[color],\n                       linewidths=1.1, linestyles=\"solid\", origin=\"upper\")\n        if np.any(local == cls):\n            ax.contour((local == cls).astype(float), levels=[0.5], colors=[color],\n                       linewidths=1.1, linestyles=\"dashed\", origin=\"upper\")\n    ax.set_title(barcode, fontsize=7)\n    ax.axis(\"off\")\n\n# Legend: colour encodes the artifact class, line style encodes the source\nclass_handles = [Line2D([0], [0], color=mpl_colors[c - 1], lw=2,\n                        label=CLASS_NAMES_SHORT[c - 1]) for c in CONTOUR_CLASSES]\nstyle_handles = [\n    Line2D([0], [0], color=\"black\", lw=1.4, linestyle=\"solid\",  label=\"Reference (Zenodo)\"),\n    Line2D([0], [0], color=\"black\", lw=1.4, linestyle=\"dashed\", label=\"Local (DICOM)\"),\n]\nfig.legend(handles=class_handles + style_handles, loc=\"lower center\",\n           ncol=len(class_handles) + len(style_handles), fontsize=7,\n           framealpha=0.9, bbox_to_anchor=(0.5, -0.02))\nplt.suptitle(\"Artifact region boundaries — reference (solid) vs. local DICOM (dashed)\",\n             fontweight=\"bold\", fontsize=10)\nplt.tight_layout(rect=[0, 0.04, 1, 0.97])\nplt.show()",
+   "source": "from matplotlib.lines import Line2D\n\n# Artifact classes to outline (skip clean tissue = 1 and background = 7)\nCONTOUR_CLASSES = [2, 3, 4, 5, 6]\n\nslides = list(uid_to_barcode.values())\nn = len(slides)\n# Lay panels out 2 per row so each image is large enough to inspect\nncols = 2 if n > 1 else 1\nnrows = (n + ncols - 1) // ncols\nfig, axes = plt.subplots(nrows, ncols, figsize=(7 * ncols, 6.5 * nrows))\naxes = np.array(axes).reshape(-1)\n\nfor ax, barcode in zip(axes, slides):\n    local_path = os.path.join(OUTPUT_DIR, \"mask_qc\", f\"{barcode}_mask.png\")\n    ref_fname  = ref_mask_by_barcode.get(barcode)\n    if not os.path.exists(local_path) or not ref_fname:\n        ax.set_title(f\"{barcode}\\n(mask missing)\", fontsize=9)\n        ax.axis(\"off\")\n        continue\n\n    local = np.array(Image.open(local_path))\n    # Both masks are produced at MPP 1.5, but the exact pixel dimensions can\n    # differ slightly; resize the reference onto the local grid with nearest-\n    # neighbour interpolation (preserves the integer class labels) so the two\n    # sets of contours share one coordinate system.\n    ref_img = Image.open(os.path.join(REF_MASKS_DIR, ref_fname))\n    if ref_img.size != (local.shape[1], local.shape[0]):\n        ref_img = ref_img.resize((local.shape[1], local.shape[0]), Image.NEAREST)\n    ref = np.array(ref_img)\n\n    # Light-grey tissue silhouette for spatial context (white = background)\n    ax.imshow(np.where(local != 7, 0.9, 1.0), cmap=\"gray\", vmin=0, vmax=1)\n\n    for cls in CONTOUR_CLASSES:\n        color = mpl_colors[cls - 1]\n        if np.any(ref == cls):\n            ax.contour((ref == cls).astype(float), levels=[0.5], colors=[color],\n                       linewidths=1.1, linestyles=\"solid\", origin=\"upper\")\n        if np.any(local == cls):\n            ax.contour((local == cls).astype(float), levels=[0.5], colors=[color],\n                       linewidths=1.1, linestyles=\"dashed\", origin=\"upper\")\n    ax.set_title(barcode, fontsize=9)\n    ax.axis(\"off\")\n\nfor ax in axes[n:]:   # hide unused panels in the last row\n    ax.axis(\"off\")\n\n# Legend: colour encodes the artifact class, line style encodes the source\nclass_handles = [Line2D([0], [0], color=mpl_colors[c - 1], lw=2,\n                        label=CLASS_NAMES_SHORT[c - 1]) for c in CONTOUR_CLASSES]\nstyle_handles = [\n    Line2D([0], [0], color=\"black\", lw=1.4, linestyle=\"solid\",  label=\"Reference (Zenodo)\"),\n    Line2D([0], [0], color=\"black\", lw=1.4, linestyle=\"dashed\", label=\"Local (DICOM)\"),\n]\nfig.legend(handles=class_handles + style_handles, loc=\"lower center\",\n           ncol=len(class_handles) + len(style_handles), fontsize=8,\n           framealpha=0.9, bbox_to_anchor=(0.5, 0.0))\nplt.suptitle(\"Artifact region boundaries — reference (solid) vs. local DICOM (dashed)\",\n             fontweight=\"bold\", fontsize=12)\nplt.tight_layout(rect=[0, 0.03, 1, 0.97])\nplt.show()",
    "metadata": {},
    "execution_count": null,
    "outputs": []

From 9c96d99c532258d646618f82f0eab345ea4801b7 Mon Sep 17 00:00:00 2001
From: Andrey Fedorov <andrey.fedorov@gmail.com>
Date: Tue, 16 Jun 2026 14:32:20 -0400
Subject: [PATCH 8/9] Replace contour overlay with side-by-side mask comparison

The overlaid solid/dashed contours were hard to read. Instead show the
reference (Zenodo) and local DICOM artifact maps side by side, one slide
per row, each colourised with the same GrandQC palette and a shared
class legend. No resizing needed since the panels are not overlaid.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb
index 829ccaa..b9dd4f7 100644
--- a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb
+++ b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb
@@ -1558,12 +1558,12 @@
   },
   {
    "cell_type": "markdown",
-   "source": "### 6.4 Compare Artifact Region Boundaries\n\nPer-class fractions (6.3) tell us *how much* of each artifact the two runs agree on, but not *where*. Here we overlay the **boundaries** of each artifact region for both runs on the same grid: a **solid** outline for the reference (Zenodo) mask and a **dashed** outline for our local DICOM run, drawn in the matching artifact colour. Where the two outlines hug each other the pipelines localise the artifact identically; visible gaps reveal boundary-level differences that aggregate fractions can hide.\n\nThe reference and local masks are both produced at MPP 1.5, but their pixel dimensions can differ slightly, so the reference is resized onto the local grid (nearest-neighbour) before overlaying. Clean tissue (class 1) and background (class 7) are omitted — only the five artifact classes are outlined.",
+   "source": "### 6.4 Compare Reference and Local Results Side by Side\n\nPer-class fractions (6.3) summarise agreement as numbers. Here we place the two artifact maps next to each other — the reference (Zenodo) result on the left and our local DICOM result on the right, one slide per row — so differences in *where* each artifact class is detected are easy to spot.\n\nBoth masks are colourised with the same GrandQC palette: clean tissue (grey) plus the five artifact classes; background (class 7) is shown white. The reference and local masks can differ slightly in pixel dimensions, but since the panels are shown side by side rather than overlaid, no resizing is needed.",
    "metadata": {}
   },
   {
    "cell_type": "code",
-   "source": "from matplotlib.lines import Line2D\n\n# Artifact classes to outline (skip clean tissue = 1 and background = 7)\nCONTOUR_CLASSES = [2, 3, 4, 5, 6]\n\nslides = list(uid_to_barcode.values())\nn = len(slides)\n# Lay panels out 2 per row so each image is large enough to inspect\nncols = 2 if n > 1 else 1\nnrows = (n + ncols - 1) // ncols\nfig, axes = plt.subplots(nrows, ncols, figsize=(7 * ncols, 6.5 * nrows))\naxes = np.array(axes).reshape(-1)\n\nfor ax, barcode in zip(axes, slides):\n    local_path = os.path.join(OUTPUT_DIR, \"mask_qc\", f\"{barcode}_mask.png\")\n    ref_fname  = ref_mask_by_barcode.get(barcode)\n    if not os.path.exists(local_path) or not ref_fname:\n        ax.set_title(f\"{barcode}\\n(mask missing)\", fontsize=9)\n        ax.axis(\"off\")\n        continue\n\n    local = np.array(Image.open(local_path))\n    # Both masks are produced at MPP 1.5, but the exact pixel dimensions can\n    # differ slightly; resize the reference onto the local grid with nearest-\n    # neighbour interpolation (preserves the integer class labels) so the two\n    # sets of contours share one coordinate system.\n    ref_img = Image.open(os.path.join(REF_MASKS_DIR, ref_fname))\n    if ref_img.size != (local.shape[1], local.shape[0]):\n        ref_img = ref_img.resize((local.shape[1], local.shape[0]), Image.NEAREST)\n    ref = np.array(ref_img)\n\n    # Light-grey tissue silhouette for spatial context (white = background)\n    ax.imshow(np.where(local != 7, 0.9, 1.0), cmap=\"gray\", vmin=0, vmax=1)\n\n    for cls in CONTOUR_CLASSES:\n        color = mpl_colors[cls - 1]\n        if np.any(ref == cls):\n            ax.contour((ref == cls).astype(float), levels=[0.5], colors=[color],\n                       linewidths=1.1, linestyles=\"solid\", origin=\"upper\")\n        if np.any(local == cls):\n            ax.contour((local == cls).astype(float), levels=[0.5], colors=[color],\n                       linewidths=1.1, linestyles=\"dashed\", origin=\"upper\")\n    ax.set_title(barcode, fontsize=9)\n    ax.axis(\"off\")\n\nfor ax in axes[n:]:   # hide unused panels in the last row\n    ax.axis(\"off\")\n\n# Legend: colour encodes the artifact class, line style encodes the source\nclass_handles = [Line2D([0], [0], color=mpl_colors[c - 1], lw=2,\n                        label=CLASS_NAMES_SHORT[c - 1]) for c in CONTOUR_CLASSES]\nstyle_handles = [\n    Line2D([0], [0], color=\"black\", lw=1.4, linestyle=\"solid\",  label=\"Reference (Zenodo)\"),\n    Line2D([0], [0], color=\"black\", lw=1.4, linestyle=\"dashed\", label=\"Local (DICOM)\"),\n]\nfig.legend(handles=class_handles + style_handles, loc=\"lower center\",\n           ncol=len(class_handles) + len(style_handles), fontsize=8,\n           framealpha=0.9, bbox_to_anchor=(0.5, 0.0))\nplt.suptitle(\"Artifact region boundaries — reference (solid) vs. local DICOM (dashed)\",\n             fontweight=\"bold\", fontsize=12)\nplt.tight_layout(rect=[0, 0.03, 1, 0.97])\nplt.show()",
+   "source": "from matplotlib.patches import Patch\n\n# Colour lookup for GrandQC class labels: 1-6 use the artifact palette, while\n# 0 (padding) and 7 (background) are shown white.\npalette = np.ones((8, 3))\nfor cls in range(1, 7):\n    palette[cls] = mpl_colors[cls - 1]\n\ndef colorize(mask):\n    return palette[np.clip(mask, 0, 7)]\n\nslides = list(uid_to_barcode.values())\nn = len(slides)\nfig, axes = plt.subplots(n, 2, figsize=(13, 5.0 * n))\naxes = np.atleast_2d(axes)\n\nfor row, barcode in enumerate(slides):\n    local_path = os.path.join(OUTPUT_DIR, \"mask_qc\", f\"{barcode}_mask.png\")\n    ref_fname  = ref_mask_by_barcode.get(barcode)\n    ref_path   = os.path.join(REF_MASKS_DIR, ref_fname) if ref_fname else None\n\n    ax_ref, ax_local = axes[row, 0], axes[row, 1]\n    ax_ref.axis(\"off\")\n    ax_local.axis(\"off\")\n\n    if ref_path and os.path.exists(ref_path):\n        ax_ref.imshow(colorize(np.array(Image.open(ref_path))))\n    else:\n        ax_ref.text(0.5, 0.5, \"reference mask missing\", ha=\"center\", va=\"center\")\n    if os.path.exists(local_path):\n        ax_local.imshow(colorize(np.array(Image.open(local_path))))\n    else:\n        ax_local.text(0.5, 0.5, \"local mask missing\", ha=\"center\", va=\"center\")\n\n    ax_ref.set_title(f\"{barcode}\\nReference (Zenodo)\", fontsize=10)\n    ax_local.set_title(\"Local (DICOM)\", fontsize=10)\n\n# Shared legend of artifact classes\nlegend_handles = [Patch(facecolor=mpl_colors[i], edgecolor=\"black\",\n                        label=CLASS_NAMES_SHORT[i]) for i in range(len(CLASS_NAMES_SHORT))]\nfig.legend(handles=legend_handles, loc=\"lower center\", ncol=len(legend_handles),\n           fontsize=9, framealpha=0.9, bbox_to_anchor=(0.5, 0.0))\nplt.suptitle(\"Artifact maps — reference vs. local DICOM run\", fontweight=\"bold\", fontsize=13)\nplt.tight_layout(rect=[0, 0.03, 1, 0.97])\nplt.show()",
    "metadata": {},
    "execution_count": null,
    "outputs": []

From 6a54113e12477906641b4369d9b55b663f0684e6 Mon Sep 17 00:00:00 2001
From: Andrey Fedorov <andrey.fedorov@gmail.com>
Date: Tue, 16 Jun 2026 14:33:59 -0400
Subject: [PATCH 9/9] Note T4 high-RAM runtime requirement in the intro

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb
index b9dd4f7..03c0bc6 100644
--- a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb
+++ b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb
@@ -10,7 +10,7 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": "# Slide Quality Control with GrandQC and NCI Imaging Data Commons\n\nAutomated quality control (QC) is an essential preprocessing step in computational pathology pipelines. Artifacts such as tissue folds, out-of-focus regions, pen markings, and air bubbles can introduce noise that degrades the performance of downstream AI models.\n\n[**GrandQC**](https://github.com/cpath-ukk/grandqc) is an open-source tool for automated tissue detection and multi-class artifact segmentation in whole slide images (WSIs). It was validated across slides from 19 international pathology departments and published in:\n\n> Weng Z., Seper A., Pryalukhin A., et al. *GrandQC: A comprehensive solution to quality control problem in digital pathology.* Nature Communications 15, 10685 (2024). https://doi.org/10.1038/s41467-024-54769-y\n\n[**NCI Imaging Data Commons (IDC)**](https://portal.imaging.datacommons.cancer.gov/) hosts ~100 TB of cancer imaging data, including thousands of H&E-stained whole slide images from TCGA, CPTAC, HTAN, and other programs — all stored as DICOM and freely accessible without authentication.\n\nThis notebook demonstrates how to:\n1. **Discover** H&E-stained whole slide images in IDC using `idc-index`\n2. **Set up** GrandQC and download its pre-trained models\n3. **Download** slides from IDC and run GrandQC directly on DICOM files\n4. **Visualize** tissue detection and artifact segmentation results\n5. **Validate** the DICOM-based pipeline against GrandQC's pre-computed QC masks for the TCGA cohorts"
+   "source": "# Slide Quality Control with GrandQC and NCI Imaging Data Commons\n\nAutomated quality control (QC) is an essential preprocessing step in computational pathology pipelines. Artifacts such as tissue folds, out-of-focus regions, pen markings, and air bubbles can introduce noise that degrades the performance of downstream AI models.\n\n[**GrandQC**](https://github.com/cpath-ukk/grandqc) is an open-source tool for automated tissue detection and multi-class artifact segmentation in whole slide images (WSIs). It was validated across slides from 19 international pathology departments and published in:\n\n> Weng Z., Seper A., Pryalukhin A., et al. *GrandQC: A comprehensive solution to quality control problem in digital pathology.* Nature Communications 15, 10685 (2024). https://doi.org/10.1038/s41467-024-54769-y\n\n[**NCI Imaging Data Commons (IDC)**](https://portal.imaging.datacommons.cancer.gov/) hosts ~100 TB of cancer imaging data, including thousands of H&E-stained whole slide images from TCGA, CPTAC, HTAN, and other programs — all stored as DICOM and freely accessible without authentication.\n\nThis notebook demonstrates how to:\n1. **Discover** H&E-stained whole slide images in IDC using `idc-index`\n2. **Set up** GrandQC and download its pre-trained models\n3. **Download** slides from IDC and run GrandQC directly on DICOM files\n4. **Visualize** tissue detection and artifact segmentation results\n5. **Validate** the DICOM-based pipeline against GrandQC's pre-computed QC masks for the TCGA cohorts\n\n> ⚠️ **Runtime requirement**: Run this notebook on a **T4 high-RAM** runtime. In Colab, select **Runtime → Change runtime type → T4 GPU** and enable the **High-RAM** option. The artifact-segmentation step needs the GPU for reasonable speed (~30–45 s/slide vs. 10–30 min/slide on CPU), and the high-RAM shape avoids out-of-memory errors when loading the larger whole-slide images."
   },
   {
    "cell_type": "markdown",

	collection_id	patients	series	size_GB	license_short_name
0	ccdi_mci	4407	4576	4548.0	CC BY 4.0
1	tcga_brca	1098	3111	1678.8	CC BY 3.0
2	gtex	971	25503	8354.2	CC BY 4.0
3	pdxnet	919	919	122.7	CC BY 4.0
4	tcga_gbm	607	2053	639.9	CC BY 3.0
5	tcga_ov	590	1481	476.5	CC BY 3.0
6	tcga_ucec	560	1371	1078.5	CC BY 3.0
7	tcga_kirc	537	2173	811.7	CC BY 3.0
8	tcga_hnsc	523	1263	560.5	CC BY 3.0
9	tcga_luad	522	1608	631.1	CC BY 3.0
10	tcga_lgg	516	1572	1166.6	CC BY 3.0
11	tcga_thca	507	1158	747.1	CC BY 3.0
12	tcga_lusc	504	1612	610.4	CC BY 3.0
13	tcga_prad	500	1172	596.4	CC BY 3.0
14	tcga_skcm	470	950	721.5	CC BY 3.0
	SeriesInstanceUID	PatientID	ContainerIdentifier	tissue_type	width_px	height_px	pixel_spacing_mm	objective_power	size_MB
0	1.3.6.1.4.1.5962.99.1.1247079754.248369703.163...	TCGA-A7-A26J	TCGA-A7-A26J-01B-02-BS2	Neoplasm, Primary	8618	7243	0.00025	40	9.1
1	1.3.6.1.4.1.5962.99.1.1279025007.575797294.163...	TCGA-AC-A2QJ	TCGA-AC-A2QJ-11A-02-TS2	Normal	33993	6923	0.00050	20	15.5
2	1.3.6.1.4.1.5962.99.1.1306193840.1223239355.16...	TCGA-AC-A2FB	TCGA-AC-A2FB-11A-01-TSA	Normal	31993	7690	0.00050	20	16.5
3	1.3.6.1.4.1.5962.99.1.1263739509.328647659.163...	TCGA-BH-A5IZ	TCGA-BH-A5IZ-11A-01-TS1	Normal	20041	15048	0.00050	20	18.7
4	1.3.6.1.4.1.5962.99.1.1290706954.176399971.163...	TCGA-AC-A2QJ	TCGA-AC-A2QJ-11A-01-TS1	Normal	41991	9316	0.00050	20	22.8
5	1.3.6.1.4.1.5962.99.1.1315471301.865934712.163...	TCGA-E9-A1RF	TCGA-E9-A1RF-11A-03-TSC	Normal	27888	9784	0.00050	20	23.0
6	1.3.6.1.4.1.5962.99.1.1322835661.1509329658.16...	TCGA-GM-A2D9	TCGA-GM-A2D9-11A-02-TS2	Normal	35992	8945	0.00050	20	25.0
7	1.3.6.1.4.1.5962.99.1.1255105493.445850269.163...	TCGA-E9-A1RF	TCGA-E9-A1RF-11A-01-TSA	Normal	23904	14929	0.00050	20	26.9
8	1.3.6.1.4.1.5962.99.1.1247939226.656219549.163...	TCGA-OL-A5RW	TCGA-OL-A5RW-01Z-00-DX1	Neoplasm, Primary	12879	13618	0.00050	<NA>	27.1
9	1.3.6.1.4.1.5962.99.1.1244105102.1853648157.16...	TCGA-E9-A1NF	TCGA-E9-A1NF-11A-05-TSE	Normal	13943	17330	0.00049	20	27.3