diff --git a/notebooks/pathomics/grandqc_slide_quality_resolution_notes.md b/notebooks/pathomics/grandqc_slide_quality_resolution_notes.md
new file mode 100644
index 0000000..51f585c
--- /dev/null
+++ b/notebooks/pathomics/grandqc_slide_quality_resolution_notes.md
@@ -0,0 +1,178 @@
+# GrandQC + IDC: Notes on Inference Resolution
+
+Supplementary notes for [`grandqc_slide_quality_with_idc.ipynb`](grandqc_slide_quality_with_idc.ipynb).
+
+This document explains, for the GrandQC inference scripts in
+`grandqc/01_WSI_inference_OPENSLIDE_QC/`, **what resolution the model runs at,
+where that is controlled, and how the per-slide microns-per-pixel (MPP) value is
+determined** when slides are read from NCI Imaging Data Commons (IDC) as DICOM.
+
+---
+
+## Two distinct resolutions
+
+There are two separate resolution concepts in the pipeline. They are easy to
+conflate but are set in different places.
+
+| Resolution concept | Value (this tutorial) | Where it is set |
+|---|---|---|
+| **Model working resolution** (MPP the network "sees") | `1.5` µm/px (≈ 7× objective) | `--mpp_model` CLI flag → `MPP_MODEL` in `main.py` |
+| **Tissue-detection resolution** (Step 1) | `10` µm/px | hardcoded in `wsi_tis_detect.py` |
+| **Model input tile size** | `512 × 512` px | `M_P_S_MODEL = 512`, hardcoded in `main.py` |
+| **Native read size per tile** | `mpp_model / mpp × 512` px (level 0) | computed in `wsi_slide_info.py:slide_info()` |
+
+The model **never** runs at the slide's native 20×/40× magnification. The tissue
+detector runs coarsest (MPP 10); the artifact model runs at MPP 1.5.
+
+---
+
+## 1. Model working resolution — `--mpp_model`
+
+The artifact-segmentation step (notebook Part 4.2) is invoked as:
+
+```python
+subprocess.run([sys.executable, "main.py",
+ "--slide_folder", slide_dir_abs,
+ "--output_dir", output_dir_abs,
+ "--mpp_model", "1.5",
+ "--create_geojson", "Y"], cwd=scripts_abs)
+```
+
+`--mpp_model` sets `MPP_MODEL` in `main.py` and does two things:
+
+1. **Selects the model weights** — only three values are valid:
+
+ | `--mpp_model` | weights file |
+ |---|---|
+ | `1.0` | `GrandQC_MPP1.pth` |
+ | `1.5` | `GrandQC_MPP15.pth` |
+ | `2.0` | `GrandQC_MPP2.pth` |
+
+ Any other value raises `Exception("mpp of the model can only be 1.0, 1.5, 2.0")`.
+
+2. **Sets the magnification slides are read at** (see §3).
+
+> **To change the inference resolution** you must set `--mpp_model` to `1.0` or
+> `2.0` **and** download the matching weights. The notebook's model-download cell
+> (Part 2.2) only fetches `GrandQC_MPP15.pth`, so switching MPP requires adding the
+> corresponding Zenodo download.
+
+---
+
+## 2. Model input tile size — fixed at 512×512
+
+Hardcoded in `main.py`; not a CLI flag:
+
+```python
+M_P_S_MODEL = 512 # the network always receives 512×512 px tiles
+```
+
+Every tile is downsampled to 512×512 before inference (§3). Changing this would
+require retraining and should not be altered.
+
+---
+
+## 3. How the two combine
+
+`wsi_slide_info.py:slide_info()` converts the model MPP into a patch size measured
+in the slide's **native level-0 pixels**:
+
+```python
+mpp = float(slide.properties["openslide.mpp-x"]) # slide's native MPP
+p_s = int(mpp_model / mpp * m_p_s) # patch size in level-0 px
+```
+
+Then `wsi_process.py:slide_process_single()` reads that region at full resolution
+and downsamples it to the fixed model input size:
+
+```python
+work_patch = slide.read_region((w, h), 0, (p_s, p_s)) # read p_s × p_s from level 0
+work_patch = work_patch.resize((m_p_s, m_p_s), Image.Resampling.LANCZOS) # → 512×512
+predictions = model.predict(x_tensor) # inference at 512×512 (= MPP 1.5)
+```
+
+**Worked example** — a 40× TCGA-BRCA slide with native MPP ≈ `0.2485`, `mpp_model = 1.5`:
+
+```
+p_s = int(1.5 / 0.2485 * 512) = int(3091.0) = 3091 px # read from level 0
+ → downsampled to 512×512 → fed to the network
+```
+
+The downsampling makes the effective resolution seen by the network MPP 1.5,
+regardless of whether the slide was scanned at 20× or 40×.
+
+---
+
+## 4. How MPP is determined
+
+MPP is **read from OpenSlide**, not computed. The only place it is obtained is the
+single line in `wsi_slide_info.py:slide_info()`:
+
+```python
+mpp = float(slide.properties["openslide.mpp-x"])
+```
+
+Key points:
+
+- It is **read, not derived** from objective power. Only the X axis is used
+ (square pixels are assumed; `mpp-y` is ignored).
+- **No fallback for MPP.** Note the asymmetry in the same function — objective
+ power has a fallback, MPP does not:
+
+ ```python
+ try:
+ obj_power = slide.properties["openslide.objective-power"]
+ except:
+ obj_power = 99 # objective power: falls back to 99
+
+ mpp = float(slide.properties["openslide.mpp-x"]) # MPP: no fallback
+ ```
+
+ If `openslide.mpp-x` is missing/`None`, `float(...)` raises. With the
+ fail-loud error handling in the `idc-dicom-fixes` fork branch, this is caught
+ and reported per slide (with a non-zero exit) rather than silently mis-scaling.
+- `openslide.mpp-x` is the **single point of failure** for scaling any new slide:
+ if it is wrong or absent, `p_s` is sized wrong and the model sees the slide at
+ the wrong scale.
+
+### For IDC DICOM slides specifically
+
+OpenSlide's DICOM driver (added in OpenSlide 4.0.0) populates `openslide.mpp-x`
+from the DICOM `PixelSpacing` attribute (mm → µm).
+
+**Verified empirically** on IDC slide `TCGA-A7-A26J-01B-02-BS2`
+(the 9.1 MB series used in the notebook), downloaded with `idc-index` and opened
+with OpenSlide 4.0.1:
+
+```
+openslide.vendor = dicom
+openslide.mpp-x = 0.24850000000000003
+openslide.mpp-y = 0.24850000000000003
+openslide.objective-power = 40
+```
+
+The raw DICOM source attribute is exposed on the same slide:
+
+```
+dicom.SharedFunctionalGroupsSequence[0].PixelMeasuresSequence[0].PixelSpacing[0] = .0002485 # mm
+```
+
+and `0.0002485 mm × 1000 = 0.2485 µm/px`, which matches `openslide.mpp-x` exactly.
+This also matches the `pixel_spacing_mm = 0.00025` shown in the notebook's
+metadata query (the 2-significant-figure rounded form).
+
+The notebook's Part 3.3 verification cell already surfaces this value with
+`slide.properties.get("openslide.mpp-x", "N/A")`, so you can inspect MPP per slide
+before running Step 2.
+
+---
+
+## Quick reference: changing resolution safely
+
+1. Pick a supported MPP: `1.0`, `1.5`, or `2.0`.
+2. Download the matching weights from Zenodo into `models/qc/`
+ (`GrandQC_MPP1.pth` / `GrandQC_MPP15.pth` / `GrandQC_MPP2.pth`).
+3. Pass it via `--mpp_model` in the Step 2 `subprocess.run` call.
+4. Leave `M_P_S_MODEL = 512` unchanged.
+5. Confirm each slide reports a valid `openslide.mpp-x` (Part 3.3) before running —
+ it has no fallback.
diff --git a/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb
new file mode 100644
index 0000000..03c0bc6
--- /dev/null
+++ b/notebooks/pathomics/grandqc_slide_quality_with_idc.ipynb
@@ -0,0 +1,1598 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": "# Slide Quality Control with GrandQC and NCI Imaging Data Commons\n\nAutomated quality control (QC) is an essential preprocessing step in computational pathology pipelines. Artifacts such as tissue folds, out-of-focus regions, pen markings, and air bubbles can introduce noise that degrades the performance of downstream AI models.\n\n[**GrandQC**](https://github.com/cpath-ukk/grandqc) is an open-source tool for automated tissue detection and multi-class artifact segmentation in whole slide images (WSIs). It was validated across slides from 19 international pathology departments and published in:\n\n> Weng Z., Seper A., Pryalukhin A., et al. *GrandQC: A comprehensive solution to quality control problem in digital pathology.* Nature Communications 15, 10685 (2024). https://doi.org/10.1038/s41467-024-54769-y\n\n[**NCI Imaging Data Commons (IDC)**](https://portal.imaging.datacommons.cancer.gov/) hosts ~100 TB of cancer imaging data, including thousands of H&E-stained whole slide images from TCGA, CPTAC, HTAN, and other programs — all stored as DICOM and freely accessible without authentication.\n\nThis notebook demonstrates how to:\n1. **Discover** H&E-stained whole slide images in IDC using `idc-index`\n2. **Set up** GrandQC and download its pre-trained models\n3. **Download** slides from IDC and run GrandQC directly on DICOM files\n4. **Visualize** tissue detection and artifact segmentation results\n5. **Validate** the DICOM-based pipeline against GrandQC's pre-computed QC masks for the TCGA cohorts\n\n> ⚠️ **Runtime requirement**: Run this notebook on a **T4 high-RAM** runtime. In Colab, select **Runtime → Change runtime type → T4 GPU** and enable the **High-RAM** option. The artifact-segmentation step needs the GPU for reasonable speed (~30–45 s/slide vs. 10–30 min/slide on CPU), and the high-RAM shape avoids out-of-memory errors when loading the larger whole-slide images."
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Disclaimer\n",
+ "\n",
+ "The code and data of this repository are provided to promote reproducible research. They are not intended for clinical care or commercial use.\n",
+ "\n",
+ "The software is provided \"as is\", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. In no event shall the authors or copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.\n",
+ "\n",
+ "**GrandQC license**: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0). Non-commercial research use only; citation of the GrandQC paper is required."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Part 1: Discover H&E Slides in IDC\n",
+ "\n",
+ "IDC provides the `idc-index` Python package for querying slide metadata and downloading DICOM files. We start by installing it and exploring the available whole slide image collections."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%%capture\n",
+ "!pip install --upgrade idc-index openslide-python openslide-bin"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "IDC data version: v24\n"
+ ]
+ }
+ ],
+ "source": [
+ "import os\n",
+ "import pandas as pd\n",
+ "import matplotlib.pyplot as plt\n",
+ "import matplotlib.patches as mpatches\n",
+ "import numpy as np\n",
+ "from IPython.display import IFrame, display\n",
+ "from idc_index import IDCClient\n",
+ "\n",
+ "idc_client = IDCClient()\n",
+ "print(f\"IDC data version: {idc_client.get_idc_version()}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 1.1 Overview of Slide Microscopy Collections\n",
+ "\n",
+ "IDC stores whole slide images using the DICOM Slide Microscopy (SM) standard. The `sm_index` table extends the main metadata index with pathology-specific attributes including staining protocol, tissue type, pixel spacing, and image dimensions."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Total SM collections: 73\n",
+ "Total SM series: 76,299\n",
+ "Total size: 47871 GB\n",
+ "\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "
\n",
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " collection_id | \n",
+ " patients | \n",
+ " series | \n",
+ " size_GB | \n",
+ " license_short_name | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " ccdi_mci | \n",
+ " 4407 | \n",
+ " 4576 | \n",
+ " 4548.0 | \n",
+ " CC BY 4.0 | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " tcga_brca | \n",
+ " 1098 | \n",
+ " 3111 | \n",
+ " 1678.8 | \n",
+ " CC BY 3.0 | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " gtex | \n",
+ " 971 | \n",
+ " 25503 | \n",
+ " 8354.2 | \n",
+ " CC BY 4.0 | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " pdxnet | \n",
+ " 919 | \n",
+ " 919 | \n",
+ " 122.7 | \n",
+ " CC BY 4.0 | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " tcga_gbm | \n",
+ " 607 | \n",
+ " 2053 | \n",
+ " 639.9 | \n",
+ " CC BY 3.0 | \n",
+ "
\n",
+ " \n",
+ " | 5 | \n",
+ " tcga_ov | \n",
+ " 590 | \n",
+ " 1481 | \n",
+ " 476.5 | \n",
+ " CC BY 3.0 | \n",
+ "
\n",
+ " \n",
+ " | 6 | \n",
+ " tcga_ucec | \n",
+ " 560 | \n",
+ " 1371 | \n",
+ " 1078.5 | \n",
+ " CC BY 3.0 | \n",
+ "
\n",
+ " \n",
+ " | 7 | \n",
+ " tcga_kirc | \n",
+ " 537 | \n",
+ " 2173 | \n",
+ " 811.7 | \n",
+ " CC BY 3.0 | \n",
+ "
\n",
+ " \n",
+ " | 8 | \n",
+ " tcga_hnsc | \n",
+ " 523 | \n",
+ " 1263 | \n",
+ " 560.5 | \n",
+ " CC BY 3.0 | \n",
+ "
\n",
+ " \n",
+ " | 9 | \n",
+ " tcga_luad | \n",
+ " 522 | \n",
+ " 1608 | \n",
+ " 631.1 | \n",
+ " CC BY 3.0 | \n",
+ "
\n",
+ " \n",
+ " | 10 | \n",
+ " tcga_lgg | \n",
+ " 516 | \n",
+ " 1572 | \n",
+ " 1166.6 | \n",
+ " CC BY 3.0 | \n",
+ "
\n",
+ " \n",
+ " | 11 | \n",
+ " tcga_thca | \n",
+ " 507 | \n",
+ " 1158 | \n",
+ " 747.1 | \n",
+ " CC BY 3.0 | \n",
+ "
\n",
+ " \n",
+ " | 12 | \n",
+ " tcga_lusc | \n",
+ " 504 | \n",
+ " 1612 | \n",
+ " 610.4 | \n",
+ " CC BY 3.0 | \n",
+ "
\n",
+ " \n",
+ " | 13 | \n",
+ " tcga_prad | \n",
+ " 500 | \n",
+ " 1172 | \n",
+ " 596.4 | \n",
+ " CC BY 3.0 | \n",
+ "
\n",
+ " \n",
+ " | 14 | \n",
+ " tcga_skcm | \n",
+ " 470 | \n",
+ " 950 | \n",
+ " 721.5 | \n",
+ " CC BY 3.0 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ " "
+ ],
+ "text/plain": [
+ " collection_id patients series size_GB license_short_name\n",
+ "0 ccdi_mci 4407 4576 4548.0 CC BY 4.0\n",
+ "1 tcga_brca 1098 3111 1678.8 CC BY 3.0\n",
+ "2 gtex 971 25503 8354.2 CC BY 4.0\n",
+ "3 pdxnet 919 919 122.7 CC BY 4.0\n",
+ "4 tcga_gbm 607 2053 639.9 CC BY 3.0\n",
+ "5 tcga_ov 590 1481 476.5 CC BY 3.0\n",
+ "6 tcga_ucec 560 1371 1078.5 CC BY 3.0\n",
+ "7 tcga_kirc 537 2173 811.7 CC BY 3.0\n",
+ "8 tcga_hnsc 523 1263 560.5 CC BY 3.0\n",
+ "9 tcga_luad 522 1608 631.1 CC BY 3.0\n",
+ "10 tcga_lgg 516 1572 1166.6 CC BY 3.0\n",
+ "11 tcga_thca 507 1158 747.1 CC BY 3.0\n",
+ "12 tcga_lusc 504 1612 610.4 CC BY 3.0\n",
+ "13 tcga_prad 500 1172 596.4 CC BY 3.0\n",
+ "14 tcga_skcm 470 950 721.5 CC BY 3.0"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Load the slide microscopy index\n",
+ "idc_client.fetch_index(\"sm_index\")\n",
+ "\n",
+ "# Summary of all SM collections\n",
+ "sm_collections = idc_client.sql_query(\"\"\"\n",
+ " SELECT\n",
+ " i.collection_id,\n",
+ " COUNT(DISTINCT i.PatientID) AS patients,\n",
+ " COUNT(DISTINCT i.SeriesInstanceUID) AS series,\n",
+ " ROUND(SUM(i.series_size_MB) / 1024.0, 1) AS size_GB,\n",
+ " i.license_short_name\n",
+ " FROM index i\n",
+ " WHERE i.Modality = 'SM'\n",
+ " GROUP BY i.collection_id, i.license_short_name\n",
+ " ORDER BY patients DESC\n",
+ "\"\"\")\n",
+ "\n",
+ "print(f\"Total SM collections: {len(sm_collections)}\")\n",
+ "print(f\"Total SM series: {sm_collections['series'].sum():,}\")\n",
+ "print(f\"Total size: {sm_collections['size_GB'].sum():.0f} GB\")\n",
+ "print()\n",
+ "sm_collections.head(15)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 1.2 Filter for H&E-Stained Slides\n",
+ "\n",
+ "The `sm_index` table includes a `staining_usingSubstance_CodeMeaning` column (stored as an array) that records the staining protocol for each slide. We filter for slides stained with hematoxylin, which identifies H&E preparations."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Collections with H&E slides: 70\n",
+ "Total H&E series: 73,422\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ " \n",
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " collection_id | \n",
+ " patients | \n",
+ " he_series | \n",
+ " size_GB | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " gtex | \n",
+ " 971 | \n",
+ " 25503 | \n",
+ " 8354.2 | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " ccdi_mci | \n",
+ " 4401 | \n",
+ " 4569 | \n",
+ " 4545.8 | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " tcga_brca | \n",
+ " 1098 | \n",
+ " 3111 | \n",
+ " 1678.8 | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " tcga_kirc | \n",
+ " 537 | \n",
+ " 2173 | \n",
+ " 811.7 | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " tcga_gbm | \n",
+ " 607 | \n",
+ " 2053 | \n",
+ " 639.9 | \n",
+ "
\n",
+ " \n",
+ " | 5 | \n",
+ " tcga_lusc | \n",
+ " 504 | \n",
+ " 1612 | \n",
+ " 610.4 | \n",
+ "
\n",
+ " \n",
+ " | 6 | \n",
+ " tcga_luad | \n",
+ " 522 | \n",
+ " 1608 | \n",
+ " 631.1 | \n",
+ "
\n",
+ " \n",
+ " | 7 | \n",
+ " tcga_lgg | \n",
+ " 516 | \n",
+ " 1572 | \n",
+ " 1166.6 | \n",
+ "
\n",
+ " \n",
+ " | 8 | \n",
+ " tcga_ov | \n",
+ " 590 | \n",
+ " 1481 | \n",
+ " 476.5 | \n",
+ "
\n",
+ " \n",
+ " | 9 | \n",
+ " tcga_coad | \n",
+ " 460 | \n",
+ " 1442 | \n",
+ " 446.5 | \n",
+ "
\n",
+ " \n",
+ " | 10 | \n",
+ " tcga_ucec | \n",
+ " 560 | \n",
+ " 1371 | \n",
+ " 1078.5 | \n",
+ "
\n",
+ " \n",
+ " | 11 | \n",
+ " tcga_hnsc | \n",
+ " 523 | \n",
+ " 1263 | \n",
+ " 560.5 | \n",
+ "
\n",
+ " \n",
+ " | 12 | \n",
+ " nlst | \n",
+ " 452 | \n",
+ " 1259 | \n",
+ " 847.9 | \n",
+ "
\n",
+ " \n",
+ " | 13 | \n",
+ " tcga_stad | \n",
+ " 443 | \n",
+ " 1197 | \n",
+ " 670.4 | \n",
+ "
\n",
+ " \n",
+ " | 14 | \n",
+ " tcga_prad | \n",
+ " 500 | \n",
+ " 1172 | \n",
+ " 596.4 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ " "
+ ],
+ "text/plain": [
+ " collection_id patients he_series size_GB\n",
+ "0 gtex 971 25503 8354.2\n",
+ "1 ccdi_mci 4401 4569 4545.8\n",
+ "2 tcga_brca 1098 3111 1678.8\n",
+ "3 tcga_kirc 537 2173 811.7\n",
+ "4 tcga_gbm 607 2053 639.9\n",
+ "5 tcga_lusc 504 1612 610.4\n",
+ "6 tcga_luad 522 1608 631.1\n",
+ "7 tcga_lgg 516 1572 1166.6\n",
+ "8 tcga_ov 590 1481 476.5\n",
+ "9 tcga_coad 460 1442 446.5\n",
+ "10 tcga_ucec 560 1371 1078.5\n",
+ "11 tcga_hnsc 523 1263 560.5\n",
+ "12 nlst 452 1259 847.9\n",
+ "13 tcga_stad 443 1197 670.4\n",
+ "14 tcga_prad 500 1172 596.4"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Count H&E slides per collection\n",
+ "he_by_collection = idc_client.sql_query(\"\"\"\n",
+ " SELECT\n",
+ " i.collection_id,\n",
+ " COUNT(DISTINCT i.PatientID) AS patients,\n",
+ " COUNT(DISTINCT i.SeriesInstanceUID) AS he_series,\n",
+ " ROUND(SUM(i.series_size_MB) / 1024.0, 1) AS size_GB\n",
+ " FROM index i\n",
+ " JOIN sm_index s ON i.SeriesInstanceUID = s.SeriesInstanceUID\n",
+ " WHERE array_to_string(s.staining_usingSubstance_CodeMeaning, ', ') LIKE '%hematoxylin%'\n",
+ " GROUP BY i.collection_id\n",
+ " ORDER BY he_series DESC\n",
+ "\"\"\")\n",
+ "\n",
+ "print(f\"Collections with H&E slides: {len(he_by_collection)}\")\n",
+ "print(f\"Total H&E series: {he_by_collection['he_series'].sum():,}\")\n",
+ "he_by_collection.head(15)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": "### 1.3 Select Slides for Quality Control\n\nFor this tutorial we work with slides from the [TCGA-BRCA](https://portal.imaging.datacommons.cancer.gov/explore/filters/?collection_id=tcga_brca) collection (The Cancer Genome Atlas — Breast Cancer). This collection is a good choice because:\n\n- It contains over 3,000 H&E-stained slides\n- GrandQC provides **pre-computed QC masks for the TCGA cohorts**, which we use to validate our DICOM-based run in Part 6\n\nThose reference masks cover only the **diagnostic FFPE slides** (TCGA `DX` barcodes), not the frozen-section slides (`TS`/`BS`), so we deliberately select diagnostic slides here. We query the `sm_index` to retrieve per-slide metadata including the `ContainerIdentifier` (the original TCGA slide barcode), which we use to match IDC series against the pre-computed GrandQC masks."
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "TCGA-BRCA H&E slides: 3111\n",
+ "Tissue types: {'Neoplasm, Primary': 2704, 'Normal': 399}\n",
+ "\n",
+ "Size range: 9.1 – 3586.3 MB\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ " \n",
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " SeriesInstanceUID | \n",
+ " PatientID | \n",
+ " ContainerIdentifier | \n",
+ " tissue_type | \n",
+ " width_px | \n",
+ " height_px | \n",
+ " pixel_spacing_mm | \n",
+ " objective_power | \n",
+ " size_MB | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " 1.3.6.1.4.1.5962.99.1.1247079754.248369703.163... | \n",
+ " TCGA-A7-A26J | \n",
+ " TCGA-A7-A26J-01B-02-BS2 | \n",
+ " Neoplasm, Primary | \n",
+ " 8618 | \n",
+ " 7243 | \n",
+ " 0.00025 | \n",
+ " 40 | \n",
+ " 9.1 | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " 1.3.6.1.4.1.5962.99.1.1279025007.575797294.163... | \n",
+ " TCGA-AC-A2QJ | \n",
+ " TCGA-AC-A2QJ-11A-02-TS2 | \n",
+ " Normal | \n",
+ " 33993 | \n",
+ " 6923 | \n",
+ " 0.00050 | \n",
+ " 20 | \n",
+ " 15.5 | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " 1.3.6.1.4.1.5962.99.1.1306193840.1223239355.16... | \n",
+ " TCGA-AC-A2FB | \n",
+ " TCGA-AC-A2FB-11A-01-TSA | \n",
+ " Normal | \n",
+ " 31993 | \n",
+ " 7690 | \n",
+ " 0.00050 | \n",
+ " 20 | \n",
+ " 16.5 | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " 1.3.6.1.4.1.5962.99.1.1263739509.328647659.163... | \n",
+ " TCGA-BH-A5IZ | \n",
+ " TCGA-BH-A5IZ-11A-01-TS1 | \n",
+ " Normal | \n",
+ " 20041 | \n",
+ " 15048 | \n",
+ " 0.00050 | \n",
+ " 20 | \n",
+ " 18.7 | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " 1.3.6.1.4.1.5962.99.1.1290706954.176399971.163... | \n",
+ " TCGA-AC-A2QJ | \n",
+ " TCGA-AC-A2QJ-11A-01-TS1 | \n",
+ " Normal | \n",
+ " 41991 | \n",
+ " 9316 | \n",
+ " 0.00050 | \n",
+ " 20 | \n",
+ " 22.8 | \n",
+ "
\n",
+ " \n",
+ " | 5 | \n",
+ " 1.3.6.1.4.1.5962.99.1.1315471301.865934712.163... | \n",
+ " TCGA-E9-A1RF | \n",
+ " TCGA-E9-A1RF-11A-03-TSC | \n",
+ " Normal | \n",
+ " 27888 | \n",
+ " 9784 | \n",
+ " 0.00050 | \n",
+ " 20 | \n",
+ " 23.0 | \n",
+ "
\n",
+ " \n",
+ " | 6 | \n",
+ " 1.3.6.1.4.1.5962.99.1.1322835661.1509329658.16... | \n",
+ " TCGA-GM-A2D9 | \n",
+ " TCGA-GM-A2D9-11A-02-TS2 | \n",
+ " Normal | \n",
+ " 35992 | \n",
+ " 8945 | \n",
+ " 0.00050 | \n",
+ " 20 | \n",
+ " 25.0 | \n",
+ "
\n",
+ " \n",
+ " | 7 | \n",
+ " 1.3.6.1.4.1.5962.99.1.1255105493.445850269.163... | \n",
+ " TCGA-E9-A1RF | \n",
+ " TCGA-E9-A1RF-11A-01-TSA | \n",
+ " Normal | \n",
+ " 23904 | \n",
+ " 14929 | \n",
+ " 0.00050 | \n",
+ " 20 | \n",
+ " 26.9 | \n",
+ "
\n",
+ " \n",
+ " | 8 | \n",
+ " 1.3.6.1.4.1.5962.99.1.1247939226.656219549.163... | \n",
+ " TCGA-OL-A5RW | \n",
+ " TCGA-OL-A5RW-01Z-00-DX1 | \n",
+ " Neoplasm, Primary | \n",
+ " 12879 | \n",
+ " 13618 | \n",
+ " 0.00050 | \n",
+ " <NA> | \n",
+ " 27.1 | \n",
+ "
\n",
+ " \n",
+ " | 9 | \n",
+ " 1.3.6.1.4.1.5962.99.1.1244105102.1853648157.16... | \n",
+ " TCGA-E9-A1NF | \n",
+ " TCGA-E9-A1NF-11A-05-TSE | \n",
+ " Normal | \n",
+ " 13943 | \n",
+ " 17330 | \n",
+ " 0.00049 | \n",
+ " 20 | \n",
+ " 27.3 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ " "
+ ],
+ "text/plain": [
+ " SeriesInstanceUID PatientID \\\n",
+ "0 1.3.6.1.4.1.5962.99.1.1247079754.248369703.163... TCGA-A7-A26J \n",
+ "1 1.3.6.1.4.1.5962.99.1.1279025007.575797294.163... TCGA-AC-A2QJ \n",
+ "2 1.3.6.1.4.1.5962.99.1.1306193840.1223239355.16... TCGA-AC-A2FB \n",
+ "3 1.3.6.1.4.1.5962.99.1.1263739509.328647659.163... TCGA-BH-A5IZ \n",
+ "4 1.3.6.1.4.1.5962.99.1.1290706954.176399971.163... TCGA-AC-A2QJ \n",
+ "5 1.3.6.1.4.1.5962.99.1.1315471301.865934712.163... TCGA-E9-A1RF \n",
+ "6 1.3.6.1.4.1.5962.99.1.1322835661.1509329658.16... TCGA-GM-A2D9 \n",
+ "7 1.3.6.1.4.1.5962.99.1.1255105493.445850269.163... TCGA-E9-A1RF \n",
+ "8 1.3.6.1.4.1.5962.99.1.1247939226.656219549.163... TCGA-OL-A5RW \n",
+ "9 1.3.6.1.4.1.5962.99.1.1244105102.1853648157.16... TCGA-E9-A1NF \n",
+ "\n",
+ " ContainerIdentifier tissue_type width_px height_px \\\n",
+ "0 TCGA-A7-A26J-01B-02-BS2 Neoplasm, Primary 8618 7243 \n",
+ "1 TCGA-AC-A2QJ-11A-02-TS2 Normal 33993 6923 \n",
+ "2 TCGA-AC-A2FB-11A-01-TSA Normal 31993 7690 \n",
+ "3 TCGA-BH-A5IZ-11A-01-TS1 Normal 20041 15048 \n",
+ "4 TCGA-AC-A2QJ-11A-01-TS1 Normal 41991 9316 \n",
+ "5 TCGA-E9-A1RF-11A-03-TSC Normal 27888 9784 \n",
+ "6 TCGA-GM-A2D9-11A-02-TS2 Normal 35992 8945 \n",
+ "7 TCGA-E9-A1RF-11A-01-TSA Normal 23904 14929 \n",
+ "8 TCGA-OL-A5RW-01Z-00-DX1 Neoplasm, Primary 12879 13618 \n",
+ "9 TCGA-E9-A1NF-11A-05-TSE Normal 13943 17330 \n",
+ "\n",
+ " pixel_spacing_mm objective_power size_MB \n",
+ "0 0.00025 40 9.1 \n",
+ "1 0.00050 20 15.5 \n",
+ "2 0.00050 20 16.5 \n",
+ "3 0.00050 20 18.7 \n",
+ "4 0.00050 20 22.8 \n",
+ "5 0.00050 20 23.0 \n",
+ "6 0.00050 20 25.0 \n",
+ "7 0.00050 20 26.9 \n",
+ "8 0.00050 27.1 \n",
+ "9 0.00049 20 27.3 "
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Query TCGA-BRCA H&E slides with key metadata\n",
+ "tcga_brca_he = idc_client.sql_query(\"\"\"\n",
+ " SELECT\n",
+ " i.SeriesInstanceUID,\n",
+ " i.PatientID,\n",
+ " s.ContainerIdentifier,\n",
+ " s.primaryAnatomicStructureModifier_CodeMeaning AS tissue_type,\n",
+ " s.max_TotalPixelMatrixColumns AS width_px,\n",
+ " s.max_TotalPixelMatrixRows AS height_px,\n",
+ " s.min_PixelSpacing_2sf AS pixel_spacing_mm,\n",
+ " s.ObjectiveLensPower AS objective_power,\n",
+ " ROUND(i.series_size_MB, 1) AS size_MB\n",
+ " FROM index i\n",
+ " JOIN sm_index s ON i.SeriesInstanceUID = s.SeriesInstanceUID\n",
+ " WHERE i.collection_id = 'tcga_brca'\n",
+ " AND array_to_string(s.staining_usingSubstance_CodeMeaning, ', ') LIKE '%hematoxylin%'\n",
+ " ORDER BY i.series_size_MB ASC\n",
+ "\"\"\")\n",
+ "\n",
+ "print(f\"TCGA-BRCA H&E slides: {len(tcga_brca_he)}\")\n",
+ "print(f\"Tissue types: {tcga_brca_he['tissue_type'].value_counts().to_dict()}\")\n",
+ "print(f\"\\nSize range: {tcga_brca_he['size_MB'].min()} – {tcga_brca_he['size_MB'].max()} MB\")\n",
+ "tcga_brca_he.head(10)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": "### 1.4 Select a Representative Set of Slides\n\nFor the hands-on GrandQC run we use a small set of **diagnostic (DX) H&E slides** that:\n- are confirmed to have pre-computed GrandQC reference masks in the Zenodo `BRCA.tar` archive (so Part 6 can validate against them),\n- span both magnifications GrandQC encounters in TCGA-BRCA (20× and 40×), and\n- are small enough (≤ ~150 MB) for quick download and processing in a notebook.\n\nAll TCGA-BRCA diagnostic slides are primary-tumor sections, so the variety here comes from magnification rather than tissue type. The five barcodes below were verified against the Zenodo archive."
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": "# Diagnostic (DX) H&E slides verified to have pre-computed GrandQC reference\n# masks in the TCGA-BRCA archive on Zenodo (record 14041578). Every DX slide is\n# primary tumor, so the variety here is in objective power (40x and 20x).\nDEMO_BARCODES = [\n \"TCGA-AC-A23G-01Z-00-DX1\", # 40x, 44 MB\n \"TCGA-AC-A23C-01Z-00-DX1\", # 40x, 57 MB\n \"TCGA-AC-A62V-01Z-00-DX1\", # 40x, 60 MB\n \"TCGA-MS-A51U-01Z-00-DX1\", # 20x, 76 MB\n \"TCGA-A8-A0AB-01Z-00-DX1\", # 20x, 151 MB\n]\n\navailable = set(tcga_brca_he[\"ContainerIdentifier\"])\npresent = [b for b in DEMO_BARCODES if b in available]\nmissing = [b for b in DEMO_BARCODES if b not in available]\nif missing:\n print(f\"WARNING: {len(missing)} barcode(s) not found in this IDC version: {missing}\")\n\ndemo_slides = (\n tcga_brca_he[tcga_brca_he[\"ContainerIdentifier\"].isin(present)]\n .drop_duplicates(subset=\"ContainerIdentifier\")\n .set_index(\"ContainerIdentifier\")\n .loc[present] # preserve our chosen order\n .reset_index()\n)\n\nprint(f\"Selected {len(demo_slides)} slides for GrandQC processing:\")\ndisplay(\n demo_slides[[\n \"PatientID\", \"ContainerIdentifier\", \"tissue_type\",\n \"objective_power\", \"width_px\", \"height_px\", \"size_MB\"\n ]]\n)"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "TCGA-A7-A26J-01B-02-BS2 (Neoplasm, Primary, 9.1 MB)\n",
+ " https://viewer.imaging.datacommons.cancer.gov/slim/studies/2.25.313270449313106073714723178854675013710/series/1.3.6.1.4.1.5962.99.1.1247079754.248369703.1637629652298.2.0\n",
+ "TCGA-AC-A2QJ-11A-02-TS2 (Normal, 15.5 MB)\n",
+ " https://viewer.imaging.datacommons.cancer.gov/slim/studies/2.25.337996839190358918582393097516172189888/series/1.3.6.1.4.1.5962.99.1.1279025007.575797294.1637661597551.2.0\n",
+ "TCGA-E2-A15K-06A-01-TS1 (None, 85.7 MB)\n",
+ " https://viewer.imaging.datacommons.cancer.gov/slim/studies/2.25.54614374037792363265218330907954861388/series/1.3.6.1.4.1.5962.99.1.1231257225.1365336880.1637613829769.2.0\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Preview slides in the IDC SLIM viewer (pathology viewer)\n",
+ "for _, row in demo_slides.iterrows():\n",
+ " url = idc_client.get_viewer_URL(seriesInstanceUID=row[\"SeriesInstanceUID\"])\n",
+ " print(f\"{row['ContainerIdentifier']} ({row['tissue_type']}, {row['size_MB']} MB)\")\n",
+ " print(f\" {url}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": "## Part 2: GrandQC Setup\n\nBefore running GrandQC we need to:\n1. Clone the repository from GitHub — we use the [`fedorov/grandqc`](https://github.com/fedorov/grandqc/tree/idc-dicom-fixes) fork, whose `idc-dicom-fixes` branch lets the inference scripts run directly on IDC DICOM series\n2. Install its Python dependencies\n3. Download the pre-trained models from Zenodo\n\n> **Note**: The `idc-dicom-fixes` branch carries a handful of small, self-contained fixes on top of upstream `cpath-ukk/grandqc` — DICOM-directory input, compute-device auto-detection, a PyTorch ≥ 2.6 model-load fix, a small-slide overlay fix, and fail-loud error handling. None of them change model weights or algorithm logic (see Part 2.3 for the full list). Working from a fork keeps the notebook clean instead of monkey-patching the scripts at runtime."
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 2.1 Clone and Install Dependencies"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": "%%capture\n# Clone the IDC-patched GrandQC fork (branch idc-dicom-fixes, shallow clone).\n# This branch applies small, self-contained fixes on top of upstream\n# cpath-ukk/grandqc so the inference scripts run directly on IDC DICOM series.\n# See the verification cell in Part 2.3 for the list of fixes.\n!git clone --depth 1 --branch idc-dicom-fixes https://github.com/fedorov/grandqc.git\n\n# Install GrandQC Python dependencies.\n# PyTorch is already present in Colab; we install the remaining packages.\n# segmentation-models-pytorch 0.3.1 is pinned because its model.predict() API\n# was removed in 0.4.0 and GrandQC relies on it.\n!pip install -q \\\n \"segmentation-models-pytorch==0.3.1\" \\\n \"rasterio>=1.3\" \\\n \"imagecodecs\" \\\n \"zarr<3\" \\\n \"tifffile\""
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 2.2 Download Pre-trained Models"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "grandqc/01_WSI_infe 100%[===================>] 25.41M 1.06MB/s in 26s \n",
+ "grandqc/01_WSI_infe 100%[===================>] 24.21M 1.08MB/s in 25s \n",
+ "Models downloaded:\n",
+ " grandqc/01_WSI_inference_OPENSLIDE_QC/models/td/Tissue_Detection_MPP10.pth (25.4 MB)\n",
+ " grandqc/01_WSI_inference_OPENSLIDE_QC/models/qc/GrandQC_MPP15.pth (24.2 MB)\n"
+ ]
+ }
+ ],
+ "source": [
+ "import os\n",
+ "\n",
+ "GRANDQC_SCRIPTS = \"grandqc/01_WSI_inference_OPENSLIDE_QC\"\n",
+ "MODELS_TD = os.path.join(GRANDQC_SCRIPTS, \"models\", \"td\")\n",
+ "MODELS_QC = os.path.join(GRANDQC_SCRIPTS, \"models\", \"qc\")\n",
+ "os.makedirs(MODELS_TD, exist_ok=True)\n",
+ "os.makedirs(MODELS_QC, exist_ok=True)\n",
+ "\n",
+ "# Tissue detection model (~27 MB) — Zenodo record 14507273\n",
+ "!wget -q --show-progress -O {MODELS_TD}/Tissue_Detection_MPP10.pth \\\n",
+ " \"https://zenodo.org/records/14507273/files/Tissue_Detection_MPP10.pth?download=1\"\n",
+ "\n",
+ "# Artifact segmentation model at MPP=1.5 (~25 MB) — Zenodo record 14041538\n",
+ "# MPP=1.5 (≈7× objective) is the recommended default for most scanners.\n",
+ "!wget -q --show-progress -O {MODELS_QC}/GrandQC_MPP15.pth \\\n",
+ " \"https://zenodo.org/records/14041538/files/GrandQC_MPP15.pth?download=1\"\n",
+ "\n",
+ "print(\"Models downloaded:\")\n",
+ "for f in [f\"{MODELS_TD}/Tissue_Detection_MPP10.pth\", f\"{MODELS_QC}/GrandQC_MPP15.pth\"]:\n",
+ " size_mb = os.path.getsize(f) / 1024**2\n",
+ " print(f\" {f} ({size_mb:.1f} MB)\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": "### 2.3 Fixes Included in the Fork\n\nThe upstream GrandQC inference scripts assume conventional single-file WSIs on an NVIDIA GPU. Running them directly on IDC DICOM series needs a few small changes, already applied in the [`idc-dicom-fixes`](https://github.com/fedorov/grandqc/tree/idc-dicom-fixes) branch cloned above. **No model weights or algorithm logic are changed.**\n\n| Fix | Scripts | Why it is needed |\n|-----|---------|------------------|\n| Accept DICOM **series directories** as slides | both | IDC stores each slide as a directory of `.dcm` files; upstream only accepted single files, so the slide list came out empty and nothing was processed |\n| Auto-detect compute device (CUDA / MPS / CPU) | both | upstream hardcodes `'cuda'`, which crashes on CPU/Apple Silicon |\n| `torch.load(..., weights_only=False)` for the QC model | `main.py` | PyTorch ≥ 2.6 changed the default and refuses to unpickle the model object |\n| Crop the tissue-detection mask to the thumbnail size | `wsi_tis_detect.py` | small slides produce a sub-512 px MPP-10 thumbnail, which crashed `cv2.addWeighted` in the overlay step (left `tis_det_overlay/` empty) |\n| Surface per-slide failures with a non-zero exit | both | upstream swallowed errors and exited `0`, so empty output looked like a successful run |\n\nThe cell below confirms these fixes are present in the cloned scripts."
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": "from pathlib import Path\n\n# Confirm the IDC fixes shipped in the cloned fork branch are present.\ntis = Path(f\"{GRANDQC_SCRIPTS}/wsi_tis_detect.py\").read_text()\nmn = Path(f\"{GRANDQC_SCRIPTS}/main.py\").read_text()\n\nchecks = {\n \"DICOM-directory input (wsi_tis_detect.py)\": \"_iter_slides\" in tis,\n \"DICOM-directory input (main.py)\": \"_iter_slides\" in mn,\n \"auto device select (wsi_tis_detect.py)\": \"cuda.is_available\" in tis,\n \"auto device select (main.py)\": \"cuda.is_available\" in mn,\n \"weights_only=False (main.py)\": \"weights_only=False\" in mn,\n \"small-slide overlay crop (wsi_tis_detect.py)\":\"end_image[-height:, -width:]\" in tis,\n \"fail-loud exit (wsi_tis_detect.py)\": \"sys.exit(1)\" in tis,\n \"fail-loud exit (main.py)\": \"sys.exit(1)\" in mn,\n}\nfor name, ok in checks.items():\n print(f\" [{'OK ' if ok else 'MISSING'}] {name}\")\n\nassert all(checks.values()), (\n \"Expected GrandQC fixes are not all present — check that the clone used \"\n \"branch 'idc-dicom-fixes' of github.com/fedorov/grandqc.\"\n)\nprint(\"\\nAll expected fixes present — scripts are ready to run on IDC DICOM slides.\")"
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Part 3: Download Slides from IDC\n",
+ "\n",
+ "`idc_client.download_from_selection()` fetches DICOM files for the requested series and places each series in its own subdirectory named by `SeriesInstanceUID`.\n",
+ "\n",
+ "After downloading we rename each subdirectory to its `ContainerIdentifier` (the original TCGA slide barcode). This ensures that GrandQC's output files and the pre-computed TCGA mask filenames are consistently keyed by the human-readable barcode rather than an opaque UID."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 3.1 Download DICOM Series"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "SLIDE_DIR = \"slides\"\n",
+ "os.makedirs(SLIDE_DIR, exist_ok=True)\n",
+ "\n",
+ "print(f\"Downloading {len(demo_slides)} slides → {SLIDE_DIR}/\")\n",
+ "print(\"(This may take a few minutes depending on file sizes and network speed)\\n\")\n",
+ "\n",
+ "idc_client.download_from_selection(\n",
+ " downloadDir=SLIDE_DIR,\n",
+ " seriesInstanceUID=demo_slides[\"SeriesInstanceUID\"].tolist(),\n",
+ " dirTemplate=\"%SeriesInstanceUID\",\n",
+ ")\n",
+ "print(\"\\nDownload complete.\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": "### 3.2 Rename Series Directories to TCGA Barcodes\n\nGrandQC uses the directory/file name as the key for all output files (tissue mask, QC map, overlay, TSV report row). By renaming each downloaded series directory from its `SeriesInstanceUID` to its `ContainerIdentifier` (TCGA barcode), we get:\n\n- Human-readable output filenames (`TCGA-AC-A23G-01Z-00-DX1_mask.png` instead of a UID)\n- Straightforward matching with GrandQC's pre-computed TCGA masks (Part 6), which embed the same barcode in their filename"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": "import shutil\n\n# Build SeriesInstanceUID → ContainerIdentifier map\nuid_to_barcode = dict(zip(demo_slides[\"SeriesInstanceUID\"], demo_slides[\"ContainerIdentifier\"]))\n\n# Rename each downloaded UID directory to its barcode. This is written to be\n# safe to re-run, and to tolerate the download cell being re-run (which would\n# re-create the UID directory next to an already-renamed barcode directory).\nfor uid, barcode in uid_to_barcode.items():\n src = os.path.join(SLIDE_DIR, uid) # freshly downloaded, named by UID\n dst = os.path.join(SLIDE_DIR, barcode) # renamed target, named by barcode\n\n if os.path.isdir(src) and os.path.isdir(dst):\n # Both present → the download cell was re-run after a previous rename.\n # The UID directory is a redundant copy of the same (immutable) series,\n # so drop it and keep the barcode directory downstream code expects.\n shutil.rmtree(src)\n print(f\" Removed duplicate UID dir; kept {barcode}\")\n elif os.path.isdir(src):\n os.rename(src, dst)\n print(f\" {uid} → {barcode}\")\n elif os.path.isdir(dst):\n print(f\" Already renamed: {barcode}\")\n else:\n print(f\" WARNING: directory not found: {uid}\")\n\nprint(\"\\nSlide directories after renaming:\")\nfor d in sorted(os.listdir(SLIDE_DIR)):\n dpath = os.path.join(SLIDE_DIR, d)\n if os.path.isdir(dpath):\n n = sum(1 for f in os.listdir(dpath) if f.lower().endswith(\".dcm\"))\n print(f\" {d}/ ({n} .dcm files)\")"
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 3.3 Verify OpenSlide Can Read the DICOM Files\n",
+ "\n",
+ "OpenSlide 4.0 (bundled in `openslide-bin`) reads DICOM WSI files natively. Opening any one `.dcm` file from a series directory is sufficient — OpenSlide discovers the remaining pyramid levels from the other files in the same directory that share the same `SeriesInstanceUID`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import openslide\n",
+ "\n",
+ "print(f\"OpenSlide version: {openslide.__library_version__}\\n\")\n",
+ "\n",
+ "for barcode in uid_to_barcode.values():\n",
+ " slide_dir = os.path.join(SLIDE_DIR, barcode)\n",
+ " dcm_files = sorted(f for f in os.listdir(slide_dir) if f.lower().endswith(\".dcm\"))\n",
+ " if not dcm_files:\n",
+ " print(f\" {barcode}: no .dcm files found\")\n",
+ " continue\n",
+ " entry = os.path.join(slide_dir, dcm_files[0])\n",
+ " try:\n",
+ " slide = openslide.OpenSlide(entry)\n",
+ " w, h = slide.level_dimensions[0]\n",
+ " mpp = slide.properties.get(\"openslide.mpp-x\", \"N/A\")\n",
+ " vendor = slide.properties.get(\"openslide.vendor\", \"N/A\")\n",
+ " print(f\" {barcode}\")\n",
+ " print(f\" {w:,} × {h:,} px | MPP={mpp} | vendor={vendor} | levels={slide.level_count}\")\n",
+ " slide.close()\n",
+ " except Exception as exc:\n",
+ " print(f\" {barcode}: ERROR — {exc}\")\n",
+ "\n",
+ "print(\"\\nAll slides verified.\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": "## Part 4: Run GrandQC Quality Control\n\nGrandQC runs as two sequential scripts, both inside `grandqc/01_WSI_inference_OPENSLIDE_QC/`:\n\n| Step | Script | MPP | Time (GPU) | Output |\n|------|--------|-----|------------|--------|\n| 1 | `wsi_tis_detect.py` | 10 | ~0.4 s/slide | Binary tissue masks |\n| 2 | `main.py` | 1.5 | ~30–45 s/slide | 7-class artifact maps, GeoJSON, TSV report |\n\nThe 7 artifact classes are: **1** clean tissue · **2** folds · **3** dark spots · **4** pen marks · **5** air bubbles/edges · **6** out-of-focus · **7** background.\n\nBoth scripts load models via relative paths (`./models/…`) and must therefore be invoked with their directory as the working directory. We use `subprocess` with `cwd=GRANDQC_SCRIPTS` to achieve this while keeping output streaming to the cell.\n\n> **On inference resolution**: the artifact model runs at MPP 1.5 (≈7× objective), *not* at the slide's native 20×/40×, and the per-slide MPP is read from OpenSlide's `openslide.mpp-x` property (derived from the DICOM `PixelSpacing` for IDC slides). For how this is controlled, the patch-size math, and an empirical confirmation on an IDC slide, see [`grandqc_slide_quality_resolution_notes.md`](grandqc_slide_quality_resolution_notes.md)."
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 4.1 Step 1 — Tissue Detection\n",
+ "\n",
+ "Produces a low-resolution binary mask (tissue vs. background) for each slide. The mask is saved as a PNG and re-used by Step 2 to skip background patches and cut inference time."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import subprocess, sys\n",
+ "\n",
+ "OUTPUT_DIR = \"qc_output\"\n",
+ "os.makedirs(OUTPUT_DIR, exist_ok=True)\n",
+ "\n",
+ "slide_dir_abs = os.path.abspath(SLIDE_DIR)\n",
+ "output_dir_abs = os.path.abspath(OUTPUT_DIR)\n",
+ "scripts_abs = os.path.abspath(GRANDQC_SCRIPTS)\n",
+ "\n",
+ "print(f\"Slides: {slide_dir_abs}\")\n",
+ "print(f\"Output: {output_dir_abs}\")\n",
+ "print(f\"Scripts dir: {scripts_abs}\")\n",
+ "print()\n",
+ "print(\"Running wsi_tis_detect.py ...\")\n",
+ "\n",
+ "result = subprocess.run(\n",
+ " [sys.executable, \"wsi_tis_detect.py\",\n",
+ " \"--slide_folder\", slide_dir_abs,\n",
+ " \"--output_dir\", output_dir_abs],\n",
+ " cwd=scripts_abs,\n",
+ ")\n",
+ "\n",
+ "if result.returncode != 0:\n",
+ " print(f\"\\nERROR: wsi_tis_detect.py exited with code {result.returncode}\")\n",
+ "else:\n",
+ " mask_dir = os.path.join(output_dir_abs, \"tis_det_mask\")\n",
+ " masks = sorted(os.listdir(mask_dir))\n",
+ " print(f\"\\nTissue detection complete — {len(masks)} mask(s) written to {mask_dir}/\")\n",
+ " for m in masks:\n",
+ " print(f\" {m}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 4.2 Step 2 — Artifact Segmentation\n",
+ "\n",
+ "Runs the 7-class artifact model at MPP=1.5. For each slide it produces:\n",
+ "\n",
+ "- `maps_qc/_map_QC.png` — color-coded artifact map (same resolution as tissue mask)\n",
+ "- `overlays_qc/_overlay_QC.jpg` — artifact map overlaid on a downscaled slide thumbnail\n",
+ "- `mask_qc/_mask.png` — raw integer mask (class 1–7 per pixel)\n",
+ "- `geojson_qc/.geojson` — polygon annotations compatible with QuPath and SLIM\n",
+ "- `report_qc_output_0_N_stats_per_slide.txt` — TSV with per-slide dimensions and processing time\n",
+ "\n",
+ "> **GPU note**: On a T4 GPU (Colab free tier), expect ~30–45 s per slide. On CPU only, plan for 10–30 min per slide depending on size."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "print(\"Running main.py (artifact segmentation) ...\")\n",
+ "\n",
+ "result = subprocess.run(\n",
+ " [sys.executable, \"main.py\",\n",
+ " \"--slide_folder\", slide_dir_abs,\n",
+ " \"--output_dir\", output_dir_abs,\n",
+ " \"--mpp_model\", \"1.5\",\n",
+ " \"--create_geojson\", \"Y\"],\n",
+ " cwd=scripts_abs,\n",
+ ")\n",
+ "\n",
+ "if result.returncode != 0:\n",
+ " print(f\"\\nERROR: main.py exited with code {result.returncode}\")\n",
+ "else:\n",
+ " maps_dir = os.path.join(output_dir_abs, \"maps_qc\")\n",
+ " maps = sorted(os.listdir(maps_dir))\n",
+ " print(f\"\\nArtifact segmentation complete — {len(maps)} map(s) in {maps_dir}/\")\n",
+ " for m in maps:\n",
+ " print(f\" {m}\")\n",
+ "\n",
+ " # Locate the TSV report (name includes the slide count)\n",
+ " reports = [f for f in os.listdir(output_dir_abs) if f.endswith(\"_stats_per_slide.txt\")]\n",
+ " if reports:\n",
+ " import pandas as pd\n",
+ " report_path = os.path.join(output_dir_abs, reports[0])\n",
+ " df = pd.read_csv(report_path, sep=\"\\t\")\n",
+ " print(f\"\\nPer-slide report ({reports[0]}):\")\n",
+ " display(df[[\"slide_name\", \"mpp\", \"height\", \"width\", \"time\"]])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Part 5: Visualize and Analyze Results\n",
+ "\n",
+ "GrandQC writes several output types per slide. We look at three of them:\n",
+ "\n",
+ "1. **Tissue detection overlays** — confirms that tissue vs. background separation was correct\n",
+ "2. **Artifact segmentation maps** — color-coded composite of the 7-class artifact model overlaid on a slide thumbnail\n",
+ "3. **Raw integer masks** (`mask_qc/`) — pixel-level class labels (1–7) used to compute per-slide artifact fractions"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 5.1 Tissue Detection Overlays\n",
+ "\n",
+ "Each overlay blends the original thumbnail (70 %) with a two-class color mask (30 %): blue = tissue, gray = background."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": "import matplotlib.pyplot as plt\nimport matplotlib.image as mpimg\n\noverlay_dir = os.path.join(OUTPUT_DIR, \"tis_det_overlay\")\nfnames = sorted(os.listdir(overlay_dir))\n\n# Lay panels out 2 per row so each image is large enough to inspect\nncols = 2 if len(fnames) > 1 else 1\nnrows = (len(fnames) + ncols - 1) // ncols\nfig, axes = plt.subplots(nrows, ncols, figsize=(8 * ncols, 6 * nrows))\naxes = np.array(axes).reshape(-1)\n\nfor ax, fname in zip(axes, fnames):\n ax.imshow(mpimg.imread(os.path.join(overlay_dir, fname)))\n ax.set_title(fname.replace(\"_OVERLAY.jpg\", \"\"), fontsize=9)\n ax.axis(\"off\")\nfor ax in axes[len(fnames):]: # hide unused panels in the last row\n ax.axis(\"off\")\n\nplt.suptitle(\"Step 1 — Tissue Detection\", fontsize=13, fontweight=\"bold\")\nplt.tight_layout()\nplt.show()"
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 5.2 Artifact Segmentation Maps\n",
+ "\n",
+ "The QC overlay composites the artifact color map onto a downscaled thumbnail of the original slide. Colors match the `wsi_colors.py` palette used by GrandQC:\n",
+ "\n",
+ "| Class | Color | Artifact |\n",
+ "|-------|-------|---------|\n",
+ "| 1 | ■ Gray `#808080` | Clean tissue |\n",
+ "| 2 | ■ Tomato `#FF6347` | Folds |\n",
+ "| 3 | ■ Lime `#00FF00` | Dark spots |\n",
+ "| 4 | ■ Red `#FF0000` | Pen marks |\n",
+ "| 5 | ■ Magenta `#FF00FF` | Air bubbles / slide edges |\n",
+ "| 6 | ■ Indigo `#4B0082` | Out-of-focus |\n",
+ "| 7 | □ White `#FFFFFF` | Background |"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": "overlay_qc_dir = os.path.join(OUTPUT_DIR, \"overlays_qc\")\nfnames_qc = sorted(os.listdir(overlay_qc_dir))\n\n# Lay panels out 2 per row so each image is large enough to inspect\nncols = 2 if len(fnames_qc) > 1 else 1\nnrows = (len(fnames_qc) + ncols - 1) // ncols\nfig, axes = plt.subplots(nrows, ncols, figsize=(8 * ncols, 6 * nrows))\naxes = np.array(axes).reshape(-1)\n\nfor ax, fname in zip(axes, fnames_qc):\n ax.imshow(mpimg.imread(os.path.join(overlay_qc_dir, fname)))\n ax.set_title(fname.replace(\"_overlay_QC.jpg\", \"\"), fontsize=9)\n ax.axis(\"off\")\nfor ax in axes[len(fnames_qc):]: # hide unused panels in the last row\n ax.axis(\"off\")\n\nplt.suptitle(\"Step 2 — Artifact Segmentation Overlays\", fontsize=13, fontweight=\"bold\")\nplt.tight_layout()\nplt.show()"
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 5.3 Artifact Fractions per Slide\n",
+ "\n",
+ "We read the raw integer masks from `mask_qc/` (pixel values 1–7) and compute what fraction of the **tissue area** (excluding background, class 7) belongs to each artifact class. This is the standard metric reported in the GrandQC paper."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from PIL import Image\n",
+ "\n",
+ "# GrandQC wsi_colors.py palette — RGB, indices 0-5 correspond to classes 1-6\n",
+ "GRANDQC_COLORS_RGB = [\n",
+ " [128, 128, 128], # 1 Clean tissue\n",
+ " [255, 99, 71], # 2 Folds\n",
+ " [ 0, 255, 0], # 3 Dark spots\n",
+ " [255, 0, 0], # 4 Pen marks\n",
+ " [255, 0, 255], # 5 Bubbles/edges\n",
+ " [ 75, 0, 130], # 6 Out-of-focus\n",
+ "]\n",
+ "CLASS_NAMES_SHORT = [\n",
+ " \"Clean tissue\", \"Folds\", \"Dark spots\",\n",
+ " \"Pen marks\", \"Bubbles/edges\", \"Out-of-focus\",\n",
+ "]\n",
+ "# Normalize to [0,1] for matplotlib\n",
+ "mpl_colors = [[r/255, g/255, b/255] for r, g, b in GRANDQC_COLORS_RGB]\n",
+ "\n",
+ "mask_dir = os.path.join(OUTPUT_DIR, \"mask_qc\")\n",
+ "rows = []\n",
+ "for fname in sorted(os.listdir(mask_dir)):\n",
+ " barcode = fname.replace(\"_mask.png\", \"\")\n",
+ " mask = np.array(Image.open(os.path.join(mask_dir, fname)))\n",
+ " tissue_px = np.sum(mask != 7) # exclude background\n",
+ " if tissue_px == 0:\n",
+ " continue\n",
+ " row = {\"slide\": barcode}\n",
+ " for cls_idx, name in enumerate(CLASS_NAMES_SHORT, start=1):\n",
+ " row[name] = np.sum(mask == cls_idx) / tissue_px * 100\n",
+ " rows.append(row)\n",
+ "\n",
+ "stats_df = pd.DataFrame(rows).set_index(\"slide\")\n",
+ "print(\"Artifact class fractions (% of tissue area):\")\n",
+ "display(stats_df.round(1))\n",
+ "\n",
+ "# ── Stacked bar chart ────────────────────────────────────────────────────────\n",
+ "ax = stats_df[CLASS_NAMES_SHORT].plot(\n",
+ " kind=\"bar\",\n",
+ " stacked=True,\n",
+ " color=mpl_colors,\n",
+ " figsize=(max(6, 2.5 * len(stats_df)), 5),\n",
+ " edgecolor=\"none\",\n",
+ ")\n",
+ "ax.set_ylabel(\"% of tissue area\")\n",
+ "ax.set_xlabel(\"\")\n",
+ "ax.set_title(\"GrandQC Artifact Fractions per Slide\", fontweight=\"bold\")\n",
+ "ax.set_xticklabels(ax.get_xticklabels(), rotation=25, ha=\"right\", fontsize=8)\n",
+ "ax.legend(loc=\"upper right\", fontsize=8, framealpha=0.8)\n",
+ "ax.set_ylim(0, 110)\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 5.4 View Original Slides in IDC SLIM Viewer\n",
+ "\n",
+ "The IDC SLIM viewer is a web-based pathology viewer for DICOM whole slide images. The links below open each slide at full resolution. The GeoJSON annotations produced by GrandQC (`geojson_qc/`) can be loaded into QuPath or other tools that accept GeoJSON overlays for interactive artifact review."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from IPython.display import IFrame, display\n",
+ "\n",
+ "print(\"IDC SLIM viewer links (open in browser for full-resolution viewing):\\n\")\n",
+ "for _, row in demo_slides.iterrows():\n",
+ " url = idc_client.get_viewer_URL(seriesInstanceUID=row[\"SeriesInstanceUID\"])\n",
+ " print(f\" {row['ContainerIdentifier']}\")\n",
+ " print(f\" {url}\\n\")\n",
+ "\n",
+ "# Embed the first slide inline (750 px height)\n",
+ "first_url = idc_client.get_viewer_URL(seriesInstanceUID=demo_slides.iloc[0][\"SeriesInstanceUID\"])\n",
+ "print(f\"Embedding: {demo_slides.iloc[0]['ContainerIdentifier']}\")\n",
+ "display(IFrame(first_url, width=\"100%\", height=750))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": "## Part 6: Validating the Pipeline Against Pre-computed TCGA Masks\n\nIn Parts 3–5 we ran GrandQC locally on IDC DICOM slides using OpenSlide's native DICOM reader — a path the GrandQC authors did not test in their original publication. Before using the pipeline on new data, it is good practice to verify that our results are consistent with the authors' reference outputs.\n\nThe GrandQC authors published pre-computed QC masks for the TCGA cohorts on Zenodo:\n\n> **GrandQC pre-computed masks**: [zenodo.org/records/14041578](https://zenodo.org/records/14041578) \n> One tar archive per TCGA cohort (a few large cohorts are split into two parts), ~18 GB in total. The TCGA-BRCA archive (`BRCA.tar`) alone is ~2 GB and holds masks for 1,105 **diagnostic (DX) slides**.\n\nEach mask is a PNG named after the original slide file — `{barcode}.{UUID}.svs_mask.png` — with integer labels 1–7, the same classes `main.py` produces. Rather than download the full 2 GB cohort archive, we pull just the reference masks for our five demo slides (~700 KB, extracted from `BRCA.tar` and re-hosted) and compare artifact fractions against our local DICOM run."
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": "### 6.1 Available Cohort Archives\n\nThe Zenodo record contains one tar file per TCGA cohort; a few large cohorts (LUAD, LUSC, THCA, TGCT) are split into two parts. The cell below lists them straight from the Zenodo API — no large download."
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import requests\n",
+ "\n",
+ "# Fetch the Zenodo record metadata to get the authoritative file list\n",
+ "rec = requests.get(\"https://zenodo.org/api/records/14041578\").json()\n",
+ "files = rec[\"files\"]\n",
+ "\n",
+ "rows_z = []\n",
+ "for f in sorted(files, key=lambda x: x[\"key\"]):\n",
+ " rows_z.append({\n",
+ " \"archive\": f[\"key\"],\n",
+ " \"size_MB\": round(f[\"size\"] / 1024**2, 0),\n",
+ " })\n",
+ "\n",
+ "zenodo_df = pd.DataFrame(rows_z)\n",
+ "total_gb = zenodo_df[\"size_MB\"].sum() / 1024\n",
+ "print(f\"{len(zenodo_df)} archives | total: {total_gb:.1f} GB\\n\")\n",
+ "display(zenodo_df)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": "### 6.2 Download the Reference Masks for Our Demo Slides\n\nWe download just the pre-computed masks for the five demo slides (~700 KB), extracted from `BRCA.tar` and re-hosted as a small folder. Each reference mask is named `{barcode}.{UUID}.svs_mask.png`, whereas our local masks in `qc_output/mask_qc/` are keyed by the bare `ContainerIdentifier` barcode — so we also build a lookup from barcode to the full reference filename.\n\n> To reproduce population-level statistics across the whole cohort, download the full `BRCA.tar` (~2 GB) from the Zenodo record above and point `REF_MASKS_DIR` at the extracted folder; the comparison loop below works unchanged on any number of masks."
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": "import io, zipfile, requests\n\nREF_MASKS_DIR = \"brca_ref_masks\"\nos.makedirs(REF_MASKS_DIR, exist_ok=True)\n\n# Pre-computed GrandQC masks for our five demo slides, extracted from the\n# TCGA-BRCA Zenodo archive (record 14041578) and re-hosted as a small folder\n# (~700 KB) so we don't have to download the full ~2 GB BRCA.tar. The link is a\n# Dropbox shared folder; appending '&dl=1' returns the whole folder as a zip.\nREF_MASKS_URL = (\n \"https://www.dropbox.com/scl/fo/z5jtldr5ed384s8nntiwe/\"\n \"ALUgs22f_0PI8oGO6XJ5eG0?rlkey=b92k8rt6jvqllwkad31obvbyx&dl=1\"\n)\n\nresp = requests.get(REF_MASKS_URL)\nresp.raise_for_status()\nwith zipfile.ZipFile(io.BytesIO(resp.content)) as zf:\n for member in zf.namelist():\n if member.endswith(\"_mask.png\"):\n target = os.path.join(REF_MASKS_DIR, os.path.basename(member))\n with zf.open(member) as src, open(target, \"wb\") as dst:\n dst.write(src.read())\n\nref_masks = sorted(f for f in os.listdir(REF_MASKS_DIR) if f.endswith(\"_mask.png\"))\n\n# Reference masks are named '{barcode}.{UUID}.svs_mask.png'; our local masks are\n# keyed by the bare ContainerIdentifier barcode (no dots). Map barcode -> filename.\nref_mask_by_barcode = {f.split(\".\")[0]: f for f in ref_masks}\n\nprint(f\"Downloaded {len(ref_masks)} reference masks to {REF_MASKS_DIR}/\")\nfor f in ref_masks:\n print(f\" {f}\")"
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 6.3 Pipeline Validation: Compare Local vs. Reference Masks\n",
+ "\n",
+ "The reference masks in `brca_ref_masks/` were produced by the GrandQC authors running on TIFF/SVS files. Our local masks in `qc_output/mask_qc/` were produced by the same model running on IDC DICOM files via OpenSlide's native DICOM reader.\n",
+ "\n",
+ "We compare per-class artifact fractions for each demo slide. Close agreement (< 2 percentage points per class) confirms that the DICOM-based path does not introduce bias relative to the authors' original pipeline."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": "def mask_fractions(mask_path):\n mask = np.array(Image.open(mask_path))\n tissue_px = np.sum(mask != 7)\n if tissue_px == 0:\n return None\n return {name: np.sum(mask == cls) / tissue_px * 100\n for cls, name in enumerate(CLASS_NAMES_SHORT, start=1)}\n\n\n# Build comparison table for each demo slide\ncompare_rows = []\nfor barcode in uid_to_barcode.values():\n local_path = os.path.join(OUTPUT_DIR, \"mask_qc\", f\"{barcode}_mask.png\")\n ref_fname = ref_mask_by_barcode.get(barcode)\n ref_path = os.path.join(REF_MASKS_DIR, ref_fname) if ref_fname else None\n\n if not os.path.exists(local_path):\n print(f\" MISSING local mask: {barcode}\")\n continue\n if ref_path is None or not os.path.exists(ref_path):\n print(f\" MISSING reference mask: {barcode} — slide not in the reference set\")\n continue\n\n local_frac = mask_fractions(local_path)\n ref_frac = mask_fractions(ref_path)\n if local_frac is None or ref_frac is None:\n continue\n\n for name in CLASS_NAMES_SHORT:\n compare_rows.append({\n \"slide\": barcode,\n \"class\": name,\n \"local_%\": round(local_frac[name], 2),\n \"ref_%\": round(ref_frac[name], 2),\n \"delta_pp\": round(local_frac[name] - ref_frac[name], 2),\n })\n\ncompare_df = pd.DataFrame(compare_rows)\nprint(\"Per-class fraction comparison (local DICOM run vs. Zenodo reference):\\n\")\ndisplay(compare_df.pivot_table(index=\"slide\", columns=\"class\", values=\"delta_pp\").round(2))\nprint(\"\\nδ = local − reference (percentage points). Values near 0 confirm agreement.\")"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Scatter plot: local fraction vs. reference fraction for all classes × slides\n",
+ "fig, ax = plt.subplots(figsize=(5, 5))\n",
+ "\n",
+ "class_colors = {name: mpl_colors[i] for i, name in enumerate(CLASS_NAMES_SHORT)}\n",
+ "\n",
+ "for cls_name, grp in compare_df.groupby(\"class\"):\n",
+ " ax.scatter(grp[\"ref_%\"], grp[\"local_%\"],\n",
+ " label=cls_name, color=class_colors[cls_name],\n",
+ " s=60, edgecolors=\"black\", linewidths=0.4, zorder=3)\n",
+ "\n",
+ "# Identity line\n",
+ "lim_max = max(compare_df[[\"local_%\", \"ref_%\"]].max()) * 1.05\n",
+ "ax.plot([0, lim_max], [0, lim_max], \"k--\", linewidth=0.8, label=\"y = x\")\n",
+ "ax.set_xlabel(\"Reference fraction (%) — Zenodo\")\n",
+ "ax.set_ylabel(\"Local fraction (%) — DICOM run\")\n",
+ "ax.set_title(\"Pipeline Validation: Local vs. Reference Artifact Fractions\",\n",
+ " fontweight=\"bold\", fontsize=9)\n",
+ "ax.legend(fontsize=7, framealpha=0.8)\n",
+ "ax.set_xlim(0, lim_max)\n",
+ "ax.set_ylim(0, lim_max)\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": "### 6.4 Compare Reference and Local Results Side by Side\n\nPer-class fractions (6.3) summarise agreement as numbers. Here we place the two artifact maps next to each other — the reference (Zenodo) result on the left and our local DICOM result on the right, one slide per row — so differences in *where* each artifact class is detected are easy to spot.\n\nBoth masks are colourised with the same GrandQC palette: clean tissue (grey) plus the five artifact classes; background (class 7) is shown white. The reference and local masks can differ slightly in pixel dimensions, but since the panels are shown side by side rather than overlaid, no resizing is needed.",
+ "metadata": {}
+ },
+ {
+ "cell_type": "code",
+ "source": "from matplotlib.patches import Patch\n\n# Colour lookup for GrandQC class labels: 1-6 use the artifact palette, while\n# 0 (padding) and 7 (background) are shown white.\npalette = np.ones((8, 3))\nfor cls in range(1, 7):\n palette[cls] = mpl_colors[cls - 1]\n\ndef colorize(mask):\n return palette[np.clip(mask, 0, 7)]\n\nslides = list(uid_to_barcode.values())\nn = len(slides)\nfig, axes = plt.subplots(n, 2, figsize=(13, 5.0 * n))\naxes = np.atleast_2d(axes)\n\nfor row, barcode in enumerate(slides):\n local_path = os.path.join(OUTPUT_DIR, \"mask_qc\", f\"{barcode}_mask.png\")\n ref_fname = ref_mask_by_barcode.get(barcode)\n ref_path = os.path.join(REF_MASKS_DIR, ref_fname) if ref_fname else None\n\n ax_ref, ax_local = axes[row, 0], axes[row, 1]\n ax_ref.axis(\"off\")\n ax_local.axis(\"off\")\n\n if ref_path and os.path.exists(ref_path):\n ax_ref.imshow(colorize(np.array(Image.open(ref_path))))\n else:\n ax_ref.text(0.5, 0.5, \"reference mask missing\", ha=\"center\", va=\"center\")\n if os.path.exists(local_path):\n ax_local.imshow(colorize(np.array(Image.open(local_path))))\n else:\n ax_local.text(0.5, 0.5, \"local mask missing\", ha=\"center\", va=\"center\")\n\n ax_ref.set_title(f\"{barcode}\\nReference (Zenodo)\", fontsize=10)\n ax_local.set_title(\"Local (DICOM)\", fontsize=10)\n\n# Shared legend of artifact classes\nlegend_handles = [Patch(facecolor=mpl_colors[i], edgecolor=\"black\",\n label=CLASS_NAMES_SHORT[i]) for i in range(len(CLASS_NAMES_SHORT))]\nfig.legend(handles=legend_handles, loc=\"lower center\", ncol=len(legend_handles),\n fontsize=9, framealpha=0.9, bbox_to_anchor=(0.5, 0.0))\nplt.suptitle(\"Artifact maps — reference vs. local DICOM run\", fontweight=\"bold\", fontsize=13)\nplt.tight_layout(rect=[0, 0.03, 1, 0.97])\nplt.show()",
+ "metadata": {},
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": "### 6.5 Scaling to Cohort-level QC\n\nBecause each reference mask is just a small PNG, the comparison above scales to an entire cohort. Downloading the full `BRCA.tar` (~2 GB) from the Zenodo record gives pre-computed masks for all 1,105 TCGA-BRCA diagnostic slides — enough to compute population-level artifact statistics (medians, percentiles, distributions) **without running any inference locally**. We keep this notebook lightweight by validating on the five demo slides only; the loop in section 6.3 generalizes directly to the full mask set, reading one mask at a time so memory stays constant."
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.12.13"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
\ No newline at end of file