Skip to content

data import script #203

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 56 additions & 4 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,15 @@
WSI Superpixel Guided Labeling
==============================

WSI Superpixel Guided Labeling is a `Girder 3 <https://github.com/girder>`_ plugin designed to be used in conjunction with `HistomicsUI <https://github.com/DigitalSlideArchive/HistomicsUI>`_ and `HistomicsTK <https://github.com/DigitalSlideArchive/HistomicsTK>`_ to facilitate active learning on whole slide images.
WSI Superpixel Guided Labeling is a `Girder 3 <https://github.com/girder>`_ plugin for interactive development of image classifiers. It is designed to be used in conjunction with `HistomicsUI <https://github.com/DigitalSlideArchive/HistomicsUI>`_ and `HistomicsTK <https://github.com/DigitalSlideArchive/HistomicsTK>`_ and enables rapid development of classifiers with whole slide images using active learning.

This plugin leverages the output of certain HistomicsTK/SlicerCLI jobs to allow end users to label superpixel regions of whole slide images to be used as input for machine learning algorithms.
This plugin can be used to classify objects ranging from cell nuclei to high-power fields, and can operate on user provided data or data from a built-in pipeline that parcellates a whole-slide image into superpixels (see ``dsarchive/superpixel:latest``).

An example algorithm is contained within the ``dsarchive/superpixel:latest`` docker image. This can be used to generate superpixels, features, and machine learning models for active learning on a directory of images. See the installation instructions below for how to include the image as part of your Digital Slide Archive deployment.
The `Installation`_ instructions below describe how to install the plugin for your existing `Digital Slide Archive deployment <https://github.com/DigitalSlideArchive/digital_slide_archive/tree/master/devops/dsa>`_.

Once the appropriate data is generated, a new view becomes available for labeling and retraining.
See the `Data Import`_ section for details on the data import script and data formatting.

.. contents:: Table of Contents:

Installation
------------
Expand Down Expand Up @@ -133,6 +135,56 @@ You can review trained or predicted superpixels via the review mode. This allows
.. image:: docs/screenshots/reviewmode.png
:alt: The review mode

Data Import
-----------
Users can provide their own data for use with the platform, providing flexibility in the type of objects, and methods of detection/segmentation and encoding. A command-line import script is provided for the upload or import of this data. The required file formats and script details are described below.

Data formats
~~~~~~~~~~~~~~~~~~~~~~~~
Each slide in the dataset requires four files:

whole-slide image (various formats)
Any format that is supported by `large image <https://girder.github.io/large_image/formats.html>`_ can be used.
feature (.h5)
This file contains a single array where each row is a feature embedding for the object. A single blank row should be prepended if the image contains non-object background pixels.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of "the object" we could say "each superpixel"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also for the rest of the paragraph

pixelmap (.tiff)
This image is used as a `pixelmap overlay <https://girder.github.io/large_image/annotations.html#tiled-pixelmap-overlays>`_ to define object locations for visualization and interactivity. Pixel values reflect the position of the object embedding in the feature file. For an object embedding in row 'i' of the feature array (zero-index), the corresponding pixels for that object should have value 2i, and the border pixels 2i+1. Non-object background pixels should be encoded using zero values.
An example pixelmap is below:

.. image:: docs/screenshots/pixelmap.png
:alt: Pixelmap example
bounding boxes (.csv)
Each row of this .csv defines the left, top, right, and bottom pixel for a single object. Objects should be listed in the same order as they appear in the feature.h5 file.
For the pixelmap example above, assuming (0,0) is the top left, the csv file would have the following line:

.. code-block:: csv

1,1,4,4


Command-line Import Tool
~~~~~~~~~~~~~~~~~~~~~~~~
data_import.py is provided to import or upload user-generated data into the plugin.

Import requires a csv file defining the paths to input files, an API key for your DSA instance, and a project name: ::

> data_import inputs.csv UI65ixMezye0LpBOyYozArB9czPu3PLNpq0RGlGn new_project

Here, input.csv lists the whole-slide image, feature h5 file, pixelmap .tiff image, bounding box csv, and pixelmap downscale factor on each row: ::

> more inputs.csv
/remote/a.svs,/remote/a.svs.feature.h5,/remote/a.svs.pixelmap.tiff,/local/a.svs.boxes.csv,4
/remote/b.svs,/remote/b.svs.feature.h5,/remote/b.svs.pixelmap.tiff,/local/b.svs.boxes.csv,4

Feature h5 filenames should follow the pattern [slide_filename].*.feature.h5, but other filenames are unrestricted.

If importing data from DSA mounted storage, provide an identifier for the assetstore where the files are mounted using the -a option. This
identifier can be determined from the DSA Admin console.

-a, --assetstore Identifier for storage assetstore if importing files
-u, --url URL for server. Defaults to http://localhost:8080/api/v1
-r, --replace Replace existing wsis, features, or pixelmaps

Features
--------

Expand Down
275 changes: 275 additions & 0 deletions data_import.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,275 @@
import argparse
from girder_client import GirderClient
import hashlib
import os
from tqdm import tqdm


def _import_str(path, file, destination):
'''Format the arguments for the assetstore/id/import endpoint to
import a single file.'''
return dict(
importPath=path,
destinationId=destination,
destinationType='folder',
fileIncludeRegex=f'^{file}$',
progress=True,
)


def _import_file(client, assetstore, folder, file, replace=False):
'''Import a single file given the assetstore, filename and destination.
Optionally replace the item if an item with the same name exists.'''
match = list(client.listItem(folder, name=os.path.split(file)[1]))
import_args = _import_str(
os.path.split(file)[0], os.path.split(file)[1], folder
)
if len(match) and not replace:
return match[0]['_id']
elif len(match) and replace:
client.delete(f'item/{match[0]["_id"]}')
client.post(f'assetstore/{assetstore}/import', import_args)
match = list(client.listItem(folder, name=os.path.split(file)[1]))
assert len(match) == 1
return match[0]['_id']


def _upload_file(client, folder, file, replace=False):
'''Upload a single file given the assetstore, filename and destination.
Optionally replace the item if an item with the same name exists.'''
match = list(client.listItem(folder, name=os.path.split(file)[1]))
if len(match) and replace:
client.delete(f'item/{match[0]["_id"]}')
return client.uploadFileToFolder(folder, file)['_id']
elif len(match) and not replace:
return match[0]['_id']
else:
return client.uploadFileToFolder(folder, file)['_id']


def _feature_h5filename(wsi_id, boxes, patchsize=100):
'''Generate h5 feature filename.
wsi_id : str
The girder identifier of the associated whole slide image.
boxes : list of floats
Bounding boxes of objects in the order they appear in the feature array.
This is a 1D array/list with left, top, right, bottom in sequence for
each box at scan magnification.
patchsize : int
The size of the patches used during feature extraction.
'''
hashval = repr(
dict(itemId=wsi_id, bbox=[int(v) for v in boxes], patchSize=patchsize)
)
hash = hashlib.new('sha256', hashval.encode()).hexdigest()
return f'feature-{hash}.h5'


def pixelmap_annotation(pixelmap_id, scale, boxes):
'''Generate JSON format pixelmap annotation to attach to whole-slide image.
pixelmap_id : string
The girder identifier of the pixelmap image (not the whole-slide image).
scale : float
Scaling ratio between the whole-slide and pixelmap image resolutions. For
example, scale for a wsi at 20X and pixelmap of 5x would be 4.
'''

values = [*[0] * (len(boxes) // 4)]
categories = [dict(
label="default", fillColor="rgba(0, 0, 0, 0)", strokeColor="rgba(0, 0, 0, 1)"
)]
transform = dict(
xoffset=0, yoffset=0, matrix=[[scale, 0], [0, scale]]
)
pixelmap = dict(
type='pixelmap',
girderId=pixelmap_id,
boundaries=True,
transform=transform,
values=values,
categories=categories,
user= {'bbox': boxes},
)
attr = dict(
cli = None,
metadata = {},
params = {"scale_x": scale, "scale_y": scale},
version = None
)
annotation = dict(
name='Superpixel Epoch 0',
elements=[pixelmap],
attributes = attr
)
return annotation


def guided_label_import(
client, collection, wsis, features, pixelmaps, boxes, scales, assetstore=None, replace=False
):
'''Import or upload a guided labeling dataset to a digital slide archive instance.

:param client: An authenticated GirderClient object.
:param collection: Name of the project to create. If a collection with this name exists,
it will be used.
:param wsis: A list of paths to wsi filenames on local (upload) or mounted (import)
storage.
:param features: A list of paths to h5 feature files in the same order as `wsis`.
:param pixelmaps: A list of paths to tiff pixelmap image files in the same order as
`wsis`.
:param boxes: A list of 2D arrays containing the left, top, right, and bottom of the
bounding box for each object in each pixelmap. Coordinates should be listed at
native scan magnification. The order of objects in each 2D array should follow
the order of values in the corresponding pixelmap.
:param scales: The float ratios of resolutions between the whole-slide images and the
corresponding pixelmap images.
:param assetstore: The girder id of the assetstore if data will be imported.
Default value of `None` means data will be uploaded and that all paths in `wsis`,
`features`, and `pixelmaps` are local file paths.
:param bool: If True, replace items during import or upload where filenames match.
'''

# if collection does not exist, create it, otherwise get collection id
match = client.get('collection', dict(text=collection, limit=0))
if len(match):
collection = match[0]['_id']
else:
collection = client.post('collection', dict(name=collection))['_id']

# construct folders if necessary
data_folder = client.loadOrCreateFolder('Data', collection, 'collection')['_id']
client.addMetadataToFolder(data_folder, {'active_learning': True})
feature_folder = client.loadOrCreateFolder('Features', data_folder, 'folder')['_id']
pixelmap_folder = client.loadOrCreateFolder('Annotations', data_folder, 'folder')['_id']
client.loadOrCreateFolder('Models', data_folder, 'folder')['_id']

# import if data assetstore provided, otherwise upload
wsi_ids = {}
feature_ids = {}
pixelmap_ids = {}
for w, f, p, b, s in tqdm(
zip(wsis, features, pixelmaps, boxes, scales), total=len(wsis),
desc='Importing' if assetstore else 'Uploading'
):
if assetstore:
wsi_ids[w] = _import_file(client, assetstore, data_folder, w, replace)
feature_ids[f] = _import_file(client, assetstore, feature_folder, f, replace)
pixelmap_ids[p] = _import_file(client, assetstore, pixelmap_folder, p, replace)
else:
wsi_ids[w] = _upload_file(client, data_folder, w, replace)
feature_ids[f] = _upload_file(client, feature_folder, f, replace)
pixelmap_ids[f] = _upload_file(client, pixelmap_folder, p, replace)

# check for existing pixelmap annotation in wsi
# generate and post annotation if necessary
existing = client.get(f'/annotation/item/{wsi_ids[w]}')
document = pixelmap_annotation(
pixelmap_ids[p], s, [x for box in b for x in box]
)
if len(existing) and replace:
for annotation in existing:
client.delete(f'/annotation/{annotation["_id"]}')
client.post(f'/annotation?itemId={wsi_ids[w]}', json=document)
else:
skip = [
element['type'] == 'pixelmap'
for annotation in existing
for element in annotation['annotation']['elements']
]
if not any(skip):
client.post(f'/annotation?itemId={wsi_ids[w]}', json=document)
return collection


def main():
parser = argparse.ArgumentParser(
description=(
'Import or upload a guided labeling dataset to a digital slide archive instance.'
)
)
parser.add_argument(
'input',
type=str,
help=(
'Comma separated file listing wsi, feature (h5), pixelmap (tiff), and bounding box '
'(csv - local) input files.'
),
)
parser.add_argument(
'key',
type=str,
help=(
'API key for the server (see /api/v1#/api_key/api_key_createKey_post_api_key).'
),
)
parser.add_argument(
'collection',
type=str,
help=(
'Name of the created collection / project.'
),
)
parser.add_argument(
'-u',
'--url',
type=str,
default='http://localhost:8080/api/v1',
help=(
'Optional URL for the DSA API. Efaults to http://localhost:8080/api/v1.'
),
)
parser.add_argument(
'-a',
'--assetstore',
type=str,
help=(
'Optional identifier of the assetstore for file import. Defaults to upload (None).'
),
)
parser.add_argument(
'-r',
'--replace',
dest='replace',
action='store_true',
help=(
'Optional replace existing wsis, features, or pixelmaps. Defaults to no replacement.'
),
)
args = parser.parse_args()

# create and authenticate client
client = GirderClient(apiUrl=args.url)
client.authenticate(apiKey=args.key)

# parse input to build lists of files
inputs = []
with open(args.input, 'r') as f:
for line in f:
wsi, feature, pixelmap, box, scales = line.strip().split(',')
inputs.append((wsi, feature, pixelmap, box, scales))
wsis = [row[0] for row in inputs]
features = [row[1] for row in inputs]
pixelmaps = [row[2] for row in inputs]
boxes = [row[3] for row in inputs]
scales = [float(row[4]) for row in inputs]

# build list of bounding boxes
bounding = []
for b in boxes:
with open(b, 'r') as f:
box = [
[int(x) for x in line.strip().split(',')]
for line in f
]
bounding.append(box)

# import if assetstore defined, otherwise upload
guided_label_import(
client, args.collection,
wsis, features, pixelmaps, bounding, scales,
args.assetstore, args.replace
)


if __name__ == "__main__":
main()
Binary file added docs/screenshots/pixelmap.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.