dana-farber-aios / pathml Goto Github PK
View Code? Open in Web Editor NEWTools for computational pathology
Home Page: https://pathml.org
License: GNU General Public License v2.0
Tools for computational pathology
Home Page: https://pathml.org
License: GNU General Public License v2.0
I am able to import bioformats and javabridge, although javabridge.start_vm(class_path=bioformats.JARS)
fails.
This error should be caught so that we can give a message to the user telling them how to resolve
On MacOS 10.15.4
>>> from pathml.preprocessing.multiparametricslide import MultiparametricSlide
Could not find Java JRE compatible with x86_64 architecture
>>> wsi = MultiparametricSlide(path = "tests/testdata/smalltif.tif")
Could not find Java JRE compatible with x86_64 architecture
Could not find Java JRE compatible with x86_64 architecture
Traceback (most recent call last):
File "/Users/jacobrosenthal/.conda/envs/pathml/lib/python3.7/site-packages/javabridge/jutil.py", line 114, in _find_mac_lib
cmd = ["find", os.path.dirname(jvm_dir), "-name", library+extension]
File "/Users/jacobrosenthal/.conda/envs/pathml/lib/python3.7/posixpath.py", line 156, in dirname
p = os.fspath(p)
TypeError: expected str, bytes or os.PathLike object, not NoneType
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/jacobrosenthal/.conda/envs/pathml/lib/python3.7/site-packages/javabridge/jutil.py", line 278, in start_thread
library_path = _find_mac_lib("libjvm")
File "/Users/jacobrosenthal/.conda/envs/pathml/lib/python3.7/site-packages/javabridge/jutil.py", line 125, in _find_mac_lib
(cmd, library), exc_info=1)
UnboundLocalError: local variable 'cmd' referenced before assignment
Failed to create Java VM
Traceback (most recent call last):
File "/Users/jacobrosenthal/.conda/envs/pathml/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3331, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-3-81b83c29b5ca>", line 1, in <module>
wsi = MultiparametricSlide(path = "tests/testdata/smalltif.tif")
File "/Users/jacobrosenthal/PycharmProjects/pathml/pathml/preprocessing/multiparametricslide.py", line 46, in __init__
javabridge.start_vm(class_path=bioformats.JARS)
File "/Users/jacobrosenthal/.conda/envs/pathml/lib/python3.7/site-packages/javabridge/jutil.py", line 319, in start_vm
raise RuntimeError("Failed to start Java VM")
RuntimeError: Failed to start Java VM
Each transform should check that the input image is compatible. For example, colorspace conversions are not applicable for multiplex images, though blurring transforms that operate on each channel may still be. This should probably just be an assert statement at the beginning of apply()
method of each transform.
Add support for tissue microarray (TMA) images.
This probably means adding functionality to take an input image and identify the separate cores.
We may be able to use TMAJ software, either directly through javabridge or as inspiration:
Example of TMA slides (source here):
Tile repr incorrectly shows "i=None" or "j=None" when i or j = 0.
I just tried installing PathML and running the tests on a new VM and ran into problems with multiparametricslide.
(pathml) jupyter@shared-dxvm-gpu:~/pathml$ python -m pytest
============================================= test session starts =============================================
platform linux -- Python 3.8.6, pytest-6.2.1, py-1.10.0, pluggy-0.13.1
rootdir: /home/jupyter/pathml
collected 74 items / 1 error / 73 selected
=================================================== ERRORS ====================================================
___________________ ERROR collecting tests/preprocessing_tests/test_multiparametricslide.py ___________________
ImportError while importing test module '/home/jupyter/pathml/tests/preprocessing_tests/test_multiparametricslide.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
pathml/preprocessing/multiparametricslide.py:10: in <module>
import bioformats
E ModuleNotFoundError: No module named 'bioformats'
During handling of the above exception, another exception occurred:
/opt/conda/envs/pathml/lib/python3.8/importlib/__init__.py:127: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
tests/preprocessing_tests/test_multiparametricslide.py:5: in <module>
from pathml.preprocessing.multiparametricslide import MultiparametricSlide2d
pathml/preprocessing/multiparametricslide.py:25: in <module>
raise ImportError("MultiparametricSlide2d requires javabridge and bioformats")
E ImportError: MultiparametricSlide2d requires javabridge and bioformats
============================================== warnings summary ===============================================
pathml/preprocessing/multiparametricslide.py:16
/home/jupyter/pathml/pathml/preprocessing/multiparametricslide.py:16: UserWarning: MultiparametricSlide2d requires a jvm to interface with java bioformats library.
See: https://pythonhosted.org/javabridge/installation.html. You can install using:
sudo apt-get install openjdk-8-jdk
pip install javabridge
pip install python-bioformats
warn(
-- Docs: https://docs.pytest.org/en/stable/warnings.html
=========================================== short test summary info ===========================================
ERROR tests/preprocessing_tests/test_multiparametricslide.py
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
========================================= 1 warning, 1 error in 0.62s =========================================
So then I tried to install openjdk using the instructions, but that didn't work:
(pathml) jupyter@shared-dxvm-gpu:~/pathml$ sudo apt-get install openjdk-8-jdk
Reading package lists... Done
Building dependency tree
Reading state information... Done
Package openjdk-8-jdk is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source
E: Package 'openjdk-8-jdk' has no installation candidate
OS info:
(pathml) jupyter@shared-dxvm-gpu:~/pathml$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
(pathml) jupyter@shared-dxvm-gpu:~/pathml$ uname -a
Linux shared-dxvm-gpu 4.19.0-12-cloud-amd64 #1 SMP Debian 4.19.152-1 (2020-10-18) x86_64 GNU/Linux
After some searching online, it seems like openjdk-8-jdk is not supported anymore (see here for example). I think the issue is that python-javabridge is not really being actively developed (see here). We need to find a better solution - either give users very detailed instructions for how to install openjdk-8 (which doesn't seem like a great solution since it isn't officially supported anymore) or drop bioformats/javabridge dependency and use a different tool to support multiparametric slides.
I am also confused why the tests pass successfully, which also use sudo apt-get install openjdk-8-jdk
Create templates for certain types of issues (e.g. feature request, bug report, etc.)
Currently SlideData design emphasizes methods required to run pipeline.
Implement convenience functions for plotting, slicing, otherwise manipulating slides.
This will allow people to pip install pathml
We should look into using release tags: https://docs.github.com/en/free-pro-team@latest/github/administering-a-repository/about-releases
Implement a method to run a preprocessing pipeline on a dataset.
Should basically be a convenience function for running pipeline on each individual image.
Pseudocode:
mydataset = pathml.datasets.PESO.download()
mypipeline = pathml.preprocessing.default_pipeline()
mypipeline.run(mydataset)
### should be equivalent to:
for wsi in mydataset:
mypipeline.run(wsi)
Classes which aren't meant to be instantiated should be abstract classes (i.e. inherit from abc.ABC).
This is probably cleaner than current implementation of raising NotImplementedError, since abstract classes can't be initialized by users (they will get an error).
Add support for tile-level
wrap dict of masks
each mask stores pixel-wise int8
repr method (keys, dimensions)
getitem method
len method
by default should be multiparametric single plane
subclass volumetric
Pipeline.run()
uses concurrent.futures.ProcessPoolExecutor
to implement multiprocessing when run on a dataset.
I'm not sure how to write a test for this though. Everything I've tried so far has caused Pytest to hang.
We need to support multiple instance learning. For example, if we only have slide-level labels, we can treat each slide as a bag of tiles and use the slide label as the bag label.
Need to do more research on what the best way to implement this in pytorch is.
TODO weekend sprint:
repo structure:
pathml
utils.py
-> core
-> preprocessing
-> ml
-> datasets
refactor:
We can define a SlideData._repr_html_()
method (or maybe SlideData._repr_jpg_()
) which would let us do pretty outputs in JupyterNotebook.
For example we could make this method display a thumbnail of the image by default, along with some text describing it.
This would be nice for users since you could see the slide without having to call any methods.
This is lower priority but seems straightforward to implement
see: https://ipython.readthedocs.io/en/stable/config/integrating.html#rich-display
The masks in PanNuke dataloaders don't match the images.
This is obviously a big problem for training models...
from pathml.datasets.pannuke import PanNukeDataModule
pannuke = PanNukeDataModule(
data_dir="../data/pannuke/",
download=False,
nucleus_type_labels=True,
batch_size=8,
hovernet_preprocess=True,
split=1
)
train = pannuke.train_dataloader
images, masks, hvs, types = next(iter(train))
fig, ax = plt.subplots(nrows=1, ncols=2)
im = np.moveaxis(images[0, ...].numpy(), 0, 2)
ax[0].imshow(im)
mask = masks.argmax(dim=1)[0, ...]
ax[1].imshow(mask)
plt.show()
I think this may be happening because the lists of filepaths for masks and images are created separately using pathlib.Path.glob()
, but glob is unordered.
Digital Imaging and Communications in Medicine (DICOM) is the standard for the representation, storage, and communication of medical images and related information. A DICOM file format and communication protocol for pathology have been defined. Whole slide image data can be encoded together with relevant patient and specimen-related metadata as DICOM objects.
As DICOM is more widely adopted in Digital Pathology support for this file format may need to be included in PathML. Creating a class that inherits from BaseSlide and which can ingest the DICOM files. The class could also implement methods specific to DICOM, like reading metadata.
DataSet object for whole-slide images.
This should be:
When a dataset is downloaded from the datasets module, it should return a DataSet object. Users should also be able to create a DataSet object from files that have locally.
see: https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#dataset-class
Currently bioformats limits file size to 2GB because of java array size limitations.
Two options:
Implement HoVer-Net (https://arxiv.org/pdf/1812.06499.pdf)
Guidelines for code
Guidelines for commits
Some bugs I ran into with PanNuke dataset implementation.
Putting here to track fixing and also adding new tests.
pannuke_dset = PanNukeDataset(
data_dir = "../data/pannuke",
fold_ix = None,
hovernet_preprocess = True,
nucleus_type_labels = True,
)
im, mask, hv, t = pannuke_dset[0]
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-14-c5513ce6bd51> in <module>
----> 1 im, mask, hv, t = pannuke_dset[0]
~/pathml/pathml/datasets/pannuke.py in __getitem__(self, ix)
110 # sum across mask channels to squash mask channel dim to size 1
111 # don't sum the last channel, which is background!
--> 112 mask_1c = pannuke_multiclass_mask_to_nucleus_mask(mask)
113 else:
114 mask_1c = mask
~/pathml/pathml/datasets/pannuke.py in pannuke_multiclass_mask_to_nucleus_mask(multiclass_mask)
135 """
136 # verify shape of input
--> 137 assert multiclass_mask.ndim == 3 and multiclass_mask.shape[0] == 6, \
138 f"Expecting a batch of masks with dims (6, 256, 256). Got input of shape {multiclass_mask.shape}"
139 assert multiclass_mask.shape[1] == 256 and multiclass_mask.shape[2] == 256, \
AssertionError: Expecting a batch of masks with dims (6, 256, 256). Got input of shape (256, 6, 256)
2. _clean_up_download_pannuke()
problem
pannuke = PanNukeDataModule(
data_dir="../data/pannuke/",
download=True,
nucleus_type_labels=True,
batch_size=8,
hovernet_preprocess=True,
split=1,
transforms=None,
)
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-4-426a1c4fd670> in <module>
----> 1 pannuke = PanNukeDataModule(
2 data_dir="../data/pannuke/",
3 download=True,
4 nucleus_type_labels=True,
5 batch_size=8,
~/pathml/pathml/datasets/pannuke.py in __init__(self, data_dir, download, shuffle, transforms, nucleus_type_labels, split, batch_size, hovernet_preprocess)
198 self.download = download
199 if download:
--> 200 self._download_pannuke(self.data_dir)
201 else:
202 # make sure that subdirectories exist
~/pathml/pathml/datasets/pannuke.py in _download_pannuke(self, download_dir)
241
242 self._process_downloaded_pannuke(download_dir)
--> 243 self._clean_up_download_pannuke(download_dir)
244
245 @staticmethod
~/pathml/pathml/datasets/pannuke.py in _clean_up_download_pannuke(pannuke_dir)
306 downloaded_dir = p / f"Fold {fold_ix}"
307 zip_file.unlink()
--> 308 downloaded_dir.rmdir()
309
310
/opt/conda/envs/pathml/lib/python3.8/pathlib.py in rmdir(self)
1333 if self._closed:
1334 self._raise_closed()
-> 1335 self._accessor.rmdir(self)
1336
1337 def lstat(self):
OSError: [Errno 39] Directory not empty: '../data/pannuke/Fold 1'
These should be informative and clear
Rewrite SlideData Class.
name: str
shape <- dict of shapes of slide, masks, tiles
slide: Slide (if slide loaded using Bioformats)
masks: pathml.Masks
tiles: pathml.Tiles
labels: (Masks, str, int, floats)
history: list
slidetype: str (e.g. "HE" or "IHC"). Set when SlideData class is initialized
init - use appropriate backend (openslide or bioformats)
repr
read_region(level)
make_tiles(Pipeline, optional)
chunks(shape, stride) --> generator of chunk objects
plot() --> matplotlib (also handle masks in plot)
save()
This is a good best-practice
The CONTRIBUTING file gives instructions and guidelines for contributors. We can start with something basic and expand as needed down the road.
There is a contributing.rst file in the documentation, but we should have it in the root directory so it is easily accessible. Also, if it is in the root directory, GitHub will automatically link to this file when a contributor creates an issue or opens a pull request.
Helpful resources:
Different slides may have different microns per pixel (MPP) depending on the physical parameters of the scanner.
This means that for any two slides, level 0 may be at different pixel resolution.
We should provide a way to standardize pixel resolution of slides, so that we know that all images in a dataset are the same resolution.
Openslide objects have slide.properties["openslide.mpp-x"]
and slide.properties["openslide.mpp-y"]
which we can use
Pip is currently not listed in environment.yml file.
Conda gives the following warning:
(base) jupyter@rosenthal-dxvm:~/pathml$ conda env create -f environment.yml
Warning: you have pip-installed dependencies in your environment file, but you do not list pip
itself as one of your conda dependencies. Conda may not use the correct pip to install your
packages, and they may end up in the wrong place. Please add an explicit pip dependency.
I'm adding one for you, but still nagging you.
Writing tiles to disk should be a method for SlideData
class. This makes more sense than writing tiles as part of the tile-level preprocessor in a Pipeline
. Directory to write tiles to should be specified in argument
Pseudocode:
data = HESlide("/path/to/image.svs").load_data()
data = my_pipeline.run(data)
data.write_tiles("/path/to/tiled/images/")
When stride is small, the last few tiles lie on the edge of slides would have smaller size than (tile size, tile size). Openslide would zero padding automatically. When set pad=False this would output undesired tiles with black edges. Recalculated tile numbers to solve this issue
Haven't got a chance to use this specific svs data yet. Randomly picked tile size and a small stride for now. Will double check this.
Psudocodes:
example_image_path = "../data/CMU-1.svs
class MySlideLoader(BaseSlideLoader):
def apply(self, path):
return HESlide(path).chunks(level=0, size=1024, stride=400)
data = MySlideLoader().apply(example_image_path)
Add some tests to make sure that the ML models are working correctly.
For example, this may involve overfitting on a toy dataset and verifying that performance is above some threshold.
Create a dataloader class similar to torch.utils.data.DataLoader
Adding https://github.com/choosehappy/HistoQC module to perform rigorous removal of unwanted artifacts in the data
Need to fix bugs in documentation. Should also add tests to make sure that docs compile successfully.
(pathml) jupyter@shared-dxvm-gpu:~/pathml/docs$ make html
Running Sphinx v3.4.2
making output directory... done
building [mo]: targets for 0 po files that are out of date
building [html]: targets for 17 source files that are out of date
updating environment: [new config] 17 added, 0 changed, 0 removed
/home/jupyter/pathml/pathml/preprocessing/multiparametricslide.py:16: UserWarning: MultiparametricSlide2d requires a jvm to interface with java bioformats library.
See: https://pythonhosted.org/javabridge/installation.html. You can install using:
sudo apt-get install openjdk-8-jdk
pip install javabridge
pip install python-bioformats
warn(
WARNING: autodoc: failed to import module 'multiparametricslide' from module 'pathml.preprocessing'; the following exception was raised:
MultiparametricSlide2d requires javabridge and bioformats
Notebook error:
Problems with linked notebook "examples/link_advanced_HE_chunks" path:
InputError: [Errno 2] No such file or directory: '../examples/advanced_HE_chunks.ipynb'.
make: *** [Makefile:20: html] Error 2
Note that this issue is about the error in compilation. The javabridge warning should be fixed when we fix #48
The example notebooks should be more comprehensive and better organized. For many people, the example notebooks will be their first experience with PathML, so we want them to be really good.
Some ideas:
Should we use PyTorch Lightning in PathML?
Pros:
Cons:
Other thoughts:
The argument of Pipeline.run()
should be an object inheriting from BaseSlide
, rather than a file path.
This means that whenever we run a pipeline, we can trust that it implements everything from BaseSlide.
If we just pass a path, it may be ambiguous how to read it (is it a H&E slide, or a multiplex slide, or...?). All the work in reading the file, etc. should happen when creating the slide object, not in the pipeline object.
This will let us do things like tile.Blur(kernel_size = 7)
for arbitrary transforms
Here's a code snipped that I was trying but couldn't get to work:
class Transform:
def __init__(self, test):
self.test = test
def apply(self, target):
print(f"applying on target of type {type(target)}. kwargs: {self.test}")
class Target:
def __init__(self, name):
self.name = name
def __getattr__(self, item):
print(f"type of item: {type(item)}")
print(str(item))
t = item(**kwargs)
t.apply(self)
target = Target(name = "testtarget")
target.Transform(test = "testitem")
See: https://rosettacode.org/wiki/Respond_to_an_unknown_method_call#Python
Currently, tiles are written to disk in the tile_level_preprocessor
component of the Pipeline
.
It would be better to pass a path to the output directory when running the Pipeline object, and then write all tiles to that directory. This would allow for better integration with DataModule
class, since the entire DataModule
could be initialized pointing to one directory and can then:
Pipeline.run()
and write all the tiles theredataset
and dataloader
objects, since the full filepath is known.Pseudocode:
# initialize pipeline
my_pipeline = Pipeline(
slide_loader = MySlideLoader(),
slide_preprocessor = MySlidePreprocessor(),
tile_extractor = SimpleTileExtractor(tile_size=224),
tile_preprocessor = MyTilePreprocessor()
)
# initialize slide
slide = HESlide("/path/to/image.svs")
# run pipeline on slide
my_pipeline.run(slide, out_dir = "./data/preprocessed")
Provided instructions:
conda install sphinx # install sphinx package for generating docs
cd docs # enter docs directory
make html # build docs in html format
fail in Linux (tested Linux Mint 19.2). Additionally required:
pip install nbsphinx
pip install nbsphinx_link
pip install sphinx_rtd_theme
pandoc https://pandoc.org/installing.html
Use GitHub actions to automatically run tests when code is pushed
See:
We can set up an automated workflow to measure code coverage and add it in a badge on the project readme.
https://github.com/codecov/codecov-action
This is not high priority at the moment but filing here to do later
Slide objects should have a method that returns an iterator over "chunks" so that the image can be processed chunk-wise instead of loading the entire thing into memory.
Abstract method should be implemented in BaseSlide
, but each slide type (e.g. HESlide
, MultiparametricSlide
) may have to be implement differently based on backend (e.g. openslide or bioformats)
Pseudocode:
slide = HESlide("/path/to/image.svs")
for chunk in slide.generate_chunks(level=0, size=1024, ...):
# operate on each 1024x1024 chunk
preprocess(chunk)
We want to be able to share pre-trained models. The trained model weights can be saved to disk, e.g. in .pth
files for pytorch. However, these files can be quite big - too big to put in the GitHub repo itself..
We need to find a solution for hosting these large files of model parameters.
E.g. we could have a GCP bucket, or S3 bucket.
Need to evaluate the costs of different options.
We need a way to share pipelines by writing them to a file
Pseudocode:
my_pipeline = Pipeline(**kwargs)
my_pipeline.save("/path/to/disk/pipeline.pickle")
## someone else can then load and use:
pipeline = load("/path/to/local/downloads/pipeline.pickle")
pipeline.run(local_slide)
Make SlideData the core pathml object, combine pipeline and transforms into methods in SlideData
Preprocessing has become a catch-all directory, improve directory structure
Slide classes should be reorganized based on dimensions and slide type.
This hierarchical class structure is more logical and will also help with making sure that the transforms work properly (#18 ). For example, some transforms may work for all 2d images regardless of number of channels, but others may only be applicable for RGB images, and others may be specific to certain types (e.g. H&E stain deconvolution).
Datamodule and Dataloader for https://camelyon17.grand-challenge.org/
Docstrings are currently written in basic Sphinx format. However, basic Sphinx doesn't support a References section so I had to start using the Napoleon extension. Since we are already using Napoleon, we may as well stick with Google or numpy docstring format moving forward, since it is more readable for humans.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.