dana-farber-aios / pathml Goto Github PK

Tools for computational pathology

License: GNU General Public License v2.0

Python 99.49% Dockerfile 0.51%

machine-learning digital-pathology computational-pathology biomedical-image-processing pathology histopathology spatial-transcriptomics image-analysis microscopy fluorescence-microscopy-imaging

pathml's People

Stargazers

Watchers

Forkers

mohamedomar2020 collinarnett jiesun1990 antoniofaneite srinidhipy dana-farber lindvalllab re73 grenkoca surya-narayanan dnbaker iphyer al3n70rn visionpathology nauyan curlup aditya964 mubashermohammed akhil4rajan nhatipoglu mehbob ghislainadon derekkaknes wiherewini hugging-face-supporter dfci-codex-group irtyamine msk-mph zolkko yu-anchen raghavendrasri kevingalacha kanedev m081429 lxc-dolphin astorfi dmbrundage sancakozdemir doc-r2 mike575 rect-war sreekarreddydfci beegass musc-pathology-informatics histopathology jnirschl ydeh22 paulscemama eng-rsmy venkatapathy yirenheihei tddough98 gmnamra ckv1110 daniya-sohail26 xellnaga abdulkarimab akihikoueda jackzhousz drsei shbrief ge-yl priyanshumahey zaloch imlxw jamesgwen chiaracorti geeks-sid shitoudidi sgoggins xiachenrui onerai fantashi099 shatadg krejiba rimanb varunullanat cowmonkeybrain hungvo304ml

pathml's Issues

MultiparametricSlide java

I am able to import bioformats and javabridge, although javabridge.start_vm(class_path=bioformats.JARS) fails.
This error should be caught so that we can give a message to the user telling them how to resolve

On MacOS 10.15.4

>>> from pathml.preprocessing.multiparametricslide import MultiparametricSlide
Could not find Java JRE compatible with x86_64 architecture
>>> wsi = MultiparametricSlide(path = "tests/testdata/smalltif.tif")
Could not find Java JRE compatible with x86_64 architecture
Could not find Java JRE compatible with x86_64 architecture
Traceback (most recent call last):
  File "/Users/jacobrosenthal/.conda/envs/pathml/lib/python3.7/site-packages/javabridge/jutil.py", line 114, in _find_mac_lib
    cmd = ["find", os.path.dirname(jvm_dir), "-name", library+extension]
  File "/Users/jacobrosenthal/.conda/envs/pathml/lib/python3.7/posixpath.py", line 156, in dirname
    p = os.fspath(p)
TypeError: expected str, bytes or os.PathLike object, not NoneType
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/Users/jacobrosenthal/.conda/envs/pathml/lib/python3.7/site-packages/javabridge/jutil.py", line 278, in start_thread
    library_path = _find_mac_lib("libjvm")
  File "/Users/jacobrosenthal/.conda/envs/pathml/lib/python3.7/site-packages/javabridge/jutil.py", line 125, in _find_mac_lib
    (cmd, library), exc_info=1)
UnboundLocalError: local variable 'cmd' referenced before assignment
Failed to create Java VM
Traceback (most recent call last):
  File "/Users/jacobrosenthal/.conda/envs/pathml/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3331, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-3-81b83c29b5ca>", line 1, in <module>
    wsi = MultiparametricSlide(path = "tests/testdata/smalltif.tif")
  File "/Users/jacobrosenthal/PycharmProjects/pathml/pathml/preprocessing/multiparametricslide.py", line 46, in __init__
    javabridge.start_vm(class_path=bioformats.JARS)
  File "/Users/jacobrosenthal/.conda/envs/pathml/lib/python3.7/site-packages/javabridge/jutil.py", line 319, in start_vm
    raise RuntimeError("Failed to start Java VM")
RuntimeError: Failed to start Java VM

Check image compatibility for transforms

Each transform should check that the input image is compatible. For example, colorspace conversions are not applicable for multiplex images, though blurring transforms that operate on each channel may still be. This should probably just be an assert statement at the beginning of apply() method of each transform.

TMA support

Add support for tissue microarray (TMA) images.
This probably means adding functionality to take an input image and identify the separate cores.

We may be able to use TMAJ software, either directly through javabridge or as inspiration:

Example of TMA slides (source here):

Tile repr wrong when i or j = 0

Tile repr incorrectly shows "i=None" or "j=None" when i or j = 0.

Trouble installing dependencies for multiparametricslide

I just tried installing PathML and running the tests on a new VM and ran into problems with multiparametricslide.

(pathml) jupyter@shared-dxvm-gpu:~/pathml$ python -m pytest
============================================= test session starts =============================================
platform linux -- Python 3.8.6, pytest-6.2.1, py-1.10.0, pluggy-0.13.1
rootdir: /home/jupyter/pathml
collected 74 items / 1 error / 73 selected                                                                    

=================================================== ERRORS ====================================================
___________________ ERROR collecting tests/preprocessing_tests/test_multiparametricslide.py ___________________
ImportError while importing test module '/home/jupyter/pathml/tests/preprocessing_tests/test_multiparametricslide.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
pathml/preprocessing/multiparametricslide.py:10: in <module>
    import bioformats
E   ModuleNotFoundError: No module named 'bioformats'

During handling of the above exception, another exception occurred:
/opt/conda/envs/pathml/lib/python3.8/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/preprocessing_tests/test_multiparametricslide.py:5: in <module>
    from pathml.preprocessing.multiparametricslide import MultiparametricSlide2d
pathml/preprocessing/multiparametricslide.py:25: in <module>
    raise ImportError("MultiparametricSlide2d requires javabridge and bioformats")
E   ImportError: MultiparametricSlide2d requires javabridge and bioformats
============================================== warnings summary ===============================================
pathml/preprocessing/multiparametricslide.py:16
  /home/jupyter/pathml/pathml/preprocessing/multiparametricslide.py:16: UserWarning: MultiparametricSlide2d requires a jvm to interface with java bioformats library.
              See: https://pythonhosted.org/javabridge/installation.html. You can install using:
                  
                  sudo apt-get install openjdk-8-jdk
                  pip install javabridge
                  pip install python-bioformats
          
    warn(

-- Docs: https://docs.pytest.org/en/stable/warnings.html
=========================================== short test summary info ===========================================
ERROR tests/preprocessing_tests/test_multiparametricslide.py
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
========================================= 1 warning, 1 error in 0.62s =========================================

So then I tried to install openjdk using the instructions, but that didn't work:

(pathml) jupyter@shared-dxvm-gpu:~/pathml$ sudo apt-get install openjdk-8-jdk
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Package openjdk-8-jdk is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package 'openjdk-8-jdk' has no installation candidate

OS info:

(pathml) jupyter@shared-dxvm-gpu:~/pathml$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
(pathml) jupyter@shared-dxvm-gpu:~/pathml$ uname -a
Linux shared-dxvm-gpu 4.19.0-12-cloud-amd64 #1 SMP Debian 4.19.152-1 (2020-10-18) x86_64 GNU/Linux

After some searching online, it seems like openjdk-8-jdk is not supported anymore (see here for example). I think the issue is that python-javabridge is not really being actively developed (see here). We need to find a better solution - either give users very detailed instructions for how to install openjdk-8 (which doesn't seem like a great solution since it isn't officially supported anymore) or drop bioformats/javabridge dependency and use a different tool to support multiparametric slides.

I am also confused why the tests pass successfully, which also use sudo apt-get install openjdk-8-jdk

Create Issue Templates

Create templates for certain types of issues (e.g. feature request, bug report, etc.)

Convenience methods for slide manipulation

Currently SlideData design emphasizes methods required to run pipeline.

Implement convenience functions for plotting, slicing, otherwise manipulating slides.

Publish package on PyPI

This will allow people to pip install pathml

Review tutorial here: https://packaging.python.org/tutorials/packaging-projects/
Make sure that README is correctly formatted: https://packaging.python.org/guides/making-a-pypi-friendly-readme/
Set up a github actions workflow to automatically prepare package and upload to PyPI: https://packaging.python.org/guides/publishing-package-distribution-releases-using-github-actions-ci-cd-workflows/

We should look into using release tags: https://docs.github.com/en/free-pro-team@latest/github/administering-a-repository/about-releases

Run pipelines on datasets

Implement a method to run a preprocessing pipeline on a dataset.
Should basically be a convenience function for running pipeline on each individual image.

Pseudocode:

mydataset = pathml.datasets.PESO.download()
mypipeline = pathml.preprocessing.default_pipeline()

mypipeline.run(mydataset)
### should be equivalent to:
for wsi in mydataset:
    mypipeline.run(wsi)

Use abstract classes

Classes which aren't meant to be instantiated should be abstract classes (i.e. inherit from abc.ABC).
This is probably cleaner than current implementation of raising NotImplementedError, since abstract classes can't be initialized by users (they will get an error).

Fold and out-of-focus detection

Add support for tile-level

fold detection
out of focus detection

create class Mask

wrap dict of masks
each mask stores pixel-wise int8
repr method (keys, dimensions)
getitem method
len method

by default should be multiparametric single plane
subclass volumetric

Add test for multiprocessing

Pipeline.run() uses concurrent.futures.ProcessPoolExecutor to implement multiprocessing when run on a dataset.
I'm not sure how to write a test for this though. Everything I've tried so far has caused Pytest to hang.

Multiple Instance Learning

We need to support multiple instance learning. For example, if we only have slide-level labels, we can treat each slide as a bag of tiles and use the slide label as the bag label.
Need to do more research on what the best way to implement this in pytorch is.

Sprint TODO

TODO weekend sprint:
repo structure:
pathml

utils.py
-> core
-> preprocessing
-> ml
-> datasets

refactor:

repr for notebooks

We can define a SlideData._repr_html_() method (or maybe SlideData._repr_jpg_()) which would let us do pretty outputs in JupyterNotebook.
For example we could make this method display a thumbnail of the image by default, along with some text describing it.
This would be nice for users since you could see the slide without having to call any methods.

This is lower priority but seems straightforward to implement

see: https://ipython.readthedocs.io/en/stable/config/integrating.html#rich-display

Pannuke masks and images don't match

The masks in PanNuke dataloaders don't match the images.
This is obviously a big problem for training models...

from pathml.datasets.pannuke import PanNukeDataModule

pannuke = PanNukeDataModule(
    data_dir="../data/pannuke/", 
    download=False,
    nucleus_type_labels=True, 
    batch_size=8, 
    hovernet_preprocess=True,
    split=1
)

train = pannuke.train_dataloader

images, masks, hvs, types = next(iter(train))

fig, ax = plt.subplots(nrows=1, ncols=2)
im = np.moveaxis(images[0, ...].numpy(), 0, 2)
ax[0].imshow(im)
mask = masks.argmax(dim=1)[0, ...]
ax[1].imshow(mask)
plt.show()

I think this may be happening because the lists of filepaths for masks and images are created separately using pathlib.Path.glob(), but glob is unordered.

Support for DICOM Integration

Digital Imaging and Communications in Medicine (DICOM) is the standard for the representation, storage, and communication of medical images and related information. A DICOM file format and communication protocol for pathology have been defined. Whole slide image data can be encoded together with relevant patient and specimen-related metadata as DICOM objects.

As DICOM is more widely adopted in Digital Pathology support for this file format may need to be included in PathML. Creating a class that inherits from BaseSlide and which can ingest the DICOM files. The class could also implement methods specific to DICOM, like reading metadata.

Create DataSet class

DataSet object for whole-slide images.

This should be:

Lightweight, i.e. only holding paths to the images rather than the entire images themselves.
Also link paths to corresponding tiles, after preprocessing is applied and the tiles are saved to disk.
Also support masks and other types of labels.

When a dataset is downloaded from the datasets module, it should return a DataSet object. Users should also be able to create a DataSet object from files that have locally.

see: https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#dataset-class

Extend multiparametric support to large images

Currently bioformats limits file size to 2GB because of java array size limitations.

Two options:

Instantiate multiple 2GB chunks and build numpy array piecewise
Support common filetypes (like .tif) with python dependencies, revert to java only when user provides a rare proprietary microscope file format.

HoVer-Net

Implement HoVer-Net (https://arxiv.org/pdf/1812.06499.pdf)

Create Style Guidelines

Guidelines for code

Guidelines for commits

pannuke bugs

Some bugs I ran into with PanNuke dataset implementation.
Putting here to track fixing and also adding new tests.

dimensions wrong

pannuke_dset = PanNukeDataset(
    data_dir = "../data/pannuke",
    fold_ix = None,
    hovernet_preprocess = True,
    nucleus_type_labels = True,
)

im, mask, hv, t = pannuke_dset[0]

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-14-c5513ce6bd51> in <module>
----> 1 im, mask, hv, t = pannuke_dset[0]

~/pathml/pathml/datasets/pannuke.py in __getitem__(self, ix)
    110                 # sum across mask channels to squash mask channel dim to size 1
    111                 # don't sum the last channel, which is background!
--> 112                 mask_1c = pannuke_multiclass_mask_to_nucleus_mask(mask)
    113             else:
    114                 mask_1c = mask

~/pathml/pathml/datasets/pannuke.py in pannuke_multiclass_mask_to_nucleus_mask(multiclass_mask)
    135     """
    136     # verify shape of input
--> 137     assert multiclass_mask.ndim == 3 and multiclass_mask.shape[0] == 6, \
    138         f"Expecting a batch of masks with dims (6, 256, 256). Got input of shape {multiclass_mask.shape}"
    139     assert multiclass_mask.shape[1] == 256 and multiclass_mask.shape[2] == 256, \

AssertionError: Expecting a batch of masks with dims (6, 256, 256). Got input of shape (256, 6, 256)

2. _clean_up_download_pannuke() problem

pannuke = PanNukeDataModule(
    data_dir="../data/pannuke/", 
    download=True,
    nucleus_type_labels=True, 
    batch_size=8, 
    hovernet_preprocess=True,
    split=1,
    transforms=None,
)
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-4-426a1c4fd670> in <module>
----> 1 pannuke = PanNukeDataModule(
      2     data_dir="../data/pannuke/",
      3     download=True,
      4     nucleus_type_labels=True,
      5     batch_size=8,

~/pathml/pathml/datasets/pannuke.py in __init__(self, data_dir, download, shuffle, transforms, nucleus_type_labels, split, batch_size, hovernet_preprocess)
    198         self.download = download
    199         if download:
--> 200             self._download_pannuke(self.data_dir)
    201         else:
    202             # make sure that subdirectories exist

~/pathml/pathml/datasets/pannuke.py in _download_pannuke(self, download_dir)
    241 
    242         self._process_downloaded_pannuke(download_dir)
--> 243         self._clean_up_download_pannuke(download_dir)
    244 
    245     @staticmethod

~/pathml/pathml/datasets/pannuke.py in _clean_up_download_pannuke(pannuke_dir)
    306             downloaded_dir = p / f"Fold {fold_ix}"
    307             zip_file.unlink()
--> 308             downloaded_dir.rmdir()
    309 
    310 

/opt/conda/envs/pathml/lib/python3.8/pathlib.py in rmdir(self)
   1333         if self._closed:
   1334             self._raise_closed()
-> 1335         self._accessor.rmdir(self)
   1336 
   1337     def lstat(self):

OSError: [Errno 39] Directory not empty: '../data/pannuke/Fold 1'

Add repr for every class

These should be informative and clear

Sprint SlideData Class

Rewrite SlideData Class.

name: str
shape <- dict of shapes of slide, masks, tiles
slide: Slide (if slide loaded using Bioformats)
masks: pathml.Masks
tiles: pathml.Tiles
labels: (Masks, str, int, floats)
history: list
slidetype: str (e.g. "HE" or "IHC"). Set when SlideData class is initialized

init - use appropriate backend (openslide or bioformats)
repr
read_region(level)
make_tiles(Pipeline, optional)
chunks(shape, stride) --> generator of chunk objects
plot() --> matplotlib (also handle masks in plot)
save()

Type hinting for everything

This is a good best-practice

https://www.python.org/dev/peps/pep-0484/

Add CONTRIBUTING file

The CONTRIBUTING file gives instructions and guidelines for contributors. We can start with something basic and expand as needed down the road.

There is a contributing.rst file in the documentation, but we should have it in the root directory so it is easily accessible. Also, if it is in the root directory, GitHub will automatically link to this file when a contributor creates an issue or opens a pull request.

Helpful resources:

Standardize Openslide pixel resolution level

Different slides may have different microns per pixel (MPP) depending on the physical parameters of the scanner.
This means that for any two slides, level 0 may be at different pixel resolution.
We should provide a way to standardize pixel resolution of slides, so that we know that all images in a dataset are the same resolution.

Openslide objects have slide.properties["openslide.mpp-x"] and slide.properties["openslide.mpp-y"]which we can use

Add pip to environment.yml

Pip is currently not listed in environment.yml file.
Conda gives the following warning:

(base) jupyter@rosenthal-dxvm:~/pathml$ conda env create -f environment.yml

Warning: you have pip-installed dependencies in your environment file, but you do not list pip 
itself as one of your conda dependencies. Conda may not use the correct pip to install your 
packages, and they may end up in the wrong place.  Please add an explicit pip dependency. 
I'm adding one for you, but still nagging you.

Add save_tiles method to SlideData class

Writing tiles to disk should be a method for SlideData class. This makes more sense than writing tiles as part of the tile-level preprocessor in a Pipeline. Directory to write tiles to should be specified in argument

Pseudocode:

data = HESlide("/path/to/image.svs").load_data()
data = my_pipeline.run(data)
data.write_tiles("/path/to/tiled/images/")

wsi.py: pad and black edge issue

When stride is small, the last few tiles lie on the edge of slides would have smaller size than (tile size, tile size). Openslide would zero padding automatically. When set pad=False this would output undesired tiles with black edges. Recalculated tile numbers to solve this issue

Haven't got a chance to use this specific svs data yet. Randomly picked tile size and a small stride for now. Will double check this.

Psudocodes:
example_image_path = "../data/CMU-1.svs
class MySlideLoader(BaseSlideLoader):
def apply(self, path):
return HESlide(path).chunks(level=0, size=1024, stride=400)
data = MySlideLoader().apply(example_image_path)

Test ML models

Add some tests to make sure that the ML models are working correctly.
For example, this may involve overfitting on a toy dataset and verifying that performance is above some threshold.

Create DataLoader class

Create a dataloader class similar to torch.utils.data.DataLoader

Adding HistoQC

Adding https://github.com/choosehappy/HistoQC module to perform rigorous removal of unwanted artifacts in the data

Documentation doesn't compile

Need to fix bugs in documentation. Should also add tests to make sure that docs compile successfully.

(pathml) jupyter@shared-dxvm-gpu:~/pathml/docs$ make html
Running Sphinx v3.4.2
making output directory... done
building [mo]: targets for 0 po files that are out of date
building [html]: targets for 17 source files that are out of date
updating environment: [new config] 17 added, 0 changed, 0 removed
/home/jupyter/pathml/pathml/preprocessing/multiparametricslide.py:16: UserWarning: MultiparametricSlide2d requires a jvm to interface with java bioformats library.
            See: https://pythonhosted.org/javabridge/installation.html. You can install using:
                
                sudo apt-get install openjdk-8-jdk
                pip install javabridge
                pip install python-bioformats
        
  warn(
WARNING: autodoc: failed to import module 'multiparametricslide' from module 'pathml.preprocessing'; the following exception was raised:
MultiparametricSlide2d requires javabridge and bioformats

Notebook error:
Problems with linked notebook "examples/link_advanced_HE_chunks" path:
InputError: [Errno 2] No such file or directory: '../examples/advanced_HE_chunks.ipynb'.
make: *** [Makefile:20: html] Error 2

Note that this issue is about the error in compilation. The javabridge warning should be fixed when we fix #48

Improve example notebooks

The example notebooks should be more comprehensive and better organized. For many people, the example notebooks will be their first experience with PathML, so we want them to be really good.

Some ideas:

Update the example pipeline notebooks to use chunk processing (much more efficient than the current basic example notebook which loads the whole slide into memory)
Create a notebook that shows examples of every transform
Create a notebook showing how to use DataModules with Pytorch

Should we use PyTorchLightning? [discussion]

Should we use PyTorch Lightning in PathML?

Pros:

More logical code organization structure
May be easier for less technical users
- Don't need to write training loops
- Automatically handles multiGPU
- Automatically handles mixed precision training
Popular framework actively being developed

Cons:

Overhead to refactor code to be compatible
One more external dependency
Committing to a specific framework may make PathML less flexible, decreasing utility

Other thoughts:

Would it be easy to support both? i.e. have models in base PyTorch, but also have lightning-compatible versions?
I haven't used pytorch-lightning myself, but @jmnyman has made the switch and it sounds like he has really benefited from it
If we decide to go the pytorch-lightning route, it probably won't be in v0.1 initial release. Pushing to first open-source release is top priority at the moment

Pipeline.run() argument

The argument of Pipeline.run()should be an object inheriting from BaseSlide, rather than a file path.
This means that whenever we run a pipeline, we can trust that it implements everything from BaseSlide.

If we just pass a path, it may be ambiguous how to read it (is it a H&E slide, or a multiplex slide, or...?). All the work in reading the file, etc. should happen when creating the slide object, not in the pipeline object.

Call transforms on tiles by getattr

This will let us do things like tile.Blur(kernel_size = 7) for arbitrary transforms

Here's a code snipped that I was trying but couldn't get to work:

class Transform:
    def __init__(self, test):
        self.test = test

    def apply(self, target):
        print(f"applying on target of type {type(target)}. kwargs: {self.test}")


class Target:
    def __init__(self, name):
        self.name = name

    def __getattr__(self, item):
        print(f"type of item: {type(item)}")
        print(str(item))
        t = item(**kwargs)
        t.apply(self)

target = Target(name = "testtarget")

target.Transform(test = "testitem")

See: https://rosettacode.org/wiki/Respond_to_an_unknown_method_call#Python

Specify output directory in Pipeline init

Currently, tiles are written to disk in the tile_level_preprocessor component of the Pipeline.
It would be better to pass a path to the output directory when running the Pipeline object, and then write all tiles to that directory. This would allow for better integration with DataModuleclass, since the entire DataModule could be initialized pointing to one directory and can then:

download images there
pass the directory path as input to Pipeline.run() and write all the tiles there
create dataset and dataloader objects, since the full filepath is known.

Pseudocode:

# initialize pipeline
my_pipeline = Pipeline(
    slide_loader       = MySlideLoader(),
    slide_preprocessor = MySlidePreprocessor(),
    tile_extractor     = SimpleTileExtractor(tile_size=224),
    tile_preprocessor  = MyTilePreprocessor()
)

# initialize slide
slide = HESlide("/path/to/image.svs")

# run pipeline on slide
my_pipeline.run(slide, out_dir = "./data/preprocessed")

Making docs in Linux fails without additional dependencies

Provided instructions:
conda install sphinx # install sphinx package for generating docs
cd docs # enter docs directory
make html # build docs in html format

fail in Linux (tested Linux Mint 19.2). Additionally required:
pip install nbsphinx
pip install nbsphinx_link
pip install sphinx_rtd_theme
pandoc https://pandoc.org/installing.html

Set up automated testing

Use GitHub actions to automatically run tests when code is pushed

See:

Codecov badge

We can set up an automated workflow to measure code coverage and add it in a badge on the project readme.

https://github.com/codecov/codecov-action

This is not high priority at the moment but filing here to do later

Chunk generator

Slide objects should have a method that returns an iterator over "chunks" so that the image can be processed chunk-wise instead of loading the entire thing into memory.
Abstract method should be implemented in BaseSlide, but each slide type (e.g. HESlide, MultiparametricSlide) may have to be implement differently based on backend (e.g. openslide or bioformats)

Pseudocode:

slide = HESlide("/path/to/image.svs")

for chunk in slide.generate_chunks(level=0, size=1024, ...):
  # operate on each 1024x1024 chunk
  preprocess(chunk)

Hosting Model Weights

We want to be able to share pre-trained models. The trained model weights can be saved to disk, e.g. in .pth files for pytorch. However, these files can be quite big - too big to put in the GitHub repo itself..

We need to find a solution for hosting these large files of model parameters.
E.g. we could have a GCP bucket, or S3 bucket.
Need to evaluate the costs of different options.

Pipeline save method

We need a way to share pipelines by writing them to a file

Pseudocode:

my_pipeline = Pipeline(**kwargs)
my_pipeline.save("/path/to/disk/pipeline.pickle")

## someone else can then load and use:

pipeline = load("/path/to/local/downloads/pipeline.pickle")
pipeline.run(local_slide)

Refactor

Make SlideData the core pathml object, combine pipeline and transforms into methods in SlideData

Preprocessing has become a catch-all directory, improve directory structure

Reorganize slide classes

Slide classes should be reorganized based on dimensions and slide type.
This hierarchical class structure is more logical and will also help with making sure that the transforms work properly (#18 ). For example, some transforms may work for all 2d images regardless of number of channels, but others may only be applicable for RGB images, and others may be specific to certain types (e.g. H&E stain deconvolution).

CAMELYON Datasets

Datamodule and Dataloader for https://camelyon17.grand-challenge.org/

Transition to Google-/numpy-style docstrings

Docstrings are currently written in basic Sphinx format. However, basic Sphinx doesn't support a References section so I had to start using the Napoleon extension. Since we are already using Napoleon, we may as well stick with Google or numpy docstring format moving forward, since it is more readable for humans.

dana-farber-aios / pathml Goto Github PK

pathml's People

Stargazers

Watchers

Forkers

pathml's Issues

Recommend Projects

Recommend Topics

Recommend Org