Coder Social home page Coder Social logo

codait / deep-histopath Goto Github PK

View Code? Open in Web Editor NEW
200.0 22.0 88.0 50.68 MB

A deep learning approach to predicting breast tumor proliferation scores for the TUPAC16 challenge

License: Apache License 2.0

Jupyter Notebook 62.42% Shell 0.22% Python 37.36%
deep-learning machine-learning medicine medical-imaging cancer-research

deep-histopath's Introduction

Predicting Breast Cancer Proliferation Scores with TensorFlow, Keras, and Apache Spark

Note: This project is still a work in progress. There is also an experimental branch with additional files and experiments.

Overview

The Tumor Proliferation Assessment Challenge 2016 (TUPAC16) is a "Grand Challenge" that was created for the 2016 Medical Image Computing and Computer Assisted Intervention (MICCAI 2016) conference. In this challenge, the goal is to develop state-of-the-art algorithms for automatic prediction of tumor proliferation scores from whole-slide histopathology images of breast tumors.

Background

Breast cancer is the leading cause of cancerous death in women in less-developed countries, and is the second leading cause of cancerous deaths in developed countries, accounting for 29% of all cancers in women within the U.S. [1]. Survival rates increase as early detection increases, giving incentive for pathologists and the medical world at large to develop improved methods for even earlier detection [2]. There are many forms of breast cancer including Ductal Carcinoma in Situ (DCIS), Invasive Ductal Carcinoma (IDC), Tubular Carcinoma of the Breast, Medullary Carcinoma of the Breast, Invasive Lobular Carcinoma, Inflammatory Breast Cancer and several others [3]. Within all of these forms of breast cancer, the rate in which breast cancer cells grow (proliferation), is a strong indicator of a patient’s prognosis. Although there are many means of determining the presence of breast cancer, tumor proliferation speed has been proven to help pathologists determine the best treatment for the patient. The most common technique for determining the proliferation speed is through mitotic count (mitotic index) estimates, in which a pathologist counts the dividing cell nuclei in hematoxylin and eosin (H&E) stained slide preparations to determine the number of mitotic bodies. Given this, the pathologist produces a proliferation score of either 1, 2, or 3, ranging from better to worse prognosis [4]. Unfortunately, this approach is known to have reproducibility problems due to the variability in counting, as well as the difficulty in distinguishing between different grades.

References:
[1] http://emedicine.medscape.com/article/1947145-overview#a3
[2] http://emedicine.medscape.com/article/1947145-overview#a7
[3] http://emedicine.medscape.com/article/1954658-overview
[4] http://emedicine.medscape.com/article/1947145-workup#c12

Goal & Approach

In an effort to automate the process of classification, this project aims to develop a large-scale deep learning approach for predicting tumor scores directly from the pixels of whole-slide histopathology images (WSI). Our proposed approach is based on a recent research paper from Stanford [1]. Starting with 500 extremely high-resolution tumor slide images [2] with accompanying score labels, we aim to make use of Apache Spark in a preprocessing step to cut and filter the images into smaller square samples, generating 4.7 million samples for a total of ~7TB of data [3]. We then utilize TensorFlow and Keras to train a deep convolutional neural network on these samples, making use of transfer learning by fine-tuning a modified ResNet50 model [4]. Our model takes as input the pixel values of the individual samples, and is trained to predict the correct tumor score classification for each one. We also explore an alternative approach of first training a mitosis detection model [5] on an auxiliary mitosis dataset, and then applying it to the WSIs, based on an approach from Paeng et al. [6]. Ultimately, we aim to develop a model that is sufficiently stronger than existing approaches for the task of breast cancer tumor proliferation score classification.

References:
[1] https://web.stanford.edu/group/rubinlab/pubs/2243353.pdf
[2] http://tupac.tue-image.nl/node/3
[3] preprocess.py, breastcancer/preprocessing.py
[4] MachineLearning-Keras-ResNet50.ipynb
[5] preprocess_mitoses.py, train_mitoses.py
[6] https://arxiv.org/abs/1612.07180

Approach


Setup (All nodes unless other specified):

  • System Packages:

    • openslide
  • Python packages:

    • Basics
      • pip3 install -U matplotlib numpy pandas scipy jupyter ipython scikit-learn scikit-image openslide-python
    • TensorFlow (only on driver):
      • pip3 install tensorflow-gpu (or pip3 install tensorflow for CPU-only)
    • Keras (bleeding-edge; only on driver):
      • pip3 install git+https://github.com/fchollet/keras.git
  • Spark 2.x (ideally bleeding-edge)

  • Add the following to the data folder (same location on all nodes):

    • training_image_data folder with the training slides.
    • testing_image_data folder with the testing slides.
    • training_ground_truth.csv file containing the tumor & molecular scores for each slide.
    • mitoses folder with the following from the mitosis detection auxiliary dataset:
      • mitoses_test_image_data folder with the folders of testing images
      • mitoses_train_image_data folder with the folders of training images
      • mitoses_train_ground_truth folder with the folders of training csv files
  • Layout:

    - MachineLearning-Keras-ResNet50.ipynb
    - breastcancer/
      - preprocessing.py
      - visualization.py
    - ...
    - data/
      - mitoses
        - mitoses_test_image_data
          - 01
            - 01.tif
          - 02
            - 01.tif
          ...
        - mitoses_train_ground_truth
          - 01
            - 01.csv
            - 02.csv
            ...
          - 02
            - 01.csv
            - 02.csv
            ...
          ...
        - mitoses_train_image_data
          - 01
            - 01.tif
            - 02.tif
            ...
          - 02
            - 01.tif
            - 02.tif
            ...
          ...
      - training_ground_truth.csv
      - training_image_data
        - TUPAC-TR-001.svs
        - TUPAC-TR-002.svs
        - ...
      - testing_image_data
        - TUPAC-TE-001.svs
        - TUPAC-TE-002.svs
        - ...
    - preprocess.py
    - preprocess_mitoses.py
    - train_mitoses.py
    
  • Adjust the Spark settings in $SPARK_HOME/conf/spark-defaults.conf using the following examples, depending on the job being executed:

    • All jobs:

      # Use most of the driver memory.
      spark.driver.memory 70g
      # Remove the max result size constraint.
      spark.driver.maxResultSize 0
      # Increase the message size.
      spark.rpc.message.maxSize 128
      # Extend the network timeout threshold.
      spark.network.timeout 1000s
      # Setup some extra Java options for performance.
      spark.driver.extraJavaOptions -server -Xmn12G
      spark.executor.extraJavaOptions -server -Xmn12G
      # Setup local directories on separate disks for intermediate read/write performance, if running
      # on Spark Standalone clusters.
      spark.local.dirs /disk2/local,/disk3/local,/disk4/local,/disk5/local,/disk6/local,/disk7/local,/disk8/local,/disk9/local,/disk10/local,/disk11/local,/disk12/local
      
    • Preprocessing:

      # Save 1/2 executor memory for Python processes
      spark.executor.memory 50g
      
  • To execute the WSI preprocessing script, use spark-submit as follows (could also use Yarn in client mode with --master yarn --deploy-mode client):

    PYSPARK_PYTHON=python3 spark-submit --master spark://MASTER_URL:7077 preprocess.py
    
  • To execute the mitoses preprocessing script, use the following:

    python3 preprocess_mitoses.py --help
    
  • To execute the mitoses training script, use the following:

    python3 training_mitoses.py --help
    
  • To use the Jupyter notebooks, start up Jupyter like normal with jupyter notebook and run the desired notebook.

Create a Histopath slide “lab” to view the slides (just driver):

  • git clone https://github.com/openslide/openslide-python.git
  • Host locally:
    • python3 path/to/openslide-python/examples/deepzoom/deepzoom_multiserver.py -Q 100 path/to/data/
  • Host on server:
    • python3 path/to/openslide-python/examples/deepzoom/deepzoom_multiserver.py -Q 100 -l HOSTING_URL_HERE path/to/data/
    • Open local browser to HOSTING_URL_HERE:5000.

deep-histopath's People

Contributors

deroneriksson avatar dusenberrymw avatar feihugis avatar nakul02 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deep-histopath's Issues

TypeError: map() got an unexpected keyword argument 'num_parallel_calls' && No training configuration found in save file

Traceback (most recent call last):
  File "train_mitoses.py", line 869, in <module>
    args.threads, args.prefetch_batches, args.log_interval, args.checkpoint, args.resume)
  File "train_mitoses.py", line 516, in train
    augmentation, False, threads, prefetch_batches)
  File "train_mitoses.py", line 230, in create_dataset
    num_parallel_calls=threads)
TypeError: map() got an unexpected keyword argument 'num_parallel_calls'

Hi @dusenberrymw, I tried to build&run the freshly updated "train_mitosis.py", but it produces error. The previous version (Fix regularization bug - 1bbb964) was working fine.

In addition to this, when I tried to run "predict_mitosis.py" with the model which was created by using old version of "train_mitosis.py" I faced with the error which is :

/anaconda3/envs/tensorflow-cpu/lib/python3.6/site-packages/keras/models.py:251: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.
  warnings.warn('No training configuration found in save file: ' "

My model file includes:

0.020121_f1_0.94866_loss_1_epoch_model.hdf5 checkpoint global_step_epoch.pickle
model.ckpt.index
train(md)
val(md)
args.txt
events.out.tfevents.1507540126.burak-pc
model.ckpt.data-00000-of-00001
model.ckpt.meta

train_mitoses.py
As far as I know, it is about not having checkpoints file. I recheck and saw that model configuration'll be saved it in there. I really want to execute whole project for once without any complication. By the way, I just want to say that I am sorry if it was uncomfortable for you to ask so many questions when you are already busy with this highly sensitive and amazing project.

Could not find the training data sets?

Hello,

In the readme , it's mentioned that the training/validation data used include high-resolution tumor slide images with accompanying score labels and for which the reference given is [2]

[2] http://emedicine.medscape.com/article/1947145-overview#a7

When followed the link, could not find the data-sets unfortunately, would it be possible to directly download the training/validation data (histology slides with the score labels) from some publicly available archive?

Regards
Jithu

Save tiled images to disk

Is it possible to save the transformed images to the disk? I see you use the PIL library to save but I you also mention an Hadoop-supported directory. I don't get how your save_jpeg_help is either saving to disk or to HDFS.
I have an Hadoop instance running on hdfs://localhost:9000/data but it does not save to the Hadoop-Support directory. If I try a local path nothing happens as well.

Is there any way to save the processed images to a local directory?

pretrained model weights

Hello and thank you for sharing your work!
Could you please provide the last checkpoint for the pretrained model?
Thank you in advance, Lucia

feed_dict={K.learning_phase(): 1} while executing train_mitoses.py

Traceback (most recent call last):
  File "E:/deep-histopath/train_mitoses.py", line 751, in <module>
    args.log_interval, args.threads, args.checkpoint, args.resume)
  File "E:/deep-histopath/train_mitoses.py", line 600, in train
    feed_dict={K.learning_phase(): 1})
  File "C:\Users\Burak\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 895, in run
    run_metadata_ptr)
  File "C:\Users\Burak\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1124, in _run
    feed_dict_tensor, options, run_metadata)
  File "C:\Users\Burak\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1321, in _do_run
    options, run_metadata)
  File "C:\Users\Burak\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1340, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: slice index -1 of dimension 0 out of bounds.
	 [[Node: strided_slice = StridedSlice[Index=DT_INT32, T=DT_STRING, begin_mask=0, ellipsis_mask=0, end_mask=0, new_axis_mask=0, shrink_axis_mask=1](StringSplit:1, strided_slice/stack, strided_slice/stack_1, strided_slice/stack_2)]]
	 [[Node: data/IteratorGetNext = IteratorGetNext[output_shapes=[[?,64,64,3], [?], [?]], output_types=[DT_FLOAT, DT_FLOAT, DT_STRING], _device="/job:localhost/replica:0/task:0/cpu:0"](data/Iterator)]]
2017-10-03 11:37:55.026389: W C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\35\tensorflow\core\framework\op_kernel.cc:1192] Invalid argument: slice index -1 of dimension 0 out of bounds.
2017-10-03 11:37:55.036743: W C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\35\tensorflow\core\framework\op_kernel.cc:1192] Invalid argument: slice index -1 of dimension 0 out of bounds.

I extracted train and validation patches using "python preprocess_mitoses.py". Once I wanted to train the default vgg model with those patches, I got the error which is shown above.

Actually I do not know how to manage with the problem, due to fact that I do not get a clue about the error itself. Would you mind if I seek for your help? Thank you.

OS: Windows
tensorflow == 1.3.0(cpu loaded)
Keras== 2.0.8
anaconda == 3.5.2
conda == 4.3.27

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.