Coder Social home page Coder Social logo

aws-samples / sagemaker-distributed-training-digital-pathology-images Goto Github PK

View Code? Open in Web Editor NEW
9.0 12.0 3.0 286 KB

Distributed training of digital pathology tissue slide images using SageMaker and Horovod.

License: MIT No Attribution

Dockerfile 1.20% Jupyter Notebook 50.28% Python 48.52%
sagemaker distributed-training digital-pathology horovod

sagemaker-distributed-training-digital-pathology-images's Introduction

Distributed training of digital pathology tissue slide images using SageMaker and Horovod

In this tutorial, using detection of cancer from tissue slide images as an example, we will explain how to build a highly scalable machine learning pipeline to

  • Pre-process Gigapixel images by tiling, zooming, and sorting them into train and test splits using Amazon SageMaker Processing.
  • Train an image classifier on pre-processed tiled images using Amazon SageMaker, Horovod and SageMaker Pipe mode.
  • Deploy a pre-trained model as an API using Amazon SageMaker.

Data

In this blog, we will be using a dataset consisting of whole-slide images obtained from The Cancer Genome Atlas (TCGA) to accurately and automatically classify them into LUAD (Adenocarcinoma), LUSC (squamous cell carcinoma), or normal lung tissue, where LUAD and LUSC are the two most prevalent subtypes of lung cancer. The dataset is available for public use by NIH and NCI. Instructions for downloading data are provided here. The raw high resolution images are in SVS format. SVS files are used for archiving and analyzing Aperio microscope images..The techniques and tools used in this blog can be applied to any Gigapixel image data set such as satellite images.

Instructions to download TCGA dataset

  • Download and Install gdc client: It is a command-line interface supporting data downloads from TCGA. Follow the instructions to download a binary distribution.

  • Initiate data download: Run the following command to initiate the download process.

    gdc-client download -m gdc_manifest_20170302_003611.txt

Architecture Overview

The following figure shows the overall end-to-end architecture, from the original raw images to inference. First, we use SageMaker Processing to tile, zoom, and sort the images into train and test splits, and then package them into the necessary number of shards for distributed SageMaker training. Second, a SageMaker training job loads the Docker container from ECR (Elastic Container Registry) and uses Pipe Mode to read the data from the prepared shards of images, trains the model, and stores the final model artifact in S3. Finally, we deploy the trained model on a real-time inference endpoint that loads the appropriate Docker container (from ECR) and model (from S3) to process inference requests with low latency.

References

  1. Nicolas Coudray, Paolo Santiago Ocampo, Theodore Sakellaropoulos, Navneet Narula, Matija Snuderl, David Fenyö, Andre L. Moreira, Narges Razavian, Aristotelis Tsirigos. "Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning". Nature Medicine, 2018; DOI: 10.1038/s41591-018-0177-5
  2. https://github.com/ncoudray/DeepPATH/tree/master/DeepPATH_code
  3. https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga
  4. https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_processing.html
  5. https://aws.amazon.com/getting-started/hands-on/build-train-deploy-machine-learning-model-sagemaker/

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

sagemaker-distributed-training-digital-pathology-images's People

Contributors

amazon-auto avatar ryanbrand avatar vinayhanumaiah avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sagemaker-distributed-training-digital-pathology-images's Issues

TCGA images referenced in the dataset are no longer available

It seems The Cancer Genome Atlas (TCGA) have redacted the files used in this example. When trying to download these using the gdc-client it returns the following error:

ERROR: 5bf1a265-e53d-4f7d-8439-a00157e626f4: 451 Client Error: UNAVAILABLE FOR LEGAL REASONS for url: https://api.gdc.cancer.gov/data/5bf1a265-e53d-4f7d-8439-a00157e626f4: {"message":"Request contains a redacted file(s): ['5bf1a265-e53d-4f7d-8439-a00157e626f4'], action not allowed"} ERROR: Unable to download file https://api.gdc.cancer.gov/data/5bf1a265-e53d-4f7d-8439-a00157e626f4

and nothing gets downloaded.

Alternative SVS pathology slides' datasets can be found in the Internet, but it requires changes in the label files as well.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.