dchaley / deepcell-imaging Goto Github PK

View Code? Open in Web Editor NEW

7.0 4.0 2.0 12.54 MB

Tools & guidance to scale DeepCell imaging on Google Cloud Batch

Jupyter Notebook 98.28% Python 1.41% Cython 0.23% Dockerfile 0.02% Shell 0.07%

ai bioinformatics cancer-research cloud gpu tensorflow

deepcell-imaging's Introduction

Cloud DeepCell - Scaling Image Analysis

This working Repo contains our notes / utilities / info for our cloud DeepCell imaging project.

Here is the high level workflow for using DeepCell:

_{lucidchart source}

Note that DeepCell itself does not process TIFF files. The TIFF channels must be extracted into Numpy arrays first.

Also note that DeepCell performs its own pre- and post-processing around the TensorFlow prediction. In particular, DeepCell divides the input into 512x512 tiles which it predicts in batches, then reconstructs the overall image.

Goal and Key Links

GOAL: Understand and optimize using DeepCell to perform cellular image segmentation on GCP at scale.
- KEY LINK #1: our benchmarking process.
- KEY LINK #2: our support/testing notebooks.
- KEY LINK #3: our project board & work areas for this project.

Findings

GPU makes a dramatic difference in model inference time.

Memory usage increases linearly with number of pixels.

Optimization opportunities

Here are some areas we've identified:

Preprocessing
- DeepCell converts everything to 64bit float. That's memory intensive. Do we actually need to?
Prediction
- Benchmark results (GPU, batch size, resolution)
- Investigate Google TensorFlow optimizations
Postprocessing
- h_maxima: need to ship a ~15x speedup optimization

Local development

Mac OS x86_64

Nothing special. You just need Python 3.10 at the latest.

python3.10 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Mac OS arm64

Some incantations are needed to work on Apple silicon computers. You also need Python 3.9.

DeepCell depends on tensorflow, not tensorflow-macos. Unfortunately we need tensorflow-macos specifically to provide TF2.8 on arm64 chips.

The solution is to install the packages one at a time so that the DeepCell failure doesn't impact the other packages.

python3.9 -m venv venv
source venv/bin/activate
pip install -r requirements-mac-arm64.txt
cat requirements.txt | xargs -n 1 pip install

# Let it fail to install DeepCell, then:
pip install -r requirements.txt --no-deps

# Lastly install our own library. Note --no-deps
pip install --editable . --no-deps

I think but am not sure that the first --no-deps invocation is unnecessary as pip install installs dependencies.

deepcell-imaging's People

Contributors

Stargazers

Watchers

Forkers

langitlynn khadija997

deepcell-imaging's Issues

Run benchmark for `mesmer-sample-3` 1mb

Input channels path: gs://davids-genomics-data-public/cellular-segmentation/deep-cell/vanvalenlab-multiplex-20200810_tissue_dataset/mesmer-sample-3/input_channels.npz

Result spreadsheet.

Build setup/instructions for cython fast-hybrid

The cython fast-hybrid implementation is a bit "raw", requiring manual cythonify in the right directory etc.

The file should be repackaged into a proper module in deepcell-imaging, with appropriate setup.py etc. so that pip knows to build the extension as part of installation.

This could also be accomplished by publishing the fast-hybrid implementation as its own library, and including that as a dependency to this repo.

Parameterize batch_size for predict method

Adding GPUs is not improving performance. This is a bit surprising considering how 1 GPU improves performance dramatically.

Are we not leveraging multiple GPUs? Possibly. With even 1 GPU, we aren't maxing out the GPU:

The batch_size parameter defaults to 4, which controls the number of images we sent to TensorFlow in parallel.

Add batch size to notebook parameters
Add batch size to benchmark output

Then we can run some benchmarks with increased batch size (with 1 + several GPUs) <-- make follow-up issues.

Build csv to chart

Now that we have the benchmarking from PR #42 , which generates CSV output, get a bunch of local data on various images and test how to visualize it.

test data
build visualization

Run benchmark for human-prostate-cancer-20210727-725mb

Input channels path: gs://davids-genomics-data-public/cellular-segmentation/10x-genomics/human-prostate-cancer-20210727-725mb/input_channels.npz

Results spreadsheet.

Find readily available or at least publishable large sample image

The larger sample data I've been using Xenium_FFPE_Human_Breast_Cancer_Rep1_if_image.tif was obtained from 10X genomics, but isn't super-duper easy to fetch especially not programmatically.

It would be nice if we had a comparable ~500MB sample available.

Run benchmark for `preview-human-breast-20221103-418mb`

Input channels path: gs://davids-genomics-data-public/cellular-segmentation/10x-genomics/preview-human-breast-20221103-418mb/input_channels.npz

Result spreadsheet.

Fix peak RAM metric for Vertex AI

The current method for getting peak RAM usage on local does not work on Vertex AI notebooks:

peak_mem_b = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss

Figure out a fix, or just abandon the metric.

Add Mesmer samples to repo

To support #10 , let's at least add DeepCell's Mesmer data (multiplex_tissue) to the repo. This gives us an easily accessible starting point for test data (albeit quite small at 512 x 512).

See the DeepCell mesmer sample notebook

numpy inputs
rgb inputs
raw predictions
output predictions & rgb image

Add 10x Genomics sample: preview-human-breast-20221103-418mb

Add the sample input images. (pull request)
Add the processed input_channels.npz data for DeepCell

Get cost tables

Fill out the cost tables in this sheet.

Method:

Start creating a notebook.
Select hardware configuration.
You should see a cost table like this:

Copy these fields into the sheet:
- TOTAL (put this into discounted $/mo)
- Sustained use discount (put this into discount $/mo)
Then copy the other computed columns from previous rows

Merge PR to scikit-image

Once everything is ready, we'll open a PR for a review cycle.

Depends on: #130

Generate sample images for 10x genomics file: human-prostate-cancer-20210727-725mb

Update 10x genomics readme to include sample pictures. (These should be generated by the benchmark notebook)

Define test configurations per sample file

For each sample file, specify the configurations to run for the benchmark.

mesmer sample 3
pick an ark-angelo sample
preview-human-breast-20221103-418mb
human-prostate-cancer-20210727-725mb blocked on #38

Expected output: a github issue, with a checkmark of configurations (machine type, gpu type, gpu number)

See this list for machine types:
https://cloud.google.com/compute/docs/general-purpose-machines#n1_machine_types

Note that we can't run "interesting" GPU configurations (more than 1, or fancy GPUs), we're "calling a friend" see also #6

Running notebook for just postprocessing

https://github.com/dchaley/deepcell-imaging/blob/main/notebooks/Benchmarking-Postprocess.ipynb

Output pixels to benchmark data

Backfill # of pixels to benchmark data
Update benchmark script to output # pixels

Investigate: post-processing performance: fill_holes

Add ARK sample data

Add the ARK-Example sample data from the Angelo lab.

Support grayreconstruct uint8 dtype

The cython code needs to generate signatures with uint8. Just add to the fused type.

Create runnable 'predict' notebook

As a user I can: go to github repo, download it to local, get ipython notebook, get sample data, upload notebook to test env, (vertex ai), config info (instance types/size, GPUs, ...) to verify. (Part of test is to figure out config parameters)

Notebook that runs prediction on parameterized input file
Script to create notebook execution w/ specified machine parameters (instance type, GPU)

Result of running notebook: timing matrix of input size vs config parameters

Benchmark will be: whole-cell compartment. (Potentially add nuclear/both later, opportunity for parallelization later.)

Implement grayreconstruct erosion vs dilation

The optimization only supported the dilation method: finding local maxima. We need to support erosion (local minima). It's basically a question of flipping the min/max signs.

h_maxima performance

Mesmer postprocessing has 3 main steps: h_maxima, watershed, and fill_holes. (See: #9)

h_maxima happens in the deepcell-toolbox

which in turn calls scikit-image h_maxima, which lastly is largely implemented using gray reconstruction.

This ticket is to investigate optimization opportunities.

Investigate: post-processing performance: watershed

Develop test suite for optimizations

I want to iterate rapidly while prepping my optimizations for merge. That means automated testing not manually running notebooks.

Rather than write my own tests, let's use the tests in scikit already 😎

The initial version will need non-supported tests disabled, these will be created as issues in the milestone.

Document BigQuery benchmarks table

The end-to-end benchmark uploads its results to the BigQuery table: benchmarking.results

The table has a schema but otherwise no documentation. That should be changed 😎

Address result type mismatch

The test test_two_image_peaks asserts that out.dtype == _supported_float_type(mask.dtype). Meanwhile the current reconstruct implementation indeed creates the result image as a float, even if it was ints to begin with.

I'm not sure we need to (always?) do this. The core of the algorithm is to adjust to the neighborhood max. The max can't have more precision than any of the starting numbers. And the max can't be capped to more precision than the mask precision. So if ints are masking ints, why not have int results?

However floats masked by ints are floats, and arguably ints masked by floats should be floats. for example 10 mask 0.51 should be 5.1 not 5. (Right?)

The current behavior is to always return floats. This is undesirable for performance as it precludes updating in-place. I wonder if we can simply ship this as new behavior. Downstream usages could be affected if they assume floats and start getting ints. Can we control via "Yet Another Parameter"™️ ?

Unexpected segments in prostate cancer results

Running these benchmarks: #73

We noticed that the output file looks a bit strange. Here's an example:

It appears to be circling artifacts outside the tissue, as well as not circling cells as we expect within the tissue.

DEBUG APPROACH:

Validate we're using correct channels here: https://github.com/dchaley/deepcell-imaging/blob/main/notebooks/Extract-Sample_human-prostate-cancer-20210727-725mb.ipynb
Extract smaller tiles like 3k x 3k (instead of full 25k x 25k)
Test prediction on selected tiles

Inconsistent array shape: preview-human-breast-20221103-418mb

The samples located here:

gs://davids-genomics-data-public/cellular-segmentation/10x-genomics/preview-human-breast-20221103-418mb

were generated with a previous convention, following the DeepCell API (one file == 4D array starting with num samples).

The other samples are 3D: x, y, channel. (One array == one input)

This creates problems because worksheets & people don't know which shape to expect, therefore when they need to do a new-axis, or not.

We should normalize one way or the other. My general thinking is that a thing is a single thing, until it is a group of things– which we could represent as either a list of things, or a numpy vector of the things. In other words, the shape of a single data example is not a list.

Update benchmark notebook to fetch cloud instance type value

The notebook has a placeholder for getting current machine type. Fix this!

Figure out kernel version mismatch

When running the e2e benchmark notebook on Vertex AI, there was a kernel warning:

2023-12-03 07:19:17.937327: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/jupyter/.local/lib/python3.10/site-packages/cv2/../../lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-12-03 07:19:17.937385: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2023-12-03 07:19:17.937413: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (72a01191f8f9): /proc/driver/nvidia/version does not exist

I think this is because we're using a TF 2.10 kernel but have installed TF 2.8 (DeepCell's dependency).

If the kernel is relevant: how can we fix this?

If the kernel is irrelevant: can we use something different? Like a basic python kernel?

I'm not sure how much to worry about this– perhaps it means we (and/or DeepCell??) aren't using modern Vertex AI kernels optimally…

Add 10x Genomics sample: human-prostate-cancer-20210727-725mb

Process this sample file: 10x Genomics data page

upload input image to cloud storage
produce more friendly-size thumbnails: #84
create README
add processed input_channels.npz file for DeepCell to cloud storage

Experiment with "warming up" benchmark script

Problem we observed: the first run is consistently ~30s slower than subsequent runs, even though we're restarting the kernel in between.

Prediction time (s)	First run?	Machine config
104.48	y	n1-highmem-4 + 1x Tesla T4
103.17	y	n1-highmem-4 + 1x Tesla T4
74.09	n	n1-highmem-4 + 1x Tesla T4
73.41	n	n1-highmem-4 + 1x Tesla T4
74.08	n	n1-highmem-4 + 1x Tesla T4
78.2	y	n1-highmem-4 + 1x Tesla P100-PCIE-16GB
78.75	y	n1-highmem-4 + 1x Tesla P100-PCIE-16GB
43.55	n	n1-highmem-4 + 1x Tesla P100-PCIE-16GB
44.4	n	n1-highmem-4 + 1x Tesla P100-PCIE-16GB
44.17	n	n1-highmem-4 + 1x Tesla P100-PCIE-16GB
43.47	n	n1-highmem-4 + 1x Tesla P100-PCIE-16GB

MORE OBSERVATIONS:

GPU memory goes back to 0% after kernel restart. Probably no caching in GPU memory.
All documentation + stackoverflow posts suggest that GPU is cleared after process shutdown (kernel restart).
It seems quite consistently faster afterward.

IDEA:

Add a dummy data 512x512 prediction in the benchmark notebook, but outside the timed portion.
Does this warm up whatever needs to be warmed up, for the main prediction loop?

Add prediction compartment to benchmark output

Update existing data to be compartment "whole-cell"
Update benchmark notebook to add new column in csv output

Create setup notebook

Create a separate setup notebook to install dependencies. This avoids restarting notebooks.

create setup.ipynb in root directory, does 1 thing: the pip install from the top of the benchmark file
in the benchmark notebook, change the pip install to trying to import deepcell – if that fails, refer the user to the setup notebook

Refactor fast_hybrid_reconstruct to accept footprint

Currently, the fast_hybrid_reconstruct implementation accepts a radius parameter. Instead, accept a footprint.

This will bring it in line with existing skimage api:
def reconstruction(seed, mask, method='dilation', footprint=None, offset=None):

Determine if CPU prediction is affected by warm-up

Relating to #94 we need to determine if CPU prediction is affected by 1st vs subsequent runs.

Task: run the ~230 MB sample through the benchmark, on n1-standard-8 no GPU batch size 16, then restart kernel (NOT make a new instance) & do it again.

Rename Data Folder to VanValen

Current folder name deepcell doesn't match folder naming convention for sample data, suggest refactor folder name to vanvalen
https://github.com/dchaley/deepcell-imaging/tree/main/sample-data/deepcell

I assume you'll have to update example notebook paths if/when you refactor this folder name

Determine what happened to previous mesmer samples

The mesmer samples we extracted used a previous commit's dataset no longer available in the deepcell-tf main branch.

From commit history it looks like the dataset may have been replaced with the tissue_net dataset. The expected hash values don't match, but this could be just the naming inside the .npz file.

Objective of this work: determine the difference between the old commit data (which is still available on s3 as of 2023-11-17 at least) and the newly available tissue net data.

Incorporate h_maxima improvements into DeepCell

Apply results from this effort: #8

into deepcell-toolbox, specifically the deep_watershed implementation

1: implement into scikit, release, update scikit in DeepCell
2: release fast-hybrid separately, integrate into DeepCell

Add cached model download to notebook

The model file is relatively large (100MB). Cache the download to disk to avoid refetching. Also, the notebook doesn't support the model download in the first place 😬

Use this gs uri:
gs://davids-genomics-data-public/cellular-segmentation/deep-cell/vanvalenlab-tf-model-multiplex-downloaded-20230706/MultiplexSegmentation.tgz

Support grayreconstruct offset parameter

The scikit-image algorithm supports passing a footprint offset, vs assuming the center. Refactor to support that.

Investigate grayreconstruct edge case test_zero_image_one_mask

This test is failing:

    def test_zero_image_one_mask():
        """Test reconstruction with an image of all zeros and a mask that's not"""
        result = reconstruction(np.zeros((10, 10)), np.ones((10, 10)))
>       assert_array_almost_equal(result, 0)
E       AssertionError:
E       Arrays are not almost equal to 6 decimals
E
E       Mismatched elements: 100 / 100 (100%)
E       Max absolute difference: 1.
E       Max relative difference: inf
E        x: array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
E              [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
E              [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],...
E        y: array(0)

test_reconstruction.py:113: AssertionError

I'm surprised to see max abs = 1 and max rel = inf, also it's 100% different so something weird is happening.

Determine way to speed up benchmarks of large files

Larger data testing is tedious because post-processing is Hella Slow™. It's 8min or more for 1.3 GB inputs.

Note that infrastructure doesn’t seem to make a big difference for post-processing time. And GPU is not used at all during this phase (per observation of monitoring charts + knowledge of implementation).

This represents post-processing time broken down by machine type, GPU (or not), and input size. Note that post-processing doesn't vary too much.

It would be really nice if we didn't have to wait for this.

(1) Skip post-processing in benchmarks.

Note in benchmark data if post-processing was run.
The output is a bit meaningless in terms of correctness.
(2) Speed up post-processing. #28

Option 1 could be something like: skip the post-processing by passing in no-op function as the postprocessing_fn in constructor to Application object (maybe need to create a subclass to Mesmer class to override constructor).

Also consider simply not caring for now, assuming DevOps processes & cost monitoring would find the issue. (Really though?) It's still just a few cents.

Document exact benchmark process

Update the top-level readme, and/or e2e-deepcell benchmark readme, with precise benchmark steps.

Something like:

open notebook in specific kernel (but see also #59)
select input file, update in notebook
select hardware type (clicking in notebook)
restart kernel & run all cells
copy the csv from the bottom into a sheet (this one?)

Test what happens to execution if Vertex AI browser tab closed

If the browser tab is closed while the long-running benchmark does its thing, what happens?

aka: do we have to leave open the tabs?