Comments (14)
✨ interesting finding ✨ the Downhill Filter paper doesn't mention sorting.
Sorting appears to be expensive in the scikit implementation.
Note that rank_order performs another argsort: sort_order = flat_image.argsort().astype(unsigned_dtype, copy=False)
source code. For some reason it's not pictured here.
Two potential solutions
1: Limited impact, but maybe easier?
We're sorting twice, do we actually have to? Can we reuse the sort in rank-order?
2: Rewrite algorithm
In principle we don't need to sort: we just need to group the pixels by intensity. Would it be faster to do more literally as the algorithm says, and create a list per intensity value? What input are we using– is it byte (0-255) or decimal point (0.0..1.0) grayscale? Byte is probably much faster b/c we can leverage direct arrays.
Can we approximate down to 255 ... heck even 1024... (or something)
from deepcell-imaging.
I rewrote the grayscale reconstruction part of h_maxima using OpenCV, which I understand to have better performance here than Scikit based on this github issue.
question from pratapvardhan:
wondering - if scipy.ndimage is already used in skimage, why not use faster versions of scipy.ndimage.{grey_dilation, .grey_operations}?
answer from jni:
to not break the API. Our versions are slower and mess with the data types! Grrr. I've been campaigning to do just this but haven't had time to implement, but when I do (should be soon actually!), we have to be careful not to break legacy code.
Rest of thread shows improvements in other algorithms but not the ones we're using (h_maxima/grayreconstruct).
def opencv_reconstruct(marker: np.ndarray, mask: np.ndarray, radius: int = 2):
kernel = disk(radius)
# .item() converts the numpy scalar to a python scalar
pad_value = np.min(marker).item()
image = marker
expanded = np.ndarray.copy(marker)
while True:
expanded = cv2.dilate(
src=image,
dst=expanded,
kernel=kernel,
borderType=cv2.BORDER_CONSTANT,
borderValue=pad_value,
)
expanded = np.fmin(expanded, mask)
# Termination criterion: Expansion didn't change the image at all
if (image == expanded).all():
return expanded
np.copyto(dst=image, src=expanded)
def opencv_h_maxima(image, h=0.075, radius=2):
resolution = 2 * np.finfo(image.dtype).resolution * np.abs(image)
shifted_img = image - h - resolution
reconstructed = opencv_reconstruct(shifted_img, image, radius)
residue_img = image - reconstructed
return (residue_img >= h).astype(np.uint8)
Test results vs scikit h_maxima
:
![OpenCV speedup 512x512](https://private-user-images.githubusercontent.com/352005/278846932-2f89ab52-2cca-423b-a572-4a200de0325b.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjMzNDY1NTAsIm5iZiI6MTcyMzM0NjI1MCwicGF0aCI6Ii8zNTIwMDUvMjc4ODQ2OTMyLTJmODlhYjUyLTJjY2EtNDIzYi1hNTcyLTRhMjAwZGUwMzI1Yi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwODExJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDgxMVQwMzE3MzBaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1iZjIxNjBkZTMxNThmZTU4M2ZkOWU4NWQwNDBhZjEyN2MwMDMxNDNmMmZlYzJhNTlhYjY2MmRjMGJmY2VmYjhiJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.ViPv_yXNtysDGPvigivwkiq6hkPGC-1zMkmd8SoFFDU)
![OpenCV slowdown 10k x 14k](https://private-user-images.githubusercontent.com/352005/278846940-13b80d57-e4fd-44a2-bd2f-6b6581bb6311.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjMzNDY1NTAsIm5iZiI6MTcyMzM0NjI1MCwicGF0aCI6Ii8zNTIwMDUvMjc4ODQ2OTQwLTEzYjgwZDU3LWU0ZmQtNDRhMi1iZDJmLTZiNjU4MWJiNjMxMS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwODExJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDgxMVQwMzE3MzBaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT05MDc1NjA5MjgyY2JiOTJjNGZmYjU3ODRkZWVmY2VjYWFkZTQ5NGJiN2YwZmMyNThjODZlODhjODNjNmM5YTQzJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.KfiM7QzNJ4OriZ6a8noukW6llXjRTlu2pWgmW8GO6Vw)
Interpretation
As stated in the papers, the "straightforward" iteration approach is very slow for larger inputs. But it's 10x as fast for 512 x 512.
(I'll put up a notebook shortly)
Open avenues:
- Tile large images into 512x512 (find right size) & figure out boundary.
- Implement FastHybrid using OpenCV primitives. Putting items into the queue not obvious?
Good citizen work?
- scikit PR to reimplement grayreconstruct using fast scipy primitives
from deepcell-imaging.
Comments reference an academic paper as basis for implementation:
Robinson, Kevin and Whelan, Paul F. (2004) Efficient morphological reconstruction: a downhill filter. Pattern Recognition Letters, 25 (15). pp. 1759-1767. ISSN 0167-8655
Available from Dublin City University or the Vision Systems Group lab page. Full text PDF: whelan_2004_104.pdf
Paper references source code but the URL is broken. I think I found it on the internet archive: prl_DownhillFilter.zip
from deepcell-imaging.
Morphological reconstruction is a "well studied" area. I found another interesting paper:
2012: A fast parallel implementation of queue-based morphological reconstruction using gpus, by George Teodoro. ResearchGate, Full text PDF
It references a 2010 paper by Pavel Karas: Efficient Computation of Morphological Greyscale
Reconstruction. Dagstuhl Research Online Publication Server, Full text pdf
Question: how does scikit implementation compare vs these algorithms? Can we do a side-by-side of software algorithms, without introducing GPUs. (These papers support the ability to implement GPU algorithms if we need it)
from deepcell-imaging.
The Downhill Filter pseudocode from whelan_2004_104.pdf, page 7, translated.
The mask image is the input image. The markers are the DeepCell model outputs.
Initialization
- Find the maximum pixel value
m
from the marker image (model output).
in English: the number m
, in the marker pixel values, such that for every pixel value p
, m >= p
- Place each non-zero marker pixel into the appropriate "height list" (one per pixel value between 1 and the maximum value)
in English: For all image points, if the point's marker value i
is non-zero, then, append that point to the height list L
at the height aka index i
.
Algorithm loop
- For each height list in
L
, for heightn
from highest to lowest:- While the list of pixels at height
n
is not empty,- Operate on
p
, the first pixel at heightn
. - Remove
p
from the height list atn
. - For each of
p
's neighborsq
:
- Operate on
- While the list of pixels at height
from deepcell-imaging.
Hypothesis: the Downhill Filter implementation could be optimized (eg above discussion re not needing to actually sort). There's also potential for parallelization:
- initialization: mapping the height-list construction to batches, then reducing the tiles into a combined height list. (But that doesn't actually "reduce" the result size, as we still need to categorize each pixel from the markers.)
- loop: can possibly process each height list in batches? in principle we only add "down" the heights not into higher levels, can we use this somehow?
Alternative algorithm: "Fast Hybrid gray scale reconstruction" as presented in the 2012 paper: A fast parallel implementation of queue-based morphological reconstruction using gpus, by George Teodoro. ResearchGate, Full text PDF
Note that this is a spelled out version of "Approach D" in the 2004 downhill filter paper:
Here's the algorithm:
Given the proof of GPU parallelization, and the more straightforward algorithm structure (scan, scan, then queue), I think parallelizing Fast-Hybrid on CPU is a good option.
Here's the 2012 fast-hybrid-GPU speed comparison:
Note that the FH CPU
implementation is on a single-core CPU. Effectively, the proposal is to implement FH CPU but in parallel (multi-core or distributed).
Note also that the scan structure gives us linearity of data access, versus the height map which is linear over the data during initialization but then "bounces all over" the image as it iterates over the height map. In scan-scan-queue we only have that problem during the queue phase. So the question might also be, how many pixels get requeued. Note that the 2012 paper parallelizes the queue part on GPU.
Maybe we should just skip all this and go straight to parallelized GPU implementation lol. But GPUs haven't been readily available and personally I wouldn't know how to write GPU code to do this. I also think we haven't nearly hit the ceiling of CPU-only performance, we don't need real-time we just need not-5-minutes.
from deepcell-imaging.
Additional finding to support not pursuing Downhill Filter. Original paper by Vincent (1993) (private link for full text pdf), which reviews parallel + sequential approaches, says the 'classic' (aka math-derived) approach is highly parallelizable but not on "conventional computers":
But "conventional computers" look very different in 2023 from 1993!
Scaling horizontally to many processors/computers is far easier than 20 years ago.
I have some naive ideas how to implement the above "hybrid" scan in parallel. But first I'm researching how hardware/GPU solutions work. It should be inspirational for a CPU/cloud version.
Googling led me to this paper:
Fast Morphological Image Processing on GPU using CUDA by Mugdah A. Rane at the College of Engineering, Pune
she in turn references the "Van Herk / Gil & Kimmel" aka vHGK algorithm.
Currently reading ^^ this paper, plus looking vHGK source code. Read this some more? https://github.com/cuekoo/Herk-Gil-Werman-Fast-Morphology
from deepcell-imaging.
I found a paper from the "GK" of the original HGK paper that improves on the algorithm.
2003-Jan Gil & Kimmel: Efficient Dilation, Erosion, Opening and Closing Algorithms
full text
This is a fairly detailed explanation / walkthrough of the algorithm improvements. However it's 1-dimensional and would need to be generalized to 2.
Reading: page 8
from deepcell-imaging.
Reconstruction algorithms refer to the "window" or "structuring element", basically the window within which we compute a min/max image processing.
DeepCell uses radius = 2, for Mesmer. source code
I think this means the window is 5x5 (2 each way from a center point).
As determined by using the disk function (source code]) defined in this source code.
Note that although the window is 5x5 we use a circle of it (b/c radius).
If I understand correctly, the more granular this window, the more "tasks" we have to run. Soooo a radius=2 means we're doing a lot of work. 😤
from deepcell-imaging.
OK–
The Hybrid Algorithm for dilation is a max
operation applied in raster order, then in anti-raster order, then the queue-based max
filtering. As seen in A fast parallel implementation of queue-based morphological reconstruction using gpus.
The HGK algorithm, plus improvement as described in Efficient Dilation, Erosion, Opening and Closing Algorithms (2003), perform the parallel max
phases.
Our algorithm would need to parallelize these units of work. For a radius of 2, aka p=5, that means doing the "obvious" (aka embarrassing) parallelism is (img-width/5) * (img-height/5)
windows (I think). Given how memory-expensive the whole thing is, that's a lot better. But. Can we get it down to 512px, because 20k/5 = 4k which is 8x as big as the 512x512 which seems to be "Very Fast".
from deepcell-imaging.
Benchmark implementation / notebook: #17
from deepcell-imaging.
My plan for OpenCV implementation was to use dilation with a before/after footprint (don't include pixels as appropriate during the scan forward or backward). But I have no visibility into the iteration order, in fact I think the whole point was parallel execution whereas FastHybrid needs very specific iteration order.
But all hope is not lost 😅 The scikit grayreconstruct python implementation uses C under the hood to perform certain operations, after setting things up in numpy.
I believe it should be "easy" to create a similar .pyx iteration function, to perform fast-hybrid's raster scans, and defer back to python for the FIFO queue. Starting point: https://cython.readthedocs.io/en/latest/src/userguide/numpy_tutorial.html
One concern is that if we start including C somewhere then it has to be super easy to build/deploy/install. Solving in scikit itself would be nice 🙏🏻 cause it'd be actually invisible to DeepCell (modulo a scikit version upgrade).
from deepcell-imaging.
I implemented "fast hybrid" in normal Python. As expected it's a lot slower.
Image size: (512, 512)
scikit h_maxima: 0.15 s
opencv h_maxima: 0.01 s
OpenCV Speedup: 9.75x
scikit == opencv : True
custom python h_maxima: 9.62 s
Python speedup: 0.02x
scikit == custom python : True
However the python code should be very readily portable to Cython, well, I hope. At least the array scan parts of it. This is not the worst as it's relatively common to distribute buildable cython code.
from deepcell-imaging.
#27 reimplements the fast-hybrid scan in Cython. As hoped, it's a LOT faster. 🎉
PR TLDR:
- ~6x improvement over scikit h_maxima for 512x512 images.
- ~18x speedup for at least one 10k x 14k image.
With this proof of concept speeding up h_maxima by an order of magnitude, I'm declaring this spike complete 😤
cc @lynnlangit
As noted from the PR there are at least 2 potential next steps:
- Submit scikit PR swapping/adding cython fast-hybrid vs existing downhill filter
- and/or submit DeepCell-toolbox PR to use new implementation (... which probably means making a python library)
The most expedient to our purposes is updating DeepCell– but it would be nice to contribute to scikit…
Followup issue here: #28
from deepcell-imaging.
Related Issues (20)
- Try HDF5 save format for faster model load HOT 3
- Migrate to gs-fastcopy
- Change benchmark.py to new schema HOT 1
- Bring multi-step to parity with benchmark.py HOT 1
- Bug: run-job scripts using wrong container tag
- Consider including TF model on Docker container
- Log entries are duplicated
- Add script to run DeepCell with QuPath conventions HOT 5
- Add a Teams-formatted webhook call on pipeline success HOT 3
- Add parameter for batch job configuration
- Need to thread compartment through parameters
- Add benchmark for serial smart_open vs parallel gs_fastcopy transfers HOT 1
- Create job that processes multiple inputs HOT 1
- Create job that runs qupath creation script on a dataset root HOT 6
- Create qupath initializer job-runner that runs creation dependent on prediction HOT 1
- Add benchmark for just gs_fastcopy
- Add user-project parameters to storage interactions
- Split tiff output by compartment
- Recreate StringIO on each BQ upload try
- Increase boot disk size based on input size
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deepcell-imaging.