Mesmer postprocessing has 3 main steps: h_maxima, watershed, and fill_holes. (See: <a

The Downhill Filter pseudocode from <a href="https://github.com/dchaley/deepcell-imagi

OK– The Hybrid Algorithm for dilation is a <code class="notranslate"

h_maxima performance about deepcell-imaging HOT 14 CLOSED

dchaley commented on August 11, 2024

h_maxima performance

from deepcell-imaging.

Comments (14)

dchaley commented on August 11, 2024 1

✨ interesting finding ✨ the Downhill Filter paper doesn't mention sorting.
Sorting appears to be expensive in the scikit implementation.

Note that rank_order performs another argsort: sort_order = flat_image.argsort().astype(unsigned_dtype, copy=False) source code. For some reason it's not pictured here.

Two potential solutions

1: Limited impact, but maybe easier?

We're sorting twice, do we actually have to? Can we reuse the sort in rank-order?

2: Rewrite algorithm

In principle we don't need to sort: we just need to group the pixels by intensity. Would it be faster to do more literally as the algorithm says, and create a list per intensity value? What input are we using– is it byte (0-255) or decimal point (0.0..1.0) grayscale? Byte is probably much faster b/c we can leverage direct arrays.

Can we approximate down to 255 ... heck even 1024... (or something)

from deepcell-imaging.

dchaley commented on August 11, 2024 1

I rewrote the grayscale reconstruction part of h_maxima using OpenCV, which I understand to have better performance here than Scikit based on this github issue.

question from pratapvardhan:

wondering - if scipy.ndimage is already used in skimage, why not use faster versions of scipy.ndimage.{grey_dilation, .grey_operations}?

answer from jni:

to not break the API. Our versions are slower and mess with the data types! Grrr. I've been campaigning to do just this but haven't had time to implement, but when I do (should be soon actually!), we have to be careful not to break legacy code.

Rest of thread shows improvements in other algorithms but not the ones we're using (h_maxima/grayreconstruct).

def opencv_reconstruct(marker: np.ndarray, mask: np.ndarray, radius: int = 2):
    kernel = disk(radius)

    # .item() converts the numpy scalar to a python scalar
    pad_value = np.min(marker).item()

    image = marker
    expanded = np.ndarray.copy(marker)
    while True:
        expanded = cv2.dilate(
            src=image,
            dst=expanded,
            kernel=kernel,
            borderType=cv2.BORDER_CONSTANT,
            borderValue=pad_value,
        )
        expanded = np.fmin(expanded, mask)

        # Termination criterion: Expansion didn't change the image at all
        if (image == expanded).all():
            return expanded

        np.copyto(dst=image, src=expanded)

def opencv_h_maxima(image, h=0.075, radius=2):
    resolution = 2 * np.finfo(image.dtype).resolution * np.abs(image)
    shifted_img = image - h - resolution
    reconstructed = opencv_reconstruct(shifted_img, image, radius)
    residue_img = image - reconstructed
    return (residue_img >= h).astype(np.uint8)

Test results vs scikit h_maxima :

Interpretation

As stated in the papers, the "straightforward" iteration approach is very slow for larger inputs. But it's 10x as fast for 512 x 512.

(I'll put up a notebook shortly)

Open avenues:

Tile large images into 512x512 (find right size) & figure out boundary.
Implement FastHybrid using OpenCV primitives. Putting items into the queue not obvious?

Good citizen work?

scikit PR to reimplement grayreconstruct using fast scipy primitives

from deepcell-imaging.

dchaley commented on August 11, 2024

Comments reference an academic paper as basis for implementation:

Robinson, Kevin and Whelan, Paul F. (2004) Efficient morphological reconstruction: a downhill filter. Pattern Recognition Letters, 25 (15). pp. 1759-1767. ISSN 0167-8655

Available from Dublin City University or the Vision Systems Group lab page. Full text PDF: whelan_2004_104.pdf

Paper references source code but the URL is broken. I think I found it on the internet archive: prl_DownhillFilter.zip

from deepcell-imaging.

dchaley commented on August 11, 2024

Morphological reconstruction is a "well studied" area. I found another interesting paper:

2012: A fast parallel implementation of queue-based morphological reconstruction using gpus, by George Teodoro. ResearchGate, Full text PDF

It references a 2010 paper by Pavel Karas: Efficient Computation of Morphological Greyscale
Reconstruction. Dagstuhl Research Online Publication Server, Full text pdf

Question: how does scikit implementation compare vs these algorithms? Can we do a side-by-side of software algorithms, without introducing GPUs. (These papers support the ability to implement GPU algorithms if we need it)

from deepcell-imaging.

dchaley commented on August 11, 2024

The Downhill Filter pseudocode from whelan_2004_104.pdf, page 7, translated.

The mask image is the input image. The markers are the DeepCell model outputs.

Initialization

Find the maximum pixel value m from the marker image (model output).

in English: the number m, in the marker pixel values, such that for every pixel value p, m >= p

Place each non-zero marker pixel into the appropriate "height list" (one per pixel value between 1 and the maximum value)

in English: For all image points, if the point's marker value i is non-zero, then, append that point to the height list L at the height aka index i.

Algorithm loop

For each height list in L, for height n from highest to lowest:
- While the list of pixels at height n is not empty,
  - Operate on p, the first pixel at height n.
  - Remove p from the height list at n.
  - For each of p's neighbors q:
    - If the mask (input) image pixel value at q is non-zero, and, the new marker value at q has not been assigned,
      - Set the new marker value at q to the smallest of height n and the neighbor's value.
      - Assign the neighbor point q to the height list at the resulting smallest height.

from deepcell-imaging.

dchaley commented on August 11, 2024

Hypothesis: the Downhill Filter implementation could be optimized (eg above discussion re not needing to actually sort). There's also potential for parallelization:

initialization: mapping the height-list construction to batches, then reducing the tiles into a combined height list. (But that doesn't actually "reduce" the result size, as we still need to categorize each pixel from the markers.)
loop: can possibly process each height list in batches? in principle we only add "down" the heights not into higher levels, can we use this somehow?

Alternative algorithm: "Fast Hybrid gray scale reconstruction" as presented in the 2012 paper: A fast parallel implementation of queue-based morphological reconstruction using gpus, by George Teodoro. ResearchGate, Full text PDF

Note that this is a spelled out version of "Approach D" in the 2004 downhill filter paper:

Here's the algorithm:

Given the proof of GPU parallelization, and the more straightforward algorithm structure (scan, scan, then queue), I think parallelizing Fast-Hybrid on CPU is a good option.

Here's the 2012 fast-hybrid-GPU speed comparison:

Note that the FH CPU implementation is on a single-core CPU. Effectively, the proposal is to implement FH CPU but in parallel (multi-core or distributed).

Note also that the scan structure gives us linearity of data access, versus the height map which is linear over the data during initialization but then "bounces all over" the image as it iterates over the height map. In scan-scan-queue we only have that problem during the queue phase. So the question might also be, how many pixels get requeued. Note that the 2012 paper parallelizes the queue part on GPU.

Maybe we should just skip all this and go straight to parallelized GPU implementation lol. But GPUs haven't been readily available and personally I wouldn't know how to write GPU code to do this. I also think we haven't nearly hit the ceiling of CPU-only performance, we don't need real-time we just need not-5-minutes.

from deepcell-imaging.

dchaley commented on August 11, 2024

Additional finding to support not pursuing Downhill Filter. Original paper by Vincent (1993) (private link for full text pdf), which reviews parallel + sequential approaches, says the 'classic' (aka math-derived) approach is highly parallelizable but not on "conventional computers":

But "conventional computers" look very different in 2023 from 1993!
Scaling horizontally to many processors/computers is far easier than 20 years ago.

I have some naive ideas how to implement the above "hybrid" scan in parallel. But first I'm researching how hardware/GPU solutions work. It should be inspirational for a CPU/cloud version.

Googling led me to this paper:

Fast Morphological Image Processing on GPU using CUDA by Mugdah A. Rane at the College of Engineering, Pune
she in turn references the "Van Herk / Gil & Kimmel" aka vHGK algorithm.
Currently reading ^^ this paper, plus looking vHGK source code. Read this some more? https://github.com/cuekoo/Herk-Gil-Werman-Fast-Morphology

from deepcell-imaging.

dchaley commented on August 11, 2024

I found a paper from the "GK" of the original HGK paper that improves on the algorithm.
2003-Jan Gil & Kimmel: Efficient Dilation, Erosion, Opening and Closing Algorithms
full text

This is a fairly detailed explanation / walkthrough of the algorithm improvements. However it's 1-dimensional and would need to be generalized to 2.

Reading: page 8

from deepcell-imaging.

dchaley commented on August 11, 2024

Reconstruction algorithms refer to the "window" or "structuring element", basically the window within which we compute a min/max image processing.

DeepCell uses radius = 2, for Mesmer. source code
I think this means the window is 5x5 (2 each way from a center point).
As determined by using the disk function (source code]) defined in this source code.
Note that although the window is 5x5 we use a circle of it (b/c radius).

If I understand correctly, the more granular this window, the more "tasks" we have to run. Soooo a radius=2 means we're doing a lot of work. 😤

from deepcell-imaging.

dchaley commented on August 11, 2024

OK–

The Hybrid Algorithm for dilation is a max operation applied in raster order, then in anti-raster order, then the queue-based max filtering. As seen in A fast parallel implementation of queue-based morphological reconstruction using gpus.

The HGK algorithm, plus improvement as described in Efficient Dilation, Erosion, Opening and Closing Algorithms (2003), perform the parallel max phases.

Our algorithm would need to parallelize these units of work. For a radius of 2, aka p=5, that means doing the "obvious" (aka embarrassing) parallelism is (img-width/5) * (img-height/5) windows (I think). Given how memory-expensive the whole thing is, that's a lot better. But. Can we get it down to 512px, because 20k/5 = 4k which is 8x as big as the 512x512 which seems to be "Very Fast".

from deepcell-imaging.

dchaley commented on August 11, 2024

Benchmark implementation / notebook: #17

from deepcell-imaging.

dchaley commented on August 11, 2024

My plan for OpenCV implementation was to use dilation with a before/after footprint (don't include pixels as appropriate during the scan forward or backward). But I have no visibility into the iteration order, in fact I think the whole point was parallel execution whereas FastHybrid needs very specific iteration order.

But all hope is not lost 😅 The scikit grayreconstruct python implementation uses C under the hood to perform certain operations, after setting things up in numpy.

I believe it should be "easy" to create a similar .pyx iteration function, to perform fast-hybrid's raster scans, and defer back to python for the FIFO queue. Starting point: https://cython.readthedocs.io/en/latest/src/userguide/numpy_tutorial.html

One concern is that if we start including C somewhere then it has to be super easy to build/deploy/install. Solving in scikit itself would be nice 🙏🏻 cause it'd be actually invisible to DeepCell (modulo a scikit version upgrade).

from deepcell-imaging.

dchaley commented on August 11, 2024

I implemented "fast hybrid" in normal Python. As expected it's a lot slower.

Image size: (512, 512)
scikit h_maxima: 0.15 s
opencv h_maxima: 0.01 s
OpenCV Speedup: 9.75x
scikit == opencv : True
custom python h_maxima: 9.62 s
Python speedup: 0.02x
scikit == custom python : True

However the python code should be very readily portable to Cython, well, I hope. At least the array scan parts of it. This is not the worst as it's relatively common to distribute buildable cython code.

from deepcell-imaging.

dchaley commented on August 11, 2024

#27 reimplements the fast-hybrid scan in Cython. As hoped, it's a LOT faster. 🎉

PR TLDR:

~6x improvement over scikit h_maxima for 512x512 images.
~18x speedup for at least one 10k x 14k image.

With this proof of concept speeding up h_maxima by an order of magnitude, I'm declaring this spike complete 😤
cc @lynnlangit

As noted from the PR there are at least 2 potential next steps:

Submit scikit PR swapping/adding cython fast-hybrid vs existing downhill filter
and/or submit DeepCell-toolbox PR to use new implementation (... which probably means making a python library)

The most expedient to our purposes is updating DeepCell– but it would be nice to contribute to scikit…

Followup issue here: #28

from deepcell-imaging.

h_maxima performance about deepcell-imaging HOT 14 CLOSED

Comments (14)

Two potential solutions

1: Limited impact, but maybe easier?

2: Rewrite algorithm

Initialization

Algorithm loop

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent