smskelley / canny-opencl Goto Github PK

C++ 94.24% C 5.32% CMake 0.44%

canny-opencl's Introduction

#canny-opencl This is an implementation of the Canny Edge Detection algorithm using OpenCL in C++. It uses OpenCV for some utility features, such as capturing an image from a webcam, opening/writing an image file, and converting from BGR to Grayscale. Note that it was considered outside of the scope of this project to maintain cross platform compatability, so it has only been tested in OSX 10.10. Additionally, because OpenCL applications can be heavily optimized (or broken) depending on the target hardware, the range of target hardware has been significantly limited. It was a goal to be able to utilize OpenCL on the CPU and GPU, so this does support both and contains optimized versions for both. While this should run on most modern macs, it has only been optimized and tested on the GPU and CPU of my early 2011 Macbook Pro 15".

To see a short demonstration, view the link below:

#Table of Contents

The Canny Edge Detection Algorithm
Executables
- Building
Code Layout
Optimization Methods

#The Canny Edge Detection Algorithm As its name implies, the Canny edge detection algorithm finds edges in an image. The input is a grayscale image and the output is a black and white image with 1 pixel wide white lines denoting edges. An edge may be defined as place of high contrast. The rate at which the contrast changes indicates the strength of the edge. It has four stages: gaussian blur, sobel filtering, non-maximum suppression, and hysteresis thresholding. Below I will briefly discuss each stage.

##Gaussian Blur A gaussian blur is performed to reduce noise in the image. This is required because noise is generally high contrast and would thus lead to false positives. It is implemented using image convolution. For those unfamiliar with it, image convolution is an operation which essentially replaces each pixel with a weighted average of its neighbors. The weights chosen are very important and cause image convolution to be applicable to a number of problems. The gaussian blur weights closer pixels more heavily than distant ones.

##Sobel Filtering Sobel filtering replaces each pixel with a combination of the x and y derivatives of neighboring pixels. In doing so, pixels in high contrast areas will be "brighter" than pixels in low contrast areas. This essentially finds areas where edges are most likely to exist, but it does not pinpoint precisely where the edges are, so they exist as a gradient of probabilities. Similar to the gaussian blur, this is done using image convolution, but is performed twice: once for the x derivative and once for the y derivative. The pixel is then replaced with essentially: sqrt( (di/dx)^2 + (di/dy)^2 ) where di represents the change in intensity. During this stage, the direction of the gradient is also calculated for each pixel which is needed for non-maximum suppression.

##Non-Maximum Suppression At this point, there exists gradients representing probable edges, but our end goal is to represent edges as a single 1 pixel line. Non-maximum suppression rules out pixels which are part of an edge, but do not define the edge. The result is that we condense these wide gradient edges into a single 1 pixel line. Note that the result produced is still not the finished product, these lines are not purely white and don't necessarily indicate an edge, they instead still represent a probability of an edge. Non-maximum suppression is performed by using the direction of the gradient found in the previous stage and comparing the current pixel with neighboring pixels on either side. If the pixel is lower in intensity than ether of these neighboring pixels, then it is not considered be the true edge, so its value is replaced by 0. If the pixel is the highest intensity among its neighbors in the direction of the gradient, then it may be the true edge, so its value is retained.

##Hysteresis Thresholding We now have 1 pixel wide lines with values indicating the strength of the edge. In order to decide which of these should be considered an edge, we will use two threshold values. The low threshold indicates that pixels less than its value cannot be edges. The high threshold indicates that pixels higher than its value must be edges. Pixels between these values will only be edges if they neighbor an edge. The low threshold thus can be thought of as affecting the length of edges and the high threshold can be thought of as affecting the number of edges. Because of the definition, correct output of this stage cannot be found in one single pass examining only neighboring pixels, since a "maybe" edge can go either way. That said, an approximation may be made which gives vastly increased performance (since no edge traversal is needed) while keeping the output accuracy high enough for most applications. This implementation uses the approximation approach.

#Executables It contains two executables: live-capture and benchmark-suite. Live capture allows the user to view the results of canny edge detection in near real time using the webcam as input. The benchmark suite uses several implementations of the canny edge detection algorithm on multiple images. Each implementation is run multiple times on each image. Next, the stages of the algorithm are timed in isolation to better understand which steps are taking the longest.

##Building This has only been tested on a handful of similar machines and likely won't build on platforms other than OSX. If someone wishes to tweak it to work on linux and/or windows, I would gladly accept the pull request. Building and running is pretty easy and should be roughly the following:

$ ./tools/fetch-images.sh
$ mkdir build
$ cd build
$ cmake ..
$ make
$ cd ../bin
$ ./live-capture    # or ./benchmark-suite

#Code Layout The code is separated into three logical groups: image processors, live capture, and benchmarking.

##Image Processors All Image Processors are contained within the ImageProcessors namespace and inherit from the ImageProcessor abstract base class. Each image processor implements the canny edge detection algorithm, but each one does it in a different way. The focus of this project was canny edge detection in opencl, so the vast majority of effort was put into the OpenclImageProcessor class. Other image processors include SerialImageProcessor and CvImageProcessor.

###OpenclImageProcessor Based upon a boolean accepted by the constructor, this class can either target the CPU or GPU. When using the GPU, it will use 16x16 sized workgroups so that it may copy chunks of data to faster memory on the GPU and make fewer calls to global memory. While this is advantageous on the GPU, CPUs generally don't support two demensional work groups, so the same kernels cannot be used. As such, the kernels directory contains two sets. These kernels are only used for the OpenclImageProcessor.

###SerialImageProcessor The serial image processor was constructed by serializing the OpenclImageProcessor. It is used so that one may determine the speedup we get from using the OpenclImageProcessor.

###CvImageProcessor The Cv Image Processor simply uses OpenCV's canny edge detection algorithm. Unlike the other classes, it does not support executing the different stages of the algorithm in isolation.

##Live Capture usage: ./live-capture [cpu|serial] If no argument is given, live-capture will use the GPU.

An image processor is created based upon arguments if any exist. When operating in CPU and GPU mode, live-capture uses the OpenclImageProcessor. Live capture is performed using OpenCV to capture frames from the webcam. The frames are then fed into the image processor and its result is displayed on screen. The result is a near realtime video of canny edge detection algorithm running on the webcam's data.

##Benchmarking usage: ./benchmark-suite Note: This currently requires images to be fetched and on the local path ./images/. If you haven't already, run tools/fetch-images.sh to obtain these images.

Because the focus of this project was to produce a reasonably performant implementation of the canny edge detection algorithm, a fair amount of effort went into creating a flexible benchmarking system. It is desirable to know how this implementation compares to others, how it performs at different resolutions, and how much time each stage of the algorithm consumes. As such, the benchmarking-suite runs several iterations of each pair of image and implementation to find the mean and standard deviation of time taken. After benchmarking the full algorithm, each stage is run in isolation to capture the amount of time each stage consumes. The test images ranged in size from 0.3-288 megapixels.

#Optimization Methods Initially, a single set of OpenCL kernels were used that were intended to run on both the CPU and GPU. The performance on the CPU was better than expected; it gave 7x speedups over serial on a four core processor. However, the GPU performance was much lower than expected; it was slower than serial. The problem was pinpointed to redundant global memory accesses. Because most stages require that each pixel have knowledge of neighboring pixels, each pixel was retrieved 9 times or more from global memory. This access pattern is specifically what GPUs are weak at, so much of what should have been executed in parallel was being serialized by global memory accesses.

To solve this problem, two dimensional workgroups were used. Each work group copies pixel data to local memory which is much faster, but only accessible from within that workgroup. This reduced global memory accesses significantly and gave a 29x speedup over the previous version. The end result was 8-13x speedup over serial for the GPU, with peak performance for images that are approximately 10 megapixels. While running on the CPU, this implementation had a 7-10x speedup on images between 0.3-9 megapixels, but fell off very quickly for larger images.

canny-opencl's People

Contributors

Stargazers

Watchers

Forkers

nratkiew jolson6 antbanks2014 mlzxy davidsunny86 fanjinfei csuk0914 cwy20030 skn123 jmespadero

canny-opencl's Issues

atan vs atan2

So I'm running into an issue with atan2 and atan. After some testing, it is clear that the range of the c++ atan2 function is not (-90, 90) degrees or (0, 180) degrees. I've gotten values in the range (-90, 270) degrees when using printf in the sobel kernel in our program. I thought about just implementing atan2 myself. atan2 = arc tan (y / x). This does give a range (-90, 90). The only problem with this is that sumx often = 0, which means division by 0 and a problem. I just wanted to keep you guys informed about where the angle detection part of the Sobel kernel was. @smskelley if you feel like starting the NMS kernel, it may take me longer than I thought to sort out the angle issues.

Sobel Filtering should also compute direction.

The Sobel kernel currently only calculates the magnitude of an edge. It should additionally calculate the direction of the edge. This direction is needed for non-maximum suppression.

Fundamentally, only four directions are needed: North-South, Northeast-Southwest, East-West, Southeast-Northwest. It is unclear currently what the best representation of direction should be. The options I can think of are:

Simply use four 'magic' integers: 0, 1, 2, 3 (using an unsigned char). This would be space efficient, since it would only require one byte per pixel, but it would be less obvious to those who read it. Additionally, the Sobel filter would be less reusable.
Compass degrees stored in a float. Sobel becomes very reusable, but it comes at the cost of using 4 bytes for each pixel.
Compass degrees stored in an unsigned char. They would be limited to 255 obviously, so not all compass degrees would be valid. However, it would technically work here, since 90 + 45 = 135 is the largest degree we need.
Normalize compass degrees to 200 and store in an unsigned char. In this scenario, 50 would map to east, 25 to northeast. This would allow the whole range of compass degrees to be considered valid input while still occupying 1 byte. It would make Sobel slightly more reusable and retain space efficiency, but it comes at the cost of a slight decrease in compute efficiency and it introduces "magic numbers" that are less readable.

@jaredmolson84 Do you have an opinion on the subject? I'm leaning towards option 2, but I think at some point we should try 1 or 4 in the benchmarking phase to see if it creates a noticeable performance increase.

Benchmark should be expanded to compare against other solutions.

Currently, the benchmarking suite only analyzes the time it takes our algorithm to perform. This should be expanded to compare it against:

A serial version.
OpenCV's OpenCL version.

Benchmarking should include multiple images of varying sizes.

Benchmarking a single image provides a glimpse of how the algorithm performs, but it doesn't provide the full picture. It is likely that OpenCL will best harness the GPU when operating on very large images. Add additional images to the benchmark of varying sizes, including one that is 10,240px by 10,240px to match the scale of Ogawa et. al's benchmarking.

Image Processor sometimes uses integrated GPU when discrete is available.

When initializing the ImageProcessor, OpenCL is setup and uses the first GPU returned. On a system which has integrated and discrete options, this may result in the integrated GPU being chosen which results in low performance.

Instead, the available GPUs should be parsed and the first Nvidia or AMD/ATI GPU should be preferred. This may not be perfect, as an example of where this could fail: Imagine you have an AMD CPU with integrated GPU, but you also have an additional discrete AMD GPU. The integrated GPU may be chosen. If there's a better option someone else can find, do that. Otherwise, this simple solution would work for now.

Add Non-maximum Suppression.

Currently, the canny edge detection algorithm is lacking it's 3rd stage: Non-maximum suppression. This will examine each pixel and set its magnitude to 0 if a neighboring parallel pixel has a larger magnitude.

Linux version of this Project

Dear Sean Kelley
I am using your project for my resarch but when i am tring to make it in Ubuntu i face following error:
/home/xx/Desktop/canny-opencl-master_OSX/src/live-capture/live-capture.cpp: In function ‘int main(int, char**)’: /home/xx/Desktop/canny-opencl-master_OSX/src/live-capture/live-capture.cpp:24:3: error: ‘unique_ptr’ was not declared in this scope unique_ptr<ImageProcessor> processor(nullptr); ^~~~~~~~~~ /home/xxDesktop/canny-opencl-master_OSX/src/live-capture/live-capture.cpp:24:28: error: expected primary-expression before ‘>’ token unique_ptr<ImageProcessor> processor(nullptr); ^ /home/xxDesktop/canny-opencl-master_OSX/src/live-capture/live-capture.cpp:24:30: error: ‘processor’ was not declared in this scope unique_ptr<ImageProcessor> processor(nullptr); ^~~~~~~~~ /home/xx/Desktop/canny-opencl-master_OSX/src/live-capture/live-capture.cpp:24:30: note: suggested alternative: ‘glScissor’ unique_ptr<ImageProcessor> processor(nullptr); ^~~~~~~~~ glScissor /home/xxDesktop/canny-opencl-master_OSX/src/live-capture/live-capture.cpp:54:5: error: ‘imshow’ was not declared in this scope imshow("canny", processor->output()); ^~~~~~ /home/xx/Desktop/canny-opencl-master_OSX/src/live-capture/live-capture.cpp:54:5: note: suggested alternative: In file included from /usr/include/opencv2/highgui/highgui.hpp:48:0, from /home/xx/Desktop/canny-opencl-master_OSX/src/live-capture/live-capture.cpp:4: /usr/include/opencv2/highgui.hpp:569:17: note: ‘cv::imshow’ CV_EXPORTS void imshow(const String& winname, const ogl::Texture2D& tex); ^~~~~~ src/live-capture/CMakeFiles/live-capture.dir/build.make:62: recipe for target 'src/live-capture/CMakeFiles/live-capture.dir/live-capture.cpp.o' failed make[2]: *** [src/live-capture/CMakeFiles/live-capture.dir/live-capture.cpp.o] Error 1 CMakeFiles/Makefile2:141: recipe for target 'src/live-capture/CMakeFiles/live-capture.dir/all' failed make[1]: *** [src/live-capture/CMakeFiles/live-capture.dir/all] Error 2 Makefile:83: recipe for target 'all' failed make: *** [all] Error 2
According to your comment, i would appreciate you if release Linux version of this project.

Benchmark results should be logged.

It would be nice if, when the benchmark is run, the raw timings with some metadata were to be output to a file alongside the image. e.g. if there is an image large.jpg with resolution 10240x10240, then it could output a file named large.gpu.csv similar to the following:

10240, 10240                       #resolution: width, height                
123.12, 124.10, 125.1              #timed complete runs: trial1, trial2, trial3
20.12, 40.10, 40.1, 43.20          #timed stages: stage1, stage2, stage3, stage4

The intention behind this is two fold: The most obvious reason is that it would be nice to use some of this data with pyplot or something along those lines. The secondary reason is that it would create log of how the performance is changing with the modifications we're making. For example, say we speedup the algorithm in a certain location and commit a new benchmark at the same time, it would provide a log showing what effect different tweaks have on the real world performance.

A potential issue could certainly be that we all have different hardware and we don't have a standardized benchmarking platform. I see two possible resolutions: files contain a unique computer identifier within them (e.g. large.mibook.gpu.csv if your computer is named mibook). The other option is to have a designated person run the tests periodically. Note that it doesn't have to directly correlate and we can always roll back changes later if we want to better understand things.

Unless someone has a preference, I will probably move towards having them named something along the lines of large.mibook.gpu.csv and commit them myself periodically. If someone else also wishes to, they are certainly welcome to.

Move source files to src directory.

Create a src directory and put all source code there, update Makefile accordingly.

Speed comparison with OpenCV

I'm looking at the benchmarking results with OpenCV and it seems that the OpenCV implementation is a lot faster. Do you know why this might be the case?

Cpu vs Gpu Kernels

I assume they are the same, with the difference being that the GPU Kernels make use of local memory?

License / copyright

Hi @smskelley ,

this is a great resource! Thanks for sharing! If I wanted to reuse the code in one of my project, what open-source license would I have to respect? Would you mind adding a license file to the repository?

Thanks!
Best,
Robert

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.