Light

wuyan33 / project2-stream-compaction Goto Github PK

View Code? Open in Web Editor NEW

This project forked from cis565-fall-2018/project2-stream-compaction

0.0 0.0 0.0 652 KB

CMake 83.35% Makefile 0.49% C++ 8.87% Cuda 7.29%

project2-stream-compaction's Introduction

CUDA Stream Compaction

University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2

Yan Wu
- LinkedIn
Tested on: Windows 10 Education, i7-8750H @ 2.2GHz 16GB, GTX 1060 6GB (Personal Laptop)

Project Description

This project let us implement a few different versions of the Scan (Prefix Sum) algorithm.

Implementing a CPU version of the algorithm.
Writing a few GPU implementations: "naive" and "work-efficient."
Using some of these to implement GPU stream compaction.

Algorithm Slides:
Algorithm Demonstration:
- Naive Inclusive Scan:
- Work Efficient Scan (up-sweep):
- Work Efficient Scan (down-sweep):

Result and Performance Analysis

A sample of outcomes:
- Scan Test:
- Stream Compaction Test:
Result Sheets:
- Scan Test Result:
- Compaction Test Result:
Analysis:
- Scan Test Charts:
- Compaction Test Charts:
  
  Just to mention, the two red lines regarding "cpu compact with scan" in both above charts are the same line.
- Analysis:
  First we can see that GPU isn't performing significantly better when the test array is short. But when array becomes very large, it also becomes a burden for the CPU and the executing time increases almost exponentially. Under this circumstance we should perceive a much greater performance outcome by GPU.
  Taking a closer look, my work efficient algorithm is slower than the naive method when the test array is relatively shorter. As array size increases, work efficient method has a trend of better performance. Thrust method is steady. It may seems slow at first, but when array size increases, its time remains almost the same. Comparing all these methods, thrust scan is clearly the best when array is incredibly large.
Q & A:
- Roughly optimize the block sizes of each of your implementations for minimal run time on your GPU.
  My block size was 128.
- Why does work efficien method slower than naive?
  When implementing algorithm from the course slides, we did reduced the calculation count since we only need to compute half of the elements involved each round. Although we did this, half of each warp didn't even work. Implementing a parallel reduction and reduce the working warp number could save a lot time on this issue.

project2-stream-compaction's People

Contributors

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.