Coder Social home page Coder Social logo

project2-stream-compaction's Introduction

CUDA Stream Compaction

University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2

  • Yan Wu
  • Tested on: Windows 10 Education, i7-8750H @ 2.2GHz 16GB, GTX 1060 6GB (Personal Laptop)

Project Description

This project let us implement a few different versions of the Scan (Prefix Sum) algorithm.

  1. Implementing a CPU version of the algorithm.
  2. Writing a few GPU implementations: "naive" and "work-efficient."
  3. Using some of these to implement GPU stream compaction.
  • Algorithm Slides:
  • Algorithm Demonstration:
    • Naive Inclusive Scan:

    • Work Efficient Scan (up-sweep):

    • Work Efficient Scan (down-sweep):

Result and Performance Analysis

  • A sample of outcomes:

    • Scan Test:

    • Stream Compaction Test:
  • Result Sheets:

    • Scan Test Result:

    • Compaction Test Result:
  • Analysis:

    • Scan Test Charts:

    • Compaction Test Charts:

      Just to mention, the two red lines regarding "cpu compact with scan" in both above charts are the same line.
    • Analysis:
      First we can see that GPU isn't performing significantly better when the test array is short. But when array becomes very large, it also becomes a burden for the CPU and the executing time increases almost exponentially. Under this circumstance we should perceive a much greater performance outcome by GPU.
      Taking a closer look, my work efficient algorithm is slower than the naive method when the test array is relatively shorter. As array size increases, work efficient method has a trend of better performance. Thrust method is steady. It may seems slow at first, but when array size increases, its time remains almost the same. Comparing all these methods, thrust scan is clearly the best when array is incredibly large.
  • Q & A:

    • Roughly optimize the block sizes of each of your implementations for minimal run time on your GPU.
      My block size was 128.
    • Why does work efficien method slower than naive?
      When implementing algorithm from the course slides, we did reduced the calculation count since we only need to compute half of the elements involved each round. Although we did this, half of each warp didn't even work. Implementing a parallel reduction and reduce the working warp number could save a lot time on this issue.

project2-stream-compaction's People

Contributors

kainino0x avatar immiao avatar shrekshao avatar ottaviohartman avatar wuyan33 avatar pjcozzi avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.