Coder Social home page Coder Social logo

mortimr / oclbhgs Goto Github PK

View Code? Open in Web Editor NEW
25.0 3.0 0.0 76.41 MB

🛰💫 OpenCL Barnes-Hut Galaxy Simulator | Fully exploiting the GPU to do massive data parallel computations in an n-body simulator

CMake 2.64% C 96.97% C++ 0.38%
opencl gpu nbody-simulation barnes-hut galaxy simulator c99

oclbhgs's Introduction

COMP426 Assignment 4

10K Body simulations examples, smooth 60FPS on GTX 1060

(Wait for gifs to properly load to see the simulations smoothly)

git checkout v1.0 to see this simulation

10k1g

git checkout v1.1 to see this simulation

10k2g

git checkout v1.2 to see this simulation

10k1grandom

Running

Requires gcc, cmake, OpenCL, OpenGL and GLUT to run.

mkdir build
cd build
cmake ..
make
./oclbhgs

OpenCL Barnes-Hut Galaxy Simulator

First Name Last Name Student ID
Iulian Rotaru 40102558

Introduction

The goal of this last assignement was to use OpenCL in order to compute an n-body simulation. The particularity with OpenCL is that kernels can be written for CPU or GPU devices. To complete this project, I restarted the simulation from scratch in C99, I'll explain why in the report.

How was divided the work between the CPU and GPU in order to minimize data movements and synchronization ?

So, the goal of the project was to make the less possible communication between CPU and GPU. In my approach, I saw way more potential in data parallelism with the GPU than in task parallelism with the CPU. I completely reimplemented my code in C99 in order to have an algorithm where I could do data parallelism on EVERY step. This means that there is no logic running at all on the CPU. The CPU code only contains the initial structure creation and kernel calls. Also, all the data is stored only on the device memory, and recovered once for display purposes.

To achieve this, I had to:

  • Switch from tree with pointers that are linking the nodes to a static sized array representation that could be allocated once
     /--------|-------|
|0| |1|2|3|4| |5|6|7|8|9|10|11|12|13|...
 \__|_______|

So in this array, cell 0's children are cells 1, 2, 3 & 4, cell 1's children are cells 5, 6, 7, & 8 etc ...

  • Store bodies inside a static sized array. The array is sorted by cell index (index of the cell linked to the body)
  • Bodies contains a link (index) to their cell
  • Each cell contains an index to the beginning of its body chunk, and a body count
  • Maximum depth is a key concept. A lot of computations are done layer by layer in the tree.

And now the algorithm is able to run in a data parallel fashion, all the informations needed are accessible in a direct and independant manner if computations are correctly dispatched.

How are kernel codes for the CPU and GPU performing the corresponding computations ?

I don't have any CPU kernel. All the kernels are GPU. I will describe every step and kernel, and assemble everything at the end.

init

Type Work Size Dimensions
Host Code none none

The initialisation occurs once.

  • Allocate bodies, cells and galaxy_infos in heap
  • Initialize values
  • Create device buffers
  • Copy bodies, cells and galaxy_infos in buffers
  • Keep heap allocation from bodies and cells (to reuse for recovery)
  • Initialize OpenCL context
  • Load Kernels
  • Compile Kernels

galaxy_contains_losts

Type Work Size Dimensions
GPU Kernel Body Count 2 (one for each galaxy)
  • Check if cell index of the body is 0
  • If it is, it means that the body is "lost" (not assigned to any cell), set return value to true
  • If it isn't, it means that the body is already assigned to a cell

galaxy_dispatch_losts

Type Work Size Dimensions
GPU Kernel Body Count 2 (one for each galaxy)
  • From (x,y) position, recover index of corresponding cell on the last layer of the tree (this can be done trivially without loops)
  • Go up in tree (can be done without loops, by computing parent index from current cell index)
  • When active cell found (cells have active flags), assign body to cell (set cell index on body)

galaxy_contains_sub_dispatchables

Type Work Size Dimensions
GPU Kernel Cell Count 2 (one for each galaxy)
  • Check on all cells at the same time if any contains more than one body and is not on the last layer
  • Sets return value accordingly

galaxy_dispatch_sub_dispatchables

Type Work Size Dimensions
GPU Kernel Cell Count on a layer 2 (one for each galaxy)
  • Kernel called once per layer, starting from top layer, ignoring last layer
  • Check if any cell contains more than one body
  • If it does, correctly assign bodies to children cells (and activate them if needed)
  • This will dispatch all the bodies as the bodies are "cascading" from higher levels to lower levels

galaxy_clear_inactivecells

Type Work Size Dimensions
GPU Kernel Cell Count on a layer 2 (one for each galaxy)
  • Kernel called once per layer, starting from last layer
  • Check if cell has no children (or is on last layer), no bodies, and deactivate accordingly
  • This will clear any superfluous cells, starting from the lower layers up to the top

galaxy_compute_com

Type Work Size Dimensions
GPU Kernel Cell Count on a layer 2 (one for each galaxy)
  • Kernel called once per layer, starting from last layer
  • Computes center of mass by combining children centers of masses (unless the cells contains bodies)
  • This will compute every cell's center of mass from lower layers up to the top

galaxy_compute_accelerations

Type Work Size Dimensions
GPU Kernel Cell Count on a layer, Body Count 3 (one for each galaxy, one for cells on a layer, one for all bodies)
  • Kernel called once per layer, starting from the top layer
  • Computes at the same time all the bodies x all the cells of a layer
  • Keeps track of computed cells for a specific body inside and compute history buffer
  • Prevents useless computation by checking if parent cell has been computed (sets current as true if it is the case)
  • Computes forces if cell contains bodies or if ratio tells us to consider a cell as a body and compute forces with its center of mass
  • Sets current cell as computed in compute history

body_sort

Type Work Size Dimensions
GPU Kernel Body Count 2 (one for each galaxy)
  • For every body, compute amount of different bodies that have cell indexes lower than selected body
  • Bodies with the same cell index are compared using their actual index in the body array
  • Place into new position, inside a secondary body container initialized at the beginning of the process
  • Host codes copies secondary buffer into primary

body_apply_accelerations

Type Work Size Dimensions
GPU Kernel Body Count 2 (one for each galaxy)
  • For every body, applies cached acceleration to speed, sets cache to 0, applies speed to position
  • Check if body is getting out of cell. If it does, set cell index on the body to 0 (considered as lost)

cell_clear_idxs

Type Work Size Dimensions
GPU Kernel Cell Count 2 (one for each galaxy)
  • Sets body index to 0 (body index of 0 means there are no assigned bodies)
  • Sets body count to 0

cell_set_idxs

Type Work Size Dimensions
GPU Kernel Body Count - 1 2 (one for each galaxy)
  • Every thread checks current and next body in the body array
  • If next body has different index than current body, takes the linked cell of the next body, and set its start index

cell_set_amount

Type Work Size Dimensions
GPU Kernel Body Count - 1 2 (one for each galaxy)
  • Every thread checks current and next body in the body array
  • If current body has different index than next body, takes the linked cell of the current body, and set its body count
# Initialise cells and bodies
init()

while (running)

    # While lost bodies are present, assign them to according cell
    # Bodies are then resorted, and cells are relinked to bodies
    while (galaxy_contains_losts())
        galaxy_dispatch_losts()
        body_sort()
        cell_clear_idxs()
        cell_set_idxs()
        cell_set_amount()

    # While sub dispatchables cells are present, dispatch their bodies to children cells
    while (galaxy_contains_sub_dispatchables())
        # Every time a layer is processed, bodies are resorted, cells are relinked to bodies to ensure that next layer
        # computation is correct
        for (layer_number in tree)
            galaxy_dispatch_sub_dispatchables(layer_number)
            body_sort()
            cell_clear_idxs()
            cell_set_idxs()
            cell_set_amount()

    # Check for cells to deactivate
    galaxy_clear_inactive_cells()

    # Compute centers of masses
    galaxy_compute_com()

    # Compute accelerations
    galaxy_compute_accelerations()

    # Apply computation
    body_apply_accelerations()

    # Recover bodies + cells, display

How the host program guarantees correctness and performance ?

The host program will take care of every kernel call, and will properly synchronise calls between each other. It will also take care of the display, and will limit computations per second depending on current FPS.

Conclusion

With Native Threads, I achieved smooth 700 body simulation.

With TBB, I achieved smooth 700 body simulation.

With CUDA, task parallelism and data parallelism, I achieved smooth 1.5k simulation.

With OpenCL, this "new" algorithm and only data parallelism, I achieved smooth 10k simulation on GTX 1060.

oclbhgs's People

Contributors

mortimr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

oclbhgs's Issues

Image stalls on startup with Intel® HD Graphics 620 (Kaby Lake GT2)

Hi @mortimr I wanted to try this project on Ubuntu with the Intel® HD Graphics 620 (Kaby Lake GT2) graphics card. Had some issues during the build and compilations so I had to tweak it for my particular case. On startup it shows this still image(attached) instead of moving. Do you have an idea why it may do that or how to make it flow smoothly?

Unable to compile on ArchLinux

Hey, I wanted to try your project you wrote about on dev.to but I'm unfortunately unable to compile link on ArchLinux:

[ 14%] Building C object CMakeFiles/oclbhgs.dir/main.c.o
[ 28%] Building C object CMakeFiles/oclbhgs.dir/sources/logics/galaxy.c.o
[ 42%] Building C object CMakeFiles/oclbhgs.dir/sources/logics/body.c.o
[ 57%] Building C object CMakeFiles/oclbhgs.dir/sources/logics/cell.c.o
[ 71%] Building C object CMakeFiles/oclbhgs.dir/sources/logics/ocl.c.o
/home/terence/oclbhgs/sources/logics/ocl.c: In function ‘ocl_init’:
/home/terence/oclbhgs/sources/logics/ocl.c:92:5: warning: ‘clCreateCommandQueue’ is deprecated [-Wdeprecated-declarations]
   92 |     ocl->queue = clCreateCommandQueue(ocl->ctx, ocl->device, 0, &ocl->err);
      |     ^~~
In file included from /home/terence/oclbhgs/sources/logics/../headers/../kernel_headers/ocl.h:14,
                 from /home/terence/oclbhgs/sources/logics/../headers/ocl.h:8,
                 from /home/terence/oclbhgs/sources/logics/ocl.c:18:
/usr/include/CL/cl.h:1443:1: note: declared here
 1443 | clCreateCommandQueue(cl_context                     /* context */,
      | ^~~~~~~~~~~~~~~~~~~~
/home/terence/oclbhgs/sources/logics/ocl.c:124:66: warning: comparison of unsigned expression in ‘< 0’ is always false [-Wtype-limits]
  124 |         if (fread(source_code, source_file_size, 1, source_file) < 0) {
      |                                                                  ^
[ 85%] Building C object CMakeFiles/oclbhgs.dir/sources/logics/ogl.c.o
[100%] Linking C executable oclbhgs
/usr/bin/ld: CMakeFiles/oclbhgs.dir/sources/logics/ogl.c.o: undefined reference to symbol 'glLoadIdentity'
/usr/bin/ld: /usr/lib/libGL.so.1: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status
make[2]: *** [CMakeFiles/oclbhgs.dir/build.make:183: oclbhgs] Error 1
make[1]: *** [CMakeFiles/Makefile2:96: CMakeFiles/oclbhgs.dir/all] Error 2
make: *** [Makefile:104: all] Error 2

I'm on gcc 10.1 and with https://www.archlinux.org/packages/extra/x86_64/freeglut/ for GLUT.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.