Coder Social home page Coder Social logo

Comments (4)

ytgui avatar ytgui commented on August 24, 2024

OpenCL Memory Model

Overview

image
image
image

Description

OpenCL defines a four-level memory hierarchy for the compute device:

global memory: shared by all processing elements, but has high access latency (__global);
read-only memory: smaller, low latency, writable by the host CPU but not the compute devices (__constant);
local memory: shared by a group of processing elements (__local);
per-element private memory (registers; __private).

Not every device needs to implement each level of this hierarchy in hardware. Consistency between the various levels in the hierarchy is relaxed, and only enforced by explicit synchronization constructs, notably barriers.

Devices may or may not share memory with the host CPU.[13] The host API provides handles on device memory buffers and functions to transfer data back and forth between host and devices.

from temp.

ytgui avatar ytgui commented on August 24, 2024

CUDA

Overview

image

Description

shared memory

  1. The CUDA C compiler treats variables in shared memory differently than typical variables. It creates a copy of the variable for each block that you launch on the GPU.
  2. Every thread in that block shares the memory, but threads cannot see or modify the copy of this variable that is seen within other blocks.
  3. This provides an excellent means by which threads within a block can communicate and collaborate on computations.
  4. Furthermore, shared memory buffers reside physically on the GPU as opposed to residing in off-chip DRAM. Because of this, the latency to access shared memory tends to be far lower than typical buffers, making shared memory effective as a per-block, software managed cache or scratchpad.

constant memory

  1. There are so many ALUs on graphics processors that sometimes we just can’t keep the input coming to them fast enough to sustain such high rates of computation.
  2. Reduce the amount of memory traffic required for a given problem.
  3. NVIDIA hardware provides 64KB of constant memory that it treats differently than it treats standard global memory. In some situations, using constant memory rather than global memory will reduce the required memory bandwidth.

texture memory

  1. Read only memory used by programs in CUDA
  2. Used in General Purpose Computing for Accuracy and Efficiency.
  3. Designed for DirectX and OpenGL rendering Pipelines.

more
Texture memory is optimized for 2D spatial locality (where it gets its name from). You can kind of think of constant memory as taking advantage of temperal locality.

The benefits of texture memory over constant memory can be summarized as follows:

  1. Spatial locality.
  2. The addressing calculations can be calculated outside of the kernel in the hardware.
  3. Data can be accessed by different variables in a single operation.
  4. 8 bit and 16 bit data can be automatically converted to floating point numbers between 0 and 1.

more 2

  1. Constant memory is optimized for broadcast, i.e. when the threads in a warp all read the same memory location. If they are reading different locations, it will work, but each different location referenced by a warp costs more time. When a read is being broadcast to the threads, constant memory is MUCH faster than texture memory.
  2. Texture memory has high latency, even for cache hits. You can think of it as a bandwidth aggregator - if there's reuse that can be serviced out of the texture cache, the GPU does not have to go out to external memory for those reads. For 2D and 3D textures, the addressing has 2D and 3D locality, so cache line fills pull in 2D and 3D blocks of memory instead of rows.
  3. Finally, the texture pipeline can perform "bonus" calculations: dealing with boundary conditions ("texture addressing") and converting 8- and 16-bit values to unitized float are examples of operations that can be done "for free." (they are part of the reason texture reads have high latency)

from temp.

ytgui avatar ytgui commented on August 24, 2024
  • Thread: from a software standpoint a thread is a computation that can be paused and resumed. In principle a thread does not need to have any reflection on the hardware (i.e. one can have threads on a single core CPU). A hardware design can support fast pause and resume of threads by allowing several sets of working registers, one per thread that the scheduler is going to keep in flight. When we talk about the number of GPU threads we mean the maximum number of working registers sets each execution unit provides multiplied by the number of execution units.
  • Warp: is a set of threads that all share the same code, follow the same execution path with minimal divergences and are expected to stall at the same places. A hardware design can exploit the commonality of the threads belonging to a swarp by combining their memory accesses and assuming that it is fine to pause and resume all the threads at the same time, rather than deciding on a per-thread basis.

https://medium.com/@smallfishbigsea/basic-concepts-in-gpu-computing-3388710e9239
https://www.quora.com/What-is-a-warp-and-how-is-it-different-from-a-thread-block-or-wave-in-CUDA

from temp.

ytgui avatar ytgui commented on August 24, 2024
// -----
// cuda: grid -> grid -> thread, __shared__
// cl:   kernel -> work_group -> work_item, __local
// -----
// size_t tid = get_local_id(0);
// size_t tid = threadIdx.x;
// -----
// size_t gid = get_global_id(0);
// size_t gid = blockIdx.x * blockDim.x + threadIdx.x;
// -----
// size_t window = get_local_size(0);
// size_t window = blockDim.x;
// -----
// size_t stride = get_global_size(0);
// size_t stride = gridDim.x * blockDim.x;
// -----

from temp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.