Comments (4)
OpenCL Memory Model
Overview
Description
OpenCL defines a four-level memory hierarchy for the compute device:
global memory: shared by all processing elements, but has high access latency (__global);
read-only memory: smaller, low latency, writable by the host CPU but not the compute devices (__constant);
local memory: shared by a group of processing elements (__local);
per-element private memory (registers; __private).
Not every device needs to implement each level of this hierarchy in hardware. Consistency between the various levels in the hierarchy is relaxed, and only enforced by explicit synchronization constructs, notably barriers.
Devices may or may not share memory with the host CPU.[13] The host API provides handles on device memory buffers and functions to transfer data back and forth between host and devices.
from temp.
CUDA
Overview
Description
shared memory
- The CUDA C compiler treats variables in shared memory differently than typical variables. It creates a copy of the variable for each block that you launch on the GPU.
- Every thread in that block shares the memory, but threads cannot see or modify the copy of this variable that is seen within other blocks.
- This provides an excellent means by which threads within a block can communicate and collaborate on computations.
- Furthermore, shared memory buffers reside physically on the GPU as opposed to residing in off-chip DRAM. Because of this, the latency to access shared memory tends to be far lower than typical buffers, making shared memory effective as a per-block, software managed cache or scratchpad.
constant memory
- There are so many ALUs on graphics processors that sometimes we just can’t keep the input coming to them fast enough to sustain such high rates of computation.
- Reduce the amount of memory traffic required for a given problem.
- NVIDIA hardware provides 64KB of constant memory that it treats differently than it treats standard global memory. In some situations, using constant memory rather than global memory will reduce the required memory bandwidth.
texture memory
- Read only memory used by programs in CUDA
- Used in General Purpose Computing for Accuracy and Efficiency.
- Designed for DirectX and OpenGL rendering Pipelines.
more
Texture memory is optimized for 2D spatial locality (where it gets its name from). You can kind of think of constant memory as taking advantage of temperal locality.
The benefits of texture memory over constant memory can be summarized as follows:
- Spatial locality.
- The addressing calculations can be calculated outside of the kernel in the hardware.
- Data can be accessed by different variables in a single operation.
- 8 bit and 16 bit data can be automatically converted to floating point numbers between 0 and 1.
more 2
- Constant memory is optimized for broadcast, i.e. when the threads in a warp all read the same memory location. If they are reading different locations, it will work, but each different location referenced by a warp costs more time. When a read is being broadcast to the threads, constant memory is MUCH faster than texture memory.
- Texture memory has high latency, even for cache hits. You can think of it as a bandwidth aggregator - if there's reuse that can be serviced out of the texture cache, the GPU does not have to go out to external memory for those reads. For 2D and 3D textures, the addressing has 2D and 3D locality, so cache line fills pull in 2D and 3D blocks of memory instead of rows.
- Finally, the texture pipeline can perform "bonus" calculations: dealing with boundary conditions ("texture addressing") and converting 8- and 16-bit values to unitized float are examples of operations that can be done "for free." (they are part of the reason texture reads have high latency)
from temp.
- Thread: from a software standpoint a thread is a computation that can be paused and resumed. In principle a thread does not need to have any reflection on the hardware (i.e. one can have threads on a single core CPU). A hardware design can support fast pause and resume of threads by allowing several sets of working registers, one per thread that the scheduler is going to keep in flight. When we talk about the number of GPU threads we mean the maximum number of working registers sets each execution unit provides multiplied by the number of execution units.
- Warp: is a set of threads that all share the same code, follow the same execution path with minimal divergences and are expected to stall at the same places. A hardware design can exploit the commonality of the threads belonging to a swarp by combining their memory accesses and assuming that it is fine to pause and resume all the threads at the same time, rather than deciding on a per-thread basis.
https://medium.com/@smallfishbigsea/basic-concepts-in-gpu-computing-3388710e9239
https://www.quora.com/What-is-a-warp-and-how-is-it-different-from-a-thread-block-or-wave-in-CUDA
from temp.
// -----
// cuda: grid -> grid -> thread, __shared__
// cl: kernel -> work_group -> work_item, __local
// -----
// size_t tid = get_local_id(0);
// size_t tid = threadIdx.x;
// -----
// size_t gid = get_global_id(0);
// size_t gid = blockIdx.x * blockDim.x + threadIdx.x;
// -----
// size_t window = get_local_size(0);
// size_t window = blockDim.x;
// -----
// size_t stride = get_global_size(0);
// size_t stride = gridDim.x * blockDim.x;
// -----
from temp.
Related Issues (20)
- nearest neighbor search HOT 3
- 2018~2020 心得体会
- puppeteer HOT 1
- merkle tree
- bitcoin p2p network HOT 1
- EuclidesDB
- mysql internal algorithm
- learning nlp HOT 1
- consistent hashing
- numba
- cache line HOT 2
- cppinsights HOT 1
- cuda book - rethinking HOT 2
- transductive vs inductive HOT 1
- raft HOT 2
- coap HOT 1
- task graph
- allreduce
- numa problem HOT 1
- aws lambda HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from temp.