unisa-hpc / sycl-bench Goto Github PK
View Code? Open in Web Editor NEWSYCL Benchmark Suite
License: BSD 3-Clause "New" or "Revised" License
SYCL Benchmark Suite
License: BSD 3-Clause "New" or "Revised" License
TimeMetricsProcessor::emitResults
leads to segfaults in case the very first benchmark run fails (e.g. doesn't verify), and just this run also happens to be the warmup run.
E.g. line double median = resultsSeconds[resultsSeconds.size() / 2];
This is e.g. triggered by running ./nbody --size=1024 --local=256 --warmup-run
with AdaptiveCpp's -DHIPSYCL_TARGETS=generic
as hierarchical parallel for isn't supported with this compilation-flow, yet.
In this case specifying --no-verification
or leaving out --warmup-run
works around the issue.
Reports indicate that there's an issue with nbody on ComputeCpp. But what could it be?
[ 3%] Building CXX object CMakeFiles/sobel7.dir/single-kernel/sobel7.cpp.o
Unsupported kernel parameter type
UNREACHABLE executed at /home/cc/sycl_workspace/llvm/clang/lib/Sema/SemaSYCL.cpp:1039!
Right now it is not clear which benchmarks are new, and which are ports of existing benchmarks. We should add notes in each benchmark source file, referring to the original source, if there is any.
As previously discussed, I think that the current implementation of the DRAM microbenchmark is not ideal, if our goal is to simply measure throughput. The benchmark is currently based on the paper "GPGPU Power Modeling for Multi-Domain Voltage-Frequency Scaling" (Guerreiro18), in which they used it to characterize the power consumption of GPUs when stressing device memory. However, the kernel itself doesn't just do a memory copy, but also includes a parameter N
to introduce a certain level of arithmetic intensity. I'm not certain why they have that in there, but I would assume they have their reasons. For us however this means adding another dimension to the benchmark, which makes it harder to evaluate and compare results.
To further complicate things, the actual implementation of the DRAM benchmark used in Guerreiro18, which can be found on GitHub (here), appears to be completely different. In this version, the arithmetic operations are gone, and the kernel mostly consists of copy instructions. However, it still does some strange things I'm not entirely clear about.
Here's my SYCL implementation of this kernel:
cgh.parallel_for<DRAMKernel>(global_size, [=](s::id<1> gid) {
for(size_t j = 0; j < COPY_ITERATIONS; j += UNROLL_ITERATIONS) {
for(size_t i = 0; i < UNROLL_ITERATIONS; ++i) {
DataT r0 = in[gid[0] + STRIDE * i];
out[gid[0] + STRIDE * i] = r0;
}
}
});
As you can see, since the loop variable j
is not being used inside the loop body, this kernel repeatedly copies the same chunk of memory from one buffer into the other; by default COPY_ITERATIONS
is 1024. In each iteration UNROLL_ITERATIONS
(default 32) copies are made. To allow for memory coalescing, each thread accesses the buffer based linearly on its global id. Each subsequent copy is offset by a multiple of STRIDE
. This parameter is sensitive to cache sizes: If chosen too small, caching can occur. In the implementation of Guerreiro18, they use an offset of 64 * 1024
, and profiling reveals a 0% L2 hit rate on my GTX 2070. If I change this to 32 * 1024
however, I already get a 11% L2 hit rate. How large this stride has to be exactly depends on the number of work items, cache sizes and so on. All in all, this version of the benchmark essentially introduces three additional parameters, greatly increasing its complexity, without a clear benefit.
Furthermore, in order to achieve the highest throughput with this benchmark, it weirdly needs to be compiled without any assembly-level optimizations (by passing -O0
to ptxas
). This is of course hard to replicate with hipSYCL, and my SYCL implementation of Guerreiro18 thus also only achieves around ~300 GiB/s, i.e., 67% of the GTX 2070's 448 GiB/s theoretical maximum. This is a bit less than the CUDA version achieves with optimizations (311 GiB/s), and substantially less than it when compiled without optimizations (325 GiB/s).
Long story short, instead of implementing Guerreiro18, I would suggest to instead go for the simplest possible memcopy kernel, i.e.:
cgh.parallel_for<DRAMKernel>(global_size, [=](s::id<1> gid) {
out[gid] = in[gid];
});
With this, I'm able to get ~358 GiB/s, which is about 80% of the theoretical maximum, and probably quite close to what we can reasonably expect to achieve.
Note that to get good results, the buffer size needs to be large enough, i.e. at least several hundred megabytes. If we do a plain mapping of --size
to buffer size, this means that the specified sizes need to be rather large. For example I used --size=939524096
(which equals 3.5 GiB).
This is of course quite a large number, and a much larger size than many other benchmarks can support (if e.g. it is used as the range in both dimensions of a 2D buffer). One possible solution would be to always multiply the size by some fixed factor, e.g. 1024. However, I generally feel like the mapping of the --size
parameter to the actual workload is too arbitrary already. It might be convenient for running lots of benchmarks in batch, but in reality I think we'll have to hand tune these values (as well as any additional parameters a benchmark might have) for each individual platform anyways. I'm thus thinking whether maybe having individual parameters for each benchmark would make more sense (e.g. having a buffer-size=
for this one).
In the same vein, having the ability to compute custom metrics would also be great. For example, it would be cool if this benchmark could actually print its achieved throughput, instead of having to manually compute it.
What do you guys think?
Hi,
I tried compile the sycl-bench, using the computecpp 2.0.0, gcc 9.3.0 and ubuntu error, but the cmake tool return me a error: #include<CL/sycl.hpp> not found.
cmake command: cmake .. -DSYCL_IMPL=ComputeCpp -D COMPUTECPP_RUNTIME_LIBRARY=/opt/ComputeCPP/include -DComputeCpp_INFO_EXECUTABLE=/opt/ComputeCPP -D COMPUTECPP_RUNTIME_LIBRARY_DEBUG=/opt/ComputeCPP
What could I be doing wrong?
Rui
In SYCL 1.2.1, accessor::operator[]
did not add the accessor offset. Users were instead expected to use parallel_for
with offset to achieve correct indexing.
In SYCL 2020, accessor::operator[]
does add the accessor offset. parallel_for
offset is deprecated and should not be used.
The blocked_transform
benchmark assumes the SYCL 1.2.1 semantics. Building it with a SYCL 2020 compiler causes the offset to be added twice, because the parallel_for
is provided an offset, and additionally the accessor::operator[]
now also adds the offset. So this benchmark is currently UB in any SYCL 2020 implementation.
We should change it to not use parallel_for
with offset to make everything match again.
We should:
-march=native
when compiling with hipSYCL for CPU. Unlike ComputeCpp+OpenCL CPU backend, hipSYCL cannot do runtime compilation, so it's important to optimize when compiling. This may also be relevant for Intel SYCL and ComputeCpp if we want to also benchmark their host fallbacks.TRISYCL
preprocessor macro when compiling for triSYCL, so that we can add some workarounds in the code.This benchmark uses a local memory accessor which is illegal inside a basic hierarchical for invocation. This causes the benchmark to crash at least on hipSYCL CPU.
It used to work. But we don't know why. Now it doesn't and we know why. We should transform this to a state where it works and (ideally, but at a lower priority) we know why.
In file included from /home/cc/sycl-bench/include/common.h:14:
/home/cc/sycl-bench/include/command_line.h:244:8: error: use of undeclared identifier 'device_selector'
if(device_selector != "gpu") {
I am working on an example which I am trying to integrate and execute using the sycl-bench suite. I am getting runtime error upon testing it with AdaptiveCpp for CUDA as well as HIP backend. When the same kernel is executed as the standalone ACpp application, the kernel executes without any error. I have a similar kind of behavior with one of my other applications. One of the similarities between both the applications is that they both are nd_range parallel_for type.
Matrices are currently initialized as identity matrix. Please let me know if you have any idea on the error.
Application code:
#include "common.h"
#include <iostream>
#include <sycl/sycl.hpp>
using namespace sycl;
class MatMulBlocked
{
protected:
BenchmarkArgs args;
std::vector<double> initA;
std::vector<double> initB;
std::vector<double> initC;
PrefetchedBuffer<double, 2> initA_buf;
PrefetchedBuffer<double, 2> initB_buf;
PrefetchedBuffer<double, 2> initC_buf;
const size_t problem_size = 256;
const size_t Ndim = 256;
const size_t Mdim = 256;
const size_t Pdim = 256;
const size_t Bsize = 16;
public:
MatMulBlocked(const BenchmarkArgs &_args) : args(_args) {}
void setup() {
// host memory intilization
initA.resize(Ndim * Pdim);
initB.resize(Pdim * Mdim);
initC.resize(Ndim * Mdim);
// Initialize matrix A to the identity
for(size_t i = 0; i < Ndim; ++i) {
for(size_t j = 0; j < Pdim; ++j) {
initA[i * Pdim + j] = i == j;
}
}
// Initialize matrix B to the identity
for(size_t i = 0; i < Pdim; ++i) {
for(size_t j = 0; j < Mdim; ++j) {
initB[i * Mdim + j] = i == j;
}
}
// Initialize matrix C to the zero
for(size_t i = 0; i < Ndim; ++i) {
for(size_t j = 0; j < Mdim; ++j) {
initC[i * Mdim + j] = 0;
}
}
initA_buf.initialize(args.device_queue, initA.data(), range<2>(Ndim, Pdim));
initB_buf.initialize(args.device_queue, initB.data(), range<2>(Pdim, Mdim));
initC_buf.initialize(args.device_queue, initC.data(), range<2>(Ndim, Mdim));
}
void run(std::vector<event>& events) {
events.push_back(args.device_queue.submit(
[&](handler& cgh) {
auto in1 = initA_buf.template get_access<access::mode::read>(cgh);
auto in2 = initB_buf.template get_access<access::mode::read>(cgh);
auto out = initC_buf.template get_access<access::mode::read_write>(cgh);
// Use local memory address space for local memory
accessor<double, 2, access_mode::read_write, access::target::local> Awrk({Bsize, Bsize}, cgh);
accessor<double, 2, access_mode::read_write, access::target::local> Bwrk({Bsize, Bsize}, cgh);
cgh.parallel_for<class SYCL_Matmul_blocked_kernel>(
nd_range<2>{{Ndim, Mdim}, {Bsize, Bsize}},
[=](nd_item<2> idx) {
// This work-item will compute C(i,j)
const size_t i = idx.get_global_id(0);
const size_t j = idx.get_global_id(1);
// Element C(i,j) is in block C(Iblk, Jblk)
const size_t Iblk = idx.get_group(0);
const size_t Jblk = idx.get_group(1);
// C(i,j) is element C(iloc, jloc) of block C(Iblk, Jblk)
const size_t iloc = idx.get_local_id(0);
const size_t jloc = idx.get_local_id(1);
// Number of blocks
const size_t Nblk = Ndim / Bsize;
const size_t Mblk = Mdim / Bsize;
const size_t Pblk = Pdim / Bsize;
for (size_t Kblk = 0; Kblk < Pblk; ++Kblk) {
// Copy A and B into local memory
Awrk[iloc][jloc] = in1[Iblk * Bsize + iloc][Kblk * Bsize + jloc];
Bwrk[iloc][jloc] = in2[Kblk * Bsize + iloc][Jblk * Bsize + jloc];
// Compute matmul for block
for (size_t kloc = 0; kloc < Bsize; ++kloc) {
out[i][j] += Awrk[iloc][kloc] * Bwrk[kloc][jloc];
}
}
});
args.device_queue.wait_and_throw();
}));
}
bool verify(VerificationSetting &ver) {
//Triggers writeback
initC_buf.reset();
bool pass = true;
for (size_t i = 0; i < problem_size; ++i) {
for (size_t j = 0; j < problem_size; ++j) {
auto kernel_value = initC[i * Mdim + j];
auto host_value = (i == j) ? 1.0 : 0.0;
if (kernel_value != host_value) {
pass = false;
break;
}
}
}
return pass;
}
static std::string getBenchmarkName() {
std::stringstream name;
if(kernel_type_thorin)
name << "Thorin_DGEMM_MatMulBlocked_";
else
name << "DGEMM_MatMulBlocked_";
return name.str();
}
};
int main(int argc, char** argv)
{
BenchmarkApp app(argc, argv);
app.run<MatMulBlocked>();
return 0;
}
Error logs when offloading to CUDA device:
********** Results for DGEMM_MatMulBlocked_**********
problem-size: 3072
local-size: 256
device-name: NVIDIA GeForce RTX 2080 SUPER
sycl-implementation: hipSYCL
============== hipSYCL error report ==============
hipSYCL has caught the following undhandled asynchronous errors:
0. from /home/jgupta/development/opensycl/OpenSYCL/OpenSYCL_juhi/src/runtime/cuda/cuda_event.cpp:63 @ wait(): cuda_node_event: cudaEventSynchronize() failed (error code = CUDA:700)
The application will now be terminated.
terminate called without an active exception
zsh: IOT instruction (core dumped)
Error logs when offloading to HIP device:
********** Results for DGEMM_MatMulBlocked_**********
problem-size: 3072
local-size: 256
device-name: AMD Radeon Pro VII
sycl-implementation: hipSYCL
Memory access fault by GPU node-1 (Agent handle: 0x56096774a7e0) on address 0x7ffe8126f000. Reason: Page not present or supervisor privilege.
zsh: IOT instruction (core dumped)
Most benchmarks provide event-based profiling information. This is achieved by pushing events returned by sycl::queue::submit
into a vector of events. For reference, see single-kernel/sobel5.cpp
:
void run(std::vector<sycl::event>& events) {
events.push_back(args.device_queue.submit([&](sycl::handler& cgh) {
However, single-kernel/nbody
is the only benchmark under single-kernel/*
and polybench/*
not taking this approach (see):
void submitHierarchical(sycl::buffer<particle_type>& particles, sycl::buffer<vector_type>& velocities) {
args.device_queue.submit([&](sycl::handler& cgh) {
As this is the only benchmark failing to do so, I inferred this is an unintended bug and should be fixed.
It seems that currently people have difficulty finding the SYCL 2020 benchmarks because they live in the sycl2020
branch which is not really mentioned anywhere.
We should consider
sycl2020
the default branchmain
-- after all, it contains important functionality of this benchmark suite.This line is a race condition:
https://github.com/bcosenza/sycl-bench/blob/5d05aa4193944ca2e42f6ac2e6bd29109dc8174e/single-kernel/scalar_prod.cpp#L131
As it writes into memory that might be read by work items from work group 0. I should fix this.
The project really looks interesting, but I am unable to find a HOWTO on build. Did I missed it somewhere?
... to make it easier to add/work with test profiles
While executing the test case blocked_transform which is present under runtime (https://github.com/bcosenza/sycl-bench/blob/master/runtime/blocked_transform.cpp), we noticed that we are getting a core dump error.
Command used to execute - ./blocked_transform --device=gpu
Output -
********** Results for Runtime_BlockedTransform_iter_64_blocksize_0**********
problem-size: 3072
local-size: 1024
device-name: NVIDIA RTX A4000
sycl-implementation: LLVM CUDA (Codeplay)
blocked_transform: /tmp/llvm-sycl-nightly-20220222/sycl/source/detail/scheduler/commands.cpp:1826: void cl::sycl::detail::adjustNDRangePerKernel(cl::sycl::detail::NDRDescT&, cl::sycl::detail::pi::PiKernel, const cl::sycl::detail::device_impl&): Assertion `NDR.NumWorkGroups[0] != 0 && NDR.LocalSize[0] == 0' failed.
Aborted (core dumped)
However, when we are explicitly assigning the value of the --local parameter to 256 (which is the default value) during runtime, it is executing without any errors.
Command used to execute - ./blocked_transform --device=gpu --local=256
We would like to know if there is a fix for this issue? If so, where can we get the revised code?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.