unisa-hpc / sycl-bench Goto Github PK

View Code? Open in Web Editor NEW

56.0 11.0 30.0 25.28 MB

SYCL Benchmark Suite

License: BSD 3-Clause "New" or "Revised" License

C++ 82.05% C 0.31% CMake 11.03% Python 3.17% Ruby 3.44%

sycl opencl gpu gpgpu gpu-programming spir-v

sycl-bench's People

Contributors

Stargazers

Watchers

sycl-bench's Issues

`emitResults` has invalid memory accesses when used with `--warmup-run` & the first run fails

TimeMetricsProcessor::emitResults leads to segfaults in case the very first benchmark run fails (e.g. doesn't verify), and just this run also happens to be the warmup run.
E.g. line double median = resultsSeconds[resultsSeconds.size() / 2];

This is e.g. triggered by running ./nbody --size=1024 --local=256 --warmup-run with AdaptiveCpp's -DHIPSYCL_TARGETS=generic as hierarchical parallel for isn't supported with this compilation-flow, yet.
In this case specifying --no-verification or leaving out --warmup-run works around the issue.

Investigate nbody on ComputeCpp

Reports indicate that there's an issue with nbody on ComputeCpp. But what could it be?

Questions about the single kernel set

[ 3%] Building CXX object CMakeFiles/sobel7.dir/single-kernel/sobel7.cpp.o
Unsupported kernel parameter type
UNREACHABLE executed at /home/cc/sycl_workspace/llvm/clang/lib/Sema/SemaSYCL.cpp:1039!

Add references to original implementations of benchmarks

Right now it is not clear which benchmarks are new, and which are ports of existing benchmarks. We should add notes in each benchmark source file, referring to the original source, if there is any.

Some thoughts on DRAM throughput benchmarking

As previously discussed, I think that the current implementation of the DRAM microbenchmark is not ideal, if our goal is to simply measure throughput. The benchmark is currently based on the paper "GPGPU Power Modeling for Multi-Domain Voltage-Frequency Scaling" (Guerreiro18), in which they used it to characterize the power consumption of GPUs when stressing device memory. However, the kernel itself doesn't just do a memory copy, but also includes a parameter N to introduce a certain level of arithmetic intensity. I'm not certain why they have that in there, but I would assume they have their reasons. For us however this means adding another dimension to the benchmark, which makes it harder to evaluate and compare results.

To further complicate things, the actual implementation of the DRAM benchmark used in Guerreiro18, which can be found on GitHub (here), appears to be completely different. In this version, the arithmetic operations are gone, and the kernel mostly consists of copy instructions. However, it still does some strange things I'm not entirely clear about.

Here's my SYCL implementation of this kernel:

cgh.parallel_for<DRAMKernel>(global_size, [=](s::id<1> gid) {
  for(size_t j = 0; j < COPY_ITERATIONS; j += UNROLL_ITERATIONS) {
    for(size_t i = 0; i < UNROLL_ITERATIONS; ++i) {
      DataT r0 = in[gid[0] + STRIDE * i];
      out[gid[0] + STRIDE * i] = r0;
    }
  }
});

As you can see, since the loop variable j is not being used inside the loop body, this kernel repeatedly copies the same chunk of memory from one buffer into the other; by default COPY_ITERATIONS is 1024. In each iteration UNROLL_ITERATIONS (default 32) copies are made. To allow for memory coalescing, each thread accesses the buffer based linearly on its global id. Each subsequent copy is offset by a multiple of STRIDE. This parameter is sensitive to cache sizes: If chosen too small, caching can occur. In the implementation of Guerreiro18, they use an offset of 64 * 1024, and profiling reveals a 0% L2 hit rate on my GTX 2070. If I change this to 32 * 1024 however, I already get a 11% L2 hit rate. How large this stride has to be exactly depends on the number of work items, cache sizes and so on. All in all, this version of the benchmark essentially introduces three additional parameters, greatly increasing its complexity, without a clear benefit.

Furthermore, in order to achieve the highest throughput with this benchmark, it weirdly needs to be compiled without any assembly-level optimizations (by passing -O0 to ptxas). This is of course hard to replicate with hipSYCL, and my SYCL implementation of Guerreiro18 thus also only achieves around ~300 GiB/s, i.e., 67% of the GTX 2070's 448 GiB/s theoretical maximum. This is a bit less than the CUDA version achieves with optimizations (311 GiB/s), and substantially less than it when compiled without optimizations (325 GiB/s).

Long story short, instead of implementing Guerreiro18, I would suggest to instead go for the simplest possible memcopy kernel, i.e.:

cgh.parallel_for<DRAMKernel>(global_size, [=](s::id<1> gid) {
  out[gid] =  in[gid];
});

With this, I'm able to get ~358 GiB/s, which is about 80% of the theoretical maximum, and probably quite close to what we can reasonably expect to achieve.

Note that to get good results, the buffer size needs to be large enough, i.e. at least several hundred megabytes. If we do a plain mapping of --size to buffer size, this means that the specified sizes need to be rather large. For example I used --size=939524096 (which equals 3.5 GiB).

This is of course quite a large number, and a much larger size than many other benchmarks can support (if e.g. it is used as the range in both dimensions of a 2D buffer). One possible solution would be to always multiply the size by some fixed factor, e.g. 1024. However, I generally feel like the mapping of the --size parameter to the actual workload is too arbitrary already. It might be convenient for running lots of benchmarks in batch, but in reality I think we'll have to hand tune these values (as well as any additional parameters a benchmark might have) for each individual platform anyways. I'm thus thinking whether maybe having individual parameters for each benchmark would make more sense (e.g. having a buffer-size= for this one).

In the same vein, having the ability to compute custom metrics would also be great. For example, it would be cool if this benchmark could actually print its achieved throughput, instead of having to manually compute it.

What do you guys think?

Problem in compilation stage with. computecpp 2.0.0

Hi,

I tried compile the sycl-bench, using the computecpp 2.0.0, gcc 9.3.0 and ubuntu error, but the cmake tool return me a error: #include<CL/sycl.hpp> not found.

cmake command: cmake .. -DSYCL_IMPL=ComputeCpp -D COMPUTECPP_RUNTIME_LIBRARY=/opt/ComputeCPP/include -DComputeCpp_INFO_EXECUTABLE=/opt/ComputeCPP -D COMPUTECPP_RUNTIME_LIBRARY_DEBUG=/opt/ComputeCPP

What could I be doing wrong?

Rui

blocked_transform is broken due to SYCL 2020 offset semantics

In SYCL 1.2.1, accessor::operator[] did not add the accessor offset. Users were instead expected to use parallel_for with offset to achieve correct indexing.

In SYCL 2020, accessor::operator[] does add the accessor offset. parallel_for offset is deprecated and should not be used.

The blocked_transform benchmark assumes the SYCL 1.2.1 semantics. Building it with a SYCL 2020 compiler causes the offset to be added twice, because the parallel_for is provided an offset, and additionally the accessor::operator[] now also adds the offset. So this benchmark is currently UB in any SYCL 2020 implementation.

We should change it to not use parallel_for with offset to make everything match again.

More special treatment for implementations in CMakeLists.txt

We should:

Add -march=native when compiling with hipSYCL for CPU. Unlike ComputeCpp+OpenCL CPU backend, hipSYCL cannot do runtime compilation, so it's important to optimize when compiling. This may also be relevant for Intel SYCL and ComputeCpp if we want to also benchmark their host fallbacks.
Define TRISYCL preprocessor macro when compiling for triSYCL, so that we can add some workarounds in the code.

Use ndrange of hierarchical parallel for in pattern_shared

This benchmark uses a local memory accessor which is illegal inside a basic hierarchical for invocation. This causes the benchmark to crash at least on hipSYCL CPU.

Fix the run-suite brommy.bmp not found issue

It used to work. But we don't know why. Now it doesn't and we know why. We should transform this to a state where it works and (ideally, but at a lower priority) we know why.

use of undeclared identifier 'device_selector'

In file included from /home/cc/sycl-bench/include/common.h:14:
/home/cc/sycl-bench/include/command_line.h:244:8: error: use of undeclared identifier 'device_selector'
if(device_selector != "gpu") {

Runtime failure for the DGEMM application

I am working on an example which I am trying to integrate and execute using the sycl-bench suite. I am getting runtime error upon testing it with AdaptiveCpp for CUDA as well as HIP backend. When the same kernel is executed as the standalone ACpp application, the kernel executes without any error. I have a similar kind of behavior with one of my other applications. One of the similarities between both the applications is that they both are nd_range parallel_for type.

Matrices are currently initialized as identity matrix. Please let me know if you have any idea on the error.

Application code:

#include "common.h"

#include <iostream>
#include <sycl/sycl.hpp>

using namespace sycl;

class MatMulBlocked
{
protected:    
  
  BenchmarkArgs args;
  std::vector<double> initA;
  std::vector<double> initB;
  std::vector<double> initC;
  
  PrefetchedBuffer<double, 2> initA_buf;
  PrefetchedBuffer<double, 2> initB_buf;
  PrefetchedBuffer<double, 2> initC_buf;

  const size_t problem_size = 256;
  const size_t Ndim = 256;
  const size_t Mdim = 256;
  const size_t Pdim = 256;
  const size_t Bsize = 16;

public:
  MatMulBlocked(const BenchmarkArgs &_args) : args(_args) {}
  
  void setup() {
    // host memory intilization
    initA.resize(Ndim * Pdim);
    initB.resize(Pdim * Mdim);
    initC.resize(Ndim * Mdim);

    // Initialize matrix A to the identity
    for(size_t i = 0; i < Ndim; ++i) {
      for(size_t j = 0; j < Pdim; ++j) {
        initA[i * Pdim + j] = i == j;
          }
    }
      // Initialize matrix B to the identity
    for(size_t i = 0; i < Pdim; ++i) {
      for(size_t j = 0; j < Mdim; ++j) {
              initB[i * Mdim + j] = i == j;
          }
    }
      // Initialize matrix C to the zero
    for(size_t i = 0; i < Ndim; ++i) {
      for(size_t j = 0; j < Mdim; ++j) {
              initC[i * Mdim + j] = 0;
          }
    }

    initA_buf.initialize(args.device_queue, initA.data(), range<2>(Ndim, Pdim));
    initB_buf.initialize(args.device_queue, initB.data(), range<2>(Pdim, Mdim));
    initC_buf.initialize(args.device_queue, initC.data(), range<2>(Ndim, Mdim));
  }

  void run(std::vector<event>& events) {
    events.push_back(args.device_queue.submit(
        [&](handler& cgh) {
          
      auto in1 = initA_buf.template get_access<access::mode::read>(cgh);
      auto in2 = initB_buf.template get_access<access::mode::read>(cgh);
      auto out = initC_buf.template get_access<access::mode::read_write>(cgh);

      // Use local memory address space for local memory
      accessor<double, 2, access_mode::read_write, access::target::local> Awrk({Bsize, Bsize}, cgh);
      accessor<double, 2, access_mode::read_write, access::target::local> Bwrk({Bsize, Bsize}, cgh);

      cgh.parallel_for<class SYCL_Matmul_blocked_kernel>(
        nd_range<2>{{Ndim, Mdim}, {Bsize, Bsize}}, 
          [=](nd_item<2> idx) {
            // This work-item will compute C(i,j)
            const size_t i = idx.get_global_id(0);
            const size_t j = idx.get_global_id(1);

            // Element C(i,j) is in block C(Iblk, Jblk)
            const size_t Iblk = idx.get_group(0);
            const size_t Jblk = idx.get_group(1);

            // C(i,j) is element C(iloc, jloc) of block C(Iblk, Jblk)
            const size_t iloc = idx.get_local_id(0);
            const size_t jloc = idx.get_local_id(1);

            // Number of blocks
            const size_t Nblk = Ndim / Bsize;
            const size_t Mblk = Mdim / Bsize;
            const size_t Pblk = Pdim / Bsize;

            for (size_t Kblk = 0; Kblk < Pblk; ++Kblk) {
              // Copy A and B into local memory
              Awrk[iloc][jloc] = in1[Iblk * Bsize + iloc][Kblk * Bsize + jloc];
              Bwrk[iloc][jloc] = in2[Kblk * Bsize + iloc][Jblk * Bsize + jloc];

              // Compute matmul for block
              for (size_t kloc = 0; kloc < Bsize; ++kloc) {
                out[i][j] += Awrk[iloc][kloc] * Bwrk[kloc][jloc];
              }
            }
        });
        args.device_queue.wait_and_throw();

    }));

  }

  bool verify(VerificationSetting &ver) {
    //Triggers writeback
    initC_buf.reset();
    bool pass = true;
 
    for (size_t i = 0; i < problem_size; ++i) {
        for (size_t j = 0; j < problem_size; ++j) {
            auto kernel_value = initC[i * Mdim + j];
            auto host_value = (i == j) ? 1.0 : 0.0;

            if (kernel_value != host_value) {
                pass = false;
                break;
            }
        }
    }    
    return pass;
  }
  
  static std::string getBenchmarkName() {
    std::stringstream name;
    if(kernel_type_thorin)
      name << "Thorin_DGEMM_MatMulBlocked_";
    else
      name << "DGEMM_MatMulBlocked_";    
    return name.str();
  }
};

int main(int argc, char** argv)
{
  BenchmarkApp app(argc, argv);
  app.run<MatMulBlocked>();
  return 0;
}

Error logs when offloading to CUDA device:


********** Results for DGEMM_MatMulBlocked_**********
problem-size: 3072
local-size: 256
device-name: NVIDIA GeForce RTX 2080 SUPER
sycl-implementation: hipSYCL
============== hipSYCL error report ============== 
hipSYCL has caught the following undhandled asynchronous errors: 

   0. from /home/jgupta/development/opensycl/OpenSYCL/OpenSYCL_juhi/src/runtime/cuda/cuda_event.cpp:63 @ wait(): cuda_node_event: cudaEventSynchronize() failed (error code = CUDA:700)
The application will now be terminated.
terminate called without an active exception
zsh: IOT instruction (core dumped)

Error logs when offloading to HIP device:


********** Results for DGEMM_MatMulBlocked_**********
problem-size: 3072
local-size: 256
device-name: AMD Radeon Pro VII
sycl-implementation: hipSYCL
Memory access fault by GPU node-1 (Agent handle: 0x56096774a7e0) on address 0x7ffe8126f000. Reason: Page not present or supervisor privilege.
zsh: IOT instruction (core dumped)

`single-kernel/nbody` does not provide event-based profiling

Most benchmarks provide event-based profiling information. This is achieved by pushing events returned by sycl::queue::submit into a vector of events. For reference, see single-kernel/sobel5.cpp:

void run(std::vector<sycl::event>& events) {
    events.push_back(args.device_queue.submit([&](sycl::handler& cgh) {

However, single-kernel/nbody is the only benchmark under single-kernel/* and polybench/* not taking this approach (see):

void submitHierarchical(sycl::buffer<particle_type>& particles, sycl::buffer<vector_type>& velocities) {
    args.device_queue.submit([&](sycl::handler& cgh) {

As this is the only benchmark failing to do so, I inferred this is an unintended bug and should be fixed.

Decide on future of sycl2020 branch - make default branch or merge into main?

It seems that currently people have difficulty finding the SYCL 2020 benchmarks because they live in the sycl2020 branch which is not really mentioned anywhere.

We should consider

Either making sycl2020 the default branch
or just merge it into main -- after all, it contains important functionality of this benchmark suite.

Race condition in scalar prod

This line is a race condition:
https://github.com/bcosenza/sycl-bench/blob/5d05aa4193944ca2e42f6ac2e6bd29109dc8174e/single-kernel/scalar_prod.cpp#L131

As it writes into memory that might be read by work items from work group 0. I should fix this.

build procedure

The project really looks interesting, but I am unable to find a HOWTO on build. Did I missed it somewhere?

Let run-suite read test profiles from yaml or json

... to make it easier to add/work with test profiles

Issue with --local command line parameter.

While executing the test case blocked_transform which is present under runtime (https://github.com/bcosenza/sycl-bench/blob/master/runtime/blocked_transform.cpp), we noticed that we are getting a core dump error.

Command used to execute - ./blocked_transform --device=gpu

Output -

********** Results for Runtime_BlockedTransform_iter_64_blocksize_0**********
problem-size: 3072
local-size: 1024
device-name: NVIDIA RTX A4000
sycl-implementation: LLVM CUDA (Codeplay)
blocked_transform: /tmp/llvm-sycl-nightly-20220222/sycl/source/detail/scheduler/commands.cpp:1826: void cl::sycl::detail::adjustNDRangePerKernel(cl::sycl::detail::NDRDescT&, cl::sycl::detail::pi::PiKernel, const cl::sycl::detail::device_impl&): Assertion `NDR.NumWorkGroups[0] != 0 && NDR.LocalSize[0] == 0' failed.
Aborted (core dumped)

However, when we are explicitly assigning the value of the --local parameter to 256 (which is the default value) during runtime, it is executing without any errors.

Command used to execute - ./blocked_transform --device=gpu --local=256

We would like to know if there is a fix for this issue? If so, where can we get the revised code?

unisa-hpc / sycl-bench Goto Github PK

sycl-bench's People

Contributors

Stargazers

Watchers

Forkers

sycl-bench's Issues

Recommend Projects

Recommend Topics

Recommend Org