Coder Social home page Coder Social logo

emu-microbench's Introduction

Overview

Microbenchmarks to test the performance characteristics of the Emu Chick.

Building

Build for emu hardware/simulator:

mkdir build-hw && cd build-hw
cmake .. \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_TOOLCHAIN_FILE=../cmake/emu-cxx-toolchain.cmake
make -j4

Build for testing on x86:

mkdir build-x86 && cd build-x86
cmake .. \
-DCMAKE_BUILD_TYPE=Debug
make -j4

Benchmarks

local_stream

Description

Allocates three arrays (A, B, C) with 2^log2_num_elements on a single nodelet. Computes the sum of two vectors (C = A + B) with num_threads threads, and reports the average memory bandwidth.

Usage

./local_stream mode log2_num_elements num_threads num_trials

Modes

  • serial - Uses a serial for loop
  • cilk_for - Uses a cilk_for loop
  • serial_spawn - Uses a serial for loop to spawn a thread for each grain-sized chunk of the loop range
  • recursive_spawn - Recursively spawns threads to divide up the loop range
  • library - Uses emu_local_for from emu_c_utils

global_stream

Allocates three arrays (A, B, C) with 2^log2_num_elements using a chunked (malloc2D) array distributed across all the nodelets. Computes the sum of two vectors (C = A + B) with num_threads threads, and reports the average memory bandwidth.

Usage

./global_stream mode log2_num_elements num_threads num_trials

Modes

  • serial - Uses a serial for loop
  • cilk_for - Uses a cilk_for loop
  • serial_spawn - Uses a serial for loop to spawn a thread for each grain-sized chunk of the loop range
  • recursive_spawn - Recursively spawns threads to divide up the loop range
  • recursive_remote_spawn - Recursively spawns threads to divide up the loop range, using remote spawns where possible.
  • serial_remote_spawn - Remote spawns a thread on each nodelet, then divides up work as in serial_spawn
  • serial_remote_spawn_shallow - Like serial_remote_spawn, but all threads are remote spawned from nodelet 0.
  • library - Uses emu_chunked_array_apply from emu_c_utils.

global_stream_1d

Allocates three arrays (A, B, C) with 2^log2_num_elements using a striped array (malloc1dlong) distributed across all the nodelets. Computes the sum of two vectors (C = A + B) with num_threads threads, and reports the average memory bandwidth.

Usage

./global_stream_1d mode log2_num_elements num_threads num_trials

Modes

  • serial - Uses a serial for loop
  • cilk_for - Uses a cilk_for loop
  • serial_spawn - Uses a serial for loop to spawn a thread for each grain-sized chunk of the loop range
  • library - Uses emu_1d_array_apply from emu_c_utils.

pointer_chase

The pointer chasing benchmark is defined as follows:

  1. Allocate a contiguous array of N elements. Each element consists of an 8-byte payload and an 8-byte pointer to the next element.
  2. Form a linked list by connecting the elements in random order.
  3. Each of P threads traverses N/P nodes in the list in parallel, summing up the payloads as it goes.

The randomization of elements in step 2 can be further controlled using the block_size and sort_mode parameters. See the documentation for sort_mode below.

Usage

./pointer_chase [OPTIONS]

    --log2_num_elements  Number of elements in the list
    --num_threads        Number of threads traversing the list
    --block_size         Number of elements to swap at a time
    --spawn_mode         How to spawn the threads
    --sort_mode          How to shuffle the array
    --num_trials         Number of times to run the benchmark

Spawn Modes

  • serial_spawn - Uses a serial for loop to spawn a thread for each grain-sized chunk of the loop range
  • serial_remote_spawn - Remote spawns a thread on each nodelet, then divides up work as in serial_spawn

Sort Modes

This parameter controls how the elements in the list are linked together. In each example, the number refers to the index of the list element in the contiguous array. Also --log2_num_elements=4 and --block_size=4.

  • ordered - Each element points to the next element in the array
0->1->2->3->4->5->6->7->8->9->10->11->12->13->14->15
  • block_shuffle - The elements are grouped into blocks of size block_size, then the order of the blocks is randomized
4->5->6->7--->0->1->2->3--->12->13->14->15--->8->9->10->11
  • intra_block_shuffle - The elements are grouped into blocks of size block_size, then the order of elements within each block is randomized
2->1->3->0--->4->5->7->6--->8->10->11->9--->14->15->13->12
  • full_block_shuffle - The elements are grouped into blocks of size block_size. The order of elements within each block is randomized and the order of each block is randomized.
4->5->7->6--->2->1->3->0--->14->15->13->12--->8->10->11->9

emu-microbench's People

Contributors

ehein6 avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

emu-microbench's Issues

Bad A register for local_stream compiled with notls

I compiled a version of local stream from sc-driver-tests using emu19.02 toolchain with -trace no-thread-local-storage. Ran fine for "recursive_spawn 25 65536 1" but generated the following errors for "serial_spawn". Someone else took demo1, so I tried it on xps1 and got exactly the same errors. No idea what's up with this. If it wasn't repeatable, I would have blamed it on ECC errors.

Demo1:
n0:~/sc-driver-tests-v3.0.rc5_parent_19.02/tests/stress/singlenode/notls$ grep EXC *.log
mn_exec_sys.15082.n0.log:EXCEPTION=0x5, CAUSE=0x2, SRC_NDLT=0x0, A=0x0! Printing thread...
mn_exec_sys.15327.n0.log:EXCEPTION=0x6, CAUSE=0x5, SRC_NDLT=0x0, A=0x0! Printing thread...
mn_exec_usr.15082.n0.log:EXCEPTION=0x5, CAUSE=0x2, SRC_NDLT=0x0, A=0x0! Printing thread...
mn_exec_usr.15327.n0.log:EXCEPTION=0x6, CAUSE=0x5, SRC_NDLT=0x0, A=0x0! Printing thread...

XPS1
n0:~/local_stream_parent_priority$ grep EXC *.log
mn_exec_sys.22105.n0.log:EXCEPTION=0x5, CAUSE=0x2, SRC_NDLT=0x0, A=0x0! Printing thread...
mn_exec_sys.22368.n0.log:EXCEPTION=0x6, CAUSE=0x5, SRC_NDLT=0x0, A=0x0! Printing thread...
mn_exec_usr.22105.n0.log:EXCEPTION=0x5, CAUSE=0x2, SRC_NDLT=0x0, A=0x0! Printing thread...
mn_exec_usr.22368.n0.log:EXCEPTION=0x6, CAUSE=0x5, SRC_NDLT=0x0, A=0x0! Printing thread...

STREAM over striped array is slower in C++ than in C

C++ (global_stream_cxx in striped mode) is getting about 3.1 GB/s, C (global_stream_1d) is reaching peak (more than 8 GB/s).

Both inner loops are computing c[i] = a[i] + b[i];

In global_stream_1d.c this compiles to:

%for.body:                                       // block                   (44)
        etd       3                              // D = E3                  [2]
        sllc      3                              // D <<= 3                 [4]
        dpeta     7                              // A = D + E7              [3]
        lde       8                              // E8 = *A                 [3]
        etd       3                              // D = E3                  [2]
        sllc      3                              // D <<= 3                 [4]
        dpeta     6                              // A = D + E6              [3]
        lde       9                              // E9 = *A                 [3]
        etd       3                              // D = E3                  [2]
        sllc      3                              // D <<= 3                 [4]
        dpeta     5                              // A = D + E5              [3]
        etd       9                              // D = E9                  [2]
        adde      8                              // D += E8                 [3]
        wrd                                      // *A = D                  [2]
        etd       2                              // D = E2                  [2]
        dpeta     3                              // A = D + E3              [3]
        ate       3                              // E3 = A                  [2]
        etd       4                              // D = E4                  [2]
        cmpe      3                              // D ?= E3                 [3]
        td0       39, %for.end                   // E sge D                 [5]
        jmp       %for.body                      //                         [4]
%for.end:                                        // block                   (105)
        jmpe      1                              // return void             [3]

In global_stream_cxx.cc, the inner loop compiles to:

%for.body:                                       // block                   (106)
        etd       2                              // D = E2                  [2]
        sllc      3                              // D <<= 3                 [4]
        dpeta     8                              // A = D + E8              [3]
        ld32l                                    // D = *A                  [3]
        dte       5                              // E5 = D                  [2]
        aaimb     4                              // A += 4                  [3]
        ld32l                                    // D = *A                  [3]
        sllc      32                             // D <<= 32                [4]
        ore       5                              // D |= E5                 [3]
        dte       9                              // E9 = D                  [2]
        etd       2                              // D = E2                  [2]
        sllc      3                              // D <<= 3                 [4]
        dpeta     4                              // A = D + E4              [3]
        ate       5                              // E5 = A                  [2]
        etd       2                              // D = E2                  [2]
        sllc      3                              // D <<= 3                 [4]
        dpeta     7                              // A = D + E7              [3]
        ld32l                                    // D = *A                  [3]
        dte       10                             // E10 = D                 [2]
        aaimb     4                              // A += 4                  [3]
        ld32l                                    // D = *A                  [3]
        sllc      32                             // D <<= 32                [4]
        ore       10                             // D |= E10                [3]
        adde      9                              // D += E9                 [3]
        eta       5                              // A = E5                  [2]
        st32                                     // *A = D                  [2]
        srlc      32                             // D >>= 32                [4]
        aaimb     4                              // A += 4                  [3]
        st32                                     // *A = D                  [2]
        etd       6                              // D = E6                  [2]
        dpeta     2                              // A = D + E2              [3]
        ate       2                              // E2 = A                  [2]
        etd       3                              // D = E3                  [2]
        cmpe      2                              // D ?= E2                 [3]
        td0       39, %for.end                   // E sge D                 [5]
        jmp       %for.body                      //                         [4]
%for.end:                                        // block                   (210)
        jmpe      1                              // return void             [3]

Why is C++ using 32-bit loads and stores for a 64-bit type?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.