Overview

Microbenchmarks to test the performance characteristics of the Emu Chick.

Building

Build for emu hardware/simulator:

mkdir build-hw && cd build-hw
cmake .. \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_TOOLCHAIN_FILE=../cmake/emu-cxx-toolchain.cmake
make -j4

Build for testing on x86:

mkdir build-x86 && cd build-x86
cmake .. \
-DCMAKE_BUILD_TYPE=Debug
make -j4

Benchmarks

`local_stream`

Description

Allocates three arrays (A, B, C) with 2^log2_num_elements on a single nodelet. Computes the sum of two vectors (C = A + B) with num_threads threads, and reports the average memory bandwidth.

Usage

./local_stream mode log2_num_elements num_threads num_trials

Modes

serial - Uses a serial for loop
cilk_for - Uses a cilk_for loop
serial_spawn - Uses a serial for loop to spawn a thread for each grain-sized chunk of the loop range
recursive_spawn - Recursively spawns threads to divide up the loop range
library - Uses emu_local_for from emu_c_utils

`global_stream`

Allocates three arrays (A, B, C) with 2^log2_num_elements using a chunked (malloc2D) array distributed across all the nodelets. Computes the sum of two vectors (C = A + B) with num_threads threads, and reports the average memory bandwidth.

Usage

./global_stream mode log2_num_elements num_threads num_trials

Modes

serial - Uses a serial for loop
cilk_for - Uses a cilk_for loop
serial_spawn - Uses a serial for loop to spawn a thread for each grain-sized chunk of the loop range
recursive_spawn - Recursively spawns threads to divide up the loop range
recursive_remote_spawn - Recursively spawns threads to divide up the loop range, using remote spawns where possible.
serial_remote_spawn - Remote spawns a thread on each nodelet, then divides up work as in serial_spawn
serial_remote_spawn_shallow - Like serial_remote_spawn, but all threads are remote spawned from nodelet 0.
library - Uses emu_chunked_array_apply from emu_c_utils.

`global_stream_1d`

Allocates three arrays (A, B, C) with 2^log2_num_elements using a striped array (malloc1dlong) distributed across all the nodelets. Computes the sum of two vectors (C = A + B) with num_threads threads, and reports the average memory bandwidth.

Usage

./global_stream_1d mode log2_num_elements num_threads num_trials

Modes

serial - Uses a serial for loop
cilk_for - Uses a cilk_for loop
serial_spawn - Uses a serial for loop to spawn a thread for each grain-sized chunk of the loop range
library - Uses emu_1d_array_apply from emu_c_utils.

`pointer_chase`

The pointer chasing benchmark is defined as follows:

Allocate a contiguous array of N elements. Each element consists of an 8-byte payload and an 8-byte pointer to the next element.
Form a linked list by connecting the elements in random order.
Each of P threads traverses N/P nodes in the list in parallel, summing up the payloads as it goes.

The randomization of elements in step 2 can be further controlled using the block_size and sort_mode parameters. See the documentation for sort_mode below.

Usage

./pointer_chase [OPTIONS]

    --log2_num_elements  Number of elements in the list
    --num_threads        Number of threads traversing the list
    --block_size         Number of elements to swap at a time
    --spawn_mode         How to spawn the threads
    --sort_mode          How to shuffle the array
    --num_trials         Number of times to run the benchmark

Spawn Modes

serial_spawn - Uses a serial for loop to spawn a thread for each grain-sized chunk of the loop range
serial_remote_spawn - Remote spawns a thread on each nodelet, then divides up work as in serial_spawn

Sort Modes

This parameter controls how the elements in the list are linked together. In each example, the number refers to the index of the list element in the contiguous array. Also --log2_num_elements=4 and --block_size=4.

ordered - Each element points to the next element in the array

0->1->2->3->4->5->6->7->8->9->10->11->12->13->14->15

block_shuffle - The elements are grouped into blocks of size block_size, then the order of the blocks is randomized

4->5->6->7--->0->1->2->3--->12->13->14->15--->8->9->10->11

intra_block_shuffle - The elements are grouped into blocks of size block_size, then the order of elements within each block is randomized

2->1->3->0--->4->5->7->6--->8->10->11->9--->14->15->13->12

full_block_shuffle - The elements are grouped into blocks of size block_size. The order of elements within each block is randomized and the order of each block is randomized.

4->5->7->6--->2->1->3->0--->14->15->13->12--->8->10->11->9

STREAM over striped array is slower in C++ than in C

C++ (global_stream_cxx in striped mode) is getting about 3.1 GB/s, C (global_stream_1d) is reaching peak (more than 8 GB/s).

Both inner loops are computing c[i] = a[i] + b[i];

In global_stream_1d.c this compiles to:

%for.body:                                       // block                   (44)
        etd       3                              // D = E3                  [2]
        sllc      3                              // D <<= 3                 [4]
        dpeta     7                              // A = D + E7              [3]
        lde       8                              // E8 = *A                 [3]
        etd       3                              // D = E3                  [2]
        sllc      3                              // D <<= 3                 [4]
        dpeta     6                              // A = D + E6              [3]
        lde       9                              // E9 = *A                 [3]
        etd       3                              // D = E3                  [2]
        sllc      3                              // D <<= 3                 [4]
        dpeta     5                              // A = D + E5              [3]
        etd       9                              // D = E9                  [2]
        adde      8                              // D += E8                 [3]
        wrd                                      // *A = D                  [2]
        etd       2                              // D = E2                  [2]
        dpeta     3                              // A = D + E3              [3]
        ate       3                              // E3 = A                  [2]
        etd       4                              // D = E4                  [2]
        cmpe      3                              // D ?= E3                 [3]
        td0       39, %for.end                   // E sge D                 [5]
        jmp       %for.body                      //                         [4]
%for.end:                                        // block                   (105)
        jmpe      1                              // return void             [3]

In global_stream_cxx.cc, the inner loop compiles to:

%for.body:                                       // block                   (106)
        etd       2                              // D = E2                  [2]
        sllc      3                              // D <<= 3                 [4]
        dpeta     8                              // A = D + E8              [3]
        ld32l                                    // D = *A                  [3]
        dte       5                              // E5 = D                  [2]
        aaimb     4                              // A += 4                  [3]
        ld32l                                    // D = *A                  [3]
        sllc      32                             // D <<= 32                [4]
        ore       5                              // D |= E5                 [3]
        dte       9                              // E9 = D                  [2]
        etd       2                              // D = E2                  [2]
        sllc      3                              // D <<= 3                 [4]
        dpeta     4                              // A = D + E4              [3]
        ate       5                              // E5 = A                  [2]
        etd       2                              // D = E2                  [2]
        sllc      3                              // D <<= 3                 [4]
        dpeta     7                              // A = D + E7              [3]
        ld32l                                    // D = *A                  [3]
        dte       10                             // E10 = D                 [2]
        aaimb     4                              // A += 4                  [3]
        ld32l                                    // D = *A                  [3]
        sllc      32                             // D <<= 32                [4]
        ore       10                             // D |= E10                [3]
        adde      9                              // D += E9                 [3]
        eta       5                              // A = E5                  [2]
        st32                                     // *A = D                  [2]
        srlc      32                             // D >>= 32                [4]
        aaimb     4                              // A += 4                  [3]
        st32                                     // *A = D                  [2]
        etd       6                              // D = E6                  [2]
        dpeta     2                              // A = D + E2              [3]
        ate       2                              // E2 = A                  [2]
        etd       3                              // D = E3                  [2]
        cmpe      2                              // D ?= E2                 [3]
        td0       39, %for.end                   // E sge D                 [5]
        jmp       %for.body                      //                         [4]
%for.end:                                        // block                   (210)
        jmpe      1                              // return void             [3]

Why is C++ using 32-bit loads and stores for a 64-bit type?

ehein6 / emu-microbench Goto Github PK

emu-microbench's Introduction

Overview

Building

Benchmarks

local_stream

Description

Usage

Modes

global_stream

Usage

Modes

global_stream_1d

Usage

Modes

pointer_chase

Usage

Spawn Modes

Sort Modes

emu-microbench's People

Contributors

Stargazers

Watchers

Forkers

emu-microbench's Issues

Recommend Projects

Recommend Topics

Recommend Org

`local_stream`

`global_stream`

`global_stream_1d`

`pointer_chase`