Coder Social home page Coder Social logo

top500-benchmark's Introduction

Top500 Benchmark - HPL Linpack

CI

A common generic benchmark for clusters (or extremly powerful single node workstations) is Linpack, or HPL (High Performance Linpack), which is famous for its use in rankings in the Top500 supercomputer list over the past few decades.

I wanted to see where my various clusters and workstations would rank, historically (you can compare to past lists here), so I built this Ansible playbook which installs all the necessary tooling for HPL to run, connects all the nodes together via SSH, then runs the benchmark and outputs the result.

Why not PTS?

Phoronix Test Suite includes HPL Linpack and HPCC test suites. I may see how they compare in the future.

When I initially started down this journey, the PTS versions didn't play nicely with the Pi, especially when clustered. And the PTS versions don't seem to support clustered usage at all!

Supported OSes

Currently supported OSes:

  • Ubuntu (20.04+)
  • Raspberry Pi OS (11+)
  • Debian (11+)
  • Rocky Linux (9+)
  • AlmaLinux (9+)
  • CentOS Stream(9+)
  • RHEL (9+)
  • Fedora (38+)
  • Arch Linux
  • Manjaro

Other OSes may need a few tweaks to work correctly. You can also run the playbook inside Docker (see the note under 'Benchmarking - Single Node'), but performance will be artificially limited.

Benchmarking - Cluster

Make sure you have Ansible installed (pip3 install ansible), then copy the following files:

  • cp example.hosts.ini hosts.ini: This is an inventory of all the hosts in your cluster (or just a single computer).
  • cp example.config.yml config.yml: This has some configuration options you may need to override, especially the ssh_* and ram_in_gb options (depending on your cluster layout)

Each host should be reachable via SSH using the username set in ansible_user. Other Ansible options can be set under [cluster:vars] to connect in more exotic clustering scenarios (e.g. via bastion/jump-host).

Tweak other settings inside config.yml as desired (the most important being hpl_root—this is where the compiled MPI, ATLAS/OpenBLAS/Blis, and HPL benchmarking code will live).

Note: The names of the nodes inside hosts.ini must match the hostname of their corresponding node; otherwise, the benchmark will hang when you try to run it in a cluster.

For example, if you have node-01.local in your hosts.ini your host's hostname should be node-01 and not something else like raspberry-pi.

If you're testing with .local domains on Ubuntu, and local mDNS resolution isn't working, consider installing the avahi-daemon package:

sudo apt-get install avahi-daemon

Then run the benchmarking playbook inside this directory:

ansible-playbook main.yml

This will run three separate plays:

  1. Setup: downloads and compiles all the code required to run HPL. (This play takes a long time—up to many hours on a slower Raspberry Pi!)
  2. SSH: configures the nodes to be able to communicate with each other.
  3. Benchmark: creates an HPL.dat file and runs the benchmark, outputting the results in your console.

After the entire playbook is complete, you can also log directly into any of the nodes (though I generally do things on node 1), and run the following commands to kick off a benchmarking run:

cd ~/tmp/hpl-2.3/bin/top500
mpirun -f cluster-hosts ./xhpl

The configuration here was tested on smaller 1, 4, and 6-node clusters with 6-64 GB of RAM. Some settings in the config.yml file that affect the generated HPL.dat file may need diffent tuning for different cluster layouts!

Benchmarking - Single Node

To run locally on a single node, clone or download this repository to the node where you want to run HPL. Make sure the hosts.ini is set up with the default options (with just one node, 127.0.0.1).

All the default configuration from example.config.yml should be copied to a config.yml file, and all the variables should scale dynamically for your node.

Run the following command so the cluster networking portion of the playbook is not run:

ansible-playbook main.yml --tags "setup,benchmark"

For testing, you can start an Ubuntu docker container:

docker run --name top500 -it -v $PWD:/code geerlingguy/docker-ubuntu2204-ansible:latest bash

Then go into the code directory (cd /code) and run the playbook using the command above.

Setting performance CPU frequency

If you get an error like CPU Throttling apparently enabled!, you may need to set the CPU frequency to performance (and disable any throttling or performance scaling).

For different OSes and different CPU types, the way you do this could be different. So far the automated performance setting in the main.yml playbook has only been tested on Raspberry Pi OS. You may need to look up how to disable throttling on your own system. Do that, then run the main.yml playbook again.

Overclocking

Since I originally built this project for a Raspberry Pi cluster, I include a playbook to set an overclock for all the Raspberry Pis in a given cluster.

You can set a clock speed by changing the pi_arm_freq in the overclock-pi.yml playbook, then run it with:

ansible-playbook overclock-pi.yml

Higher clock speeds require more power and thus more cooling, so if you are running a Pi cluster with just heatsinks, you may also require a fan blowing over them if running overclocked.

Results

Here are a few of the results I've acquired in my testing (sorted by efficiency, highest to lowest):

Configuration Architecture Result Wattage Gflops/W
Radxa CM5 (RK3588S2 8-core) Arm 48.619 Gflops 10W 4.86 Gflops/W
Ampere Altra Q64-22 @ 2.2 GHz Arm 655.90 Gflops 140W 4.69 Gflops/W
Orange Pi 5 (RK3588S 8-core) Arm 53.333 Gflops 11.5W 4.64 Gflops/W
Radxa ROCK 5B (RK3588 8-core) Arm 51.382 Gflops 12W 4.32 Gflops/W
Ampere Altra Max M128-28 @ 2.8 GHz Arm 1,265.5 Gflops 296W 4.27 Gflops/W
Radxa ROCK 5C (RK3588S2 8-core) Arm 49.285 Gflops 12W 4.11 Gflops/W
Ampere Altra Max M96-28 @ 2.8 GHz Arm 1,188.3 Gflops 295W 4.01 Gflops/W
M1 Max Mac Studio (1x M1 Max @ 3.2 GHz, in Docker) Arm 264.32 Gflops 66W 4.00 Gflops/W
Ampere Altra Q32-17 @ 1.7 GHz Arm 332.07 Gflops 100W 3.32 Gflops/W
Turing Machines RK1 (RK3588 8-core) Arm 59.810 Gflops 18.1 3.30 Gflops/W
Turing Pi 2 (4x RK1 @ 2.4 GHz) Arm 224.60 Gflops 73W 3.08 Gflops/W
Raspberry Pi 5 (BCM2712 @ 2.4 GHz) Arm 30.249 Gflops 11W 2.75 Gflops/W
LattePanda Mu (1x N100 @ 3.4 GHz) x86 62.851 Gflops 25W 2.51 Gflops/W
Radxa X4 (1x N100 @ 3.4 GHz) x86 37.224 Gflops 16W 2.33 Gflops/W
Raspberry Pi CM4 (BCM2711 @ 1.5 GHz) Arm 11.433 Gflops 5.2W 2.20 Gflops/W
Ampere Altra Max M128-30 @ 3.0 GHz Arm 953.47 Gflops 500W 1.91 Gflops/W
Turing Pi 2 (4x CM4 @ 1.5 GHz) Arm 44.942 Gflops 24.5W 1.83 Gflops/W
Lenovo M710q Tiny (1x i5-7400T @ 2.4 GHz) x86 72.472 Gflops 41W 1.76 Gflops/W
Raspberry Pi 4 (BCM2711 @ 1.8 GHz) Arm 11.889 Gflops 7.2W 1.65 Gflops/W
Turing Pi 2 (4x CM4 @ 2.0 GHz) Arm 51.327 Gflops 33W 1.54 Gflops/W
DeskPi Super6c (6x CM4 @ 1.5 GHz) Arm 60.293 Gflops 40W 1.50 Gflops/W
Orange Pi CM4 (RK3566 4-core) Arm 5.604 Gflops 4.0W 1.40 Gflop/W
DeskPi Super6c (6x CM4 @ 2.0 GHz) Arm 70.338 Gflops 51W 1.38 Gflops/W
AMD Ryzen 5 5600x @ 3.7 GHz x86 229 Gflops 196W 1.16 Gflops/W
Milk-V Mars CM JH7110 4-core RISC-V 1.99 Gflops 3.6W 0.55 Gflops/W
Lichee Console 4A TH1520 4-core RISC-V 1.99 Gflops 3.6W 0.55 Gflops/W
Milk-V Jupiter Spacemit X60 8-core RISC-V 5.66 Gflops 10.6W 0.55 Gflops/W
Milk-V Mars JH7110 4-core RISC-V 2.06 Gflops 4.7W 0.44 Gflops/W
Raspberry Pi Zero 2 W (RP3A0-AU @ 1.0 GHz) Arm 0.370 Gflops 2.1W 0.18 Gflops/W
M2 Pro MacBook Pro (1x M2 Pro, in Asahi Linux) Arm 296.93 Gflops N/A N/A
M2 MacBook Air (1x M2 @ 3.5 GHz, in Docker) Arm 104.68 Gflops N/A N/A

You can enter the Gflops in this tool to see how it compares to historical top500 lists.

Note: My current calculation for efficiency is based on average power draw over the course of the benchmark, based on either a Kill-A-Watt (pre-2024 tests) or a ThirdReality Smart Outlet monitor. The efficiency calculations may vary depending on the specific system under test.

Other Listings

Over the years, as I find other people's listings of HPL results—especially those with power usage ratings—I will add them here:

top500-benchmark's People

Contributors

darkassassin23 avatar geerlingguy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

top500-benchmark's Issues

OpenBLAS or BLIS instead of ATLAS?

After doing some more testing with Ampere's recommended HPL setup (with an Ampere-optimized BLIS library), I would like to investigate switching away from ATLAS.

The primary motivation is build speed. I've noticed some machines can compile in an hour or two, but others take 2-3 days (especially slower systems like the Raspberry Pi 4...).

That's not especially fun, but in the past I've stuck with this method thinking it will compile ATLAS in a way that is tuned to each specific processor the best. Supposedly. (Who understands all this math that well anyway?)

I would like to compare other options like OpenBLAS or BLIS to see:

  1. If they are able to be used easily as a drop-in replacement for ATLAS
  2. If the performance result is affected too drastically (which would trigger me wanting to re-run HPL on all my machines, heh)

Benchmark Orange Pi 5

The Orange Pi 5 has a Rockchip RK3588S with 8 cores (A76 quad core + A55 quad core), and runs up to 2.4 GHz.

SSH connection problems running multiple slave nodes on a cluster

Hi Everyone!

My name is Mike Lindstedt, and I’m a senior student studying Electrical Engineering at Grove City College. I’m working on a project for a parallel computer architecture course with a group of four other people. Our project is to create a supercomputer out of 5 hp laptops and a network switch. We are connecting the laptops as nodes on a cluster on the network and need to run Linpack to measure the performance of the cluster. So far, we removed the windows OS from each computer and replaced it with Linux Ubuntu on each machine. Then, we downloaded the code from this Top500 Benchmark – HPL Linpack Github page. We used one laptop as the master node distributing the workload, and the other four as slave nodes receiving the workload. When we run the program per the instructions, we can get the benchmark to run successfully with one master node and one slave node. However, when we run the benchmark with multiple slave nodes, one or more nodes always fail while attempting to establish ssh connections after we are prompted to enter the passcode for the host node several times in command line. We’ve tried several options including modifying some of the relevant files and researching how changing those files might help, but we haven’t had any luck. We need to finish the project by Thursday April 20. Has anyone ever encountered this error before when attempting to benchmark a cluster with this code? Do you have any suggestions on how we might resolve this error? Any help would be greatly appreciated!

Thanks in advance,

Mike Lindstedt

Benchmark Orange Pi Compute Module 4

================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   14745
NB     :     256
PMAP   : Row-major process mapping
P      :       1
Q      :       4
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       14745   256     1     4             381.41             5.6043e+00
HPL_pdgesv() start time Sat Nov  4 01:45:49 2023

HPL_pdgesv() end time   Sat Nov  4 01:52:11 2023

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   4.07780599e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

I don't have the power usage measurement, however.

Benchmark M1 Max Mac Studio

To do this:

  1. git clone https://github.com/geerlingguy/top500-benchmark.git && cd top500-benchmark
  2. cp example.hosts.ini hosts.ini && cp example.config.yml config.yml
  3. Edit config.yml and change Qs to 10 (for 10 vCPUs in Docker)
  4. Make sure Docker is running, and increase RAM to 32 GB and CPUs to 10 in Resources settings.
  5. Start container: docker run --name top500 -it -v $PWD:/code geerlingguy/docker-ubuntu2204-ansible:latest bash
  6. Go into code directory: cd /code
  7. Run the benchmark: ansible-playbook main.yml --tags "setup,benchmark"

Benchmark Raspberry Pi 5 model B - 4 core A76 2.4 GHz

Base clock of 2.4 GHz:

30.249 Gflops at 11W = 2.75 Gflops/W

================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   23314
NB     :     256
PMAP   : Row-major process mapping
P      :       1
Q      :       4
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
    ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       23314   256     1     4             279.31             3.0249e+01
HPL_pdgesv() start time Fri Sep 22 14:26:10 2023

HPL_pdgesv() end time   Fri Sep 22 14:30:49 2023

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   3.83945609e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
            1 tests completed and passed residual checks,
            0 tests completed and failed residual checks,
            0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

Generalize the 'ssh' play in main.yml for any type of cluster

The main.yml playbook's Configure SSH connections between nodes. play was written for a cluster of Raspberry Pis running Raspberry Pi OS. As such, it's a little inflexible in its current form.

I would like to make this more generic so it can work on any cluster.

In fact, I may consider breaking that play out into its own playbook, especially if it can be run independent of the main benchmarking setup/benchmark playbook.

Install ATLAS Task failed

hosts.ini

# For single node benchmarking (default), use this:
[cluster]
127.0.0.1 ansible_connection=local

# For cluster benchmarking, delete everything above this line and uncomment:
# [cluster]
# node-01.local
# node-02.local
# node-03.local
# node-04.local
# node-05.local
#
# [cluster:vars]
# ansible_user=username

config.yml

# Working directory where HPL and associated applications will be compiled.
hpl_root: /opt/top500

# Home directory of the user for whom SSH keys will be configured.
ssh_user: dantemp
ssh_user_home: /home/dantemp

# Specify manually if needed for mixed-RAM-capacity clusters.
ram_in_gb: "{{ ( ansible_memtotal_mb / 1024 * 0.75 ) | int | abs }}"

# Count the nodes for accurate HPL.dat calculations.
nodecount: "{{ ansible_play_hosts | length | int }}"

# HPL.dat configuration options.
# See: https://www.advancedclustering.com/act_kb/tune-hpl-dat-file/
# See also: https://hpl-calculator.sourceforge.net/HPL-HowTo.pdf
hpl_dat_opts:
  # sqrt((Memory in GB * 1024 * 1024 * 1024 * Node count) / 8) * 0.9
  Ns: "{{ (((((ram_in_gb | int) * 1024 * 1024 * 1024 * (nodecount | int)) / 8) | root) * 0.90) | int }}"
  NBs: 256
  # (P * Q) should be roughly equivalent to total core count, with Qs higher.
  # If running on a single system, Ps should be 1 and Qs should be core count.
  Ps: 1
  Qs: 192

ATLAS error:

failed: [127.0.0.1] (item=../ATLAS/configure) => changed=true
  ansible_loop_var: item
  cmd:
  - ../ATLAS/configure
  delta: '0:00:04.281680'
  end: '2022-11-22 06:54:52.777140'
  item: ../ATLAS/configure
  msg: non-zero return code
  rc: 1
  start: '2022-11-22 06:54:48.495460'
  stderr: |-
    /usr/bin/ld: /tmp/ccMxBJwV.o: in function `ATL_tmpnam':
    /opt/top500/tmp/atlas-build/../ATLAS//CONFIG/include/atlas_sys.h:224: warning: the use of `tmpnam' is dangerous, better use `mkstemp'
    /usr/bin/ld: probe_OS.o: in function `ATL_tmpnam':
    /opt/top500/tmp/atlas-build/../ATLAS//CONFIG/include/atlas_sys.h:224: warning: the use of `tmpnam' is dangerous, better use `mkstemp'
    /usr/bin/ld: probe_asm.o: in function `ATL_tmpnam':
    /opt/top500/tmp/atlas-build/../ATLAS//CONFIG/include/atlas_sys.h:224: warning: the use of `tmpnam' is dangerous, better use `mkstemp'
    /usr/bin/ld: probe_vec.o: in function `ATL_tmpnam':
    /opt/top500/tmp/atlas-build/../ATLAS//CONFIG/include/atlas_sys.h:224: warning: the use of `tmpnam' is dangerous, better use `mkstemp'
    /usr/bin/ld: probe_arch.o: in function `ATL_tmpnam':
    /opt/top500/tmp/atlas-build/../ATLAS//CONFIG/include/atlas_sys.h:224: warning: the use of `tmpnam' is dangerous, better use `mkstemp'
    /usr/bin/ld: /tmp/ccQqmLej.o: in function `ATL_tmpnam':
    /opt/top500/tmp/atlas-build/../ATLAS//CONFIG/include/atlas_sys.h:224: warning: the use of `tmpnam' is dangerous, better use `mkstemp'
    ERROR: enum fam=0, chip=0, model=17, mach=0
    make[3]: *** [Makefile:106: atlas_run] Error 100
    make[2]: *** [Makefile:449: IRunArchInfo_x86] Error 2
    /usr/bin/ld: /tmp/ccwT5Idd.o: in function `ATL_tmpnam':
    /opt/top500/tmp/atlas-build/../ATLAS//CONFIG/include/atlas_sys.h:224: warning: the use of `tmpnam' is dangerous, better use `mkstemp'
    ERROR: enum fam=0, chip=0, model=17, mach=0
    make[3]: *** [Makefile:106: atlas_run] Error 100
    make[2]: *** [Makefile:449: IRunArchInfo_x86] Error 2
    ERROR: enum fam=0, chip=0, model=17, mach=0
    make[3]: *** [Makefile:106: atlas_run] Error 100
    make[2]: *** [Makefile:449: IRunArchInfo_x86] Error 2
    /usr/bin/ld: probe_pmake.o: in function `ATL_tmpnam':
    /opt/top500/tmp/atlas-build/../ATLAS//CONFIG/include/atlas_sys.h:224: warning: the use of `tmpnam' is dangerous, better use `mkstemp'
    ERROR: enum fam=0, chip=0, model=17, mach=0
    make[3]: *** [Makefile:106: atlas_run] Error 100
    make[2]: *** [Makefile:449: IRunArchInfo_x86] Error 2
    It appears you have cpu throttling enabled, which makes timings
    unreliable and an ATLAS install nonsensical.

    OS-controlled CPU throttling is so course grained, that timings become
    essentially random.  What this means for an ATLAS install is that ATLAS
    cannot tell the difference between a good and bad kernel, and so the
    tuning step may result in arbitrarily bad performance.  If you don't care
    about performance, you are usually better off just using the reference BLAS.

    If you fear overheating, setting clock speed to some lower, constant
    value should give you a decent install.

    Hardware-controlled throttling is usually much finer grained, and therefore
    may result in mediocre tuning, but this will depend quite bit on luck.

    If your machine has OS throttling enabled, it is critical that you disable
    it (with something like cpufreq-set).  See INSTALL.txt for details.

    If you you do not care at all about performance, you can rerun configure
    with --cripple-atlas-performance to proceed in the face of throttling.
    Do not do this unless you really don't care about performance.
    If you are able to turn off throttling, rerun configure as normal
    once you have done so.

    Aborting due to throttling
  stderr_lines: <omitted>
  stdout: |-
    gcc -I/opt/top500/tmp/atlas-build/../ATLAS//CONFIG/include  -g -w -c /opt/top500/tmp/atlas-build/../ATLAS//CONFIG/src/atlconf_misc.c
    gcc -I/opt/top500/tmp/atlas-build/../ATLAS//CONFIG/include  -g -w -o xconfig /opt/top500/tmp/atlas-build/../ATLAS//CONFIG/src/config.c atlconf_misc.o
    ./xconfig -d s /opt/top500/tmp/atlas-build/../ATLAS/ -d b /opt/top500/tmp/atlas-build

    OS configured as Linux (1)

    Assembly configured as GAS_x8664 (2)

    Vector ISA Extension configured as  AVXMAC (4,976)

    Architecture configured as  UNKNOWNx86 (43)

    Clock rate configured as 3709Mhz

    Maximum number of threads configured as  384
    Parallel make command configured as '$(MAKE) -j 384'
    CPU Throttling apparently enabled!
    xconfig exited with 1
  stdout_lines: <omitted>
failed: [127.0.0.1] (item=make) => changed=true
  ansible_loop_var: item
  cmd:
  - make
  delta: '0:00:00.003766'
  end: '2022-11-22 06:54:52.880356'
  item: make
  msg: non-zero return code
  rc: 2
  start: '2022-11-22 06:54:52.876590'
  stderr: |-
    make[1]: Make.top: No such file or directory
    make[1]: *** No rule to make target 'Make.top'.  Stop.
    make: *** [Makefile:536: build] Error 2
  stderr_lines: <omitted>
  stdout: |-
    make -f Make.top build
    make[1]: Entering directory '/opt/top500/tmp/atlas-build'
    make[1]: Leaving directory '/opt/top500/tmp/atlas-build'
  stdout_lines: <omitted>

PLAY RECAP ***************************************************************************************************************************************
127.0.0.1                  : ok=10   changed=5    unreachable=0    failed=1    skipped=1    rescued=0    ignored=0

dantemp@genoatest:~/top500-benchmark$

Benchmark Asus RS300-E11

The RS300-E11 we have is configured with:

1 x Xeon E-2374G @ 5.0GHz
64GB DDR4
No GPU.

TASK [Output the results.] *****************************************************************************************************
ok: [127.0.0.1] =>                                                                                                              
  mpirun_output.stdout: |-                                                                                                      
    ================================================================================                                            
    HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018                                                
    Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK                                            
    Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK                                                            
    Modified by Julien Langou, University of Colorado Denver                                                                    
    ================================================================================                                            
                                                                                                                                
    An explanation of the input/output parameters follows:                                                                      
    T/V    : Wall time / encoded variant.                                                                                       
    N      : The order of the coefficient matrix A.                                                                             
    NB     : The partitioning blocking factor.                                                                                  
    P      : The number of process rows.                                                                                        
    Q      : The number of process columns.                                                                                     
    Time   : Time in seconds to solve the linear system.                                                                        
    Gflops : Rate of execution for solving the linear system.                                                                   
                                                                                                                                
    The following parameter values will be used:                                                                                
                                                                                                                                
    N      :   71481                                                                                                            
    NB     :     256                                                                                                            
    PMAP   : Row-major process mapping                                                                                          
    P      :       1                                                                                                            
    Q      :       8                                                                                                            
    PFACT  :   Right                                                                                                            
    NBMIN  :       4                                                                                                            
    NDIV   :       2                                                                                                            
    RFACT  :   Crout                                                                                                            
    BCAST  :  1ringM                                                                                                            
    DEPTH  :       1                                                                                                            
    SWAP   : Mix (threshold = 64)                                                                                               
    L1     : transposed form                                                                                                    
    U      : transposed form                                                                                                    
    EQUIL  : yes                                                                                                                
    ALIGN  : 8 double precision words                                                                                           
                                                                                                                                
    --------------------------------------------------------------------------------                                            
  
    - The matrix A is randomly generated for each test.
    - The following scaled residual check will be computed:
          ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
    - The relative machine precision (eps) is taken to be               1.110223e-16
    - Computational tests pass if scaled residuals are less than                16.0
                                             
    ================================================================================                                                                                                  
    T/V                N    NB     P     Q               Time                 Gflops                                                                                                  
    --------------------------------------------------------------------------------                                                                                                  
    WR11C2R4       71481   256     1     8            1241.97             1.9606e+02                                                                                                  
    HPL_pdgesv() start time Wed Nov 29 22:21:15 2023                                       
                                             
    HPL_pdgesv() end time   Wed Nov 29 22:41:57 2023                                       
                                             
    --------------------------------------------------------------------------------                                                                                                  
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   2.54353754e-03 ...... PASSED                                                                                                  
    ================================================================================                                                                                                  
                                             
    Finished      1 tests with the following results:                                      
                  1 tests completed and passed residual checks,                            
                  0 tests completed and failed residual checks,                            
                  0 tests skipped because of illegal input values.                                                                                                                    
    --------------------------------------------------------------------------------                                                                                                  
                                             
    End of Tests.                            
    ================================================================================

Thought it would be fun to benchmark one of our lower-specced servers (it runs a couple web servers). A better result than I was expecting for a 4 core/8 thread machine.

Dhrystones come on! Phoronix, LuxMark https://youtu.be/wl5H5rT87JE

http://www.roylongbottom.org.uk/
http://www.roylongbottom.org.uk/dhrystone%20results.htm
Source:
https://netlib.org/benchmark/dhry-c
https://netlib.org/benchmark/

Dhrystones are Fun, but same as Geekbench, all over the place....
Amiga 1200 + PiStom32 + Pi4B does 1.5Million
https://youtu.be/NDuVh-zOwdc?t=301

much more stable Wine+Cinebench R15 R20 R23
or
http://wiki.luxcorerender.org/LuxMark#Binaries
x86_64 Requires Intel & AMD CPU OpenCL drivers, for all OS.

https://www.phoronix.com/review/power9-talos-2/2
https://www.phoronix-test-suite.com/

https://youtu.be/wl5H5rT87JE

Failures building MPI with gfortran on 5.4.0-1071-raspi #81-Ubuntu (20.04)

Tried to run this on a cluster of pis running Ubuntu 20.04 - it failed trying to build mpi.

Is there a reason not to use the OS provided version of mpich, if available?

Error here:

`
failed: [wrk-0002] (item=./configure --with-device=ch3:sock FFLAGS=-fallow-argument-mismatch) => changed=true
ansible_loop_var: item
cmd:

  • ./configure
  • --with-device=ch3:sock
  • FFLAGS=-fallow-argument-mismatch
    delta: '0:01:58.104863'
    end: '2022-11-10 14:31:28.537145'
    item: ./configure --with-device=ch3:sock FFLAGS=-fallow-argument-mismatch
    msg: non-zero return code
    rc: 1
    start: '2022-11-10 14:29:30.432282'
    stderr: |-
    configure: WARNING: X11 not found; GL disabled
    configure: WARNING: compilation failed
    configure: error: **** Incompatible Fortran and C Object File Types! ****
    F77 Object File Type produced by "gfortran -fallow-argument-mismatch " is : : cannot open ' (No such file or directory). C Object File Type produced by "gcc -O2" is : : ELF 64-bit LSB relocatable, ARM aarch64, version 1 (SYSV), not stripped. stderr_lines: <omitted> stdout: |- ...

Benchmarking RaptorCS Blackbird (POWER9 8-Core 32-threads @ 3.8 Ghz)

$ lscpu
Architecture:             ppc64le
  Byte Order:             Little Endian
CPU(s):                   32
  On-line CPU(s) list:    0-31
Model name:               POWER9, altivec supported
  Model:                  2.3 (pvr 004e 1203)
  Thread(s) per core:     4
  Core(s) per socket:     8
  Socket(s):              1
  Frequency boost:        enabled
  CPU(s) scaling MHz:     100%
  CPU max MHz:            3800.0000
  CPU min MHz:            2166.0000
Caches (sum of all):      
  L1d:                    256 KiB (8 instances)
  L1i:                    256 KiB (8 instances)
  L2:                     4 MiB (8 instances)
  L3:                     80 MiB (8 instances)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-31
Vulnerabilities:          
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Mitigation; RFI Flush, L1D private per thread
  Mds:                    Not affected
  Meltdown:               Mitigation; RFI Flush, L1D private per thread
  Mmio stale data:        Not affected
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Kernel entry/exit barrier (eieio)
  Spectre v1:             Mitigation; __user pointer sanitization, ori31 speculation barrier enabled
  Spectre v2:             Mitigation; Software count cache flush (hardware accelerated), Software link stack flush
  Srbds:                  Not affected
  Tsx async abort:        Not affected
$ ansible-playbook main.yml --tags "setup,benchmark" --ask-become-pass
  mpirun_output.stdout: |-
    ================================================================================
    HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
    Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
    Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
    Modified by Julien Langou, University of Colorado Denver
    ================================================================================
  
    An explanation of the input/output parameters follows:
    T/V    : Wall time / encoded variant.
    N      : The order of the coefficient matrix A.
    NB     : The partitioning blocking factor.
    P      : The number of process rows.
    Q      : The number of process columns.
    Time   : Time in seconds to solve the linear system.
    Gflops : Rate of execution for solving the linear system.
  
    The following parameter values will be used:
  
    N      :   70717
    NB     :     256
    PMAP   : Row-major process mapping
    P      :       1
    Q      :      32
    PFACT  :   Right
    NBMIN  :       4
    NDIV   :       2
    RFACT  :   Crout
    BCAST  :  1ringM
    DEPTH  :       1
    SWAP   : Mix (threshold = 64)
    L1     : transposed form
    U      : transposed form
    EQUIL  : yes
    ALIGN  : 8 double precision words
  
    --------------------------------------------------------------------------------
  
    - The matrix A is randomly generated for each test.
    - The following scaled residual check will be computed:
          ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
    - The relative machine precision (eps) is taken to be               1.110223e-16
    - Computational tests pass if scaled residuals are less than                16.0
  
    ================================================================================
    T/V                N    NB     P     Q               Time                 Gflops
    --------------------------------------------------------------------------------
    WR11C2R4       70717   256     1    32            1650.43             1.4286e+02
    HPL_pdgesv() start time Thu Jun 13 15:57:05 2024
  
    HPL_pdgesv() end time   Thu Jun 13 16:24:36 2024
  
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   2.41238455e-03 ...... PASSED
    ================================================================================
  
    Finished      1 tests with the following results:
                  1 tests completed and passed residual checks,
                  0 tests completed and failed residual checks,
                  0 tests skipped because of illegal input values.
    --------------------------------------------------------------------------------
  
    End of Tests.
    ================================================================================

PLAY RECAP *********************************************************************************************************************************************************************************************************************************************************************
127.0.0.1                  : ok=29   changed=10   unreachable=0    failed=0    skipped=7    rescued=0    ignored=0   

Benchmark Raspberry Pi 5 Linux kernel NUMA patch

I would like to see if the NUMA patch here: https://lore.kernel.org/lkml/[email protected]/ — has any bearing on HPL performance and/or efficiency scores. Especially if it's reproducible and significant.

The stated numbers for Geekbench 6 are 5-ish and 20-ish percent improvements for single/multicore. I would like to see if there's any impact for HPL (which is inherently multicore, and very RAM-speed-dependent). Also measure the power usage to see if this affects power draw positively, negatively, or not at all.

NOTE: I'm testing with an 8GB Raspberry Pi 5. Default clocks, Raspberry Pi 5 Active Cooler, ambient temperature 80°F/26.7°C.

Benchmark Ampere Altra Q32-17 in HL15 Chassis

I am running an Ampere Altra Q32-17 in a 45Homelab HL15 4U rackmount chassis, and currently I can get the power draw at idle down to around 50W if I limit the system to just the motherboard, CPU, 32 GB of RAM, and a 2 TB NVMe SSD.

With all the stuff I'm using to make it a NAS (6x HDD, 4x SSD, HBA), power draw rises to around 120W. So I'll have to remove all that when doing my final power efficiency calculations.

I'm also running Rocky Linux 9, so I'll need to check if this playbook even runs properly there. We'll see!

Tests to run:

  • Full system with HDDs spinning and everything, 64 GB RAM
  • Bare system with no HDDs, HBA, etc., 64 GB RAM (4x sticks)
  • Bare system with 32 GB RAM (2x sticks)
  • Bare system with 128 GB RAM (8x sticks)

Single-node playbook execution fails during firewall configuration due to undefined 'host_ips' variable

When running the playbook on a single-node installation using the suggested ansible-playbook main.yml --tags "setup,benchmark" call the playbook execution fails during firewall configuration due to the variable host_ips being undefined:

...
TASK [include_tasks] **********************************************************************************************************************************************************************************************
included: /tmp/top500-benchmark/firewall/configure-firewall.yml for 127.0.0.1

TASK [Creating new custom firewall zone.] *************************************************************************************************************************************************************************
changed: [127.0.0.1]

TASK [Setting custom firewall zone to accept connections.] ********************************************************************************************************************************************************
changed: [127.0.0.1]

TASK [Adding nodes as trusted sources in the firewall.] ***********************************************************************************************************************************************************
fatal: [127.0.0.1]: FAILED! =>
  msg: '''host_ips'' is undefined. ''host_ips'' is undefined'

The host_ips fact is only set as part of the node SSH configuration playbook (tagged 'ssh') hence it's not present if the tasks with this tag are not executed at all.

Benchmark Radxa Rock5 model B

Running the benchmark now. Surprisingly, Radxa's apt repository seems to not have the key pre-shipped with their Debian OS image download, so I had to add it to the keyring manually before running the playbook.

Benchmark Ampere Altra Max M128-30

I would like to see if I can get a good HPL benchmark run from the Ampere Altra Max M128-30.

Using this project, and assuming someone is running Ubuntu 22.04 (or any modern version of Ubuntu), I believe this test can be run with:

  1. Install Ansible: pip3 install ansible
  2. Clone this project: git clone https://github.com/geerlingguy/top500-benchmark.git && cd top500-benchmark
  3. Copy config files: cp example.config.yml config.yml && cp example.hosts.ini hosts.ini
  4. Make the following change in config.yml:
    1. Set hpl_dat_opts.Qs to 128
  5. Make sure cpu frequency scaling is set to performance (disable throttling)
  6. Run the playbook: ansible-playbook main.yml --ask-become-pass --tags "setup,benchmark"

After 8-24 hours (not sure on that system, ATLAS is the main variable!), it should spit out a result.

Follow-up runs can be done by changing directories to /opt/top500/tmp/hpl-2.3/bin/rpi and running mpirun -f cluster-hosts ./xhpl (tweak the HPL.dat file in that directory as desired).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.