geerlingguy / top500-benchmark Goto Github PK

Automated Top500 benchmark for clusters or single nodes.

License: MIT License

Jinja 100.00%

ansible cluster hpl linpack raspberry-pi supercomputer top500

top500-benchmark's Introduction

Top500 Benchmark - HPL Linpack

A common generic benchmark for clusters (or extremly powerful single node workstations) is Linpack, or HPL (High Performance Linpack), which is famous for its use in rankings in the Top500 supercomputer list over the past few decades.

I wanted to see where my various clusters and workstations would rank, historically (you can compare to past lists here), so I built this Ansible playbook which installs all the necessary tooling for HPL to run, connects all the nodes together via SSH, then runs the benchmark and outputs the result.

Why not PTS?

Phoronix Test Suite includes HPL Linpack and HPCC test suites. I may see how they compare in the future.

When I initially started down this journey, the PTS versions didn't play nicely with the Pi, especially when clustered. And the PTS versions don't seem to support clustered usage at all!

Supported OSes

Currently supported OSes:

Ubuntu (20.04+)
Raspberry Pi OS (11+)
Debian (11+)
Rocky Linux (9+)
AlmaLinux (9+)
CentOS Stream(9+)
RHEL (9+)
Fedora (38+)
Arch Linux
Manjaro

Other OSes may need a few tweaks to work correctly. You can also run the playbook inside Docker (see the note under 'Benchmarking - Single Node'), but performance will be artificially limited.

Benchmarking - Cluster

Make sure you have Ansible installed (pip3 install ansible), then copy the following files:

cp example.hosts.ini hosts.ini: This is an inventory of all the hosts in your cluster (or just a single computer).
cp example.config.yml config.yml: This has some configuration options you may need to override, especially the ssh_* and ram_in_gb options (depending on your cluster layout)

Each host should be reachable via SSH using the username set in ansible_user. Other Ansible options can be set under [cluster:vars] to connect in more exotic clustering scenarios (e.g. via bastion/jump-host).

Tweak other settings inside config.yml as desired (the most important being hpl_root—this is where the compiled MPI, ATLAS/OpenBLAS/Blis, and HPL benchmarking code will live).

Note: The names of the nodes inside hosts.ini must match the hostname of their corresponding node; otherwise, the benchmark will hang when you try to run it in a cluster.

For example, if you have node-01.local in your hosts.ini your host's hostname should be node-01 and not something else like raspberry-pi.

If you're testing with .local domains on Ubuntu, and local mDNS resolution isn't working, consider installing the avahi-daemon package:

sudo apt-get install avahi-daemon

Then run the benchmarking playbook inside this directory:

ansible-playbook main.yml

This will run three separate plays:

Setup: downloads and compiles all the code required to run HPL. (This play takes a long time—up to many hours on a slower Raspberry Pi!)
SSH: configures the nodes to be able to communicate with each other.
Benchmark: creates an HPL.dat file and runs the benchmark, outputting the results in your console.

After the entire playbook is complete, you can also log directly into any of the nodes (though I generally do things on node 1), and run the following commands to kick off a benchmarking run:

cd ~/tmp/hpl-2.3/bin/top500
mpirun -f cluster-hosts ./xhpl

The configuration here was tested on smaller 1, 4, and 6-node clusters with 6-64 GB of RAM. Some settings in the config.yml file that affect the generated HPL.dat file may need diffent tuning for different cluster layouts!

Benchmarking - Single Node

To run locally on a single node, clone or download this repository to the node where you want to run HPL. Make sure the hosts.ini is set up with the default options (with just one node, 127.0.0.1).

All the default configuration from example.config.yml should be copied to a config.yml file, and all the variables should scale dynamically for your node.

Run the following command so the cluster networking portion of the playbook is not run:

ansible-playbook main.yml --tags "setup,benchmark"

For testing, you can start an Ubuntu docker container:
docker run --name top500 -it -v $PWD:/code geerlingguy/docker-ubuntu2204-ansible:latest bash
Then go into the code directory (cd /code) and run the playbook using the command above.

Setting `performance` CPU frequency

If you get an error like CPU Throttling apparently enabled!, you may need to set the CPU frequency to performance (and disable any throttling or performance scaling).

For different OSes and different CPU types, the way you do this could be different. So far the automated performance setting in the main.yml playbook has only been tested on Raspberry Pi OS. You may need to look up how to disable throttling on your own system. Do that, then run the main.yml playbook again.

Overclocking

Since I originally built this project for a Raspberry Pi cluster, I include a playbook to set an overclock for all the Raspberry Pis in a given cluster.

You can set a clock speed by changing the pi_arm_freq in the overclock-pi.yml playbook, then run it with:

ansible-playbook overclock-pi.yml

Higher clock speeds require more power and thus more cooling, so if you are running a Pi cluster with just heatsinks, you may also require a fan blowing over them if running overclocked.

Results

Here are a few of the results I've acquired in my testing (sorted by efficiency, highest to lowest):

Configuration	Architecture	Result	Wattage	Gflops/W
Radxa CM5 (RK3588S2 8-core)	Arm	48.619 Gflops	10W	4.86 Gflops/W
Ampere Altra Q64-22 @ 2.2 GHz	Arm	655.90 Gflops	140W	4.69 Gflops/W
Orange Pi 5 (RK3588S 8-core)	Arm	53.333 Gflops	11.5W	4.64 Gflops/W
Radxa ROCK 5B (RK3588 8-core)	Arm	51.382 Gflops	12W	4.32 Gflops/W
Ampere Altra Max M128-28 @ 2.8 GHz	Arm	1,265.5 Gflops	296W	4.27 Gflops/W
Radxa ROCK 5C (RK3588S2 8-core)	Arm	49.285 Gflops	12W	4.11 Gflops/W
Ampere Altra Max M96-28 @ 2.8 GHz	Arm	1,188.3 Gflops	295W	4.01 Gflops/W
M1 Max Mac Studio (1x M1 Max @ 3.2 GHz, in Docker)	Arm	264.32 Gflops	66W	4.00 Gflops/W
Ampere Altra Q32-17 @ 1.7 GHz	Arm	332.07 Gflops	100W	3.32 Gflops/W
Turing Machines RK1 (RK3588 8-core)	Arm	59.810 Gflops	18.1	3.30 Gflops/W
Turing Pi 2 (4x RK1 @ 2.4 GHz)	Arm	224.60 Gflops	73W	3.08 Gflops/W
Raspberry Pi 5 (BCM2712 @ 2.4 GHz)	Arm	30.249 Gflops	11W	2.75 Gflops/W
LattePanda Mu (1x N100 @ 3.4 GHz)	x86	62.851 Gflops	25W	2.51 Gflops/W
Radxa X4 (1x N100 @ 3.4 GHz)	x86	37.224 Gflops	16W	2.33 Gflops/W
Raspberry Pi CM4 (BCM2711 @ 1.5 GHz)	Arm	11.433 Gflops	5.2W	2.20 Gflops/W
Ampere Altra Max M128-30 @ 3.0 GHz	Arm	953.47 Gflops	500W	1.91 Gflops/W
Turing Pi 2 (4x CM4 @ 1.5 GHz)	Arm	44.942 Gflops	24.5W	1.83 Gflops/W
Lenovo M710q Tiny (1x i5-7400T @ 2.4 GHz)	x86	72.472 Gflops	41W	1.76 Gflops/W
Raspberry Pi 4 (BCM2711 @ 1.8 GHz)	Arm	11.889 Gflops	7.2W	1.65 Gflops/W
Turing Pi 2 (4x CM4 @ 2.0 GHz)	Arm	51.327 Gflops	33W	1.54 Gflops/W
DeskPi Super6c (6x CM4 @ 1.5 GHz)	Arm	60.293 Gflops	40W	1.50 Gflops/W
Orange Pi CM4 (RK3566 4-core)	Arm	5.604 Gflops	4.0W	1.40 Gflop/W
DeskPi Super6c (6x CM4 @ 2.0 GHz)	Arm	70.338 Gflops	51W	1.38 Gflops/W
AMD Ryzen 5 5600x @ 3.7 GHz	x86	229 Gflops	196W	1.16 Gflops/W
Milk-V Mars CM JH7110 4-core	RISC-V	1.99 Gflops	3.6W	0.55 Gflops/W
Lichee Console 4A TH1520 4-core	RISC-V	1.99 Gflops	3.6W	0.55 Gflops/W
Milk-V Jupiter Spacemit X60 8-core	RISC-V	5.66 Gflops	10.6W	0.55 Gflops/W
Milk-V Mars JH7110 4-core	RISC-V	2.06 Gflops	4.7W	0.44 Gflops/W
Raspberry Pi Zero 2 W (RP3A0-AU @ 1.0 GHz)	Arm	0.370 Gflops	2.1W	0.18 Gflops/W
M2 Pro MacBook Pro (1x M2 Pro, in Asahi Linux)	Arm	296.93 Gflops	N/A	N/A
M2 MacBook Air (1x M2 @ 3.5 GHz, in Docker)	Arm	104.68 Gflops	N/A	N/A

You can enter the Gflops in this tool to see how it compares to historical top500 lists.

Note: My current calculation for efficiency is based on average power draw over the course of the benchmark, based on either a Kill-A-Watt (pre-2024 tests) or a ThirdReality Smart Outlet monitor. The efficiency calculations may vary depending on the specific system under test.

Other Listings

Over the years, as I find other people's listings of HPL results—especially those with power usage ratings—I will add them here:

VMW Research Group GFLOPS/W listing

top500-benchmark's People

Contributors

Stargazers

Watchers

Forkers

awesome-benchmark preethaml7 neilhanlon em-winterschon fancytyphoonkitty noslin005 andersspringborg arnaudallogene pilgrimage80 aalborg-supercomputer-klubben swap2ag darkassassin23 manojramamurthy slug-not-so-slow cameronbunce arsvision dncraptor

top500-benchmark's Issues

OpenBLAS or BLIS instead of ATLAS?

After doing some more testing with Ampere's recommended HPL setup (with an Ampere-optimized BLIS library), I would like to investigate switching away from ATLAS.

The primary motivation is build speed. I've noticed some machines can compile in an hour or two, but others take 2-3 days (especially slower systems like the Raspberry Pi 4...).

That's not especially fun, but in the past I've stuck with this method thinking it will compile ATLAS in a way that is tuned to each specific processor the best. Supposedly. (Who understands all this math that well anyway?)

I would like to compare other options like OpenBLAS or BLIS to see:

If they are able to be used easily as a drop-in replacement for ATLAS
If the performance result is affected too drastically (which would trigger me wanting to re-run HPL on all my machines, heh)

Benchmark Orange Pi 5

The Orange Pi 5 has a Rockchip RK3588S with 8 cores (A76 quad core + A55 quad core), and runs up to 2.4 GHz.

SSH connection problems running multiple slave nodes on a cluster

Hi Everyone!

My name is Mike Lindstedt, and I’m a senior student studying Electrical Engineering at Grove City College. I’m working on a project for a parallel computer architecture course with a group of four other people. Our project is to create a supercomputer out of 5 hp laptops and a network switch. We are connecting the laptops as nodes on a cluster on the network and need to run Linpack to measure the performance of the cluster. So far, we removed the windows OS from each computer and replaced it with Linux Ubuntu on each machine. Then, we downloaded the code from this Top500 Benchmark – HPL Linpack Github page. We used one laptop as the master node distributing the workload, and the other four as slave nodes receiving the workload. When we run the program per the instructions, we can get the benchmark to run successfully with one master node and one slave node. However, when we run the benchmark with multiple slave nodes, one or more nodes always fail while attempting to establish ssh connections after we are prompted to enter the passcode for the host node several times in command line. We’ve tried several options including modifying some of the relevant files and researching how changing those files might help, but we haven’t had any luck. We need to finish the project by Thursday April 20. Has anyone ever encountered this error before when attempting to benchmark a cluster with this code? Do you have any suggestions on how we might resolve this error? Any help would be greatly appreciated!

Thanks in advance,

Mike Lindstedt

Benchmark Orange Pi Compute Module 4

================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   14745
NB     :     256
PMAP   : Row-major process mapping
P      :       1
Q      :       4
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       14745   256     1     4             381.41             5.6043e+00
HPL_pdgesv() start time Sat Nov  4 01:45:49 2023

HPL_pdgesv() end time   Sat Nov  4 01:52:11 2023

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   4.07780599e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

I don't have the power usage measurement, however.

Benchmark M1 Max Mac Studio

To do this:

git clone https://github.com/geerlingguy/top500-benchmark.git && cd top500-benchmark
cp example.hosts.ini hosts.ini && cp example.config.yml config.yml
Edit config.yml and change Qs to 10 (for 10 vCPUs in Docker)
Make sure Docker is running, and increase RAM to 32 GB and CPUs to 10 in Resources settings.
Start container: docker run --name top500 -it -v $PWD:/code geerlingguy/docker-ubuntu2204-ansible:latest bash
Go into code directory: cd /code
Run the benchmark: ansible-playbook main.yml --tags "setup,benchmark"

Benchmark Ampere Altra Developer Platform - 96 core 2.8 GHz ARM64

As the title says...

Benchmark Turing Machines RK1 (RK3588 octo-core)

Turing Machines RK1 has an 8-core RK3588 onboard, but is a new System on Module design. The first one I've seen come to market (should be shipping this month). See my full workup on it here: geerlingguy/sbc-reviews#25

[Not-an-issue] Ampere Altra Max M128-30 slower than lower models?

Was wondering, since it seems both M128-28 and M96-28 outperform M128-30, while pulling significantly less power. Could it have to do with different test version (M128-30 was tested much earlier than others), or is it an actual known chip limitation?

Benchmark Raspberry Pi 5 model B - 4 core A76 2.4 GHz

Base clock of 2.4 GHz:

30.249 Gflops at 11W = 2.75 Gflops/W

================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   23314
NB     :     256
PMAP   : Row-major process mapping
P      :       1
Q      :       4
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
    ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       23314   256     1     4             279.31             3.0249e+01
HPL_pdgesv() start time Fri Sep 22 14:26:10 2023

HPL_pdgesv() end time   Fri Sep 22 14:30:49 2023

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   3.83945609e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
            1 tests completed and passed residual checks,
            0 tests completed and failed residual checks,
            0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

Generalize the 'ssh' play in main.yml for any type of cluster

The main.yml playbook's Configure SSH connections between nodes. play was written for a cluster of Raspberry Pis running Raspberry Pi OS. As such, it's a little inflexible in its current form.

I would like to make this more generic so it can work on any cluster.

In fact, I may consider breaking that play out into its own playbook, especially if it can be run independent of the main benchmarking setup/benchmark playbook.

Install ATLAS Task failed

hosts.ini

# For single node benchmarking (default), use this:
[cluster]
127.0.0.1 ansible_connection=local

# For cluster benchmarking, delete everything above this line and uncomment:
# [cluster]
# node-01.local
# node-02.local
# node-03.local
# node-04.local
# node-05.local
#
# [cluster:vars]
# ansible_user=username

config.yml

# Working directory where HPL and associated applications will be compiled.
hpl_root: /opt/top500

# Home directory of the user for whom SSH keys will be configured.
ssh_user: dantemp
ssh_user_home: /home/dantemp

# Specify manually if needed for mixed-RAM-capacity clusters.
ram_in_gb: "{{ ( ansible_memtotal_mb / 1024 * 0.75 ) | int | abs }}"

# Count the nodes for accurate HPL.dat calculations.
nodecount: "{{ ansible_play_hosts | length | int }}"

# HPL.dat configuration options.
# See: https://www.advancedclustering.com/act_kb/tune-hpl-dat-file/
# See also: https://hpl-calculator.sourceforge.net/HPL-HowTo.pdf
hpl_dat_opts:
  # sqrt((Memory in GB * 1024 * 1024 * 1024 * Node count) / 8) * 0.9
  Ns: "{{ (((((ram_in_gb | int) * 1024 * 1024 * 1024 * (nodecount | int)) / 8) | root) * 0.90) | int }}"
  NBs: 256
  # (P * Q) should be roughly equivalent to total core count, with Qs higher.
  # If running on a single system, Ps should be 1 and Qs should be core count.
  Ps: 1
  Qs: 192

ATLAS error:

failed: [127.0.0.1] (item=../ATLAS/configure) => changed=true
  ansible_loop_var: item
  cmd:
  - ../ATLAS/configure
  delta: '0:00:04.281680'
  end: '2022-11-22 06:54:52.777140'
  item: ../ATLAS/configure
  msg: non-zero return code
  rc: 1
  start: '2022-11-22 06:54:48.495460'
  stderr: |-
    /usr/bin/ld: /tmp/ccMxBJwV.o: in function `ATL_tmpnam':
    /opt/top500/tmp/atlas-build/../ATLAS//CONFIG/include/atlas_sys.h:224: warning: the use of `tmpnam' is dangerous, better use `mkstemp'
    /usr/bin/ld: probe_OS.o: in function `ATL_tmpnam':
    /opt/top500/tmp/atlas-build/../ATLAS//CONFIG/include/atlas_sys.h:224: warning: the use of `tmpnam' is dangerous, better use `mkstemp'
    /usr/bin/ld: probe_asm.o: in function `ATL_tmpnam':
    /opt/top500/tmp/atlas-build/../ATLAS//CONFIG/include/atlas_sys.h:224: warning: the use of `tmpnam' is dangerous, better use `mkstemp'
    /usr/bin/ld: probe_vec.o: in function `ATL_tmpnam':
    /opt/top500/tmp/atlas-build/../ATLAS//CONFIG/include/atlas_sys.h:224: warning: the use of `tmpnam' is dangerous, better use `mkstemp'
    /usr/bin/ld: probe_arch.o: in function `ATL_tmpnam':
    /opt/top500/tmp/atlas-build/../ATLAS//CONFIG/include/atlas_sys.h:224: warning: the use of `tmpnam' is dangerous, better use `mkstemp'
    /usr/bin/ld: /tmp/ccQqmLej.o: in function `ATL_tmpnam':
    /opt/top500/tmp/atlas-build/../ATLAS//CONFIG/include/atlas_sys.h:224: warning: the use of `tmpnam' is dangerous, better use `mkstemp'
    ERROR: enum fam=0, chip=0, model=17, mach=0
    make[3]: *** [Makefile:106: atlas_run] Error 100
    make[2]: *** [Makefile:449: IRunArchInfo_x86] Error 2
    /usr/bin/ld: /tmp/ccwT5Idd.o: in function `ATL_tmpnam':
    /opt/top500/tmp/atlas-build/../ATLAS//CONFIG/include/atlas_sys.h:224: warning: the use of `tmpnam' is dangerous, better use `mkstemp'
    ERROR: enum fam=0, chip=0, model=17, mach=0
    make[3]: *** [Makefile:106: atlas_run] Error 100
    make[2]: *** [Makefile:449: IRunArchInfo_x86] Error 2
    ERROR: enum fam=0, chip=0, model=17, mach=0
    make[3]: *** [Makefile:106: atlas_run] Error 100
    make[2]: *** [Makefile:449: IRunArchInfo_x86] Error 2
    /usr/bin/ld: probe_pmake.o: in function `ATL_tmpnam':
    /opt/top500/tmp/atlas-build/../ATLAS//CONFIG/include/atlas_sys.h:224: warning: the use of `tmpnam' is dangerous, better use `mkstemp'
    ERROR: enum fam=0, chip=0, model=17, mach=0
    make[3]: *** [Makefile:106: atlas_run] Error 100
    make[2]: *** [Makefile:449: IRunArchInfo_x86] Error 2
    It appears you have cpu throttling enabled, which makes timings
    unreliable and an ATLAS install nonsensical.

    OS-controlled CPU throttling is so course grained, that timings become
    essentially random.  What this means for an ATLAS install is that ATLAS
    cannot tell the difference between a good and bad kernel, and so the
    tuning step may result in arbitrarily bad performance.  If you don't care
    about performance, you are usually better off just using the reference BLAS.

    If you fear overheating, setting clock speed to some lower, constant
    value should give you a decent install.

    Hardware-controlled throttling is usually much finer grained, and therefore
    may result in mediocre tuning, but this will depend quite bit on luck.

    If your machine has OS throttling enabled, it is critical that you disable
    it (with something like cpufreq-set).  See INSTALL.txt for details.

    If you you do not care at all about performance, you can rerun configure
    with --cripple-atlas-performance to proceed in the face of throttling.
    Do not do this unless you really don't care about performance.
    If you are able to turn off throttling, rerun configure as normal
    once you have done so.

    Aborting due to throttling
  stderr_lines: <omitted>
  stdout: |-
    gcc -I/opt/top500/tmp/atlas-build/../ATLAS//CONFIG/include  -g -w -c /opt/top500/tmp/atlas-build/../ATLAS//CONFIG/src/atlconf_misc.c
    gcc -I/opt/top500/tmp/atlas-build/../ATLAS//CONFIG/include  -g -w -o xconfig /opt/top500/tmp/atlas-build/../ATLAS//CONFIG/src/config.c atlconf_misc.o
    ./xconfig -d s /opt/top500/tmp/atlas-build/../ATLAS/ -d b /opt/top500/tmp/atlas-build

    OS configured as Linux (1)

    Assembly configured as GAS_x8664 (2)

    Vector ISA Extension configured as  AVXMAC (4,976)

    Architecture configured as  UNKNOWNx86 (43)

    Clock rate configured as 3709Mhz

    Maximum number of threads configured as  384
    Parallel make command configured as '$(MAKE) -j 384'
    CPU Throttling apparently enabled!
    xconfig exited with 1
  stdout_lines: <omitted>
failed: [127.0.0.1] (item=make) => changed=true
  ansible_loop_var: item
  cmd:
  - make
  delta: '0:00:00.003766'
  end: '2022-11-22 06:54:52.880356'
  item: make
  msg: non-zero return code
  rc: 2
  start: '2022-11-22 06:54:52.876590'
  stderr: |-
    make[1]: Make.top: No such file or directory
    make[1]: *** No rule to make target 'Make.top'.  Stop.
    make: *** [Makefile:536: build] Error 2
  stderr_lines: <omitted>
  stdout: |-
    make -f Make.top build
    make[1]: Entering directory '/opt/top500/tmp/atlas-build'
    make[1]: Leaving directory '/opt/top500/tmp/atlas-build'
  stdout_lines: <omitted>

PLAY RECAP ***************************************************************************************************************************************
127.0.0.1                  : ok=10   changed=5    unreachable=0    failed=1    skipped=1    rescued=0    ignored=0

dantemp@genoatest:~/top500-benchmark$

Benchmark Sipeed Lichee Console 4A

I would like to see how the Lichee Console 4A runs; it uses the 4-core TH1520 SoC, underclocked to 1.5 GHz. See related: geerlingguy/sbc-reviews#39

Benchmark Asus RS300-E11

The RS300-E11 we have is configured with:

1 x Xeon E-2374G @ 5.0GHz
64GB DDR4
No GPU.

TASK [Output the results.] *****************************************************************************************************
ok: [127.0.0.1] =>                                                                                                              
  mpirun_output.stdout: |-                                                                                                      
    ================================================================================                                            
    HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018                                                
    Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK                                            
    Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK                                                            
    Modified by Julien Langou, University of Colorado Denver                                                                    
    ================================================================================                                            
                                                                                                                                
    An explanation of the input/output parameters follows:                                                                      
    T/V    : Wall time / encoded variant.                                                                                       
    N      : The order of the coefficient matrix A.                                                                             
    NB     : The partitioning blocking factor.                                                                                  
    P      : The number of process rows.                                                                                        
    Q      : The number of process columns.                                                                                     
    Time   : Time in seconds to solve the linear system.                                                                        
    Gflops : Rate of execution for solving the linear system.                                                                   
                                                                                                                                
    The following parameter values will be used:                                                                                
                                                                                                                                
    N      :   71481                                                                                                            
    NB     :     256                                                                                                            
    PMAP   : Row-major process mapping                                                                                          
    P      :       1                                                                                                            
    Q      :       8                                                                                                            
    PFACT  :   Right                                                                                                            
    NBMIN  :       4                                                                                                            
    NDIV   :       2                                                                                                            
    RFACT  :   Crout                                                                                                            
    BCAST  :  1ringM                                                                                                            
    DEPTH  :       1                                                                                                            
    SWAP   : Mix (threshold = 64)                                                                                               
    L1     : transposed form                                                                                                    
    U      : transposed form                                                                                                    
    EQUIL  : yes                                                                                                                
    ALIGN  : 8 double precision words                                                                                           
                                                                                                                                
    --------------------------------------------------------------------------------                                            
  
    - The matrix A is randomly generated for each test.
    - The following scaled residual check will be computed:
          ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
    - The relative machine precision (eps) is taken to be               1.110223e-16
    - Computational tests pass if scaled residuals are less than                16.0
                                             
    ================================================================================                                                                                                  
    T/V                N    NB     P     Q               Time                 Gflops                                                                                                  
    --------------------------------------------------------------------------------                                                                                                  
    WR11C2R4       71481   256     1     8            1241.97             1.9606e+02                                                                                                  
    HPL_pdgesv() start time Wed Nov 29 22:21:15 2023                                       
                                             
    HPL_pdgesv() end time   Wed Nov 29 22:41:57 2023                                       
                                             
    --------------------------------------------------------------------------------                                                                                                  
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   2.54353754e-03 ...... PASSED                                                                                                  
    ================================================================================                                                                                                  
                                             
    Finished      1 tests with the following results:                                      
                  1 tests completed and passed residual checks,                            
                  0 tests completed and failed residual checks,                            
                  0 tests skipped because of illegal input values.                                                                                                                    
    --------------------------------------------------------------------------------                                                                                                  
                                             
    End of Tests.                            
    ================================================================================

Thought it would be fun to benchmark one of our lower-specced servers (it runs a couple web servers). A better result than I was expecting for a 4 core/8 thread machine.

Benchmark Milk-V Jupiter

As the title says...

Similar to #35 I will probably run Ansible from a separate computer.

Dhrystones come on! Phoronix, LuxMark https://youtu.be/wl5H5rT87JE

http://www.roylongbottom.org.uk/
http://www.roylongbottom.org.uk/dhrystone%20results.htm
Source:
https://netlib.org/benchmark/dhry-c
https://netlib.org/benchmark/

Dhrystones are Fun, but same as Geekbench, all over the place....
Amiga 1200 + PiStom32 + Pi4B does 1.5Million
https://youtu.be/NDuVh-zOwdc?t=301

much more stable Wine+Cinebench R15 R20 R23
or
http://wiki.luxcorerender.org/LuxMark#Binaries
x86_64 Requires Intel & AMD CPU OpenCL drivers, for all OS.

https://www.phoronix.com/review/power9-talos-2/2
https://www.phoronix-test-suite.com/

https://youtu.be/wl5H5rT87JE

Benchmark Raspberry Pi 4 model B

As the title says...

Benchmark Radxa X4 (Intel N100)

As the title says...

Failures building MPI with gfortran on 5.4.0-1071-raspi #81-Ubuntu (20.04)

Tried to run this on a cluster of pis running Ubuntu 20.04 - it failed trying to build mpi.

Is there a reason not to use the OS provided version of mpich, if available?

Error here:

`
failed: [wrk-0002] (item=./configure --with-device=ch3:sock FFLAGS=-fallow-argument-mismatch) => changed=true
ansible_loop_var: item
cmd:

./configure
--with-device=ch3:sock
FFLAGS=-fallow-argument-mismatch
delta: '0:01:58.104863'
end: '2022-11-10 14:31:28.537145'
item: ./configure --with-device=ch3:sock FFLAGS=-fallow-argument-mismatch
msg: non-zero return code
rc: 1
start: '2022-11-10 14:29:30.432282'
stderr: |-
configure: WARNING: X11 not found; GL disabled
configure: WARNING: compilation failed
configure: error: **** Incompatible Fortran and C Object File Types! ****
F77 Object File Type produced by "gfortran -fallow-argument-mismatch " is : : cannot open ' (No such file or directory). C Object File Type produced by "gcc -O2" is : : ELF 64-bit LSB relocatable, ARM aarch64, version 1 (SYSV), not stripped. stderr_lines: <omitted> stdout: |- ...

Benchmark Milk-V Mars RISC-V SBC

See: geerlingguy/sbc-reviews#46

It will likely be similar in performance to the Mars CM (#20)

Benchmark Raspberry Pi CM4

See: geerlingguy/sbc-reviews#8

Speed up MPI compile time on systems with > 4 cores

We were trying to run this playbook on an Ampere Altra Max M128-30 (see #3), and Patrick noticed the compile was using a hardcoded -j4. I would rather use -j{{ ansible_processor_nproc }}.

Benchmark Adlink Ampere Altra Dev Kit - 64-core 2.2 GHz

I have been sent an ADLINK Ampere Altra Dev Kit with a Q64-22 CPU for testing.

I've installed 96 GB of Samsung DDR4 3200 ECC RAM, and would like to see how it compares to some of its beefier bretheren (see #10 and #17).

Some interesting notes/observations are documented on the unofficial Ampere Altra Dev Kit wiki.

Benchmarking RaptorCS Blackbird (POWER9 8-Core 32-threads @ 3.8 Ghz)

$ lscpu
Architecture:             ppc64le
  Byte Order:             Little Endian
CPU(s):                   32
  On-line CPU(s) list:    0-31
Model name:               POWER9, altivec supported
  Model:                  2.3 (pvr 004e 1203)
  Thread(s) per core:     4
  Core(s) per socket:     8
  Socket(s):              1
  Frequency boost:        enabled
  CPU(s) scaling MHz:     100%
  CPU max MHz:            3800.0000
  CPU min MHz:            2166.0000
Caches (sum of all):      
  L1d:                    256 KiB (8 instances)
  L1i:                    256 KiB (8 instances)
  L2:                     4 MiB (8 instances)
  L3:                     80 MiB (8 instances)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-31
Vulnerabilities:          
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Mitigation; RFI Flush, L1D private per thread
  Mds:                    Not affected
  Meltdown:               Mitigation; RFI Flush, L1D private per thread
  Mmio stale data:        Not affected
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Kernel entry/exit barrier (eieio)
  Spectre v1:             Mitigation; __user pointer sanitization, ori31 speculation barrier enabled
  Spectre v2:             Mitigation; Software count cache flush (hardware accelerated), Software link stack flush
  Srbds:                  Not affected
  Tsx async abort:        Not affected

$ ansible-playbook main.yml --tags "setup,benchmark" --ask-become-pass
  mpirun_output.stdout: |-
    ================================================================================
    HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
    Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
    Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
    Modified by Julien Langou, University of Colorado Denver
    ================================================================================
  
    An explanation of the input/output parameters follows:
    T/V    : Wall time / encoded variant.
    N      : The order of the coefficient matrix A.
    NB     : The partitioning blocking factor.
    P      : The number of process rows.
    Q      : The number of process columns.
    Time   : Time in seconds to solve the linear system.
    Gflops : Rate of execution for solving the linear system.
  
    The following parameter values will be used:
  
    N      :   70717
    NB     :     256
    PMAP   : Row-major process mapping
    P      :       1
    Q      :      32
    PFACT  :   Right
    NBMIN  :       4
    NDIV   :       2
    RFACT  :   Crout
    BCAST  :  1ringM
    DEPTH  :       1
    SWAP   : Mix (threshold = 64)
    L1     : transposed form
    U      : transposed form
    EQUIL  : yes
    ALIGN  : 8 double precision words
  
    --------------------------------------------------------------------------------
  
    - The matrix A is randomly generated for each test.
    - The following scaled residual check will be computed:
          ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
    - The relative machine precision (eps) is taken to be               1.110223e-16
    - Computational tests pass if scaled residuals are less than                16.0
  
    ================================================================================
    T/V                N    NB     P     Q               Time                 Gflops
    --------------------------------------------------------------------------------
    WR11C2R4       70717   256     1    32            1650.43             1.4286e+02
    HPL_pdgesv() start time Thu Jun 13 15:57:05 2024
  
    HPL_pdgesv() end time   Thu Jun 13 16:24:36 2024
  
    --------------------------------------------------------------------------------
    ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   2.41238455e-03 ...... PASSED
    ================================================================================
  
    Finished      1 tests with the following results:
                  1 tests completed and passed residual checks,
                  0 tests completed and failed residual checks,
                  0 tests skipped because of illegal input values.
    --------------------------------------------------------------------------------
  
    End of Tests.
    ================================================================================

PLAY RECAP *********************************************************************************************************************************************************************************************************************************************************************
127.0.0.1                  : ok=29   changed=10   unreachable=0    failed=0    skipped=7    rescued=0    ignored=0

Benchmark LattePanda Mu

As the title says... see geerlingguy/sbc-reviews#42

This system uses a single Intel N100 CPU with unlocked TDP thanks to an aluminum active cooler.

Benchmark Milk-V Mars CM - JH7110 4x 1.5 GHz RISC-V

As the title says... I can't get ansible running locally on the device since cryptography won't install, and I had trouble upgrading pip. So I'll run it from my Mac over SSH.

Benchmark Radxa Rock 5C

As the title says... see geerlingguy/sbc-reviews#41

Benchmark Raspberry Pi 5 Linux kernel NUMA patch

I would like to see if the NUMA patch here: https://lore.kernel.org/lkml/[email protected]/ — has any bearing on HPL performance and/or efficiency scores. Especially if it's reproducible and significant.

The stated numbers for Geekbench 6 are 5-ish and 20-ish percent improvements for single/multicore. I would like to see if there's any impact for HPL (which is inherently multicore, and very RAM-speed-dependent). Also measure the power usage to see if this affects power draw positively, negatively, or not at all.

NOTE: I'm testing with an 8GB Raspberry Pi 5. Default clocks, Raspberry Pi 5 Active Cooler, ambient temperature 80°F/26.7°C.

Benchmark Ampere Altra Q32-17 in HL15 Chassis

I am running an Ampere Altra Q32-17 in a 45Homelab HL15 4U rackmount chassis, and currently I can get the power draw at idle down to around 50W if I limit the system to just the motherboard, CPU, 32 GB of RAM, and a 2 TB NVMe SSD.

With all the stuff I'm using to make it a NAS (6x HDD, 4x SSD, HBA), power draw rises to around 120W. So I'll have to remove all that when doing my final power efficiency calculations.

I'm also running Rocky Linux 9, so I'll need to check if this playbook even runs properly there. We'll see!

Tests to run:

Full system with HDDs spinning and everything, 64 GB RAM
Bare system with no HDDs, HBA, etc., 64 GB RAM (4x sticks)
Bare system with 32 GB RAM (2x sticks)
Bare system with 128 GB RAM (8x sticks)

Single-node playbook execution fails during firewall configuration due to undefined 'host_ips' variable

When running the playbook on a single-node installation using the suggested ansible-playbook main.yml --tags "setup,benchmark" call the playbook execution fails during firewall configuration due to the variable host_ips being undefined:

...
TASK [include_tasks] **********************************************************************************************************************************************************************************************
included: /tmp/top500-benchmark/firewall/configure-firewall.yml for 127.0.0.1

TASK [Creating new custom firewall zone.] *************************************************************************************************************************************************************************
changed: [127.0.0.1]

TASK [Setting custom firewall zone to accept connections.] ********************************************************************************************************************************************************
changed: [127.0.0.1]

TASK [Adding nodes as trusted sources in the firewall.] ***********************************************************************************************************************************************************
fatal: [127.0.0.1]: FAILED! =>
  msg: '''host_ips'' is undefined. ''host_ips'' is undefined'

The host_ips fact is only set as part of the node SSH configuration playbook (tagged 'ssh') hence it's not present if the tasks with this tag are not executed at all.

Benchmark Raspberry Pi Zero 2 W

I would like to see if I can get numbers for a Pi Zero 2 W...

See: geerlingguy/sbc-reviews#43

Benchmark Radxa Rock5 model B

Running the benchmark now. Surprisingly, Radxa's apt repository seems to not have the key pre-shipped with their Debian OS image download, so I had to add it to the keyring manually before running the playbook.

Benchmark Radxa CM5

As the title says — see geerlingguy/sbc-reviews#40 for more.

Benchmark Turing Pi 2 cluster with 4x RK1 Nodes

I am going to see how well 4x RK1 nodes (each running an 8-core RK3588 SoC) will run on the Turing Pi 2 cluster.

Benchmark Ampere Altra Max M128-30

I would like to see if I can get a good HPL benchmark run from the Ampere Altra Max M128-30.

Using this project, and assuming someone is running Ubuntu 22.04 (or any modern version of Ubuntu), I believe this test can be run with:

Install Ansible: pip3 install ansible
Clone this project: git clone https://github.com/geerlingguy/top500-benchmark.git && cd top500-benchmark
Copy config files: cp example.config.yml config.yml && cp example.hosts.ini hosts.ini
Make the following change in config.yml:
1. Set hpl_dat_opts.Qs to 128
Make sure cpu frequency scaling is set to performance (disable throttling)
Run the playbook: ansible-playbook main.yml --ask-become-pass --tags "setup,benchmark"

After 8-24 hours (not sure on that system, ATLAS is the main variable!), it should spit out a result.

Follow-up runs can be done by changing directories to /opt/top500/tmp/hpl-2.3/bin/rpi and running mpirun -f cluster-hosts ./xhpl (tweak the HPL.dat file in that directory as desired).

Compare Phoronix Test Suite hpl to results from top500-benchmark

I should note from the outset that PTS's pts/hpl benchmark doesn't seem to be able to run against a whole cluster.

But I wanted to see how the HPL suite results compare to top500-benchmark results.

Benchmark Ampere Altra Developer Platform - 128 core 2.8 GHz ARM64

I am upgrading my system from 96 to 128 core M128-28, which should hopefully boost the score a little further. We'll see if efficiency is better or worse with the extra 32 cores.

Previous discussion: #10

geerlingguy / top500-benchmark Goto Github PK

top500-benchmark's Introduction

Top500 Benchmark - HPL Linpack

Why not PTS?

Supported OSes

Benchmarking - Cluster

Benchmarking - Single Node

Setting performance CPU frequency

Overclocking

Results

Other Listings

top500-benchmark's People

Contributors

Stargazers

Watchers

Forkers

top500-benchmark's Issues

Recommend Projects

Recommend Topics

Recommend Org

Setting `performance` CPU frequency