I have problems running the benchmark codes. I can collect several issues here: <o

two things, the code is designed to work better with 1 thread

Problems running Benchmark Codes about grid HOT 13 CLOSED

paboyle commented on July 18, 2024

Problems running Benchmark Codes

from grid.

Comments (13)

paboyle commented on July 18, 2024

Hi Thorsten,

compiler, configure command and config log. I noticed in an earlier issue you were using GCC.

I have never, NEVER, verified AVX512 on gcc.

I know for a fact that Clang++ generates illegal instructions from my legal intrinsics.

I can only recommend ICPC for AVX512 at this stage.

from grid.

coppolachan commented on July 18, 2024

The very last error is because you are trying to run the optimised kernel for a double precision field, that is the default if you did not choose explicitly. This is not supported at the moment (hence the assert) but it will be in the next release.

from grid.

paboyle commented on July 18, 2024

so presumably —enable-precision=single will make the latter work on KNL prior to us getting the double
Peter

On 25 Oct 2016, at 23:41, Guido Cossu [email protected] wrote:

The very last error is because you are trying to run the optimised kernel for a double precision field, that is the default if you did not choose explicitly. This is not supported at the moment (hence the assert) but it will be in the next release.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub #59 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/AHMczdn_N3nPIhHw0Xcn3S1rV67ZRdc2ks5q3oWogaJpZM4KcJgb.

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

from grid.

azrael417 commented on July 18, 2024

Hi, I am able to run single precision benchmarks. The flops/sec I get are not super good, about 450 Gflops for a single ink node with 128 threads and 32^4 lattice. Does someone have a preferred local lattice size (with Ls=16)?

I am using intel, the fact that it tried to use gnu was that I did not call ./bootstrap.sh before. After that it used the intel compiler.

from grid.

coppolachan commented on July 18, 2024

are you using the --dslash-opt flag in the command line? This enables the fastest routines.

from grid.

azrael417 commented on July 18, 2024

Grid : Message : 24 ms : Grid is setup to use 128 threads Grid : Message : 753 ms : Making s innermost grids ^[[A^[[A^[[AGrid : Message : 211896 ms : Naive wilson implementation Grid : Message : 211896 ms : Calling Dw Grid : Message : 217279 ms : Called Dw 100 times in 5.38234e+06 us Grid : Message : 217279 ms : norm result 4.99943e+09 Grid : Message : 217284 ms : norm ref 4.99943e+09 Grid : Message : 217290 ms : mflop/s = 418936 Grid : Message : 217290 ms : mflop/s per node = 418936 Grid : Message : 217930 ms : norm diff 4.29343e-05 Grid : Message : 217949 ms : #### Dhop calls report Grid : Message : 217949 ms : WilsonFermion5D Number of Dhop Calls : 100 Grid : Message : 217949 ms : WilsonFermion5D Total Communication time : 231 us Grid : Message : 217949 ms : WilsonFermion5D CommTime/Calls : 2.31 us Grid : Message : 217949 ms : WilsonFermion5D Total Compute time : 5.38182e+06 us Grid : Message : 217949 ms : WilsonFermion5D ComputeTime/Calls : 53818.2 us Grid : Message : 217949 ms : Average mflops/s per call : 418976 Grid : Message : 217949 ms : Average mflops/s per call per node : 418976 Grid : Message : 217949 ms : WilsonFermion5D Stencil Grid : Message : 217949 ms : Stencil calls 100 Grid : Message : 217949 ms : Stencil halogtime 1.5 Grid : Message : 217949 ms : Stencil gathertime 0 Grid : Message : 217949 ms : Stencil gathermtime 0 Grid : Message : 217949 ms : Stencil mergetime 0 Grid : Message : 217950 ms : Stencil jointime 0 Grid : Message : 217950 ms : Stencil spintime 0 Grid : Message : 217950 ms : Stencil splicetime 0 Grid : Message : 217950 ms : Stencil nosplicetime 0 Grid : Message : 217950 ms : Stencil t_table 0 Grid : Message : 217950 ms : Stencil t_data 0 Grid : Message : 217950 ms : WilsonFermion5D StencilEven Grid : Message : 217950 ms : WilsonFermion5D StencilOdd

this is what I get. I use 128 threads, with spread binding, opt dslash, one rank per node. Is that expected?

from grid.

coppolachan commented on July 18, 2024

two things,

the code is designed to work better with 1 thread per core
could you post the full output and the command line? There are several
implementations of the kernel being tested and the portion you are showing
is the slowest one...

On Wed, Oct 26, 2016, 18:07 Thorsten Kurth [email protected] wrote:

Grid : Message : 24 ms : Grid is setup to use 128 threads
Grid : Message : 753 ms : Making s innermost grids
^[[A^[[A^[[AGrid : Message : 211896 ms : Naive wilson implementation
Grid : Message : 211896 ms : Calling Dw
Grid : Message : 217279 ms : Called Dw 100 times in 5.38234e+06 us
Grid : Message : 217279 ms : norm result 4.99943e+09
Grid : Message : 217284 ms : norm ref 4.99943e+09
Grid : Message : 217290 ms : mflop/s = 418936
Grid : Message : 217290 ms : mflop/s per node = 418936
Grid : Message : 217930 ms : norm diff 4.29343e-05
Grid : Message : 217949 ms : #### Dhop calls report
Grid : Message : 217949 ms : WilsonFermion5D Number of Dhop Calls : 100
Grid : Message : 217949 ms : WilsonFermion5D Total Communication time :
231 us
Grid : Message : 217949 ms : WilsonFermion5D CommTime/Calls : 2.31 us
Grid : Message : 217949 ms : WilsonFermion5D Total Compute time :
5.38182e+06 us
Grid : Message : 217949 ms : WilsonFermion5D ComputeTime/Calls : 53818.2 us
Grid : Message : 217949 ms : Average mflops/s per call : 418976
Grid : Message : 217949 ms : Average mflops/s per call per node : 418976
Grid : Message : 217949 ms : WilsonFermion5D Stencil
Grid : Message : 217949 ms : Stencil calls 100
Grid : Message : 217949 ms : Stencil halogtime 1.5
Grid : Message : 217949 ms : Stencil gathertime 0
Grid : Message : 217949 ms : Stencil gathermtime 0
Grid : Message : 217949 ms : Stencil mergetime 0
Grid : Message : 217950 ms : Stencil jointime 0
Grid : Message : 217950 ms : Stencil spintime 0
Grid : Message : 217950 ms : Stencil splicetime 0
Grid : Message : 217950 ms : Stencil nosplicetime 0
Grid : Message : 217950 ms : Stencil t_table 0
Grid : Message : 217950 ms : Stencil t_data 0
Grid : Message : 217950 ms : WilsonFermion5D StencilEven
Grid : Message : 217950 ms : WilsonFermion5D StencilOdd

this is what I get. I use 128 threads, with spread binding, opt dslash,
one rank per node. Is that expected?

—
You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#59 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA9sg8rle_slp5JPv20U0x2rxYoWN05qks5q34iQgaJpZM4KcJgb
.

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

from grid.

azrael417 commented on July 18, 2024

The worst I found gave 100 Gflops. I will try 64 threads now.
I am going to upload some output later and maybe you can tell me if that is expected.
single_node_output.txt

This is the run script:

#!/bin/bash
#SBATCH --ntasks-per-core=4
#SBATCH -p regular_knl
#SBATCH -N 1
#SBATCH -C quad,flat
#SBATCH -t 1:00:00

export OMP_NUM_THREADS=64
export OMP_PLACES=threads
export OMP_PROC_BIND=spread


srun -n 1 -c 272 --cpu_bind=cores numactl -p 1 ./install/grid_sp/bin/Benchmark_dwf --grid 32.32.32.32 --mpi 1.1.1.1 --dslash-opt

from grid.

azrael417 commented on July 18, 2024

I have checked out the latest level, recompiled and then got:

Grid : Message : Grid is setup to use 64 threads
Grid : Message : Making s innermost grids
Grid : Message : Naive wilson implementation
Grid : Message : Calling Dw
Grid : Message : Called Dw 100 times in 4.28215e+06 us
Grid : Message : norm result 4.9985e+09
Grid : Message : norm ref 4.9985e+09
Grid : Message : mflop/s = 526572
Grid : Message : mflop/s per rank = 526572
Grid : Message : norm diff 4.29227e-05
Benchmark_dwf: ../../src/benchmarks/Benchmark_dwf.cc:159: int main(int, char **): Assertion `norm2(err)< 1.0e-5' failed.
srun: error: nid11315: task 0: Aborted
srun: Terminating job step 3029178.0

Is the current develop branch bugfree?

from grid.

paboyle commented on July 18, 2024

Almost certainly never bug free. Any sufficiently large code is not.

However, this is a feature of the test and not a bug.

The check is stringent and absolute norm of rounding
error summed over volume, which naturally grows linearly in the volume.

32^3 is a large volume, and exceeds what I habitually run with, and so my stringent limit
got exceeded.

Use a smaller volume. Problem could be fixed if we normalised relative to source, but it was
introduced two days ago with the intent of trapping the CI fails on travis if an error is committed.

You should export ASMOPT=1 also, I believe.

--cacheblocking=8.2.2.2 is marginally faster.

Guido is right -- the assembly works best on 1 thread per core, and for weird Linux reasons
sometimes better if you live a couple of cores free.

from grid.

azrael417 commented on July 18, 2024

Ok, I will try your suggestions. What is a good local volume to run?

Also: the Making s innermost Grid step takes veeeery long. Is there some hidden serial portion or is that another issue?

from grid.

paboyle commented on July 18, 2024

I've been developing today on 16^4 ok.

from grid.

paboyle commented on July 18, 2024

Default is 8^4 which is a bit small for a 64 core monster.

from grid.

Problems running Benchmark Codes about grid HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent