Comments (13)
Hi Thorsten,
compiler, configure command and config log. I noticed in an earlier issue you were using GCC.
I have never, NEVER, verified AVX512 on gcc.
I know for a fact that Clang++ generates illegal instructions from my legal intrinsics.
I can only recommend ICPC for AVX512 at this stage.
from grid.
The very last error is because you are trying to run the optimised kernel for a double precision field, that is the default if you did not choose explicitly. This is not supported at the moment (hence the assert) but it will be in the next release.
from grid.
so presumably —enable-precision=single will make the latter work on KNL prior to us getting the double
Peter
On 25 Oct 2016, at 23:41, Guido Cossu [email protected] wrote:
The very last error is because you are trying to run the optimised kernel for a double precision field, that is the default if you did not choose explicitly. This is not supported at the moment (hence the assert) but it will be in the next release.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub #59 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/AHMczdn_N3nPIhHw0Xcn3S1rV67ZRdc2ks5q3oWogaJpZM4KcJgb.
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
from grid.
Hi, I am able to run single precision benchmarks. The flops/sec I get are not super good, about 450 Gflops for a single ink node with 128 threads and 32^4 lattice. Does someone have a preferred local lattice size (with Ls=16)?
I am using intel, the fact that it tried to use gnu was that I did not call ./bootstrap.sh before. After that it used the intel compiler.
from grid.
are you using the --dslash-opt
flag in the command line? This enables the fastest routines.
from grid.
Grid : Message : 24 ms : Grid is setup to use 128 threads Grid : Message : 753 ms : Making s innermost grids ^[[A^[[A^[[AGrid : Message : 211896 ms : Naive wilson implementation Grid : Message : 211896 ms : Calling Dw Grid : Message : 217279 ms : Called Dw 100 times in 5.38234e+06 us Grid : Message : 217279 ms : norm result 4.99943e+09 Grid : Message : 217284 ms : norm ref 4.99943e+09 Grid : Message : 217290 ms : mflop/s = 418936 Grid : Message : 217290 ms : mflop/s per node = 418936 Grid : Message : 217930 ms : norm diff 4.29343e-05 Grid : Message : 217949 ms : #### Dhop calls report Grid : Message : 217949 ms : WilsonFermion5D Number of Dhop Calls : 100 Grid : Message : 217949 ms : WilsonFermion5D Total Communication time : 231 us Grid : Message : 217949 ms : WilsonFermion5D CommTime/Calls : 2.31 us Grid : Message : 217949 ms : WilsonFermion5D Total Compute time : 5.38182e+06 us Grid : Message : 217949 ms : WilsonFermion5D ComputeTime/Calls : 53818.2 us Grid : Message : 217949 ms : Average mflops/s per call : 418976 Grid : Message : 217949 ms : Average mflops/s per call per node : 418976 Grid : Message : 217949 ms : WilsonFermion5D Stencil Grid : Message : 217949 ms : Stencil calls 100 Grid : Message : 217949 ms : Stencil halogtime 1.5 Grid : Message : 217949 ms : Stencil gathertime 0 Grid : Message : 217949 ms : Stencil gathermtime 0 Grid : Message : 217949 ms : Stencil mergetime 0 Grid : Message : 217950 ms : Stencil jointime 0 Grid : Message : 217950 ms : Stencil spintime 0 Grid : Message : 217950 ms : Stencil splicetime 0 Grid : Message : 217950 ms : Stencil nosplicetime 0 Grid : Message : 217950 ms : Stencil t_table 0 Grid : Message : 217950 ms : Stencil t_data 0 Grid : Message : 217950 ms : WilsonFermion5D StencilEven Grid : Message : 217950 ms : WilsonFermion5D StencilOdd
this is what I get. I use 128 threads, with spread binding, opt dslash, one rank per node. Is that expected?
from grid.
two things,
- the code is designed to work better with 1 thread per core
- could you post the full output and the command line? There are several
implementations of the kernel being tested and the portion you are showing
is the slowest one...
On Wed, Oct 26, 2016, 18:07 Thorsten Kurth [email protected] wrote:
Grid : Message : 24 ms : Grid is setup to use 128 threads
Grid : Message : 753 ms : Making s innermost grids
^[[A^[[A^[[AGrid : Message : 211896 ms : Naive wilson implementation
Grid : Message : 211896 ms : Calling Dw
Grid : Message : 217279 ms : Called Dw 100 times in 5.38234e+06 us
Grid : Message : 217279 ms : norm result 4.99943e+09
Grid : Message : 217284 ms : norm ref 4.99943e+09
Grid : Message : 217290 ms : mflop/s = 418936
Grid : Message : 217290 ms : mflop/s per node = 418936
Grid : Message : 217930 ms : norm diff 4.29343e-05
Grid : Message : 217949 ms : #### Dhop calls report
Grid : Message : 217949 ms : WilsonFermion5D Number of Dhop Calls : 100
Grid : Message : 217949 ms : WilsonFermion5D Total Communication time :
231 us
Grid : Message : 217949 ms : WilsonFermion5D CommTime/Calls : 2.31 us
Grid : Message : 217949 ms : WilsonFermion5D Total Compute time :
5.38182e+06 us
Grid : Message : 217949 ms : WilsonFermion5D ComputeTime/Calls : 53818.2 us
Grid : Message : 217949 ms : Average mflops/s per call : 418976
Grid : Message : 217949 ms : Average mflops/s per call per node : 418976
Grid : Message : 217949 ms : WilsonFermion5D Stencil
Grid : Message : 217949 ms : Stencil calls 100
Grid : Message : 217949 ms : Stencil halogtime 1.5
Grid : Message : 217949 ms : Stencil gathertime 0
Grid : Message : 217949 ms : Stencil gathermtime 0
Grid : Message : 217949 ms : Stencil mergetime 0
Grid : Message : 217950 ms : Stencil jointime 0
Grid : Message : 217950 ms : Stencil spintime 0
Grid : Message : 217950 ms : Stencil splicetime 0
Grid : Message : 217950 ms : Stencil nosplicetime 0
Grid : Message : 217950 ms : Stencil t_table 0
Grid : Message : 217950 ms : Stencil t_data 0
Grid : Message : 217950 ms : WilsonFermion5D StencilEven
Grid : Message : 217950 ms : WilsonFermion5D StencilOddthis is what I get. I use 128 threads, with spread binding, opt dslash,
one rank per node. Is that expected?—
You are receiving this because you commented.Reply to this email directly, view it on GitHub
#59 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA9sg8rle_slp5JPv20U0x2rxYoWN05qks5q34iQgaJpZM4KcJgb
.
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
from grid.
The worst I found gave 100 Gflops. I will try 64 threads now.
I am going to upload some output later and maybe you can tell me if that is expected.
single_node_output.txt
This is the run script:
#!/bin/bash
#SBATCH --ntasks-per-core=4
#SBATCH -p regular_knl
#SBATCH -N 1
#SBATCH -C quad,flat
#SBATCH -t 1:00:00
export OMP_NUM_THREADS=64
export OMP_PLACES=threads
export OMP_PROC_BIND=spread
srun -n 1 -c 272 --cpu_bind=cores numactl -p 1 ./install/grid_sp/bin/Benchmark_dwf --grid 32.32.32.32 --mpi 1.1.1.1 --dslash-opt
from grid.
I have checked out the latest level, recompiled and then got:
Grid : Message : Grid is setup to use 64 threads
Grid : Message : Making s innermost grids
Grid : Message : Naive wilson implementation
Grid : Message : Calling Dw
Grid : Message : Called Dw 100 times in 4.28215e+06 us
Grid : Message : norm result 4.9985e+09
Grid : Message : norm ref 4.9985e+09
Grid : Message : mflop/s = 526572
Grid : Message : mflop/s per rank = 526572
Grid : Message : norm diff 4.29227e-05
Benchmark_dwf: ../../src/benchmarks/Benchmark_dwf.cc:159: int main(int, char **): Assertion `norm2(err)< 1.0e-5' failed.
srun: error: nid11315: task 0: Aborted
srun: Terminating job step 3029178.0
Is the current develop branch bugfree?
from grid.
Almost certainly never bug free. Any sufficiently large code is not.
However, this is a feature of the test and not a bug.
The check is stringent and absolute norm of rounding
error summed over volume, which naturally grows linearly in the volume.
32^3 is a large volume, and exceeds what I habitually run with, and so my stringent limit
got exceeded.
Use a smaller volume. Problem could be fixed if we normalised relative to source, but it was
introduced two days ago with the intent of trapping the CI fails on travis if an error is committed.
You should export ASMOPT=1 also, I believe.
--cacheblocking=8.2.2.2 is marginally faster.
Guido is right -- the assembly works best on 1 thread per core, and for weird Linux reasons
sometimes better if you live a couple of cores free.
from grid.
Ok, I will try your suggestions. What is a good local volume to run?
Also: the Making s innermost Grid step takes veeeery long. Is there some hidden serial portion or is that another issue?
from grid.
I've been developing today on 16^4 ok.
from grid.
Default is 8^4 which is a bit small for a 64 core monster.
from grid.
Related Issues (20)
- MPI2 romio321 library fails when reading >= 2GB per rank HOT 2
- Cannot compile the gparity and adjoint versions of the CompactWilsonCloverAction
- Compilation errors and warnings build targeting Nvidia GPUs HOT 2
- GPU Benchmark_ITT segfaults with MPI and ranks > 1 HOT 9
- Create a version of Benchmark_ITT including Clover instead of Wilson
- Grid fails to build for Nc != 3
- hipcc on Crusher: function bcopy undefined (compiler does not have openmp enabled?) HOT 1
- Certain operations involving SitePropagator::scalar_object won't compile with CUDA for Nc > 3
- make install doesn't install all headers due to duplicate Config.h and Version.h HOT 3
- Using ILDG checkpointer causes a crash during write HOT 2
- Develop is broken HOT 1
- ARM NEON is broken HOT 2
- Feature request: provenance tracking
- Add hint to shm error message
- Cuda error invalid device ordinal
- Recent commit causing Grid build to fail
- The configure options --enable-setdevice and --diable-setdevice have no effect
- Grid does not compile on Arm with CUDA HOT 9
- invalid configuration argument when running with 1 GPU
- FlightRecorder.cc breaks compilation for --enable-comms=none HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from grid.