paboyle / grid Goto Github PK

View Code? Open in Web Editor NEW

149.0 31.0 106.0 63.66 MB

Data parallel C++ mathematical object library

License: GNU General Public License v2.0

Shell 0.89% C++ 92.94% C 3.27% Makefile 0.04% M4 1.72% Mathematica 0.83% Roff 0.24% BitBake 0.06%

grid's Introduction

Grid

Data parallel C++ mathematical object library.

License: GPL v2.

Last update June 2017.

Please do not send pull requests to the master branch which is reserved for releases.

Description

This library provides data parallel C++ container classes with internal memory layout that is transformed to map efficiently to SIMD architectures. CSHIFT facilities are provided, similar to HPF and cmfortran, and user control is given over the mapping of array indices to both MPI tasks and SIMD processing elements.

Identically shaped arrays then be processed with perfect data parallelisation.
Such identically shaped arrays are called conformable arrays.

The transformation is based on the observation that Cartesian array processing involves identical processing to be performed on different regions of the Cartesian array.

The library will both geometrically decompose into MPI tasks and across SIMD lanes. Local vector loops are parallelised with OpenMP pragmas.

Data parallel array operations can then be specified with a SINGLE data parallel paradigm, but optimally use MPI, OpenMP and SIMD parallelism under the hood. This is a significant simplification for most programmers.

The layout transformations are parametrised by the SIMD vector length. This adapts according to the architecture. Presently SSE4, ARM NEON (128 bits) AVX, AVX2, QPX (256 bits), IMCI and AVX512 (512 bits) targets are supported.

These are presented as vRealF, vRealD, vComplexF, and vComplexD internal vector data types. The corresponding scalar types are named RealF, RealD, ComplexF and ComplexD.

MPI, OpenMP, and SIMD parallelism are present in the library. Please see this paper for more detail.

Compilers

Intel ICPC v16.0.3 and later

Clang v3.5 and later (need 3.8 and later for OpenMP)

GCC v4.9.x (recommended)

GCC v6.3 and later

Specific machine compilation instructions - Summit, Tesseract

The Wiki contains specific instructions for some Summit, Tesseract and GPU compilation

Important:

Some versions of GCC appear to have a bug under high optimisation (-O2, -O3).

The safety of these compiler versions cannot be guaranteed at this time. Follow Issue 100 for details and updates.

GCC v5.x

GCC v6.1, v6.2

Bug report

To help us tracking and solving more efficiently issues with Grid, please report problems using the issue system of GitHub rather than sending emails to Grid developers.

When you file an issue, please go though the following checklist:

Check that the code is pointing to the HEAD of develop or any commit in master which is tagged with a version number.
Give a description of the target platform (CPU, network, compiler). Please give the full CPU part description, using for example cat /proc/cpuinfo | grep 'model name' | uniq (Linux) or sysctl machdep.cpu.brand_string (macOS) and the full output the --version option of your compiler.
Give the exact configure command used.
Attach config.log.
Attach grid.config.summary.
Attach the output of make V=1.
Describe the issue and any previous attempt to solve it. If relevant, show how to reproduce the issue using a minimal working example.

Required libraries

Grid requires:

GMP,

MPFR

Bootstrapping grid downloads and uses for internal dense matrix (non-QCD operations) the Eigen library.

Grid optionally uses:

HDF5

LIME for ILDG and SciDAC file format support.

FFTW either generic version or via the Intel MKL library.

LAPACK either generic version or Intel MKL library.

Quick start

First, start by cloning the repository:

git clone https://github.com/paboyle/Grid.git

Then enter the cloned directory and set up the build system:

cd Grid
./bootstrap.sh

Now you can execute the configure script to generate makefiles (here from a build directory):

mkdir build; cd build
../configure --enable-simd=AVX --enable-comms=mpi-auto --prefix=<path>

where --enable-simd= set the SIMD type, --enable- comms=, and <path> should be replaced by the prefix path where you want to install Grid. Other options are detailed in the next section, you can also use configure --help to display them. Like with any other program using GNU autotool, the CXX, CXXFLAGS, LDFLAGS, ... environment variables can be modified to customise the build.

Finally, you can build, check, and install Grid:

make; make check; make install

To minimise the build time, only the tests at the root of the tests directory are built by default. If you want to build tests in the sub-directory <subdir> you can execute:

make -C tests/<subdir> tests

If you want to build all the tests at once just use make tests.

Build configuration options

--prefix=<path>: installation prefix for Grid.
--with-gmp=<path>: look for GMP in the UNIX prefix <path>
--with-mpfr=<path>: look for MPFR in the UNIX prefix <path>
--with-fftw=<path>: look for FFTW in the UNIX prefix <path>
--enable-lapack[=<path>]: enable LAPACK support in Lanczos eigensolver. A UNIX prefix containing the library can be specified (optional).
--enable-mkl[=<path>]: use Intel MKL for FFT (and LAPACK if enabled) routines. A UNIX prefix containing the library can be specified (optional).
--enable-numa: enable NUMA first touch optimisation
--enable-simd=<code>: setup Grid for the SIMD target <code> (default: GEN). A list of possible SIMD targets is detailed in a section below.
--enable-gen-simd-width=<size>: select the size (in bytes) of the generic SIMD vector type (default: 64 bytes).
--enable-comms=<comm>: Use <comm> for message passing (default: none). A list of possible SIMD targets is detailed in a section below.
--enable-rng={sitmo|ranlux48|mt19937}: choose the RNG (default: sitmo ).
--disable-timers: disable system dependent high-resolution timers.
--enable-chroma: enable Chroma regression tests.
--enable-doxygen-doc: enable the Doxygen documentation generation (build with make doxygen-doc)

Possible communication interfaces

The following options can be use with the --enable-comms= option to target different communication interfaces:

`<comm>`	Description
`none`	no communications
`mpi[-auto]`	MPI communications
`mpi3[-auto]`	MPI communications using MPI 3 shared memory
`shmem`	Cray SHMEM communications

For the MPI interfaces the optional -auto suffix instructs the configure scripts to determine all the necessary compilation and linking flags. This is done by extracting the informations from the MPI wrapper specified in the environment variable MPICXX (if not specified configure will scan though a list of default names). The -auto suffix is not supported by the Cray environment wrapper scripts. Use the standard versions instead.

Possible SIMD types

The following options can be use with the --enable-simd= option to target different SIMD instruction sets:

`<code>`	Description
`GEN`	generic portable vector code
`SSE4`	SSE 4.2 (128 bit)
`AVX`	AVX (256 bit)
`AVXFMA`	AVX (256 bit) + FMA
`AVXFMA4`	AVX (256 bit) + FMA4
`AVX2`	AVX 2 (256 bit)
`AVX512`	AVX 512 bit
`NEONv8`	ARM NEON (128 bit)
`QPX`	IBM QPX (256 bit)

Alternatively, some CPU codenames can be directly used:

`<code>`	Description
`KNL`	Intel Xeon Phi codename Knights Landing
`SKL`	Intel Skylake with AVX512 extensions
`BGQ`	Blue Gene/Q

Notes:

We currently support AVX512 for the Intel compiler and GCC (KNL and SKL target). Support for clang will appear in future versions of Grid when the AVX512 support in the compiler will be more advanced.
For BG/Q only bgclang is supported. We do not presently plan to support more compilers for this platform.
BG/Q performances are currently rather poor. This is being investigated for future versions.
The vector size for the GEN target can be specified with the configure script option --enable-gen-simd-width.

Build setup for Intel Knights Landing platform

The following configuration is recommended for the Intel Knights Landing platform:

../configure --enable-simd=KNL        \
             --enable-comms=mpi-auto  \
             --enable-mkl             \
             CXX=icpc MPICXX=mpiicpc

The MKL flag enables use of BLAS and FFTW from the Intel Math Kernels Library.

If you are working on a Cray machine that does not use the mpiicpc wrapper, please use:

../configure --enable-simd=KNL        \
             --enable-comms=mpi       \
             --enable-mkl             \
             CXX=CC CC=cc

If gmp and mpfr are NOT in standard places (/usr/) these flags may be needed:

               --with-gmp=<path>        \
               --with-mpfr=<path>       \

where <path> is the UNIX prefix where GMP and MPFR are installed.

Knight's Landing with Intel Omnipath adapters with two adapters per node presently performs better with use of more than one rank per node, using shared memory for interior communication. This is the mpi3 communications implementation. We recommend four ranks per node for best performance, but optimum is local volume dependent.

../configure --enable-simd=KNL        \
             --enable-comms=mpi3-auto \
             --enable-mkl             \
             CC=icpc MPICXX=mpiicpc

Build setup for Intel Haswell Xeon platform

The following configuration is recommended for the Intel Haswell platform:

../configure --enable-simd=AVX2       \
             --enable-comms=mpi3-auto \
             --enable-mkl             \
             CXX=icpc MPICXX=mpiicpc

The MKL flag enables use of BLAS and FFTW from the Intel Math Kernels Library.

If gmp and mpfr are NOT in standard places (/usr/) these flags may be needed:

               --with-gmp=<path>        \
               --with-mpfr=<path>       \

where <path> is the UNIX prefix where GMP and MPFR are installed.

If you are working on a Cray machine that does not use the mpiicpc wrapper, please use:

../configure --enable-simd=AVX2       \
             --enable-comms=mpi3      \
             --enable-mkl             \
             CXX=CC CC=cc

Since Dual socket nodes are commonplace, we recommend MPI-3 as the default with the use of one rank per socket. If using the Intel MPI library, threads should be pinned to NUMA domains using

        export I_MPI_PIN=1

This is the default.

Build setup for Intel Skylake Xeon platform

The following configuration is recommended for the Intel Skylake platform:

../configure --enable-simd=AVX512     \
             --enable-comms=mpi3      \
             --enable-mkl             \
             CXX=mpiicpc

The MKL flag enables use of BLAS and FFTW from the Intel Math Kernels Library.

If gmp and mpfr are NOT in standard places (/usr/) these flags may be needed:

               --with-gmp=<path>        \
               --with-mpfr=<path>       \

where <path> is the UNIX prefix where GMP and MPFR are installed.

If you are working on a Cray machine that does not use the mpiicpc wrapper, please use:

../configure --enable-simd=AVX512     \
             --enable-comms=mpi3      \
             --enable-mkl             \
             CXX=CC CC=cc

Since Dual socket nodes are commonplace, we recommend MPI-3 as the default with the use of one rank per socket. If using the Intel MPI library, threads should be pinned to NUMA domains using

        export I_MPI_PIN=1

This is the default.

Expected Skylake Gold 6148 dual socket (single prec, single node 20+20 cores) performance using NUMA MPI mapping):

mpirun -n 2 benchmarks/Benchmark_dwf --grid 16.16.16.16 --mpi 2.1.1.1 --cacheblocking 2.2.2.2 --dslash-asm --shm 1024 --threads 18

TBA

Build setup for AMD EPYC / RYZEN

The AMD EPYC is a multichip module comprising 32 cores spread over four distinct chips each with 8 cores. So, even with a single socket node there is a quad-chip module. Dual socket nodes with 64 cores total are common. Each chip within the module exposes a separate NUMA domain. There are four NUMA domains per socket and we recommend one MPI rank per NUMA domain. MPI-3 is recommended with the use of four ranks per socket, and 8 threads per rank.

The following configuration is recommended for the AMD EPYC platform.

../configure --enable-simd=AVX2       \
             --enable-comms=mpi3 \
             CXX=mpicxx

If gmp and mpfr are NOT in standard places (/usr/) these flags may be needed:

               --with-gmp=<path>        \
               --with-mpfr=<path>       \

where <path> is the UNIX prefix where GMP and MPFR are installed.

Using MPICH and g++ v4.9.2, best performance can be obtained using explicit GOMP_CPU_AFFINITY flags for each MPI rank. This can be done by invoking MPI on a wrapper script omp_bind.sh to handle this.

It is recommended to run 8 MPI ranks on a single dual socket AMD EPYC, with 8 threads per rank using MPI3 and shared memory to communicate within this node:

mpirun -np 8 ./omp_bind.sh ./Benchmark_dwf --mpi 2.2.2.1 --dslash-unroll --threads 8 --grid 16.16.16.16 --cacheblocking 4.4.4.4

Where omp_bind.sh does the following:

#!/bin/bash

numanode=` expr $PMI_RANK % 8 `
basecore=`expr $numanode \* 16`
core0=`expr $basecore + 0 `
core1=`expr $basecore + 2 `
core2=`expr $basecore + 4 `
core3=`expr $basecore + 6 `
core4=`expr $basecore + 8 `
core5=`expr $basecore + 10 `
core6=`expr $basecore + 12 `
core7=`expr $basecore + 14 `

export GOMP_CPU_AFFINITY="$core0 $core1 $core2 $core3 $core4 $core5 $core6 $core7"
echo GOMP_CUP_AFFINITY $GOMP_CPU_AFFINITY

$@

Performance:

Expected AMD EPYC 7601 dual socket (single prec, single node 32+32 cores) performance using NUMA MPI mapping):

mpirun -np 8 ./omp_bind.sh ./Benchmark_dwf --threads 8 --mpi 2.2.2.1 --dslash-unroll --grid 16.16.16.16 --cacheblocking 4.4.4.4

TBA

Build setup for BlueGene/Q

To be written...

Build setup for ARM Neon

To be written...

Build setup for laptops, other compilers, non-cluster builds

Many versions of g++ and clang++ work with Grid, and involve merely replacing CXX (and MPICXX), and omit the enable-mkl flag.

Single node builds are enabled with

            --enable-comms=none

FFTW support that is not in the default search path may then enabled with

    --with-fftw=<installpath>

BLAS will not be compiled in by default, and Lanczos will default to Eigen diagonalisation.

grid's People

Contributors

Stargazers

Watchers

Forkers

mspraggs coppolachan sidkashyap erinaldi azrael417 urbach kostrzewa weinbe2 rprollins giltirn chulwoo1 brower mywoodstock robertolat naveen-rn jc-harrison lanny91 edbennett sunpho84 awennersteen nmeyer-ur gfilaci lehner djm2131 thealico pretidav goracle fionnoh guelpers bcolquhoun danielrichtmann yongchull jprichings andreasjuettner cpviolator stsquad ydiz andrewyongzhenning julia-kettle mmphys magritte-code krox nils-asmussen grid-test-organisation benjaminhuth willjay rabbott99 enterprisebcme srijitpaul gridsycl milc-qcd i-kanamori mengshanfeng heinrich-br felixerben stjordanis nelsonlachini rrhodgson freddeceuster lupoa pittmnn smangham josephleekl dreamplayer-zhang sa2c fjosw hwancheoljeong yikaihuo daschaich rabbott999 rjhudspith felixpgziegler mbruno46 jdmaia san-zh giordano sofiemartins gkanwar dbollweg chillenzer bnl-hpc tblum2 uniofleicester marcoce7 nils-ht clarkedavida sy3394 leonhostetler

grid's Issues

Instrument CG and linear operation for flops count

Add the info of flops count to the solvers and HMC routines.
Added here as a memo for everyone.

Licensing

Hi,

As the code is available publicly and certainly receive attention, I think we should license it properly. There is already a GPLv3 text distributed in COPYING. It would be safer to mark all source file as prescribed by the FSF guys. I suggest that we add a license header to each individual source file . I propose the following header:

/*
 * <filename>.cc, part of Grid
 *
 * Copyright (C) 2015 <author list>
 *
 * Grid is free software: you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation, either version 3 of the License, or
 * (at your option) any later version.
 *
 * Grid is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with Grid.  If not, see <http://www.gnu.org/licenses/>.
 */

Please tell me if you think another license is more suitable.

MPI freeze

Just for bookkeeping for 0.6 (Peter is aware of the origin of the issue), currently MPI freezes in an eternal wait and this should be corrected for 0.6.

Problems running Benchmark Codes

I have problems running the benchmark codes. I can collect several issues here:

running the comms benchmark on 16 KNL nodes, 4 ranks per node with 32 threads per rank, on a 128^4 grid with 2^2*4^2 topology, works till the summary step, there it fails:

Grid : Message        : 24906 ms : 30       4       10368000        1198.7      2397.41
Grid : Message        : 26629 ms : 30       8       20736000        1199.82     2399.64
Grid : Message        : 30097 ms : 30       16      41472000        1208.64     2417.28
Grid : Message        : 30599 ms : 32       1       3145728     710.73      1421.46
Grid : Message        : 31088 ms : 32       2       6291456     1173.55     2347.1
Grid : Message        : 32166 ms : 32       4       12582912        1159.37     2318.73
Grid : Message        : 34211 ms : 32       8       25165824        1231.09     2462.19
Grid : Message        : 38414 ms : 32       16      50331648        1210.31     2420.63
Grid : Message        : 38500 ms : ====================================================================================================
Grid : Message        : 38500 ms : = Benchmarking sequential halo exchange in 4 dimensions
Grid : Message        : 38500 ms : ====================================================================================================
Grid : Message        : 38500 ms :   L           Ls         bytes       MB/s uni        MB/s bidi
srun: error: nid12126: task 23: Floating point exception

There might be a division by zero or something.

Running the DWF benchmark on multiple nodes, the code hangs in
Grid : Message : 116 ms : Making s innermost grids
And that's it. Waited for 30 minutes, but it got stuck. Possible deadlock?
Running the DWF benchmark on single node, the code crashes with some assertion:

||||||||||||||__
||||||||||||||__
|_ | | | | | | | | | | | | _|
|_ _|
|_ GGGG RRRR III DDDD _|
|_ G R R I D D _|
|_ G R R I D D _|
|_ G GG RRRR I D D _|
|_ G G R R I D D _|
|_ GGGG R R III DDDD _|
|_ _|
||||||||||||||__
||||||||||||||__
| | | | | | | | | | | | | |

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

Grid : Message : 25 ms : Grid is setup to use 128 threads
Grid : Message : 116 ms : Making s innermost grids
Grid : Message : 15963 ms : Naive wilson implementation
Grid : Message : 15964 ms : Calling Dw
Benchmark_dwf: ../../src/lib/qcd/action/fermion/WilsonKernelsAsm.cc:46: void Grid::QCD::WilsonKernels::DiracOptAsmDhopSite(Impl::StencilImpl &, Grid::LebesgueOrder &, Impl::DoubledGaugeField &, std::vector<Impl::SiteHalfSpinor, Grid::alignedAllocatorImpl::SiteHalfSpinor> &, int, int, int, int, const Impl::FermionField &, Impl::FermionField &) [with Impl = Grid::QCD::WilsonImplGrid::Grid_simd<std::complex<double, __m512d>, Grid::QCD::FundamentalRep<3>, double>]: Assertion `0' failed.
srun: error: nid12151: task 0: Aborted
srun: Terminating job step 3027515.1```

Can you help me here?

Installation structure

Hi Guys!

As you know I have been experiencing with Gruido last week.
I am using Grid as an external library, and I had a little pain in making my code able to compile against the "installed" version of Grid.

I would assume that one is expected to pass to oneself program the flags

-I$grid_prefix/include -L$grid_prefix/install/lib -lGrid

(where $grid_prefix is the prefix passed when configuring Grid) in agreement with typical package folder structure. If then I include Grid.h as per:

#include <Grid/Grid.h>

I get the error:
$grid_prefix/include/Grid/algorithms/approx/Remez.h:19:20: fatal error: Config.h: No such file or directory #include <Config.h>

in facts, the file Config.h is included in $grid_prefix/include/Grid, which is not in the searched path.

If I pass instead

-I$grid_prefix/include/Grid

and include Grid.h as in:

#include <Grid.h>

what I obtain is the error:

$grid_prefix/include/Grid/Grid.h:63:24: fatal error: Grid/Timer.h: No such file or directory #include <Grid/Timer.h>

beacuse now "Grid/" is already parth of the search path. So ultimately I need to compile the code with both paths:

-I$grid_prefix/include -I$grid_prefix/include/Grid

Provided that this combination is kept, everything goes fine, but I find this hard intuitive.

Another issue is the fact that, as you distribute Config.h file, the "PACKAGE_NAME", "PACKAGE_STRING" etc macros clash with those used in my autotools generated header. A typical solution that I've seen used is to wrap the Config.h in a "true" included file, and then rename the package macros. For example c-lime library does the following:

#ifndef LIME_CONFIG_H
#define LIME_CONFIG_H

/* Undef the unwanted from the environment -- eg the compiler command line */
#undef PACKAGE
#undef PACKAGE_BUGREPORT
#undef PACKAGE_NAME
#undef PACKAGE_STRING
#undef PACKAGE_TARNAME
#undef PACKAGE_VERSION
#undef VERSION

/* Include the stuff generated by autoconf */
#include "lime_config_internal.h"

/* Prefix everything with LIME_ /
static const char const LIME_PACKAGE = PACKAGE;
static const char* const LIME_PACKAGE_BUGREPORT = PACKAGE_BUGREPORT;
static const char* const LIME_PACKAGE_NAME = PACKAGE_NAME;
static const char* const LIME_PACKAGE_STRING = PACKAGE_STRING;
static const char* const LIME_PACKAGE_TARNAME = PACKAGE_TARNAME;
static const char* const LIME_PACKAGE_VERSION = PACKAGE_VERSION;
/* LIME_VERSION is already defined in lime_defs.h */

/* Undef the unwanted */
#undef PACKAGE
#undef PACKAGE_BUGREPORT
#undef PACKAGE_NAME
#undef PACKAGE_STRING
#undef PACKAGE_TARNAME
#undef PACKAGE_VERSION
#undef VERSION
#endif

Milestones?

I have started adding the '0.6.0' milestone to some issues. I think this system is useful to plan releases. Tell me if you think it is inconvenient.

File naming conventions

I've finally got fed up files with a Grid_ in front. What seemed like a good idea when there
were one or two files is blatantly dumb when we have a whole tree which has the "Grid" in the
top directory name (where it should be) anyway.

I've started switching to C++ style capitalised names like

MobiusZolotarevFermion.h

Any reasons to do lower case mobius_zolotarev_fermion.h

I plan to do a global rename exercise soon, and think that the file name reflecting the capitalised
class name is the simplest. and will do so unless there are objections persuading me otherwise.

non compiler portable syntax in indexing _m256 etc...

Intel compiler chokes on syntax used in
lib/simd/Grid_avx.h(360): error: expression must have pointer-to-object type
return v1[0];

For example. It is necessary to go through a "conv" union to get at element by element
access to the _mXXX vector intrinsic types in a way that works with Clang, G++ and Intel's compiler.

Shared memory buffers allocated even with standard MPI comms

I am running on Cori and it crashed with the error:

NOARCH.splanc.x: ../../../src/Grid/lib/communicator/Communicator_base.cc:49: void *Grid::CartesianCommunicator::ShmBufferMalloc(unsigned long): Assertion heap_bytes<MAX_MPI_SHM_BYTES' failed.`

I have Grid configured with standard MPI comms so by my understanding, these shared buffer allocs for the comms buffers are not needed. However it appears that Stencil.h doesn't check the MPI mode. It would be great if Stencil did a check that Grid is configured to use the hybrid MPI and if not just do regular allocs.

barrier synch model for CG

I'm finding on a small test I can get a substantive speed up out
of using the BFM style long lived thread with barrier synch.

e.g. Single node 8^4:
620GF/s -> 910 GF/s on KNL 7210 (SP)
62GF/s -> 72 GF/s on BG/Q node (DP)

On 16^4 local volumes on KNL the gain is minimal though.

If we act, this pushes us into a threading nightmare however, with many routines having
to accept multiple threads entering them.

I will think a little about whether the "parallel_for" macro can work around this and the implications,
but if there is no easy common solution we are looking at having to make "DhopThread"
and "Dhop" routines and changing the CG and other solvers to run in thread / barrier mode.

This seems to be mandated by self threading being 1.5x faster than OpenMP for loops.

I'll make the comparison a little more robust though. I'm tempted to not do this even if it means
a little less performance on the sweet spot on BG/Q since the software cost is large and perhaps
we should just accept it.

Building Grid on machine with AMD porcessors

Hi,
I built Grid for the Bc-cluster at Fermilab (AMD Opteron 6320) using icc v16 with impi 5.1.3. No issues when building but when executing the binary only the following message is printed:

Please verify that both the operating system and the processor support Intel(R) X87, CMOV, MMX, FXSAVE, SSE, SSE2, SSE3, SSSE3, SSE4_1, SSE4_2, POPCNT and AVX instructions.

Allowed flags from /proc/cpuinfo
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nonstop_tsc extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb cpb npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold bmi1

Outputs from configure and make (config.log, config.out) are attached. I'm mainly testing the configure script and cleaning-up my install.
CXX=icpc ../Grid/configure --enable-simd=AVX --enable-comms=mpi-auto

I have a working build using the same intel compilers on this machine. However, that was created by explicitly specifying compile options on September 26, 2016 (cf config.log-2016-09-26).

Thank you,
Oliver
PS: To upload files I attached suffix '.txt' to the filename.

config.log.txt
config.log-2016-09-26.txt
config.out.txt

mpi-auto fails on ARCHER

the flag mpi-auto triggers a failure in the configure step in ARCHER.
The responsible is the
LX_FIND_MPI
which seems unsupported by the Cray wrappers in that machine.
Compilation with just
--enable-comms=mpi
works.

Two options

discard this as a localised problem on ARCHER (but not sure yet)
address by making the search for mpi flags more portable

large heap memory consumption in mpi mode

I the following problem

[tkurth@gert01 GRID]$ tail -f slurm-3046972.out

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.


Grid : Message        : Requesting 134217728 byte stencil comms buffers 
Grid : Message        : Grid is setup to use 32 threads
Grid : Message        : Making s innermost grids
^[[A^[[A^[[ABenchmark_dwf: ../../src/lib/communicator/Communicator_base.cc:49: void *Grid::CartesianCommunicator::ShmBufferMalloc(unsigned long): Assertion `heap_bytes<MAX_MPI_SHM_BYTES' failed.
 ShmBufferMalloc exceeded shared heap size -- try increasing with --shm <MB> flag
 Parameter specified in units of MB (megabytes) 
 Current value is 128
Benchmark_dwf: ../../src/lib/communicator/Communicator_base.cc:49: void *Grid::CartesianCommunicator::ShmBufferMalloc(unsigned long): Assertion `heap_bytes<MAX_MPI_SHM_BYTES' failed.
srun: error: nid02439: tasks 0-1: Aborted
srun: Terminating job step 3046972.0

the code hangs when it tries to make the innermost grids and then fails after 10 minutes. this is my run script

[tkurth@gert01 GRID]$ cat benchmark_dwf.sh
#!/bin/bash
#SBATCH --ntasks-per-core=4
#SBATCH -N 1
#SBATCH -A mpccc
#SBATCH -p regular
#SBATCH -t 2:00:00
#SBATCH -C knl,quad,cache

export OMP_NUM_THREADS=32
export OMP_PLACES=threads
export OMP_PROC_BIND=spread

#MPI stuff
export MPICH_NEMESIS_ASYNC_PROGRESS=MC
export MPICH_MAX_THREAD_SAFETY=multiple
export MPICH_USE_DMAPP_COLL=1

srun -n 2 -c 136 --cpu_bind=cores ./install/grid_sp_mpi/bin/Benchmark_dwf --threads 32 --grid 32.32.32.32 --mpi 1.1.1.2 --dslash-asm --cacheblocking=4.2.2.1
[config.log.txt](https://github.com/paboyle/Grid/files/572824/config.log.txt)
[config.summary.txt](https://github.com/paboyle/Grid/files/572823/config.summary.txt)

commit version

commit c067051d5ff1a3f4c4dea0e72cc9b1b0ad092c7a
Merge: bc248b6 afdeb2b
Author: paboyle <[email protected]>
Date:   Wed Nov 2 13:59:18 2016 +0000

    Merge branch 'develop' into release/v0.6.0

KNL bin1, cray xc-40, intel 16.0.3.210
build script and configure

#!/bin/bash -l

#module loads
module unload craype-haswell
module load craype-mic-knl
module load cray-memkind

precision=single
comms=mpi

if [ "${precision}" == "single" ]; then
    installpath=$(pwd)/install/grid_sp_${comms}
else
    installpath=$(pwd)/install/grid_dp_${comms}
fi

mkdir -p build

cd build
../src/configure --prefix=${installpath} \
    --enable-simd=KNL \
    --enable-precision=${precision} \
    --enable-comms=${comms} \
    --host=x86_64-unknown-linux \
    --enable-mkl \
    CXX="CC" \
    CC="cc"
        
    #CXXFLAGS="-mkl -xMIC-AVX512 -std=c++11" \
    #CFLAGS="-mkl -xMIC-AVX512 -std=c99" \
    #LDFLAGS="-mkl -lmemkind"

make -j12 
make install

cd ..

attached config.log
attached config. summary
no make.log but should not be necessary hopefully

mpi-auto fails on ARCHER

Posting here before propagating to the devel

zMobius CG convergence?

Hi,

Would you mind checking if the zMobius CG converges when omega has imaginary components? I've modified Test_dwf_cg_prec DomainWallFermionR -> ZMobiusFermionR and it fails to converge.
In case you are wondering. omega(s) = 0.25 + 0.01 i

Queries on OpenSHMEM collectives usage

I just happened to look at the OpenSHMEM usage in the library; it looks like that the collectives usage is little buggy. As per the OpenSHMEM standards, "Every element of this array(here pSync array) must be initialized with the value SHMEM_SYNC_VALUE (in C/C++) or SHMEM_SYNC_VALUE (in Fortran) before any of the PEs in the Active set enter the reduction routine."

Some random example from the library; In CartesianCommunicator::GlobalSumVector(double *d,int N), it looks like psync lacks initialization to SHMEM_SYNC_VALUE

parallel write to parallel file system

Hi,

I ran four tests on an 8^4 lattice on the summit machine at UC Boulder and similar tests on pi0 at Fermilab. All jobs were running on 2 nodes each with 1 mpi rank and 24 threads summit (16 threads pi0). The jobs differ by the type of the files ystem used for writing the ckpoints (NFS or GPFS summit; ZFS or lustre Fermilab) and whether I split the T or the Z direction (1 or 2 IO nodes)

summit
mpi SLURM-ID
1.1.1.2 (2 IO nodes) NFS 462 280 MB/s
1.1.1.2 (2 IO nodes) GPFS 461 0.05 MB/s
1.1.2.1 (1 IO node) NFS 460 131 MB/s
1.1.2.1 (1 IO node) GPFS 455 79 MB/s

pi0 Fermilab
mpi PBS-ID (last three)
1.1.1.2 (2 IO nodes) ZFS 628 228 MB/s
1.1.1.2 (2 IO nodes) lustre 635 0.002 MB/s
1.1.2.1 (1 IO node) ZFS 626 110 MB/s
1.1.2.1 (1 IO node) lustre 627 3-20 MB/s

Unfortunately, I didn't find performance values for the single I/O writing the rng-files in the log-files;
the full log-files are however attached and carry the SLURM-ID / PBS-ID in the filename. Do I need some special flag for parallel file systems? (striping?)

The parallel read of the ckpoint at the beginning of the job seems OK four all cases although in this tests not all jobs started from a checkpoint. On both machines Grid is compiled on NFS/ZFS.

Thank you,
Oliver

pi0.zip
summit.zip

Const-correctness & access rights

That is going to be very painful, but I think that on the long run it can really pay in term of clarity for users and the way other softwares will base themselves on Grid. The idea would full scan all the declarations to insure:

that function arguments & class methods are set to const when appropriate to reduce the risk of silent buggy variable changes
that class members & inheritance have the right level of accuracy to avoid exposure of internals outside Grid

Action classes & parameter I/O

I can start working on the action parts (gauge and fermions).
We have to decide the target design functionality and style.

I have in mind something similar to what I have in IroIro++, a kind of lightweight version of the Chroma ones.

You can have a look here
https://github.com/coppolachan/IroIro/tree/master/lib/Action
and here for the corresponding HMC implementation (leapfrog, 2MN, multilevel)
https://github.com/coppolachan/IroIro/tree/master/lib/HMC

--enable-lapack flag problems

It seems that the configure flag --enable-lapack just checks if liblapack is there. However, it would be good if it also checks if MKL is there and continue if this is the case. Maybe by test-compiling a small file which makes a lapack call with and without trailing underscores and if one of it succeeds, just continue.

Branch cleaning for v0.6.0

We should try to go through the different feature branches and see which ones are fully integrated into develop and should be removed. I have spotted the following ones:

knl-stats
hirep

Please advise if the development of these features is considered finished.

Intel Compilation Error

We get the following error when we try to compile Grid on Intel

In file included from ../../../lib/qcd/action/Actions.h(44),
from ../../../lib/qcd/QCD.h(460),
from ../../../lib/Grid.h(78),
from ../../../lib/PerfCount.cc(29):
/usr/include/c++/5/bits/stl_iterator_base_types.h(154): error: name followed by "::" must be a class or namespace name
typedef typename _Iterator::iterator_category iterator_category;

Please find attached the full list of errors

$ g++ --version
g++ (Ubuntu 5.1.1-4ubuntu12) 5.1.1 20150504
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO

build_out.txt
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Human-made documentation (not Doxygen)

Just to start the discussion on a real documentation. There are many possibilities, I can think of the following:

Paper-style PDF

Pro: can be put on arXiv, good for reaching the community
Con: quite "static", with a risk of the document becoming quickly obsolete

Documentation CMS
It seems that the developper community quite like Sphinx which was originally designed for Python documentation. A lot of examples can be found here.

Pro: nice to browse, search. More dynamical
Con: we need to learn how to use the thing (but that does not look that hard)

Suggestion: Create include directory

For more clarity in the code structure it would be better having the .h files now in lib moved to and include / directory in the root level.

Travis test failure on XCODE5

Travis on XCODE 5 continues to fail due to the time it takes to compile in the virtual machine.
I tried to separate the matrix of computations splitting the single and double but without success. The env options inside the matrix: seem not to allow for splitting the compilations and external env: will be overridden.
Any ideas?

AVXFMA4?

Hi,

My laptop does not support AVX2 instructions. When I try to compile with AVXFMA4, the compilation craches complaining that I am using AVX2 intrinsics unsupported by my machine.
So my question is: is that really an AVX1+FMA target or did it become redundant with AVX2 (especially now that Peter added -mfma to AVX2)?

Colors?

Hi,

Do we really want to do this colour thing? I see many cons:

in a log file (which in production is always what we will be looking at) the special characters can mess up some text editors
the actual setup assume that the user have a white background, mine is black and running a Grid program will set my terminal to black text once then everything is unreadable and I have to kill my session

Pros: ??

Let me know what you think

Consider std::unordered_map instead of std::map in NerscField

If you're always using C++11, then unordered_map may offer some performance improvements over std::map, since the best case lookup complexity is O(1) due to the underlying implementation using a hash table, and you're not iterating over header, so an ordered map isn't required.

On the other hand, loading an NERSC file is of course IO bound, so a few extra cycles spent using std::map probably won't affect things that much.

Compilation on cray machines [SOLVED]

I have several issues compiling GRID on a cray machine.

the automatically generated Makefile in lib does not get the correct include paths:
make[2]: Entering directory '/global/project/projectdirs/mpccc/tkurth/NESAP/GRID/build/lib'
CXX Init.o
CXX PerfCount.o
CXX algorithms/approx/MultiShiftFunction.o
CXX Log.o
CXX qcd/action/fermion/CayleyFermion5D.o
CXX qcd/action/fermion/ContinuedFractionFermion5D.o
CXX qcd/action/fermion/PartialFractionFermion5D.o
CXX qcd/action/fermion/WilsonFermion.o
CXX qcd/action/fermion/WilsonKernels.o
CXX qcd/action/fermion/WilsonFermion5D.o
CXX qcd/action/fermion/WilsonKernelsAsm.o
CXX qcd/action/fermion/WilsonKernelsHand.o
In file included from ../../src/lib/qcd/action/fermion/CayleyFermion5D.cc:32:0:
../../src/lib/Grid.h:62:46: fatal error: Grid/serialisation/Serialisation.h: No such file or directory
#include <Grid/serialisation/Serialisation.h>
That I could fix by manually adding -I's in the generated makefile
the lib compilation uses gcc/g++ and not the compiler I selected. I want to use the cray wrappers cc/CC to enable cray mpi, but I got:
make[1]: Entering directory '/global/project/projectdirs/mpccc/tkurth/NESAP/GRID/build/lib'
CXX Init.o
In file included from ../../src/include/Grid/Communicator.h:31:0,
from ../../src/lib/Grid.h:72,
from ../../src/lib/Init.cc:44:
../../src/include/Grid/communicator/Communicator_base.h:35:17: fatal error: mpi.h: No such file or directory
#include <mpi.h>
That is expected, as gcc does not know about mpi. Please fix this so that CC/CXX is actually CC/CXX specified by the user
after fixing that, the next error is:
../../src/include/Grid/Stencil.h(276): error: a value of type "Grid::iScalar<Grid::iVector<Grid::iVector<Grid::vComplexF, 3>, 2>> *" cannot be used to initialize an entity of type "uint64_t={unsigned long}"
uint64_t cbase = & comm_buf[0];
to me this looks like a some implicit casting the intel compiler does not like. I think an explicit typecast would be healthy here.
after fixing these things, I finally get:
make[2]: Entering directory '/global/project/projectdirs/mpccc/tkurth/NESAP/GRID/build/lib'
make[2]: *** No rule to make target 'simd/Grid_empty.h', needed by 'all-am'. Stop.
That I don't know how to solve. Please advise.

Best
Thorsten

Develop doesn't compile

My compiler is complaining of invalid intrinsic calls in the Grid_avx.h code:

/home/ckelly/CPS/src/grid_gitflow/Grid/include/Grid/simd/Grid_avx.h: In static member function ‘static __m256i Grid::Optimization::PrecisionChange::StoH(__m256, __m256)’:
/home/ckelly/CPS/src/grid_gitflow/Grid/include/Grid/simd/Grid_avx.h:480:36: error: cannot convert ‘__m128i {aka __vector(2) long long int}’ to ‘__m128 {aka __vector(4) float}’ for argument ‘1’ to ‘__m256 _mm256_castps128_ps256(__m128)’
h = _mm256_castps128_ps256(ha);
^
/home/ckelly/CPS/src/grid_gitflow/Grid/include/Grid/simd/Grid_avx.h:481:38: error: cannot convert ‘__m128i {aka __vector(2) long long int}’ to ‘__m128 {aka __vector(4) float}’ for argument ‘2’ to ‘__m256 _mm256_insertf128_ps(__m256, __m128, int)’
h = _mm256_insertf128_ps(h,hb,1);
^
/home/ckelly/CPS/src/grid_gitflow/Grid/include/Grid/simd/Grid_avx.h:485:14: error: cannot convert ‘__m256 {aka __vector(8) float}’ to ‘__m256i {aka __vector(4) long long int}’ in return
return h;
^
/home/ckelly/CPS/src/grid_gitflow/Grid/include/Grid/simd/Grid_avx.h: In static member function ‘static void Grid::Optimization::PrecisionChange::HtoS(__m256i, __m256&, __m256&)’:
/home/ckelly/CPS/src/grid_gitflow/Grid/include/Grid/simd/Grid_avx.h:489:53: error: cannot convert ‘__m256i {aka __vector(4) long long int}’ to ‘__m256 {aka __vector(8) float}’ for argument ‘1’ to ‘__m128 _mm256_extractf128_ps(__m256, int)’
sa = _mm256_cvtph_ps(_mm256_extractf128_ps(h,0));
^
/home/ckelly/CPS/src/grid_gitflow/Grid/include/Grid/simd/Grid_avx.h:490:53: error: cannot convert ‘__m256i {aka __vector(4) long long int}’ to ‘__m256 {aka __vector(8) float}’ for argument ‘1’ to ‘__m128 _mm256_extractf128_ps(__m256, int)’
sb = _mm256_cvtph_ps(_mm256_extractf128_ps(h,1));

I have managed to fix all the issues by modifying the code to the following (starting Grid_avx.h:475):

static inline __m256i StoH (__m256 a,__m256 b) {
__m256i hi;
#ifdef USE_FP16
__m256 h;
__m128i ha = _mm256_cvtps_ph(a,0);
__m128 hha = _mm_cvtepi32_ps(ha);
__m128i hb = _mm256_cvtps_ph(b,0);
__m128 hhb = _mm_cvtepi32_ps(hb);
h = _mm256_castps128_ps256(hha);
h = _mm256_insertf128_ps(h,hhb,1);
hi = _mm256_cvtps_epi32(h);
#else
assert(0);
#endif
return hi;
}
static inline void HtoS (__m256i h,__m256 &sa,__m256 &sb) {
#ifdef USE_FP16
__m256 hh = _mm256_cvtepi32_ps(h);
__m128 hh0 = _mm256_extractf128_ps(hh,0);
__m128i hh0i = _mm_cvtps_epi32(hh0);
__m128 hh1 = _mm256_extractf128_ps(hh,1);
__m128i hh1i = _mm_cvtps_epi32(hh1);
sa = _mm256_cvtph_ps(hh0i);
sb = _mm256_cvtph_ps(hh1i);
#else
assert(0);
#endif

Unfortunately this involves a lot more instructions!

max_align_t

Some versions of clang++/g++ fails to compile lib/algorithms/approx/Remez.cc
Adding #include<stddef.h> before any other include's fixes it.

RNG management

Need a way to handle RNG management in QCD with multiple live grids.

SpaceTimeGrid class is perhaps a place holder for central lattice information,
could create a Grid hierarchy there.

We could make several simplifying assumptions:

i) 5d is never spread out; 4d RNG's suffice.
ii) Subdivided Grids make use of RNG's from the 0,0,0,0 subcell element of the finest grid.

We would only ever have to save and restore 4d RNG's then, and alternate routines
for RNG filling to index the corresponding RNG on a different grid.

Tradeoffs:

I want the RNG sequence to be independent of machine decomposition.
IroIro no longer does this, and gains from the Mersenne twister skip ahead by having
one RNG per node. This gives up both machine decomposition independence AND the
threading of RNG generation within an MPI task.
Could make RNG's live on a coarser grid. (Coarsest?). Suppresses RNG state volume.
CPS does a version of this with one RNG per hypercube.

Making this quite general -- fill a fine grid from a coarse grid RNG that subdivides the fine
grid, allowing for 5d/4d -- would enable unable suppression of RNG state volume, while retaining
ability to parallelise within a node and also retaining machine decomposition independence providing
we do not subdivide too much.
I am tempted to expand lib/qcd/utils/SpaceTimeGrid.h/cc to retain a sequence of
global Grid objects for QCD running (Fermion Grid, Gauge Grid, RNGGrid) and
provide the subdivided RNG grid fill, save/restore etc...
Similarly retain the single serial RNG here.

Comments on this strategy welcome. With a Mersenne Twister implementation we can
take a single seed and skip-ahead instead of reseeding with random as is presently done with ranlux.

Travis failing for Linux clang builds

The Travis build failing for Linux/clang come from the fact that LLVM guys closed their APT repository because network traffic was to heavy http://lists.llvm.org/pipermail/llvm-dev/2016-May/100303.html.
This is not surprising considering the increasing number of CI bots all downloading over and over again clang from their server.
Finding a plan B would be very painful, and some say this is just temporary and that the server will reopen.
I will keep a look on it and figure out something if it is not solved on LLVM side.

Benchmark_dwf* fails in branch develop

As per the title. I'm now in the stage of merging the smearing branch and I'm retesting everything, but not have the time to address immediately this issue. If someone wants to solve it in the meantime...
It was running in my branch before the merge

My configuration flags
../../Grid/configure --enable-precision=single --enable-simd=AVX CXXFLAGS=-mavx -fopenmp=libomp -O3 -std=c++11 LDFLAGS=-fopenmp=libomp LIBS=-lgmp -lmpfr --enable-comms=none

Autoconf files

This is not an issue - more an information message (I don't know if the github messages attached to the commits get broadcasted to everyone so I write also here)
In the latest commit I added a .gitignore file to ignore autoconf files in the commits, and also the compiled libraries and created makefiles from automake. this would allow contributors to skip reconfiguring the tools and keep their own environment.

For new users, I also added a simple utility called reconfigure_script (maybe should be moved to the scripts dir) that runs all autotools command to setup the correct environment.

High precision norms

All norms and innerProducts are no high precision norms using intermediate double.
The code Guido has written with a differentiated normHP should just call norm.

These should be bandwidth limited anyway, so there is no reason to not make default.

Develop broken

Develop is generating an internal compiler error since a commit about 5 days ago under travis and GCC5.

DO NOT REFORMAT FILES

I want to be clear on this.

My source code is written to be readable by me.

I do NOT appreciate to reformatting source files, and committing them, especially if
this is done brainlessly by an automatic tool, but in any case the style of the author
is key. It does not matter if you do not like my style and choices.

A complete mess has been made of several source key files, and further
dozens of files have been pointlessly changed in a way that violates my
personal preference of not acquiring high levels of indentation when entering
the Grid or QCD namespace.

This floating indent, combined with a tool based application of 80 character wrap
creates unreadable code from code that was previously easily readable by the author.

Further, readability is in the judge of the prime author of a given area of the code, and it
is rude and inappropriate to reformat without consultation and agreement.

I am now spending several hours reverting code with this waste of time created by thoughtless
action.

The worst case is the thoughtless application of automatic formatting to a critical
file with braces and scopes in ifdef's that hopeless confused the formatting.

You should NEVER be committing without first applying git diff and satisfying yourself that
your are in complete control of these changes with a very few lines deliberately modified with genuine purpose.

FMA support in clang++-3.8

Compiling Grid with clang++-3.8 and AVX2 support appears to trigger a compiler error related to FMA support in the intrinsics.

$ cd Grid
$ git rev-parse --short HEAD
5e02392

In file included from qcd/action/fermion/CayleyFermion5D.cc:31:
In file included from ./Grid.h:68:
In file included from ./Simd.h:166:
In file included from ./simd/Grid_vector_types.h:47:
./simd/Grid_avx.h:240:14: error: always_inline function '_mm256_fmaddsub_ps' requires target feature 'fma', but would be
inlined into function 'operator()' that is compiled without support for 'fma'
return _mm256_fmaddsub_ps( a_real, b, a_imag ); // Ar Br , Ar Bi +- Ai Bi = ArBr-AiBi , ArBi+AiBr
^
./simd/Grid_avx.h:286:14: error: always_inline function '_mm256_fmaddsub_pd' requires target feature 'fma', but would be
inlined into function 'operator()' that is compiled without support for 'fma'
return _mm256_fmaddsub_pd( a_real, b, a_imag ); // Ar Br , Ar Bi +- Ai Bi = ArBr-AiBi , ArBi+AiBr
^
fatal error: error in backend: Cannot select: 0x76cec80: v8f32 = X86ISD::FMADDSUB 0x787a1d0, 0x7a523a0, 0x7a52600

===== system details ========

$ cat /etc/redhat-release
Scientific Linux release 7.2 (Nitrogen)

$ uname -a
Linux yosemite.fnal.gov 3.10.0-229.20.1.el7.x86_64 #1 SMP Wed Nov 4 10:08:36 CST 2015 x86_64 x86_64 x86_64 GNU/Linux

$ less /proc/cpuinfo
model name : Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz

$ clang++ -v
clang version 3.8.0 (tags/RELEASE_380/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/james/installed/bin
Found candidate GCC installation: /usr/lib/gcc/x86_64-redhat-linux/4.8.2
Found candidate GCC installation: /usr/lib/gcc/x86_64-redhat-linux/4.8.5
Selected GCC installation: /usr/lib/gcc/x86_64-redhat-linux/4.8.5
Candidate multilib: .;@m64
Candidate multilib: 32;@m32
Selected multilib: .;@m64

$ ./configure CXX=clang++ CXXFLAGS="-std=c++11 -O3 -mavx2 -fopenmp -lomp" --enable-simd=AVX2

Summary of configuration for grid v1.0

The following features are enabled:

architecture (build) : x86_64
os (build) : linux-gnu
architecture (target) : x86_64
os (target) : linux-gnu
build DOXYGEN documentation : no
graphs and diagrams : no

- Supported SIMD flags : -mmmx -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx

enabled simd support : AVX2 (config macro says supported: no )
communications type : none
default precision : double
RNG choice : ranlux48
LAPACK : no

Consider std::array in place of boost::array

You mention boost arrays in the TODO list, so I thought I'd mention this on the off-chance you hadn't already seen it. C++11 provides an array class template that's pretty much the same as the boost array type.

5D observables, 4D props

We need to think about a sensible interface providing things like conserved currents etc...
abstracting the differences between 5D formulations like DWF, mobius, ContFrac, PartFrac
and Wilson and other 4D approaches

We also need to think about the interface to 4D props and sources.

Antonin has done some of this in his measurement code, but we need to standardise and include.

Header mess needs tidying?

Hi,

Through my recent development (HDF5, Gamma) I really struggled with Grid header structures. A lot of headers can only be included assuming a very specific sequence of previous includes done externally to the header itself. I was curious enough to try to follow include chains and in some case it is really involved.

That scares me a bit considering that Grid is growing fast, because this can go out of control rather quickly. The main issue is that the whole structure is becoming increasingly cryptic:

It is very hard to know where something is defined, one would just like to look at the first lines to see the includes but in a lot of case it is empty. One needs to understand the order of inclusion in a larger structure which is becoming increasingly complex.
There is a lot of duplication in standard header inclusion.
IDE/tools parsing the code get completely lost because, again, the correct chain of inclusion is only visible at the highest level.

I am not advocating that we should change the include strategy, but rather that we consolidate it. One possible strategy could be:

Concentrate all the standard headers and general purpose macros in a Global.h file.
Have a standard template for Grid headers, e.g include guard then #include <Grid/Global.h> then #include of thematic headers (Cartesian.h, Algorithms.h) necessary for the definition in the current header.
This is a change that would not change the current header interface at all, any program just using Grid.h would be fine. It is just about adding a bunch of Grid includes on top of each header to make them self-consistent and independent.

This is what I am already doing in the measurement code, the loading order of headers can be permuted arbitrarily and any header is self-consistent in term of definitions & declarations (i.e. it can be included alone without errors). Of course although this is not a huge change either it will be a complete pain to do. I would be happy to volunteer doing so but I won't do anything without having your opinion. But I would say the code would gain quite some readability and robustness.

Any thoughts?

File formats support

Add new file formats:

ILDG is needed
Some people from US suggested/requested HDF5
other suggestions?

Development flow

Hi,

Recently my attention was attracted to the following extension of git to manage the development flow of a project (features, stable/unstable branches, ...): http://danielkummer.github.io/git-flow-cheatsheet/

I thought it was worth considering it regarding the recent discussion and Guido's suggestion (which I completely support) of some form quality control and maybe milestones.

Let me know what you think.

Segmentation fault with Benchmark_wilson

Dear contributers,

I tried to run benchmarks, but Benchmark_wilson failed due to Segmentation fault.
Benchmark_dwf and Benchmark_zmm seem to have the same problem.
I am not sure this is a problem of Grid or gcc behind the intel compiler.

version of the Grid:
master, as of Apr. 21 (d9b5e66)

$ ./Benchmark_wilson --debug-signals

||||||||||||||__
||||||||||||||__
|| | | | | | | | | | | | |_
| |_
|_ GGGG RRRR III DDDD _|
|_ G R R I D D _|
|_ G R R I D D _|
|_ G GG RRRR I D D _|
|_ G G R R I D D _|
|_ GGGG R R III DDDD _|
| |_
||||||||||||||__
||||||||||||||__
| | | | | | | | | | | | | |

Copyright (C) 2015 Peter Boyle, Azusa Yamaguchi, Guido Cossu, Antonin Portelli and other authors
Colours by Tadahito Boyle

Grid : Message : 0 ms : Grid is setup to use 1 threads
Grid : Message : 0 ms : Grid floating point word size is REALF4
Grid : Message : 0 ms : Grid floating point word size is REALD8
Grid : Message : 0 ms : Grid floating point word size is REAL4
Grid : Message : 1134 ms : Calling Dw
Caught signal 11
mem address 4
code 1
instruction 3c4c20e0fc
rdi 77af20
rsi 0
rbp 0
rbx 0
rdx 22088400
rax 0
rcx 20
rsp 7ffc40e26ed0
rip 3c4c20e0fc
r8 2eb4a67
r9 0
r10 7ffc40e26d40
r11 7ffc40e26cf0
r12 3c56cefa88
r13 0
r14 13c65c0
r15 0
BackTrace Strings: 0 ./Benchmark_wilson() [0x4366da]
BackTrace Strings: 1 /lib64/libc.so.6() [0x3c4ca326a0]
BackTrace Strings: 2 /lib64/ld-linux-x86-64.so.2() [0x3c4c20e0fc]
BackTrace Strings: 3 /lib64/ld-linux-x86-64.so.2() [0x3c4c2148f5]
BackTrace Strings: 4 /usr/lib64/libstdc++.so.6(_ZNSt6thread15_M_start_threadESt10shared_ptrINS_10_Impl_baseEE+0x97) [0x3c56ab65a7]
BackTrace Strings: 5 ./Benchmark_wilson() [0x441170]
BackTrace Strings: 6 ./Benchmark_wilson() [0x46c105]
BackTrace Strings: 7 ./Benchmark_wilson() [0x46c045]
BackTrace Strings: 8 ./Benchmark_wilson() [0x487d8e]
BackTrace Strings: 9 ./Benchmark_wilson() [0x487d00]
BackTrace Strings: 10 ./Benchmark_wilson() [0x48761b]
BackTrace Strings: 11 ./Benchmark_wilson() [0x406cbd]
BackTrace Strings: 12 /lib64/libc.so.6(__libc_start_main+0xfd) [0x3c4ca1ed5d]
BackTrace Strings: 13 ./Benchmark_wilson() [0x4032c9]

Here is my configuration:

$ ../configure CXX=icpc --enable-simd=AVX --enable-precision=single CXXFLAGS="-std=c++11 -O0 -debug inline-debug-info -g " --enable-comms=none

$ icpc -v
icpc version 16.0.1 (gcc version 4.8.2 compatibility)

The build gave plenty of warnings, like:

../../lib/simd/Grid_avx.h(521): warning #167: argument of type "__m128" is incompatible with parameter of type "__m128i"
_mm256_alignr_epi32(ret,in,tmp,n);
^
detected during instantiation of "__m256 Grid::Optimization::Rotate::tRotate(__m256) [with n=0]" at line 494

../../lib/simd/Grid_avx.h(521): warning #167: argument of type "__m128i" is incompatible with parameter of type "const __m128 &"
_mm256_alignr_epi32(ret,in,tmp,n);
^
detected during instantiation of "__m256 Grid::Optimization::Rotate::tRotate(__m256) [with n=0]" at line 494

(it continues for about 48000 lines)

The machine has an Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz, which has the AVX,
and 16GB memory.

Other benchmarks:

OK:
Benchmark_su3
Benchmark_memory_asynch
Benchmark_memory_bandwidth

Failed:
Benchmark_comms : Aborted (due to --enable-comms=none, I guess)
Benchmark_dwf : Segmentation fault (after "Calling Dw")
Benchmark_zmm : Segmentation fault (after "Calling Dw")

Best regards,

Issaku

FFTW, Eigen, directory reorganisation

See pre-development branch

feature/Ls-vectorised-actions

I've concluded we really require easy access to dense matrix functionality.

There's been a few places:
Lanczos,
now also Mobius with s-vectorisation

where it is needed. Depending on Eigen seems the least ugly option.

Similarly, FFT is becoming important.

-- fourier accel gauge fixing
-- measurement mom projection
-- QED
-- Gauge fixed smearing by convolution
-- etc...

Headers

I made include/Grid a symlink to lib/ and all includes be

include <Grid/Grid.h> etc...

The headers remain with source, side by side in tree, but the logical include path is
include/Grid/

AC_LINK_FILES is used for same effect in the build directory.

prerequisites

I added a prerequisites subdirectory that gets built first.
-- This downloads and caches (in the source tree) the Eigen and FFTW packages.
-- In the build directory, it untars and builds FFTW, untars Eigen.
-- Eigen headers, the FFTW header and the compiled FFTW library are moved into
the include/Grid/Eigen/
include/Grid/fftw3.h

These are therefore disambiguated from any system installed versions kicking around,
as I want to avoid blowing in the wind as various clusters module systems control
software versions.

Thus the header only Eigen, and the fftw3 header and library get installed along with Grid,
but pulled from their source repositories and cached in the Grid source tree
from the first checkout & build someone does from github.

Tests
The number of tests was growing too long in an unmangaged flat directory

I've added subdirectories, and only core tests get built by default.
We could make a named travis test directory and run all tests in there.

Thoughts welcome...

bgclang no compile

CXX Application.o
In file included from ../../../extras/Hadrons/Application.cc:31:
/dirac1/work/x03/paboyle/Grid-bgq/Grid/include/Grid/Hadrons/GeneticScheduler.hpp:169:21: error: no member named 'emplace' in 'std::multimap<int, std::vector<unsigned int, std::allocator >, std::less, std::allocator<std::pair<const int,
std::vector<unsigned int, std::allocator > > > >'
population_.emplace(func_(p), p);
~~~~~~~~~~~ ^
/dirac1/work/x03/paboyle/Grid-bgq/Grid/include/Grid/Hadrons/GeneticScheduler.hpp:132:9: note: in instantiation of member function 'Grid::Hadrons::GeneticScheduler::initPopulation' requested here
initPopulation();
^
../../../extras/Hadrons/Application.cc:210:23: note: in instantiation of member function 'Grid::Hadrons::GeneticScheduler::nextGeneration' requested here
scheduler.nextGeneration();
^
In file included from ../../../extras/Hadrons/Application.cc:31:
/dirac1/work/x03/paboyle/Grid-bgq/Grid/include/Grid/Hadrons/GeneticScheduler.hpp:203:25: error: no member named 'emplace' in 'std::multimap<int, std::vector<unsigned int, std::allocator >, std::less, std::allocator<std::pair<const int,
std::vector<unsigned int, std::allocator > > > >'
population_.emplace(func_(m), m);
~~~~~~~~~~~ ^
/dirac1/work/x03/paboyle/Grid-bgq/Grid/include/Grid/Hadrons/GeneticScheduler.hpp:140:9: note: in instantiation of member function 'Grid::Hadrons::GeneticScheduler::doMutation' requested here
doMutation();
^
../../../extras/Hadrons/Application.cc:210:23: note: in instantiation of member function 'Grid::Hadrons::GeneticScheduler::nextGeneration' requested here
scheduler.nextGeneration();
^
In file included from ../../../extras/Hadrons/Application.cc:31:
/dirac1/work/x03/paboyle/Grid-bgq/Grid/include/Grid/Hadrons/GeneticScheduler.hpp:183:21: error: no member named 'emplace' in 'std::multimap<int, std::vector<unsigned int, std::allocator >, std::less, std::allocator<std::pair<const int,
std::vector<unsigned int, std::allocator > > > >'
population_.emplace(func_(c1), c1);
~~~~~~~~~~~ ^
/dirac1/work/x03/paboyle/Grid-bgq/Grid/include/Grid/Hadrons/GeneticScheduler.hpp:148:9: note: in instantiation of member function 'Grid::Hadrons::GeneticScheduler::doCrossover' requested here
doCrossover();
^
../../../extras/Hadrons/Application.cc:210:23: note: in instantiation of member function 'Grid::Hadrons::GeneticScheduler::nextGeneration' requested here
scheduler.nextGeneration();
^
In file included from ../../../extras/Hadrons/Application.cc:31:
/dirac1/work/x03/paboyle/Grid-bgq/Grid/include/Grid/Hadrons/GeneticScheduler.hpp:184:21: error: no member named 'emplace' in 'std::multimap<int, std::vector<unsigned int, std::allocator >, std::less, std::allocator<std::pair<const int,
std::vector<unsigned int, std::allocator > > > >'
population_.emplace(func_(c2), c2);
~~~~~~~~~~~ ^
4 errors generated.
make[2]: *** [Application.o] Error 1

Doxygen?

What is the point of view of the developers on starting to write the doxygen annotations before the code becomes too big?
Any other suggestion about the documentation?

Add more tests to Travis

I suggest to add few more tests to the Travis CI:

Test_simd
Test_cshift
Test_hmc (one of them that runs one or two trajectories)
Test_cayley_cg
Test_stencil

serialisation

http://stackoverflow.com/questions/11031062/c-preprocessor-avoid-code-repetition-of-member-variable-list/11744832#11744832

good example of how to use variadic macros; simplifies the Hana library Bob discovered
and gives reflection (serialise/deserialise).

This is still dependent on Boost PP (preprocessor) but can strip that out too,

This could give a better version of what i did in ukhadron.

paboyle / grid Goto Github PK

grid's Introduction

Grid

Description

Compilers

Specific machine compilation instructions - Summit, Tesseract

Important:

Bug report

Required libraries

Quick start

Build configuration options

Possible communication interfaces

Possible SIMD types

Notes:

Build setup for Intel Knights Landing platform

Build setup for Intel Haswell Xeon platform

Build setup for Intel Skylake Xeon platform

Expected Skylake Gold 6148 dual socket (single prec, single node 20+20 cores) performance using NUMA MPI mapping):

Build setup for AMD EPYC / RYZEN

Expected AMD EPYC 7601 dual socket (single prec, single node 32+32 cores) performance using NUMA MPI mapping):

Build setup for BlueGene/Q

Build setup for ARM Neon

Build setup for laptops, other compilers, non-cluster builds

grid's People

Contributors

Stargazers

Watchers

Forkers

grid's Issues

- Supported SIMD flags : -mmmx -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx

include <Grid/Grid.h> etc...

Recommend Projects

Recommend Topics

Recommend Org