ecmwf-ifs / ectrans Goto Github PK

View Code? Open in Web Editor NEW

17.0 17.0 30.0 2.67 MB

Global spherical harmonics transforms library underpinning the IFS

License: Apache License 2.0

CMake 2.18% Shell 0.30% Fortran 81.62% C 15.90%

ectrans's People

Contributors

Stargazers

Watchers

ectrans's Issues

Compile failure for Intel

I am getting this compile failure with ifort for the develop branch:

/home/h01/frwd/cylc-run/mo-bundle-build-base-latest/share/make_mo__spice_intel_debug/ectrans/src/trans/generated/ectrans_dp/internal/ftdir_ctl_mod.F90(147): error #5558: A pointer with the CONTIGUOUS attributes is being made to a non-contiguous target
ZGTF => ZGTF_STACK(:,:)
--^
compilation aborted for /home/h01/frwd/cylc-run/mo-bundle-build-base-latest/share/make_mo__spice_intel_debug/ectrans/src/trans/generated/ectrans_dp/internal/ftdir_ctl_mod.F90 (code 1)
gmake[2]: *** [ectrans/src/trans/CMakeFiles/ectrans_dp.dir/generated/ectrans_dp/internal/ftdir_ctl_mod.F90.o] Error 1
gmake[2]: *** Waiting for unfinished jobs....
gmake[1]: *** [ectrans/src/trans/CMakeFiles/ectrans_dp.dir/all] Error 2
gmake[1]: *** Waiting for unfinished jobs....

GPU branches: Missing dependency imports in ectrans-import.cmake

For $reason I attempted a standalone build of redgreengpu (outside of a bundle) on an NVIDIA GPU machine and link IFS in a bundle build against it. (I can share more details if required.)

In that scenario, ectrans is imported into ifs-source via the CMake config files. The link interfaces in ectrans-targets.cmake include (correctly) CUDA::cufft, CUDA::cublas and OpenACC::OpenACC_Fortran. However, the dependencies to CUDAToolkit and OpenACC are not captured in the corresponding ectrans-import.cmake.

The import file does correctly issue the dependencies on OpenMP and fiat, but from a quick glance I wasn't able to figure out how to achieve the same for OpenACC, CUDA and CUDAToolkit. Manually editing the file and adding the following worked:

    if( NOT CMAKE_CUDA_COMPILER_LOADED )
        enable_language( CUDA )
    endif()
    find_dependency( CUDAToolkit )
    find_package( OpenACC )
    if( OpenACC_Fortran_FOUND AND OpenACC_C_FOUND )
        set( OpenACC_FOUND ON )
    endif()

For some reason, using find_dependency(OpenACC) did not work and made the entire ectrans import fail. Using instead find_package and doing the usual set(OpenACC_FOUND ON) hack worked.

Presumably, similar issues would appear with HIP on AMD platforms, and are likely affecting the optimised GPU branch, too.

@wdeconinck might have an idea where the CMake needs to be changed for this?

GPU-aware MPI issues in redgreengpu

This issue is for documenting progress in implementing GPU-aware MPI communication in the redgreengpu branch.

PR #39 enabled GPU-aware MPI for the simple benchmark binary, ectrans-benchmark-gpu-sp. However there are still problems in using this option with the "IFS-like" benchmark ectrans-benchmark-ifs-gpu-sp (and presumably the IFS itself).

Firstly here is the impact of using GPU-aware MPI for the simple benchmark

ectrans-benchmark-gpu-sp -v --niter 100 --vordiv --nlev 137 --nfld 1 --scders -t 79 --norms --meminfo

on 2 GPUs/tasks:

            With GPU-aware MPI | Without GPU-aware MPI
            ================== | =====================
med  (s):   0.0550             | 0.1383
loop (s):   16.0833            | 23.7661

That's a nice 2.5x acceleration of the median time step, even with only 2 GPUs.

Now with the ectrans-benchmark-ifs-gpu-sp program:

With GPU-aware MPI

ACC:            find_in_present_table failed for 'iflds' (0x7ffebdcd2560-0x7ffebdcd2564) from ../../../lustre/orion/cli131/proj-shared/hatfield/ectrans-bundle/source/ectrans/src/trans/gpu/internal/trltog_mod.F90:512

That can probably be fixed by COPYINg IFLDS. Commit incoming.

Without GPU-aware MPI

Floating-point exception. No stack trace, but by tracing the GSTATS regions we can see there's an issue for rank 2 of 2 in computing the relative error of the spectral norms after the first direct transform. Will investigate further.

Floating-point exception for lumi-g, multi-rank CPU version on Frontier (CCE/15.0.0)

I am testing the lumi-g branch on Frontier with CCE 15, CPU only on multiple ranks, and I find it gives a floating-point exception somewhere in SPNORMC in the final call to specnorm after the transform loop. If I comment out the write statement before this, it proceeds past the crash point.

It works fine:

If I run on a single rank
If I disable norm printing (no --norms)

My hunch is that we are doing something inconsistent regarding the vertical domain decomposition (e.g. nflevl/nflevg confusion). Will keep debugging and post updates here.

segfaults in ltinv_ctlad_mod with GCC 10.2 and 10.1 (related to OpenMP)

We see segfaults in calls to ltinv_ctlad when the code is built with GCC 10.2 or GCC 10.1. ectrans_test_adjoint (or any code calling ltinv_ctlad) fails with a segfault with a following backtrace:

2: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
2:
2: Backtrace for this error:
2: #0  0x7f92b929e33f in ???
2: #1  0x7f92bb0a887e in __ltinv_ctlad_mod_MOD_ltinv_ctlad._omp_fn.0
2: 	at /work/noaa/da/ashlyaev/jedi/jedi-bundle/ectrans/src/trans/internal/ltinv_ctlad_mod.F90:107
2: #2  0x7f92bb52ae21 in GOMP_parallel
2: 	at /home/builder/ktietz/cos6/ci_cos6/ctng-compilers_1622658800915/work/.build/x86_64-conda-linux-gnu/src/gcc/libgomp/parallel.c:171
2: #3  0x7f92bb0a9752 in __ltinv_ctlad_mod_MOD_ltinv_ctlad
2: 	at /work/noaa/da/ashlyaev/jedi/jedi-bundle/ectrans/src/trans/internal/ltinv_ctlad_mod.F90:107
2: #4  0x7f92bb09c25c in __inv_trans_ctlad_mod_MOD_inv_trans_ctlad
2: 	at /work/noaa/da/ashlyaev/jedi/jedi-bundle/ectrans/src/trans/internal/inv_trans_ctlad_mod.F90:289
2: #5  0x7f92bb101425 in inv_transad_
2: 	at /work/noaa/da/ashlyaev/jedi/jedi-bundle/ectrans/src/trans/external/inv_transad.F90:610
2: #6  0x4043de in test_adjoint
2: 	at /work/noaa/da/ashlyaev/jedi/jedi-bundle/ectrans/tests/trans/test_adjoint.F90:212
2: #7  0x4015ce in main
2: 	at /work/noaa/da/ashlyaev/jedi/jedi-bundle/ectrans/tests/trans/test_adjoint.F90:11
1/1 Test #2: ectrans_test_adjoint .............***Exception: SegFault  0.19 sec

The problem appears to be in how compiler deals with this OpenMP-parallelized loop:

ectrans/src/trans/internal/ltinv_ctlad_mod.F90

Lines 101 to 109 in 58fd28a

    
           !$OMP PARALLEL DO SCHEDULE(DYNAMIC,1) PRIVATE(JM,IM) 
        
             DO JM=1,D%NUMP 
        
               IM = D%MYMS(JM) 
        
               CALL LTINVAD(IM,JM,KF_OUT_LT,KF_UV,KF_SCALARS,KF_SCDERS,ILEI2,IDIM1,& 
        
                & PSPVOR,PSPDIV,PSPSCALAR,& 
        
                & PSPSC3A,PSPSC3B,PSPSC2 , & 
        
                & KFLDPTRUV,KFLDPTRSC,FSPGL_PROC) 
        
             ENDDO 
        
           !$OMP END PARALLEL DO

If the OpenMP directives are commented/removed, the calling code doesn't have any issues, the test succeeds. Trying to set explicitly PRIVATE or SHARED directives for all the variables in the loop doesn't solve the issue.

The same behavior is observed on a different system with GCC 10.1 (Maryam @mer-a-o ran the tests).

Any ideas on how this can be resolved would be greatly appreciated.

Log of all ectrans tests run on the system with GCC 10.2: LastTest.log

Intel failure in butterfly_alg_mod.F90

Another failure building with Intel:

ectrans_test_benchmark_dp_T47_O48_mpi0_omp1_nfld10_nlev20_flt

forrtl: severe (151): allocatable array is already allocated
Image PC Routine Line Source
ectrans-benchmark 0000000000479EB7 Unknown Unknown Unknown
libtrans_dp.so 00002BA9E810BAC3 butterfly_alg_mod 188 butterfly_alg_mod.F90
libiomp5.so 00002BA9EFF79A43 __kmp_invoke_micr Unknown Unknown
libiomp5.so 00002BA9EFF3D2C6 _kmp_fork_call Unknown Unknown
libiomp5.so 00002BA9EFEFCBB0 kmpc_fork_call Unknown Unknown
libtrans_dp.so 00002BA9E80F979F butterfly_alg_mod 179 butterfly_alg_mod.F90
libtrans_dp.so 00002BA9E8515801 suleg_mod_mp_sule 790 suleg_mod.F90
libtrans_dp.so 00002BA9E86BE8E0 setup_trans 406 setup_trans.F90
ectrans-benchmark 00000000004106AC MAIN 397 ectrans-benchmark.F90
ectrans-benchmark 000000000040DC9E Unknown Unknown Unknown
libc-2.11.3.so 00002BA9EF918C36 __libc_start_main Unknown Unknown
ectrans-benchmark 000000000040DB29 Unknown Unknown Unknown

This only shows with runtime checks on.

Format error in ectrans-benchmark.F90

Running ectrans_test_benchmark_dp_T47_O48_mpi0_omp1_nfld0 gives:

Runtime Error: /home/h01/frwd/cylc-run/mi-be984/work/1/git_clone_ectrans/ectrans/src/programs/ectrans-benchmark.F90, line 438: Unexpected end of format specification
Program terminated by I/O error on unit 6 (Output_Unit,Formatted,Sequential)

Remove extended precision JPRH

JPRH (the extended floating point type exported by Fiat here) is referenced in a few places in ecTrans. E.g.:

SULEG
PRE_SULEG_MOD (not actually used)
SUGAW

JPRH is actually always the same as JPRD in Fiat. It's only equal to the extended type when REALHUGE is defined. But there is no way to set this when building Fiat, so it looks like a half-deprecated feature.

Whether we keep this in Fiat is another issue, but for now perhaps we should delete all references to JPRH and replace with JPRD, since this is what we are actually doing.

@wdeconinck's upcoming PR #61 effectively does this in SUGAW anyway.

ctest failure in ectrans_test_transi_memory

This ctest fails as follows:

144/156 Test #151: ectrans_test_transi_memory ................................................***Failed 0.12 sec
Initially allocated: 0.000000 KB
iteration 1
Allocated in iteration: 7.345215 MB
Possibly leaked in iteration: 39.234375 KB
iteration 2
Allocated in iteration: 7.305069 MB
Possibly leaked in iteration: 2.125000 KB
ERROR: Assertion `allocated()-start_iter == 0' failed @/home/h01/david.davies/cylc-run/EctransTransiFailure/share/mo-bundle/ectrans/tests/transi/transi_test_memory.c:63

This failure occurred on a Cray machine, gnu 12.1.0

Array out of bounds in ectrans-benchmark

ectrans_test_benchmark_dp_T47_O48_mpi0_omp1_nfld0 is failing like this:

Runtime Error: /home/h01/frwd/cylc-run/mi-be984/work/1/git_clone_ectrans/ectrans/src/programs/ectrans-benchmark.F90, line 540: Subscript 3 of ZSPSC3A (value 1) is out of range (1:0)
Program terminated by fatal error

Use of keywords as variable names

There are a couple places where Fortran keywords, namely IF, are used as variable names, e.g. here. Can we rename these variables?

ctest failure in ectrans_test_adjoint

Running the ctests results in this failure:

    Start   8: ectrans_test_benchmark_dp_T47_O48_mpi0_omp1_nfld10_nlev20_vordiv_uvders

1/156 Test #2: ectrans_test_adjoint ......................................................***Failed 0.51 sec
NSMAX= 21
NDGL= 32
LMPOFF= T
NSPEC2= 506
NGPTOT= 2048
NFLEV= 9
SETUP FINISHED
forrtl: severe (408): fort: (3): Subscript #1 of the array A has value -26 which is less than the lower bound of 1

Image PC Routine Line Source
libifcoremt.so.5 00002B53F837E2C9 for_emit_diagnost Unknown Unknown
libtrans_dp.so 00002B53F9A9BA48 fft992_IP_rpassf_ 483 fft992.F90
libtrans_dp.so 00002B53F9A98DF3 fft992_ 235 fft992.F90
libtrans_dp.so 00002B53F9CC6B94 ftinv_mod_mp_ftin 94 ftinv_mod.F90
libtrans_dp.so 00002B53F9CBC608 ftinv_ctl_mod_mp_ 193 ftinv_ctl_mod.F90
libiomp5.so 00002B53F8041A43 __kmp_invoke_micr Unknown Unknown
libiomp5.so 00002B53F80052C6 kmp_fork_call Unknown Unknown
libiomp5.so 00002B53F7FC4BB0 kmpc_fork_call Unknown Unknown
libtrans_dp.so 00002B53F9CB780E ftinv_ctl_mod_mp 178 ftinv_ctl_mod.F90
libtrans_dp.so 00002B53F9D0E84E inv_trans_ctl_mod 288 inv_trans_ctl_mod.F90
libtrans_dp.so 00002B53F9F5257D inv_trans 610 inv_trans.F90
ectrans_test_adjo 000000000040AD26 MAIN 197 test_adjoint.F90
ectrans_test_adjo 0000000000401BA2 Unknown Unknown Unknown
libc-2.17.so 00002B53FB1C8555 __libc_start_main Unknown Unknown
ectrans_test_adjo 0000000000401AB9 Unknown Unknown Unknown

This failure occurs with Intel compilers. Gnu compilers seem to work.

Dangling symbolic links in installation include directory

I did a build and install of ectrans and saw this in the installation include/ectrans directory:

-rw-r--r--. 1 frwd users 6524 Feb 18 17:07 dir_trans.h
-rw-r--r--. 1 frwd users 6370 Feb 18 17:07 dir_transad.h
-rw-r--r--. 1 frwd users 2084 Feb 18 17:07 dist_grid.h
-rw-r--r--. 1 frwd users 2044 Feb 18 17:07 dist_grid_32.h
-rw-r--r--. 1 frwd users 2205 Feb 18 17:07 dist_spec.h
-rw-r--r--. 1 frwd users 1998 Feb 18 17:07 gath_grid.h
-rw-r--r--. 1 frwd users 2013 Feb 18 17:07 gath_grid_32.h
-rw-r--r--. 1 frwd users 2277 Feb 18 17:07 gath_spec.h
-rw-r--r--. 1 frwd users 1375 Feb 29 17:07 get_current.h
-rw-r--r--. 1 frwd users 2292 Feb 18 17:07 gpnorm_trans.h
-rw-r--r--. 1 frwd users 3030 Feb 29 17:07 ini_spec_dist.h
-rw-r--r--. 1 frwd users 8022 Feb 18 17:07 inv_trans.h
-rw-r--r--. 1 frwd users 7809 Feb 18 17:07 inv_transad.h
-rw-r--r--. 1 frwd users 4558 Feb 18 17:07 setup_trans.h
-rw-r--r--. 1 frwd users 3641 Feb 29 17:07 setup_trans0.h
-rw-r--r--. 1 frwd users 1942 Feb 18 17:07 specnorm.h
-rw-r--r--. 1 frwd users 1450 Feb 18 17:07 sugawc.h
-rw-r--r--. 1 frwd users 1134 Feb 18 17:07 trans_end.h
-rw-r--r--. 1 frwd users 8287 Feb 18 17:07 trans_inq.h
-rw-r--r--. 1 frwd users 1749 Feb 18 17:07 trans_pnm.h
-rw-r--r--. 1 frwd users 570 Feb 18 17:07 trans_release.h
lrwxrwxrwx. 1 frwd users 14 Mar 6 12:59 transi.h -> ../../transi.h
lrwxrwxrwx. 1 frwd users 15 Mar 6 12:59 version.h -> ../../version.h
-rw-r--r--. 1 frwd users 2174 Feb 18 17:07 vordiv_to_uv.h

Those symbolic links are dangling.

Runtime failure in DIST_SPEC_CONTROL

I have a program which is failing without any sort of error message. Looking through a core dump it appears to be failing in the call to DIST_SPEC_CONTROL, and this is confirmed by adding write statements. I know this isn't much to go on, it will be hard for me to produce something to reproduce the problem.

The compiler is gnu 12 on a Cray. It is related to https://github.com/JCSDA-internal/saber/issues/821.

Need to be able to pass more compile options to ctest

Trying to run ctests with e.g. NAG gives this:

NAG Fortran Compiler Release 7.0(Yurakucho) Build 7048
Fatal Error: /home/h01/frwd/cylc-run/mi-be984/work/1/git_clone_ectrans/ectrans/tests/test_install/main.F90, line 13: Incompatible option setting for module YOMHOOK (was compiled with the -kind=byte option)
detected at YOMHOOK@
make[2]: *** [CMakeFiles/main_dp.dir/main.F90.o] Error 2

This is because I need to be able to pass things like Fortran compile flags to ctest but currently that doesn't happen.

Crash when compiling with ACFL and '-O3 -mcpu=native' flags

@samhatfield

Hello,

I am playing with ecTrans on the Graviton3 system. Compiling with ACFL (Arm Compiler for Linux = armclang/armflang) led the app to crash when using some performance flags. I confirmed the issue to happen on other systems with SVE, but not on systems without. Below is a table summarizing my experiments.

The hardware I tested on:

AmpereQ8030 : Ampere Altra Q8030 (80 cores, neoverse n1, without SVE)
Fuji_A64FX : 48 cores, with SVE
Graviton 3 : 64 cores, neoverse v1, with SVE

The software stack consists of:

ACFL 23.04.1 with OpenMPI 4.1.5 (BLAS provided by ARMPL)
- tested both AmazonLinux2 (Graviton3) and RHEL8 (Ampere & Fuji) distros
GCC 13.1.0 with OpenMPI 4.1.5 and OpenBLAS 0.23.3

And the command run is mpiexec -n 1 ./ectrans-benchmark-dp --meminfo --norms -n 20 -f 5 -l 40 --vordiv. Note that similar behavior occurs with single precision.

System	Compiler	flags	Build	Exec	Max Error
AmpereQ8030	gcc-13.1.0	perf	✅ (-O3 -ffast-math -mcpu=native -g -fno-omit-frame-pointer -pipe)	✅	✅ (0.292E-13)
	acfl-23.04.1	perf	✅ (-O3 -mcpu=native -ffp-model=fast -fsimdmath -g -fno-omit-frame-pointer -pipe)	✅	✅ (0.232E-13)
Fuji_A64FX	gcc-13.1.0	perf	✅ (-O3 -ffast-math -mcpu=native -g -fno-omit-frame-pointer -pipe)	✅	✅ (0.576E-13)
	acfl-23.04.1	perf	✅ (-O3 -mcpu=native -ffp-model=fast -fsimdmath -g -fno-omit-frame-pointer -pipe)	❌	⚡
		nosimdmath	✅ (-O3 -mcpu=native -ffp-model=fast -DNDEBUG -pipe)	❌	⚡
		mcpu	✅ (-O3 -mcpu=native -DNDEBUG -pipe)	❌	⚡
		normal	✅ (-O3 -DNDEBUG -pipe)	✅	✅ (0.232E-13)
Graviton3	gcc-13.1.0	perf	✅ (-O3 -ffast-math -mcpu=native -g -fno-omit-frame-pointer -pipe)	✅	✅ (0.349E-13)
	acfl-23.04.1	perf	✅ (-O3 -mcpu=native -ffp-model=fast -fsimdmath -g -fno-omit-frame-pointer -pipe)	❌	⚡
		nosimdmath	✅ (-O3 -mcpu=native -ffp-model=fast -DNDEBUG -pipe)	❌	⚡
		mcpu	✅ (-O3 -mcpu=native -DNDEBUG -pipe)	❌	⚡
		normal	✅ (-O3 -DNDEBUG -pipe)	✅	✅ (0.232E-13)

As we can see, the non-SVE AmepreQ8030 system seems unaffected by this issue, whereas both SVE systems exhibit similar behavior. We can also observe that removing the -mcpu=native flag leads to successful run.

Typical output looks when crashing like this (here was a run on Graviton3 using the double precision benchmark) :

CMD: mpiexec -n 1 ectrans-benchmark-dp --meminfo --norms -n 20 -f 5 -l 40 --vordiv
 CONVERGENCE FAILED IN SUGAW 
 ALLOWED :   20NECESSARY :   21
 ABORT_TRANS CALLED
  FAILURE IN SUGAW 
 ABORT!    1  FAILURE IN SUGAW 
SDL_TRACEBACK [PROC=1,THRD=1] ...
[LinuxTraceBack] Backtrace(s) for program './bin/ectrans-benchmark-dp' : sigcontextptr=0xffffdbc0c8c0
[LinuxTraceBack] Backtrace (size = 10) with addr2line-cmd
[LinuxTraceBack] /usr/bin/addr2line -fs -e './bin/ectrans-benchmark-dp' 0xffff8a64f010 0xffff8a6a11ac 0xffff8a8e5dc4 0xffff8a91d4a8 0xffff8a91e1f0 0xffff8a95923c 0xaaaaaf7d4014 0xaaaaaf7d2510 0xffff89b3ada4 0xaaaaaf7d23f4
[LinuxTraceBack] [00]: libfiat.so(LinuxTraceBack+0x190) [0xffff8a64f010] : ??() at ??:0
[LinuxTraceBack] [01]: libfiat.so(sdl_mod_sdl_traceback_+0x16c) [0xffff8a6a11ac] : ??() at ??:0
[LinuxTraceBack] [02]: libtrans_dp.so(abort_trans_mod_abort_trans_+0x214) [0xffff8a8e5dc4] : ??() at ??:0
[LinuxTraceBack] [03]: libtrans_dp.so(sugaw_mod_sugaw_+0xe98) [0xffff8a91d4a8] : ??() at ??:0
[LinuxTraceBack] [04]: libtrans_dp.so(suleg_mod_suleg_+0xac0) [0xffff8a91e1f0] : ??() at ??:0
[LinuxTraceBack] [05]: libtrans_dp.so(setup_trans_+0x1cdc) [0xffff8a95923c] : ??() at ??:0
[LinuxTraceBack] [06]: ectrans-benchmark-dp(+0x4014) [0xaaaaaf7d4014] : ??() at ??:0
[LinuxTraceBack] [07]: ectrans-benchmark-dp(+0x2510) [0xaaaaaf7d2510] : ??() at ??:0
[LinuxTraceBack] [08]: libc.so.6(__libc_start_main+0xe4) [0xffff89b3ada4] : ??() at ??:0
[LinuxTraceBack] [09]: ectrans-benchmark-dp(+0x23f4) [0xaaaaaf7d23f4] : ??() at ??:0
[LinuxTraceBack] End of backtrace(s)
SDL_TRACEBACK [PROC=1,THRD=1] ... DONE
[ip-10-0-7-69:01665] *** Process received signal ***
[ip-10-0-7-69:01665] Signal: Aborted (6)
[ip-10-0-7-69:01665] Signal code:  (-6)
[ip-10-0-7-69:01665] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffff8ac30860]
[ip-10-0-7-69:01665] [ 1] /lib64/libpthread.so.0(raise+0xb0)[0xffff89cd54b0]
[ip-10-0-7-69:01665] [ 2] /shared/efs_home/amorvan/workdir/IFS/ectrans-scripts/run_now/compute/acfl-23.04.1/perf/build/fiat_prefix/lib64/libfiat.so(sdl_mod_sdl_srlabort_+0x10)[0xffff8a6a12a0]
[ip-10-0-7-69:01665] [ 3] /shared/efs_home/amorvan/workdir/IFS/ectrans-scripts/run_now/compute/acfl-23.04.1/perf/build/ectrans_prefix/bin/../lib64/libtrans_dp.so(sugaw_mod_sugaw_+0xe98)[0xffff8a91d4a8]
[ip-10-0-7-69:01665] [ 4] /shared/efs_home/amorvan/workdir/IFS/ectrans-scripts/run_now/compute/acfl-23.04.1/perf/build/ectrans_prefix/bin/../lib64/libtrans_dp.so(suleg_mod_suleg_+0xac0)[0xffff8a91e1f0]
[ip-10-0-7-69:01665] [ 5] /shared/efs_home/amorvan/workdir/IFS/ectrans-scripts/run_now/compute/acfl-23.04.1/perf/build/ectrans_prefix/bin/../lib64/libtrans_dp.so(setup_trans_+0x1cdc)[0xffff8a95923c]
[ip-10-0-7-69:01665] [ 6] ./bin/ectrans-benchmark-dp(+0x4014)[0xaaaaaf7d4014]
[ip-10-0-7-69:01665] [ 7] ./bin/ectrans-benchmark-dp(+0x2510)[0xaaaaaf7d2510]
[ip-10-0-7-69:01665] [ 8] /lib64/libc.so.6(__libc_start_main+0xe4)[0xffff89b3ada4]
[ip-10-0-7-69:01665] [ 9] ./bin/ectrans-benchmark-dp(+0x23f4)[0xaaaaaf7d23f4]
[ip-10-0-7-69:01665] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ip-10-0-7-69 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

The build uses following parameters (excerpt from this full script : https://gist.github.com/antoine-morvan/611c4d779fd704279bb0b938598fb597):

normal)
    export CFLAGS="-O3 -DNDEBUG"
    export FCFLAGS="$CFLAGS"
    export CXXFLAGS="$CXXFLAGS"
    CMAKE_BUILD_TYPE="None"
    ;;
mcpu)
    export CFLAGS="-O3 -mcpu=native -DNDEBUG"
    export FCFLAGS="$CFLAGS"
    export CXXFLAGS="$CXXFLAGS"
    CMAKE_BUILD_TYPE="None"
    ;;

(cd ${fiat_BUILD} && cmake \
        -DCMAKE_BUILD_TYPE="$CMAKE_BUILD_TYPE" \
        -DCMAKE_Fortran_FLAGS="$FCFLAGS" \
        -DCMAKE_C_FLAGS="$CFLAGS" \
        -DCMAKE_INSTALL_PREFIX=${fiat_ROOT} \
        -DENABLE_TESTS=OFF \
        ${fiat_SRC} && make -j && make install)

(cd ${ectrans_BUILD} && cmake \
        -DCMAKE_BUILD_TYPE="$CMAKE_BUILD_TYPE" \
        -DCMAKE_Fortran_FLAGS="$FCFLAGS" \
        -DCMAKE_C_FLAGS="$CFLAGS" \
        -DCMAKE_INSTALL_PREFIX=${ectrans_ROOT} \
        -DENABLE_TESTS=ON \
        ${ectrans_SRC} && make -j && make install)

Then this benchmark causes the run to fail :

export OMP_NUM_THREADS=4
export MPI_NUM_RANKS=1

BINARY=ectrans-benchmark-dp
PROFILEARGS="--meminfo --norms"
ARGS="-n 20 -f 5 -l 40 --vordiv"
(cd ${ectrans_ROOT} \
    && mpirun -n $MPI_NUM_RANKS ./bin/$BINARY $PROFILEARGS $ARGS )

Looking at the backtrace it feels like the problem originates from fiat, but I did not investigate further.

Also, ARM is aware of this issue.

Feel free to ask more details.

Best.

Norm computation issue

Hello,

Problem

I tried a configuration that make the computation go way off, (E+03). However, when using --norm to check deviation, the max error combined is 0. This is because 0 is greater than -999.

Still, the computation is wrong and this should get caught.

My wrapper script was missing this deviation because it is focusing solely on the combined result.

Setup

To reproduce:

compile with aocc 4.1.0; OpenMPI 4.1.5; OpenBLAS 0.3.23; FFTW 3.3.10

export CFLAGS="-O3 -march=native -mtune=native"
export CXXFLAGS="$CFLAGS"
export FCFLAGS="$CFLAGS"

could reproduce on latest AMD & Intel CPUs
run ectrans-benchmark-dp --norms -n 5 -l 137 -t 319 --vordiv --scders

Some Leads ?

After few investigation, I spotted 2 potential causes:

The max error is initialized with zmaxerr(:) = -999.0. It would be wiser to initialize the max error with 0.

ectrans/src/programs/ectrans-benchmark.F90

Line 756 in 8d13aa3

zmaxerr(:) = -999.0

When enabling verbosity (and printing the divider), we could observe half of the arrays znormvor(:) znormdiv(:) znormt(:) znormsp(:) comming with NaN values.

ectrans/src/programs/ectrans-benchmark.F90

Lines 757 to 784 in 8d13aa3

    
           do ifld = 1, nflevg 
        
             zerr(3) = abs(znormvor1(ifld)/znormvor(ifld) - 1.0d0) 
        
             zmaxerr(3) = max(zmaxerr(3), zerr(3)) 
        
             if (verbosity >= 1) then 
        
               write(nout,'("norm zspvor( ",i4,")     = ",f20.15,"        error = ",e10.3)') ifld, znormvor1(ifld), zerr(3) 
        
             endif 
        
           enddo 
        
           do ifld = 1, nflevg 
        
             zerr(2) = abs(znormdiv1(ifld)/znormdiv(ifld) - 1.0d0) 
        
             zmaxerr(2) = max(zmaxerr(2),zerr(2)) 
        
             if (verbosity >= 1) then 
        
               write(nout,'("norm zspdiv( ",i4,",:)   = ",f20.15,"        error = ",e10.3)') ifld, znormdiv1(ifld), zerr(2) 
        
             endif 
        
           enddo 
        
           do ifld = 1, nflevg 
        
             zerr(4) = abs(znormt1(ifld)/znormt(ifld) - 1.0d0) 
        
             zmaxerr(4) = max(zmaxerr(4), zerr(4)) 
        
             if (verbosity >= 1) then 
        
               write(nout,'("norm zspsc3a(",i4,",:,1) = ",f20.15,"        error = ",e10.3)') ifld, znormt1(ifld), zerr(4) 
        
             endif 
        
           enddo 
        
           do ifld = 1, 1 
        
             zerr(1) = abs(znormsp1(ifld)/znormsp(ifld) - 1.0d0) 
        
             zmaxerr(1) = max(zmaxerr(1), zerr(1)) 
        
             if (verbosity >= 1) then 
        
               write(nout,'("norm zspsc2( ",i4,",:)   = ",f20.15,"        error = ",e10.3)') ifld, znormsp1(ifld), zerr(1) 
        
             endif 
        
           enddo

Use this to print the divider:

  verbosity=1
  zmaxerr(:) = 0
  do ifld = 1, nflevg
    zerr(3) = abs(znormvor1(ifld)/znormvor(ifld) - 1.0d0)
    zmaxerr(3) = max(zmaxerr(3), zerr(3))
    if (verbosity >= 1) then
      write(nout,'("norm zspvor( ",i4,")     = ",f20.15,f20.15,"        error = ",e10.3)') ifld, znormvor1(ifld), znormvor(ifld), zerr(3)
    endif
  enddo
  do ifld = 1, nflevg
    zerr(2) = abs(znormdiv1(ifld)/znormdiv(ifld) - 1.0d0)
    zmaxerr(2) = max(zmaxerr(2),zerr(2))
    if (verbosity >= 1) then
      write(nout,'("norm zspdiv( ",i4,",:)   = ",f20.15,f20.15,"        error = ",e10.3)') ifld, znormdiv1(ifld), znormdiv(ifld), zerr(2)
    endif
  enddo
  do ifld = 1, nflevg
    zerr(4) = abs(znormt1(ifld)/znormt(ifld) - 1.0d0)
    zmaxerr(4) = max(zmaxerr(4), zerr(4))
    if (verbosity >= 1) then
      write(nout,'("norm zspsc3a(",i4,",:,1) = ",f20.15,f20.15,"        error = ",e10.3)') ifld, znormt1(ifld),znormt(ifld), zerr(4)
    endif
  enddo
  do ifld = 1, 1
    zerr(1) = abs(znormsp1(ifld)/znormsp(ifld) - 1.0d0)
    zmaxerr(1) = max(zmaxerr(1), zerr(1))
    if (verbosity >= 1) then
      write(nout,'("norm zspsc2( ",i4,",:)   = ",f20.15,f20.15,"        error = ",e10.3)') ifld, znormsp1(ifld),znormsp(ifld), zerr(1)
    endif
  enddo

This could come from these arrays being initialized with a function declared as C binding, iterating over non-contiguous segments.

But that would require more investigation to confirm :)

Best.

Failure to compile SIZEOF (NAG) in suleg_mod

NAG compiles are failing in suleg_mod like this:

Error: /home/h01/frwd/cylc-run/mi-be984/work/1/ecbuild_ectrans_nag/build/src/trans/generated/ectrans_dp/internal/suleg_mod.F90, line 1207: Implicit type for SIZEOF in SULEG
[NAG Fortran Compiler pass 1 error termination, 1 error]
Error: /home/h01/frwd/cylc-run/mi-be984/work/1/ecbuild_ectrans_nag/build/src/trans/generated/ectrans_sp/internal/suleg_mod.F90, line 1207: Implicit type for SIZEOF in SULEG
[NAG Fortran Compiler pass 1 error termination, 1 error]

Unifying cpu/external and gpu/external

I'm thinking ahead to after when redgreen-optimized is merged into develop. After this merge, src/trans will have this structure (ext: external, int: internal):

                    src/trans
                   /         \
                  /           \
                cpu           gpu
              /     \        /   \
             /       \      /     \
           ext       int  ext     int

cpu/internal and gpu/internal are of course substantially different, and probably not worth trying to combine at this stage (same for algor etc., not shown). But there is a lot of overlap between cpu/external and gpu/external so I'm wondering if we can somehow combine these two to make

                    src/trans
                   /    |   \
                  /     |    \
                 /      |     \
                /       |      \
              ext    cpu_int  gpu_int

In the relevant CMakeLists.txt, we would just have to modify the source file lists for the trans_cpu and trans_gpu libraries.

Here's a breakdown of the differences in every file between cpu/external and gpu/external:

dir_transad.F90: only whitespace differences.
dir_trans.F90: GSTATS overloaded by GSTATS_NVTX -> put overload in a #if defined(__NVCOMPILER statement?
dist_grid_32.F90: only whitespace differences.
dist_grid.F90: only whitespace differences.
dist_spec.F90: GPU version is out of date, but could be updated.
gath_grid_32.F90: only whitespace differences.
gath_grid.F90: only whitespace differences.
gath_spec.F90: only whitespace differences.
get_current.F90: only module USE differences - trivial.
gpnorm_trans.F90: significant differences, however gpu/internal/gpnorm_trans_ctl.F90 could be reinstated so cpu/gpu differences are hidden by unified GPNORM_TRANS subroutine.
gpnorm_trans_gpu.F90: do we actually need this?
ini_spec_dist.F90: only whitespace differences.
inv_transad.F90: only whitespace differences.
inv_trans.F90: same issue with GSTATS_NVTX as above. Also arguments seem to be validated slightly differently between cpu and gpu, but I think these should be the same. Probably resolvable.
setup_trans0.F90: only meaningful differences are in the GPU pinning logic, but this could be wrapped in preprocessor statements so it's only compiled when targeting GPUs.
setup_trans.F90: big differences. This is where device memory is allocated. For now, wrap this in GPU-specific preprocessor regions?
specnorm.F90: literally identical.
sugawc.F90: literally identical.
trans_end.F90: some device deallocation code, which could be wrapped in GPU-specific preprocessor regions.
trans_inq.F90: some variable cast differences, but nothing meaningful.
trans_pnm.F90: references to JPRBT in GPU version which could be kept.
trans_release.F90: literally identical.
vordiv_to_uv.F90: only whitespace differences.

So most files are basically the same already, and only two or three would require some thought. Seems like an obvious way to reduce the complexity and code volumn of the library.

gstats labels in the benchmark

To resolve this TODO we would have to import ifs/utility/gstats_label_ifs.F90 from ifs-source, and perhaps remove the labels that aren't relevant for ecTrans. Can we think of a better way to do this, or should we just forget about gstats labels entirely?

FLUSH function non-standard

There are a few uses of the FLUSH intrinsic in the code. This function is non-standard, and nowadays there is a FLUSH statement that has been standard since 2003 and is widely supported.

Add ecTrans versioning information print statement to SETUP_TRANS0

It would be useful to be able to see the Git version of ecTrans printed to STDOUT in SETUP_TRANS0. I'm often playing with multiple branches at the same time so it's useful to verify that I'm using the one I think I'm using.

If we put it in SETUP_TRANS0 it will be printed to STDOUT when running the benchmark, or NODE.001_01 when run from IFS.

@wdeconinck isn't there an easy way to have CMake insert this information to the Fortran source through its preprocessor?

Compile failure in ectrans-benchmark.F90

Compiling with NAG gives:

Error: /home/h01/frwd/cylc-run/mi-be984/work/1/git_clone_ectrans/ectrans/src/programs/ectrans-benchmark.F90, line 1141: Implicit type for N
detected at N@)
Error: /home/h01/frwd/cylc-run/mi-be984/work/1/git_clone_ectrans/ectrans/src/programs/ectrans-benchmark.F90, line 1142: Symbol N has already been implicitly typed
detected at N@

Intel compile failure from illegal code

I am getting compile failures with Intel, for example

/home/d03/frwd/cylc-run/AddEctransFiat/share/mo-bundle/ectrans/src/trans/internal/ftdirad_mod.F90(92): error #8284: If the actual argument is scalar, the dummy argument shall be scalar unless the actual argument is of type character or is an element of an array that is not assumed shape, pointer, or polymorphic. [A]
CALL FFT992(PREEL(1,IOFF),T%TRIGS(1,KGL),&

	!$OMP PARALLEL DO SCHEDULE(DYNAMIC,1) PRIVATE(JM,IM)
	DO JM=1,D%NUMP
	IM = D%MYMS(JM)
	CALL LTINVAD(IM,JM,KF_OUT_LT,KF_UV,KF_SCALARS,KF_SCDERS,ILEI2,IDIM1,&
	& PSPVOR,PSPDIV,PSPSCALAR,&
	& PSPSC3A,PSPSC3B,PSPSC2 , &
	& KFLDPTRUV,KFLDPTRSC,FSPGL_PROC)
	ENDDO
	!$OMP END PARALLEL DO

	do ifld = 1, nflevg
	zerr(3) = abs(znormvor1(ifld)/znormvor(ifld) - 1.0d0)
	zmaxerr(3) = max(zmaxerr(3), zerr(3))
	if (verbosity >= 1) then
	write(nout,'("norm zspvor( ",i4,") = ",f20.15," error = ",e10.3)') ifld, znormvor1(ifld), zerr(3)
	endif
	enddo
	do ifld = 1, nflevg
	zerr(2) = abs(znormdiv1(ifld)/znormdiv(ifld) - 1.0d0)
	zmaxerr(2) = max(zmaxerr(2),zerr(2))
	if (verbosity >= 1) then
	write(nout,'("norm zspdiv( ",i4,",:) = ",f20.15," error = ",e10.3)') ifld, znormdiv1(ifld), zerr(2)
	endif
	enddo
	do ifld = 1, nflevg
	zerr(4) = abs(znormt1(ifld)/znormt(ifld) - 1.0d0)
	zmaxerr(4) = max(zmaxerr(4), zerr(4))
	if (verbosity >= 1) then
	write(nout,'("norm zspsc3a(",i4,",:,1) = ",f20.15," error = ",e10.3)') ifld, znormt1(ifld), zerr(4)
	endif
	enddo
	do ifld = 1, 1
	zerr(1) = abs(znormsp1(ifld)/znormsp(ifld) - 1.0d0)
	zmaxerr(1) = max(zmaxerr(1), zerr(1))
	if (verbosity >= 1) then
	write(nout,'("norm zspsc2( ",i4,",:) = ",f20.15," error = ",e10.3)') ifld, znormsp1(ifld), zerr(1)
	endif
	enddo