Comments (9)
Added comment about segfaults from other benchmarks run on our local cluster
Benchmark_comms: faults after "Benchmarking concurrent STENCIL halo exchange in 1 dimensions"
Benchmark_comms_host_device: fault after "Benchmarking sequential halo exchange from GPU memory"
Benchmark_memory_asynch: runs OK
Benchmark_dwf: faults after "Setting up Cshift based reference"
from grid.
On Perlmutter, I built git branch c0d56a1 according to the recipe in ./systems/Perlmutter. Benchmark_ITT generates
a segfault for parallel runs while the code runs correctly on a single GPU.
$ srun -n4 ~/grid/bench/perlmutter/bind_gpu4.sh ./benchmarks/Benchmark_ITT --mpi 1.1.2.2 --threads 8 --shm 2048 --debug-mem --debug-signals
... edited ...
Grid : Message : 3.494448 s : Benchmark DWF on 32^4 local volume
Grid : Message : 3.494451 s : * Nc : 3
Grid : Message : 3.494453 s : * Global volume : 32 32 64 64
Grid : Message : 3.494462 s : * Ls : 1
Grid : Message : 3.494464 s : * ranks : 4
Grid : Message : 3.494466 s : * nodes : 1
Grid : Message : 3.494469 s : * ranks/node : 4
Grid : Message : 3.494471 s : * ranks geom : 1 1 2 2
Grid : Message : 3.494474 s : * Using 8 threads
Grid : Message : 3.494476 s : ==================================================================================
Grid : Message : 3.708264 s : Initialised RNGs
(GTL DEBUG: 1) cuIpcOpenMemHandle: resource already mapped, CUDA_ERROR_ALREADY_MAPPED, line no 272
(GTL DEBUG: 3) cuIpcOpenMemHandle: resource already mapped, CUDA_ERROR_ALREADY_MAPPED, line no 272
(GTL DEBUG: 2) cuIpcOpenMemHandle: resource already mapped, CUDA_ERROR_ALREADY_MAPPED, line no 272
MPICH ERROR [Rank 1] [job id 2148669.5] [Tue May 10 07:44:32 2022] [nid001520] - Abort(70900482) (rank 1 in comm 0): Fatal error in PMPI_Sendrecv: Invalid count, error stack:
PMPI_Sendrecv(249)........................: MPI_Sendrecv(sbuf=0x7f3edfc00000, scount=589824, MPI_CHAR, dest=3, stag=1, rbuf=0x7f3d29520000, rcount=589824, MPI_CHAR, src=3, rtag=3, comm=0xc4000199, status=0x1) failed
... edited...
(unknown)(): Invalid count
BackTrace Strings: 0 ./benchmarks/Benchmark_ITT() [0x450419]
BackTrace Strings: 1 /lib64/libc.so.6(+0x4db09) [0x7fcf57d33b09]
BackTrace Strings: 2 /lib64/libc.so.6(+0x4dc9a) [0x7fcf57d33c9a]
BackTrace Strings: 3 /opt/cray/pe/lib64/libpmi2.so.0(PMI_Get_base_rank_in_app+0) [0x7fcf57408dd2]
BackTrace Strings: 4 /opt/cray/pe/lib64/libmpi_gnu_91.so.12(+0x202cca2) [0x7fcf5b0e9ca2]
BackTrace Strings: 5 /opt/cray/pe/lib64/libmpi_gnu_91.so.12(+0x1ebfa5c) [0x7fcf5af7ca5c]
BackTrace Strings: 6 /opt/cray/pe/lib64/libmpi_gnu_91.so.12(MPIR_Err_return_comm+0x11b) [0x7fcf5af7cb8b]
BackTrace Strings: 7 /opt/cray/pe/lib64/libmpi_gnu_91.so.12(MPI_Sendrecv+0x3b9) [0x7fcf59aa2729]
BackTrace Strings: 8 /global/common/software/nersc/pm-2021q4/sw/darshan/3.3.1/lib/libdarshan.so(MPI_Sendrecv+0x84) [0x7fcf5bd1ddd4]
BackTrace Strings: 9 ./benchmarks/Benchmark_ITT() [0x46fcd5]
BackTrace Strings: 10 ./benchmarks/Benchmark_ITT() [0x57168d]
BackTrace Strings: 11 ./benchmarks/Benchmark_ITT() [0x572039]
BackTrace Strings: 12 ./benchmarks/Benchmark_ITT() [0x572dd6]
BackTrace Strings: 13 ./benchmarks/Benchmark_ITT() [0x573cee]
BackTrace Strings: 14 ./benchmarks/Benchmark_ITT() [0x574baf]
BackTrace Strings: 15 ./benchmarks/Benchmark_ITT() [0x484b65]
BackTrace Strings: 16 ./benchmarks/Benchmark_ITT() [0x4445fc]
BackTrace Strings: 17 ./benchmarks/Benchmark_ITT() [0x411b51]
BackTrace Strings: 18 /lib64/libc.so.6(__libc_start_main+0xef) [0x7fcf57d1b2bd]
BackTrace Strings: 19 ./benchmarks/Benchmark_ITT() [0x4164ca]
from grid.
Unfortunately, this problem still persists on the develop 042ab1a branch dated Mon Jun 27, 2022.
from grid.
Any updates on this?
from grid.
Unfortunately, no updates. I see similar segfaults on systems other than Perlmutter. I suspect it is more of a problem with the mpich family of MPI and later versions of Grid, though openmpi has also shown segfaults.
from grid.
Are you using GPU-aware MPI? We have seen several unexplained segfaults with this that vanish using the normal build of MPI. So far, the implementors have not been motivated to fix these.
from grid.
I see the same segfaults using CUDA aware OpenMPI, I cannot confirm this is the case with normal MPI. Do you suggest to use normal OpenMPI instead?
from grid.
Yes
from grid.
there must be something else going on:
0 0x0000000000012ce0 __funlockfile() :0
1 0x00000000000cfcc3 __memmove_avx_unaligned_erms() :0
2 0x000000000004c194 ucp_dt_pack() ???:0
3 0x00000000000853e4 ucp_tag_offload_unexp_eager() ???:0
4 0x000000000001b962 uct_mm_ep_am_bcopy() ???:0
5 0x0000000000085a14 ucp_tag_offload_unexp_eager() ???:0
6 0x0000000000090897 ucp_tag_send_nbx() ???:0
from grid.
Related Issues (20)
- MPI2 romio321 library fails when reading >= 2GB per rank HOT 2
- Cannot compile the gparity and adjoint versions of the CompactWilsonCloverAction
- Compilation errors and warnings build targeting Nvidia GPUs HOT 2
- Create a version of Benchmark_ITT including Clover instead of Wilson
- Grid fails to build for Nc != 3
- hipcc on Crusher: function bcopy undefined (compiler does not have openmp enabled?) HOT 1
- Certain operations involving SitePropagator::scalar_object won't compile with CUDA for Nc > 3
- make install doesn't install all headers due to duplicate Config.h and Version.h HOT 3
- Using ILDG checkpointer causes a crash during write HOT 2
- Develop is broken HOT 1
- ARM NEON is broken HOT 2
- Feature request: provenance tracking
- Add hint to shm error message
- Cuda error invalid device ordinal
- Recent commit causing Grid build to fail
- The configure options --enable-setdevice and --diable-setdevice have no effect
- Grid does not compile on Arm with CUDA HOT 9
- invalid configuration argument when running with 1 GPU
- FlightRecorder.cc breaks compilation for --enable-comms=none HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from grid.