Coder Social home page Coder Social logo

Comments (9)

james-simone avatar james-simone commented on June 20, 2024

Added comment about segfaults from other benchmarks run on our local cluster
Benchmark_comms: faults after "Benchmarking concurrent STENCIL halo exchange in 1 dimensions"
Benchmark_comms_host_device: fault after "Benchmarking sequential halo exchange from GPU memory"
Benchmark_memory_asynch: runs OK
Benchmark_dwf: faults after "Setting up Cshift based reference"

from grid.

james-simone avatar james-simone commented on June 20, 2024

On Perlmutter, I built git branch c0d56a1 according to the recipe in ./systems/Perlmutter. Benchmark_ITT generates
a segfault for parallel runs while the code runs correctly on a single GPU.

$ srun -n4 ~/grid/bench/perlmutter/bind_gpu4.sh ./benchmarks/Benchmark_ITT --mpi 1.1.2.2 --threads 8 --shm 2048 --debug-mem --debug-signals
... edited ...
Grid : Message : 3.494448 s : Benchmark DWF on 32^4 local volume
Grid : Message : 3.494451 s : * Nc             : 3
Grid : Message : 3.494453 s : * Global volume  : 32 32 64 64
Grid : Message : 3.494462 s : * Ls             : 1
Grid : Message : 3.494464 s : * ranks          : 4
Grid : Message : 3.494466 s : * nodes          : 1
Grid : Message : 3.494469 s : * ranks/node     : 4
Grid : Message : 3.494471 s : * ranks geom     : 1 1 2 2
Grid : Message : 3.494474 s : * Using 8 threads
Grid : Message : 3.494476 s : ==================================================================================
Grid : Message : 3.708264 s : Initialised RNGs
(GTL DEBUG: 1) cuIpcOpenMemHandle: resource already mapped, CUDA_ERROR_ALREADY_MAPPED, line no 272
(GTL DEBUG: 3) cuIpcOpenMemHandle: resource already mapped, CUDA_ERROR_ALREADY_MAPPED, line no 272
(GTL DEBUG: 2) cuIpcOpenMemHandle: resource already mapped, CUDA_ERROR_ALREADY_MAPPED, line no 272
MPICH ERROR [Rank 1] [job id 2148669.5] [Tue May 10 07:44:32 2022] [nid001520] - Abort(70900482) (rank 1 in comm 0): Fatal error in PMPI_Sendrecv: Invalid count, error stack:
PMPI_Sendrecv(249)........................: MPI_Sendrecv(sbuf=0x7f3edfc00000, scount=589824, MPI_CHAR, dest=3, stag=1, rbuf=0x7f3d29520000, rcount=589824, MPI_CHAR, src=3, rtag=3, comm=0xc4000199, status=0x1) failed
... edited...
(unknown)(): Invalid count
BackTrace Strings: 0 ./benchmarks/Benchmark_ITT() [0x450419]
BackTrace Strings: 1 /lib64/libc.so.6(+0x4db09) [0x7fcf57d33b09]
BackTrace Strings: 2 /lib64/libc.so.6(+0x4dc9a) [0x7fcf57d33c9a]
BackTrace Strings: 3 /opt/cray/pe/lib64/libpmi2.so.0(PMI_Get_base_rank_in_app+0) [0x7fcf57408dd2]
BackTrace Strings: 4 /opt/cray/pe/lib64/libmpi_gnu_91.so.12(+0x202cca2) [0x7fcf5b0e9ca2]
BackTrace Strings: 5 /opt/cray/pe/lib64/libmpi_gnu_91.so.12(+0x1ebfa5c) [0x7fcf5af7ca5c]
BackTrace Strings: 6 /opt/cray/pe/lib64/libmpi_gnu_91.so.12(MPIR_Err_return_comm+0x11b) [0x7fcf5af7cb8b]
BackTrace Strings: 7 /opt/cray/pe/lib64/libmpi_gnu_91.so.12(MPI_Sendrecv+0x3b9) [0x7fcf59aa2729]
BackTrace Strings: 8 /global/common/software/nersc/pm-2021q4/sw/darshan/3.3.1/lib/libdarshan.so(MPI_Sendrecv+0x84) [0x7fcf5bd1ddd4]
BackTrace Strings: 9 ./benchmarks/Benchmark_ITT() [0x46fcd5]
BackTrace Strings: 10 ./benchmarks/Benchmark_ITT() [0x57168d]
BackTrace Strings: 11 ./benchmarks/Benchmark_ITT() [0x572039]
BackTrace Strings: 12 ./benchmarks/Benchmark_ITT() [0x572dd6]
BackTrace Strings: 13 ./benchmarks/Benchmark_ITT() [0x573cee]
BackTrace Strings: 14 ./benchmarks/Benchmark_ITT() [0x574baf]
BackTrace Strings: 15 ./benchmarks/Benchmark_ITT() [0x484b65]
BackTrace Strings: 16 ./benchmarks/Benchmark_ITT() [0x4445fc]
BackTrace Strings: 17 ./benchmarks/Benchmark_ITT() [0x411b51]
BackTrace Strings: 18 /lib64/libc.so.6(__libc_start_main+0xef) [0x7fcf57d1b2bd]
BackTrace Strings: 19 ./benchmarks/Benchmark_ITT() [0x4164ca]

from grid.

james-simone avatar james-simone commented on June 20, 2024

Unfortunately, this problem still persists on the develop 042ab1a branch dated Mon Jun 27, 2022.

from grid.

lcebaman avatar lcebaman commented on June 20, 2024

Any updates on this?

from grid.

james-simone avatar james-simone commented on June 20, 2024

Unfortunately, no updates. I see similar segfaults on systems other than Perlmutter. I suspect it is more of a problem with the mpich family of MPI and later versions of Grid, though openmpi has also shown segfaults.

from grid.

knepley avatar knepley commented on June 20, 2024

Are you using GPU-aware MPI? We have seen several unexplained segfaults with this that vanish using the normal build of MPI. So far, the implementors have not been motivated to fix these.

from grid.

lcebaman avatar lcebaman commented on June 20, 2024

I see the same segfaults using CUDA aware OpenMPI, I cannot confirm this is the case with normal MPI. Do you suggest to use normal OpenMPI instead?

from grid.

knepley avatar knepley commented on June 20, 2024

Yes

from grid.

lcebaman avatar lcebaman commented on June 20, 2024

there must be something else going on:

 0 0x0000000000012ce0 __funlockfile()  :0
 1 0x00000000000cfcc3 __memmove_avx_unaligned_erms()  :0
 2 0x000000000004c194 ucp_dt_pack()  ???:0
 3 0x00000000000853e4 ucp_tag_offload_unexp_eager()  ???:0
 4 0x000000000001b962 uct_mm_ep_am_bcopy()  ???:0
 5 0x0000000000085a14 ucp_tag_offload_unexp_eager()  ???:0
 6 0x0000000000090897 ucp_tag_send_nbx()  ???:0

from grid.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.