Comments (29)
@jedbrown @jrwrigh feel free to describe anything that @brtnfld and others developing CGNS might need to know that I missed in the above description
from cgns.
Do a few ranks only print that message? Some ranks may not have something to read at that scale, and we may not have the correct bail-out for that situation.
I've run CGNS with 43k ranks with no issue, but that was with 43k ranks reading a file created with 43k ranks. Compiling CGNS with -DADFH_DEBUG_ON, or uncomment #define ADFH_DEBUG_ON in ADFH.c might help with the diagnostics. However, that will produce a ton of output at that number of ranks. It might be helpful to determine the smallest rank count that the problem occurs.
If you can provide me access to the file on Aurora, I can look into it. If you have a simple reproducer, that would also help.
Let me know when you get to the DAOS phase, as CGNS will need the fixes mentioned in #613. I will try to get the fixes in branch CGNS218 into develop. If you continue with Lustre, you will likely want to consider using HDF5 subfiling.
from cgns.
Thanks for the advice. I did not realize that CGNS did anything differently when reading a file with m processes that was written by n processes when n is not equal to m. I thought there was no concept of prior partition.
Answering your first question
kjansen@aurora-uan-0010:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> grep "CGNS error 1 mismatch" JZ1536Nodes0515_240108.o618823 |wc
48 672 3984
suggests that at least only 48 of the 18432 processes that I expected to participate in reading the file (that is at least how I chunked it out on each read line) are reporting this error but this is also relying on PETSc error reporting e.g.
kjansen@aurora-uan-0010:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> grep "14163" JZ1536Nodes0515_240108.o618823 |grep -v ": -" |grep -v ":-"
[14163]PETSC ERROR: Error in external library
[14163]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14163]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc!
[14163]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[14163]PETSC ERROR: Petsc Development GIT revision: v3.19.5-1858-g581ad989054 GIT Date: 2024-02-12 14:59:06 -0700
[14163]PETSC ERROR: Configure options --with-debugging=0 --with-mpiexec-tail=gpu_tile_compact.sh --with-64-bit-indices --with-cc=mpicc --with-cxx=mpicxx --with-fc=0 --COPTFLAGS=-O2 --CXXOPTFLAGS=-O2 --FOPTFLAGS=-O2 --SYCLPPFLAGS=-Wno-tautological-constant-compare --SYCLOPTFLAGS=-O2 --download-kokkos --download-kokkos-kernels --download-kokkos-commit=origin/develop --download-kokkos-kernels-commit=origin/develop --download-hdf5 --download-cgns --download-metis --download-parmetis --download-ptscotch=../scotch_7.0.4beta3.tar.gz --with-sycl --with-syclc=icpx --with-sycl-arch=pvc --PETSC_ARCH=05-15_RB240108_B_JZ
[14163]PETSC ERROR: #1 DMPlexCreateCGNSFromFile_Internal() at /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-fork0515006/src/dm/impls/plex/cgns/plexcgns2.c:187
[14163]PETSC ERROR: #2 DMPlexCreateCGNSFromFile() at /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-fork0515006/src/dm/impls/plex/plexcgns.c:29
[14163]PETSC ERROR: #3 DMPlexCreateFromFile() at /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-fork0515006/src/dm/impls/plex/plexcreate.c:5921
[14163]PETSC ERROR: #4 DMPlexCreateFromOptions_Internal() at /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-fork0515006/src/dm/impls/plex/plexcreate.c:3943
[14163]PETSC ERROR: #5 DMSetFromOptions_Plex() at /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-fork0515006/src/dm/impls/plex/plexcreate.c:4465
[14163]PETSC ERROR: #6 DMSetFromOptions() at /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-fork0515006/src/dm/interface/dm.c:905
[14163]PETSC ERROR: #7 CreateDM() at /lus/gecko/projects/PHASTA_aesp_CNDA/libCEED_0515006_240108_JZ/examples/fluids/src/setupdm.c:36
[14163]PETSC ERROR: #8 main() at /lus/gecko/projects/PHASTA_aesp_CNDA/libCEED_0515006_240108_JZ/examples/fluids/navierstokes.c:159
[14163]PETSC ERROR: PETSc Option Table entries:
Abort(76) on node 14163 (rank 0 in comm 16): application called MPI_Abort(MPI_COMM_SELF, 76) - process 0
so there are 48 that have the CGNS reported error
kjansen@aurora-uan-0010:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> grep "CGNS error 1 mismatch" JZ1536Nodes0515_240108.o618823
[14163]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14174]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14164]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14175]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14167]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14179]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14168]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14180]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14182]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14171]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14172]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14160]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14173]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14161]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14162]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14176]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14177]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14178]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14165]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14181]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14166]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14183]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14169]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14170]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14184]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14185]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14186]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14196]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14187]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14197]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14188]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14198]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14189]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14199]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14190]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14200]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14191]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14201]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14192]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14202]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14193]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14203]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14194]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14204]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14195]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14205]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14206]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14207]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
I have not sorted but the tight node range might be just 4 nodes with 12 processes are not happy?
@jedbrown or @jrwrigh will know better but, for example
kjansen@aurora-uan-0010:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> grep "13163" JZ1536Nodes0515_240108.o618823
Returns nothing so it seems the other processes (other 762 nodes if these are indeed all on the same 4 nodes) are not getting this error or at least not reporting it through PETSc.
Can you explain Some ranks may not have something to read at that scale? I think we spread out the node and element read range evenly, even for boundary elements (which was a rendezvous headache to get them back on ranks that have better correspondence to their node range).
Thanks for the flags to get more verbose output even if it will be a huge haystack to sift for the needle.
Since it works fine at 96,192,384, and 768 nodes, I don't know how to make a small reproducer.
If you have an account on Aurora, give me your username and I will ask support to add you to our group as these files are in our group space and readable by anyone I add.
Can you point us to information on HDF5 subfiling? This might be more promising than debugging a case that is beyond the limits of Lustre.
from cgns.
You are correct. By default, there are no rank dependencies for a CGNS file unless an application introduces such dependencies, such as different zones for each rank.
I wanted to know if the data was partitioned for the larger scale case such that it could be a scenario where some ranks might not have some data condition to read.
Do you always need to double the nodes for the next rank jump? For example, can't you run with 576 nodes?
General subfiling info is here: https://github.com/HDFGroup/hdf5doc/blob/master/RFCs/HDF5_Library/VFD_Subfiling/user_guide/HDF5_Subfiling_VFD_User_s_Guide.pdf
I've not merged the CGNS "subfiling" branch into develop. I've tested it on Summit and Frontier and will have some Aurora results shortly. I still need to document its usage and best practices.
If you list "ls -tr" home on Aurora, my username is obvious. Otherwise, I can send it to you offline.
Which version of CGNS and HDF5 are you using?
from cgns.
Thanks again for the response.
Our file is "flat" in the sense that it is a single zone and we are expecting all ranks to read a range of the data that is size/nranks.
No requirement to double its just what I usually do. In any event, 1536 is not as big as we want to go but sure, we can try 1024 or any other number between works at 768 and fails at 1536.
Thanks for the link and the "status" and yes eager for documentation on its usage and best practices as I am very much a CGNS newbie (who dove into the Exascale usage as the first experience).
I will find your username and request that you be added to our projects shortly.
from cgns.
Request for you to be added sent but no response yet so it might be a while. In the interim @jedbrown suggested
lfs setstripe -c 16 .
to set the directory's Lustre properties, copying the file such that it gets those properties matched, and we are testing to see if that improves things. Do you have any advice as to whether those are the best settings for Aurora?
from cgns.
A stripe count of 16 is a good starting point; I've seen good results on Frontier with a stripe count of 64 and a stripe size of 16 MiB.
Which version of HDF5 are you using?
from cgns.
In the spirit of push it until it breaks mode, @jedbrown suggested -1
and this produced a hang with 192 nodes (each with 12 processes) reading a file written originally with 16 but then "copied" after setting the dir to -1
#0 0x00001502efe3cee1 in MPIR_Allreduce () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
No symbol table info available.
#1 0x00001502ef445c7e in PMPI_Allreduce () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
No symbol table info available.
#2 0x00001502f38a88f3 in PMPI_File_set_view () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
No symbol table info available.
#3 0x00001502e3855c61 in ?? () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
No symbol table info available.
#4 0x00001502e35fce32 in H5FD_read_selection () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
No symbol table info available.
#5 0x00001502e35e3430 in H5F_shared_select_read () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
No symbol table info available.
#6 0x00001502e35902bf in H5D__contig_read () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
No symbol table info available.
#7 0x00001502e35a4c7b in H5D__read () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
No symbol table info available.
#8 0x00001502e380f7ec in H5VL__native_dataset_read () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
No symbol table info available.
#9 0x00001502e37fade3 in H5VL_dataset_read_direct () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
No symbol table info available.
#10 0x00001502e357514e in ?? () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
No symbol table info available.
#11 0x00001502e3574c9d in H5Dread () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
No symbol table info available.
#12 0x00001502e39f4cde in readwrite_data_parallel () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libcgns.so.4.3
No symbol table info available.
#13 0x00001502e39f601a in cgp_elements_read_data () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libcgns.so.4.3
No symbol table info available.
#14 0x000015030dc421e0 in DMPlexCreateCGNS_Internal () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libpetsc.so.3.020
No symbol table info available.
#15 0x000015030daaacc6 in DMPlexCreateCGNS () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libpetsc.so.3.020
No symbol table info available.
#16 0x000015030dc418b5 in DMPlexCreateCGNSFromFile_Internal () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libpetsc.so.3.020
No symbol table info available.
#17 0x000015030daaac76 in DMPlexCreateCGNSFromFile () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libpetsc.so.3.020
No symbol table info available.
#18 0x000015030daccfd0 in DMPlexCreateFromFile () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libpetsc.so.3.020
No symbol table info available.
#19 0x000015030dad4fc7 in DMSetFromOptions_Plex () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libpetsc.so.3.020
No symbol table info available.
#20 0x000015030d9435f9 in DMSetFromOptions () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libpetsc.so.3.020
No symbol table info available.
#21 0x000000000047d99d in CreateDM ()
No symbol table info available.
#22 0x000000000040c2b1 in main ()
No symbol table info available.
[Inferior 1 (process 4399) detached]
[New LWP 4476]
[New LWP 4488]
I have 12 of these in each of the 192 node files for us to digest.
Answering your question, PETSc "chooses" the version of HDF5 and it is: hdf5-1.14.3-p1
or from the configure log
=============================================================================================
Trying to download
https://web.cels.anl.gov/projects/petsc/download/externalpackages/hdf5-1.14.3-p1.tar.bz2
for HDF5
install: Retrieving https://web.cels.anl.gov/projects/petsc/download/externalpackages/hdf5-1.14.3-p1.tar.bz2 as tarball to /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_B/externalpackages/_d_hdf5-1.14.3-p1.tar.bz2
/hdf
from cgns.
kjansen@aurora-uan-0009:~> grep "#0 " out192ReadHang..* |grep MPIR_Allreduce |wc
1496 10472 251977
so that leaves 808 processes (12*192-1496) doing something else.
from cgns.
For reasons we are still sorting out, we seem to get 12 control processes as well but filtering these with what they seem to all be doing on BT #0 which is wait4:
kjansen@aurora-uan-0009:~> grep "#0 " out192ReadHang..* |grep -v MPIR_Allreduce | grep -v wait4 |head
out192ReadHang..0:#0 0x00001471d09030a9 in poll () from /lib64/libc.so.6
out192ReadHang..0:#0 ofi_genlock_lock (lock=0x4d63490) at ./include/ofi_lock.h:359
out192ReadHang..0:#0 0x0000148ddc7403e0 in ofi_mutex_unlock_noop () at src/common.c:996
out192ReadHang..0:#0 0x000015230a8dc9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..1:#0 0x000014abd7887d19 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..1:#0 0x0000150ec7027af4 in cxip_ep_ctrl_eq_progress (ep_obj=0x4981620, ctrl_evtq=0x4947618, tx_evtq=true, ep_obj_locked=false) at prov/cxi/src/cxip_ctrl.c:320
out192ReadHang..1:#0 cxip_ep_ctrl_eq_progress (ep_obj=0x4ec9900, ctrl_evtq=0x4ed3698, tx_evtq=false, ep_obj_locked=false) at prov/cxi/src/cxip_ctrl.c:320
out192ReadHang..1:#0 0x0000152af605d0ed in ofi_cq_readfrom (cq_fid=0x5ce1290, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..1:#0 cxip_ep_ctrl_progress (ep_obj=0x5583110) at prov/cxi/src/cxip_ctrl.c:374
out192ReadHang..10:#0 0x0000146f67dd2b90 in cxip_ep_ctrl_eq_progress (ep_obj=<optimized out>, ctrl_evtq=<optimized out>, tx_evtq=true, ep_obj_locked=<optimized out>) at prov/cxi/src/cxip_ctrl.c:355
kjansen@aurora-uan-0009:~> grep "#0 " out192ReadHang..* |grep -v MPIR_Allreduce | grep -v wait4 |wc
809 6030 100729
```
diving in on the variation of states for node 0 to see if that tells us anything (here us means somebody else because if I understood it I would not be dumping all this stuff here).
```
kjansen@aurora-uan-0009:~> grep "#1 " out192ReadHang..0
#1 0x00001471d0a5d0dd in zmq_poll () from /usr/lib64/libzmq.so.5
#1 0x0000560229cbcce6 in ?? ()
#1 0x00005593736bcce6 in ?? ()
#1 0x000055bf552bcce6 in ?? ()
#1 0x000055807fabcce6 in ?? ()
#1 0x0000564af9abcce6 in ?? ()
#1 0x000055e712ebcce6 in ?? ()
#1 0x000055f4986bcce6 in ?? ()
#1 0x0000561f4f2bcce6 in ?? ()
#1 0x0000557751cbcce6 in ?? ()
#1 0x00005637044bcce6 in ?? ()
#1 0x00005582ecebcce6 in ?? ()
#1 0x000055a2240bcce6 in ?? ()
#1 0x00001502ef445c7e in PMPI_Allreduce () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
#1 0x000015532c9b7c7e in PMPI_Allreduce () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
#1 0x00001502c757bc7e in PMPI_Allreduce () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
#1 0x00001554066e9c7e in PMPI_Allreduce () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
#1 0x0000152e8cd46c7e in PMPI_Allreduce () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
#1 0x000014d1302c9c7e in PMPI_Allreduce () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
#1 0x0000148e12f7ec7e in PMPI_Allreduce () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
#1 ofi_cq_readfrom (cq_fid=0x4d63420, buf=0x7ffee0d1f2e0, count=8, src_addr=0x0) at prov/util/src/util_cq.c:229
#1 0x0000148ddc75627b in ofi_genlock_unlock (lock=0x4764660) at ./include/ofi_lock.h:364
#1 0x00001520e3342c7e in PMPI_Allreduce () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
#1 0x00001524c1fa1c7e in PMPI_Allreduce () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
#1 0x00001523066c84d0 in cxip_cq_progress (cq=0x45b4610) at prov/cxi/src/cxip_cq.c:545
kjansen@aurora-uan-0009:~> grep "#2 " out192ReadHang..0
#2 0x0000562828c04fc4 in event_loop () at src/mpiexec/mpiexec.c:949
#2 0x0000560229cbf526 in wait_for ()
#2 0x00005593736bf526 in wait_for ()
#2 0x000055bf552bf526 in wait_for ()
#2 0x000055807fabf526 in wait_for ()
#2 0x0000564af9abf526 in wait_for ()
#2 0x000055e712ebf526 in wait_for ()
#2 0x000055f4986bf526 in wait_for ()
#2 0x0000561f4f2bf526 in wait_for ()
#2 0x0000557751cbf526 in wait_for ()
#2 0x00005637044bf526 in wait_for ()
#2 0x00005582ecebf526 in wait_for ()
#2 0x000055a2240bf526 in wait_for ()
#2 0x00001502f38a88f3 in PMPI_File_set_view () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
#2 0x0000155330e1a8f3 in PMPI_File_set_view () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
#2 0x00001502cb9de8f3 in PMPI_File_set_view () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
#2 0x000015540ab4c8f3 in PMPI_File_set_view () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
#2 0x0000152e911a98f3 in PMPI_File_set_view () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
#2 0x000014d13472c8f3 in PMPI_File_set_view () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
#2 0x0000148e173e18f3 in PMPI_File_set_view () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
#2 0x000014da51320929 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
#2 ofi_cq_readfrom (cq_fid=0x47645f0, buf=<optimized out>, count=<optimized out>, src_addr=0x0) at prov/util/src/util_cq.c:280
#2 0x00001520e77a58f3 in PMPI_File_set_view () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
#2 0x00001524c64048f3 in PMPI_File_set_view () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
#2 0x00001523066c8c79 in cxip_util_cq_progress (util_cq=0x45b4610) at prov/cxi/src/cxip_cq.c:563
kjansen@aurora-uan-0009:~> grep "#3 " out192ReadHang..0
#3 launch_apps () at src/mpiexec/mpiexec.c:1026
#3 0x0000560229c83e23 in execute_command_internal ()
#3 0x0000559373683e23 in execute_command_internal ()
#3 0x000055bf55283e23 in execute_command_internal ()
#3 0x000055807fa83e23 in execute_command_internal ()
#3 0x0000564af9a83e23 in execute_command_internal ()
#3 0x000055e712e83e23 in execute_command_internal ()
#3 0x000055f498683e23 in execute_command_internal ()
#3 0x0000561f4f283e23 in execute_command_internal ()
#3 0x0000557751c83e23 in execute_command_internal ()
#3 0x0000563704483e23 in execute_command_internal ()
#3 0x00005582ece83e23 in execute_command_internal ()
#3 0x000055a224083e23 in execute_command_internal ()
#3 0x00001502e3855c61 in ?? () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
#3 0x0000155320dc7c61 in ?? () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
#3 0x00001502bb98bc61 in ?? () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
#3 0x00001553faaf9c61 in ?? () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
#3 0x0000152e81156c61 in ?? () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
#3 0x000014d1246d9c61 in ?? () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
#3 0x0000148e0738ec61 in ?? () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
#3 0x000014da512fdeda in MPIR_Allreduce () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
#3 0x0000148dece92559 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
#3 0x00001520d7752c61 in ?? () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
#3 0x00001524b63b1c61 in ?? () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
#3 0x00001523066a4111 in ofi_cq_readfrom (cq_fid=0x45b4610, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:232
kjansen@aurora-uan-0009:~> grep "#4 " out192ReadHang..0
#4 launch_loop () at src/mpiexec/mpiexec.c:1050
#4 0x0000560229c846d1 in execute_command ()
#4 0x00005593736846d1 in execute_command ()
#4 0x000055bf552846d1 in execute_command ()
#4 0x000055807fa846d1 in execute_command ()
#4 0x0000564af9a846d1 in execute_command ()
#4 0x000055e712e846d1 in execute_command ()
#4 0x000055f4986846d1 in execute_command ()
#4 0x0000561f4f2846d1 in execute_command ()
#4 0x0000557751c846d1 in execute_command ()
#4 0x00005637044846d1 in execute_command ()
#4 0x00005582ece846d1 in execute_command ()
#4 0x000055a2240846d1 in execute_command ()
#4 0x00001502e35fce32 in H5FD_read_selection () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
#4 0x0000155320b6ee32 in H5FD_read_selection () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
#4 0x00001502bb732e32 in H5FD_read_selection () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
#4 0x00001553fa8a0e32 in H5FD_read_selection () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
#4 0x0000152e80efde32 in H5FD_read_selection () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
#4 0x000014d124480e32 in H5FD_read_selection () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
#4 0x0000148e07135e32 in H5FD_read_selection () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
#4 0x000014da50906c7e in PMPI_Allreduce () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
#4 0x0000148dece823fd in MPIR_Wait_state () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
#4 0x00001520d74f9e32 in H5FD_read_selection () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
#4 0x00001524b6158e32 in H5FD_read_selection () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
#4 0x0000152316c1d929 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
from cgns.
Slicing the other way, here is where the first 200ish of the 809 processes that are not at the MPIR_Allreduce
in case that tells anyone anything (I can provide more obviously but unsure how helpful this is.
grep "#0 " out192ReadHang..* |grep -v MPIR_Allreduce | grep -v wait4 > 809DoingWhat.log
pasted lines from 809DoingWhat.log
[?4m2ReadHang..0:#0 0x00001471d09030a9 in poll () from /lib64/libc.so.6
out192ReadHang..0:#0 ofi_genlock_lock (lock=0x4d63490) at ./include/ofi_lock.h:359
out192ReadHang..0:#0 0x0000148ddc7403e0 in ofi_mutex_unlock_noop () at src/common.c:996
out192ReadHang..0:#0 0x000015230a8dc9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..1:#0 0x000014abd7887d19 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..1:#0 0x0000150ec7027af4 in cxip_ep_ctrl_eq_progress (ep_obj=0x4981620, ctrl_evtq=0x4947618, tx_evtq=true, ep_obj_locked=false) at prov/cxi/src/cxip_ctrl.c:320
out192ReadHang..1:#0 cxip_ep_ctrl_eq_progress (ep_obj=0x4ec9900, ctrl_evtq=0x4ed3698, tx_evtq=false, ep_obj_locked=false) at prov/cxi/src/cxip_ctrl.c:320
out192ReadHang..1:#0 0x0000152af605d0ed in ofi_cq_readfrom (cq_fid=0x5ce1290, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..1:#0 cxip_ep_ctrl_progress (ep_obj=0x5583110) at prov/cxi/src/cxip_ctrl.c:374
out192ReadHang..10:#0 0x0000146f67dd2b90 in cxip_ep_ctrl_eq_progress (ep_obj=<optimized out>, ctrl_evtq=<optimized out>, tx_evtq=true, ep_obj_locked=<optimized out>) at prov/cxi/src/cxip_ctrl.c:355
out192ReadHang..10:#0 ofi_mutex_lock_noop (lock=0x5ec5668) at ./include/ofi_lock.h:295
out192ReadHang..10:#0 0x000014aaec80c0bf in ofi_cq_readfrom (cq_fid=0x50af420, buf=0x7ffc37947580, count=8, src_addr=0x0) at prov/util/src/util_cq.c:221
out192ReadHang..10:#0 0x0000145cac7e3566 in _dl_update_slotinfo () from /lib64/ld-linux-x86-64.so.2
out192ReadHang..10:#0 0x000014c0aebf10f1 in ofi_cq_readfrom (cq_fid=0x519d460, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..10:#0 0x00001539b2d949b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..100:#0 cxip_cq_progress (cq=0x409d290) at prov/cxi/src/cxip_cq.c:554
out192ReadHang..100:#0 cxip_util_cq_progress (util_cq=0x57e7460) at prov/cxi/src/cxip_cq.c:566
out192ReadHang..100:#0 0x0000145664c7a919 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..100:#0 0x000014d4af3acc30 in pthread_spin_lock@plt () from /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1
out192ReadHang..100:#0 0x000014d9a26a40ed in ofi_cq_readfrom (cq_fid=0x56df250, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..100:#0 0x0000153bdb63a9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..101:#0 0x0000150e3e649d10 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..101:#0 cxip_ep_ctrl_eq_progress (ep_obj=0x42d5660, ctrl_evtq=0x42a0d98, tx_evtq=false, ep_obj_locked=false) at prov/cxi/src/cxip_ctrl.c:320
out192ReadHang..101:#0 0x0000146f7d5038b4 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..101:#0 0x0000147d3d6120f1 in ofi_cq_readfrom (cq_fid=0x47a1330, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..102:#0 0x00001525be2920bf in ofi_cq_readfrom (cq_fid=0x5088420, buf=0x7ffdd9447b20, count=8, src_addr=0x0) at prov/util/src/util_cq.c:221
out192ReadHang..102:#0 0x000014899009b9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..102:#0 0x000015489fce19b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..102:#0 cxip_cq_progress (cq=0x5d019a0) at prov/cxi/src/cxip_cq.c:545
out192ReadHang..102:#0 0x000014f4800f59b5 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..102:#0 0x000014ee3525c9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..103:#0 0x0000154218fdf9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..103:#0 0x0000150b35b54346 in MPIDI_OFI_gpu_progress_task () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..103:#0 0x0000154218fdf9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..103:#0 0x0000150b35b54346 in MPIDI_OFI_gpu_progress_task () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..103:#0 0x00001478d5cc9c30 in pthread_spin_lock@plt () from /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1
out192ReadHang..103:#0 0x000014603229e0ed in ofi_cq_readfrom (cq_fid=0x4d25460, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..103:#0 0x000014b9f43b19b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..104:#0 ofi_mutex_lock_noop (lock=0x49286a8) at ./include/ofi_lock.h:295
out192ReadHang..104:#0 0x0000149a9598a9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..104:#0 0x000014b804b85539 in cxip_cq_eq_progress (eq=0x5e90710, cq=0x5e905f0) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..104:#0 ofi_cq_read (cq_fid=0x58ed630, buf=0x7fff68146540, count=8) at prov/util/src/util_cq.c:286
out192ReadHang..104:#0 0x00001464285c1539 in cxip_cq_eq_progress (eq=0x4b92750, cq=0x4b92630) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..104:#0 0x00001528fa0759b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..105:#0 cxip_util_cq_progress (util_cq=0x5816460) at prov/cxi/src/cxip_cq.c:560
out192ReadHang..105:#0 cxip_cq_progress (cq=0x5c85630) at prov/cxi/src/cxip_cq.c:545
out192ReadHang..105:#0 0x00001509c72714e7 in cxip_cq_progress (cq=0x4804290) at prov/cxi/src/cxip_cq.c:554
out192ReadHang..106:#0 0x0000150f8fd1bd19 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..106:#0 0x000014bf20773919 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..106:#0 0x00001481f772a9a0 in __tls_get_addr_slow () from /lib64/ld-linux-x86-64.so.2
out192ReadHang..106:#0 0x0000153ab4496915 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..106:#0 0x00001553d92119b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..107:#0 0x0000152762f5e9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..107:#0 0x00001483c395d274 in ofi_genlock_unlock (lock=0x5589490) at ./include/ofi_lock.h:364
out192ReadHang..107:#0 0x00001532304864e7 in cxip_cq_progress (cq=0x5410460) at prov/cxi/src/cxip_cq.c:554
out192ReadHang..107:#0 cxip_ep_ctrl_progress (ep_obj=0x5ae37f0) at prov/cxi/src/cxip_ctrl.c:374
out192ReadHang..107:#0 0x00001482ce11228a in ofi_cq_readfrom (cq_fid=0x5948290, buf=<optimized out>, count=<optimized out>, src_addr=<optimized out>) at prov/util/src/util_cq.c:282
out192ReadHang..108:#0 0x00001515dad2b9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..108:#0 0x00001520cb3bd9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..108:#0 0x0000150ff06289b5 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..108:#0 0x0000151eb4ef9539 in cxip_cq_eq_progress (eq=0x4aca3b0, cq=0x4aca290) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..109:#0 0x0000146ca368f566 in _dl_update_slotinfo () from /lib64/ld-linux-x86-64.so.2
out192ReadHang..109:#0 0x0000145cd506d9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..109:#0 0x0000150f6a5bf9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..11:#0 0x0000147b94c080ed in ofi_cq_readfrom (cq_fid=0x5a25b20, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..110:#0 0x0000149dc25eec30 in pthread_spin_lock@plt () from /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1
out192ReadHang..110:#0 0x000014c9c7dc1cce in MPIR_Progress_hook_exec_all () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..110:#0 0x00001518d72fa9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..110:#0 0x0000145be20c353d in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..110:#0 0x00001518d72fa9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..110:#0 0x0000145be20c353d in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..111:#0 ofi_genlock_lock (lock=0x45c9660) at ./include/ofi_lock.h:359
out192ReadHang..111:#0 0x0000147f9d6259b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..111:#0 0x000014faf1c6528a in ofi_cq_readfrom (cq_fid=0x462c250, buf=<optimized out>, count=<optimized out>, src_addr=<optimized out>) at prov/util/src/util_cq.c:282
out192ReadHang..111:#0 0x00001466859449a0 in MPIDI_OFI_gpu_progress_task () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..112:#0 cxip_cq_progress (cq=0x5712420) at prov/cxi/src/cxip_cq.c:545
out192ReadHang..112:#0 0x0000149e042559b5 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..113:#0 0x0000154229929d19 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..113:#0 0x0000154391bde9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..113:#0 0x0000154cd50149b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..114:#0 0x00001456782a04e7 in cxip_cq_progress (cq=0x520f420) at prov/cxi/src/cxip_cq.c:554
out192ReadHang..114:#0 0x000014e5f6b7c0ed in ofi_cq_readfrom (cq_fid=0x49c5b00, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..114:#0 ofi_mutex_lock_noop (lock=0x507bb58) at ./include/ofi_lock.h:295
out192ReadHang..114:#0 0x000014e873c7a0ed in ofi_cq_readfrom (cq_fid=0x4e12250, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..114:#0 0x00001489c764e0ed in ofi_cq_readfrom (cq_fid=0x5c06290, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..114:#0 0x000014894e1dd9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..115:#0 0x0000147366ba46e8 in _dl_update_slotinfo () from /lib64/ld-linux-x86-64.so.2
out192ReadHang..115:#0 0x00001523097139b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..115:#0 0x000014d5ff3cc234 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..115:#0 0x0000151568a4a236 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..116:#0 0x000014e099b210bf in ofi_cq_readfrom (cq_fid=0x47cf290, buf=0x7ffd19693650, count=8, src_addr=0x0) at prov/util/src/util_cq.c:221
out192ReadHang..116:#0 0x000014721257b4e7 in cxip_cq_progress (cq=0x5d84a60) at prov/cxi/src/cxip_cq.c:554
out192ReadHang..116:#0 0x000014e0591139b5 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..117:#0 0x0000150573ad150a in cxip_cq_eq_progress (eq=0x48c3c20, cq=0x48c3b00) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..117:#0 ofi_cq_read (cq_fid=0x43d3b00, buf=0x7fffafe2ee60, count=8) at prov/util/src/util_cq.c:286
out192ReadHang..117:#0 ofi_genlock_unlock (lock=0x483d300) at ./include/ofi_lock.h:364
out192ReadHang..117:#0 0x000014c51ee5a9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..117:#0 0x0000150e027f3d2a in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..117:#0 0x0000151f2d4ae0ed in ofi_cq_readfrom (cq_fid=0x4c98290, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..118:#0 0x000014d1e28e39b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..118:#0 0x00001552246ce109 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..118:#0 ofi_mutex_lock_noop (lock=0x48e24d8) at ./include/ofi_lock.h:295
out192ReadHang..118:#0 0x000014a6c8c95d10 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..118:#0 cxip_cq_eq_progress (eq=0x4352920, cq=0x4352800) at prov/cxi/src/cxip_cq.c:535
out192ReadHang..118:#0 0x000014a6c8c95d10 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..118:#0 cxip_cq_eq_progress (eq=0x4352920, cq=0x4352800) at prov/cxi/src/cxip_cq.c:535
out192ReadHang..118:#0 0x00001468bc18dd19 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..119:#0 0x000014564b3160ed in ofi_cq_readfrom (cq_fid=0x49e95f0, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..119:#0 0x0000147c53dfd9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..119:#0 0x000014cb837cb9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..119:#0 ofi_cq_read (cq_fid=0x519a630, buf=0x7ffe6be63ac0, count=8) at prov/util/src/util_cq.c:286
out192ReadHang..119:#0 0x000014965f1889b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..12:#0 0x000014dd6c0f79b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..12:#0 0x000014c89cc6aca1 in MPIR_Progress_hook_exec_all () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..12:#0 0x00001457997e89b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..12:#0 ofi_cq_read (cq_fid=0x41e9610, buf=0x7ffcb0f42160, count=8) at prov/util/src/util_cq.c:286
out192ReadHang..120:#0 0x0000149836cc99b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..120:#0 cxip_ep_ctrl_progress (ep_obj=0x402b660) at prov/cxi/src/cxip_ctrl.c:374
out192ReadHang..120:#0 0x0000145f480fb549 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..120:#0 0x0000151a30fe4919 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..121:#0 ofi_mutex_lock_noop (lock=0x40fc6a8) at ./include/ofi_lock.h:295
out192ReadHang..121:#0 ofi_genlock_lock (lock=0x5ae0660) at ./include/ofi_lock.h:359
out192ReadHang..121:#0 0x000014f9853c2539 in cxip_cq_eq_progress (eq=0x507a580, cq=0x507a460) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..121:#0 cxip_cq_eq_progress (eq=0x445a710, cq=0x445a5f0) at prov/cxi/src/cxip_cq.c:535
out192ReadHang..122:#0 0x000014559bf929b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..122:#0 0x000014643c6bc0ed in ofi_cq_readfrom (cq_fid=0x4523460, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..123:#0 0x000014e5bb4d2512 in cxip_cq_eq_progress (eq=0x4a7cbe0, cq=0x4a7cac0) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..123:#0 0x000014fdf760a543 in cxi_eq_peek_event (eq=0x5430a58) at /usr/include/cxi_prov_hw.h:1537
out192ReadHang..123:#0 0x000014b1ee9a4539 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..123:#0 0x00001497fc97f284 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..124:#0 0x0000150054cd10f1 in ofi_cq_readfrom (cq_fid=0x49cc460, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..124:#0 0x000014b9823040ed in ofi_cq_readfrom (cq_fid=0x5ba4630, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..124:#0 0x000014ae3df3b9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..125:#0 0x000014a674ae3b12 in cxi_eq_peek_event (eq=0x5bd8c58) at /usr/include/cxi_prov_hw.h:1540
out192ReadHang..125:#0 cxip_ep_ctrl_progress (ep_obj=0x58d4e20) at prov/cxi/src/cxip_ctrl.c:374
out192ReadHang..125:#0 ofi_mutex_lock_noop (lock=0x47b7b58) at ./include/ofi_lock.h:295
out192ReadHang..125:#0 0x000014c08b0609b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..125:#0 0x00001458c92e728a in ofi_cq_readfrom (cq_fid=0x3fdd460, buf=<optimized out>, count=<optimized out>, src_addr=<optimized out>) at prov/util/src/util_cq.c:282
out192ReadHang..125:#0 0x000014ec3e0fc9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..126:#0 0x00001542132729b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..125:#0 0x000014ec3e0fc9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..126:#0 0x00001542132729b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..126:#0 0x000014807f36e0bf in ofi_cq_readfrom (cq_fid=0x4ab8420, buf=0x7ffe9110a7e0, count=8, src_addr=0x0) at prov/util/src/util_cq.c:221
out192ReadHang..126:#0 0x000014bc0410d0ed in ofi_cq_readfrom (cq_fid=0x4c4a250, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..126:#0 0x0000152dcfe4d0ed in ofi_cq_readfrom (cq_fid=0x4f89290, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..126:#0 0x0000151fa6123d19 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..127:#0 0x000014590b29f0ed in ofi_cq_readfrom (cq_fid=0x46f1250, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..127:#0 0x00001461aaf4809b in __tls_get_addr () from /lib64/ld-linux-x86-64.so.2
out192ReadHang..127:#0 0x000014d96066f234 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..127:#0 0x000014c7812ce0ed in ofi_cq_readfrom (cq_fid=0x41f6ae0, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..127:#0 0x0000149b6f0399b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..128:#0 0x000014788a0910ed in ofi_cq_readfrom (cq_fid=0x5af1290, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..128:#0 0x0000153a352b60ed in ofi_cq_readfrom (cq_fid=0x58d8420, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..128:#0 0x000014bb79165b12 in cxi_eq_peek_event (eq=0x4197de8) at /usr/include/cxi_prov_hw.h:1540
out192ReadHang..128:#0 cxip_util_cq_progress (util_cq=0x3e58aa0) at prov/cxi/src/cxip_cq.c:560
out192ReadHang..128:#0 0x0000152ea5d70301 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..128:#0 cxi_eq_peek_event (eq=0x43eb948) at /usr/include/cxi_prov_hw.h:1532
out192ReadHang..129:#0 0x00001493da6640bf in ofi_cq_readfrom (cq_fid=0x3e1aac0, buf=0x7fff5e0fb9c0, count=8, src_addr=0x0) at prov/util/src/util_cq.c:221
out192ReadHang..129:#0 0x0000149f09d6b9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..129:#0 0x00001552d6b419b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..129:#0 0x0000151613b123e0 in ofi_mutex_unlock_noop () at src/common.c:996
out192ReadHang..13:#0 0x000014c05360f9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..13:#0 0x0000147d213680fd in ofi_genlock_unlock (lock=0x4210660) at ./include/ofi_lock.h:364
out192ReadHang..13:#0 0x0000152a1dd95506 in cxip_cq_eq_progress (eq=0x54f23b0, cq=0x54f2290) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..13:#0 0x00001467aed7b0ed in ofi_cq_readfrom (cq_fid=0x4cbb250, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..130:#0 cxip_ep_ctrl_eq_progress (ep_obj=0x5669660, ctrl_evtq=0x5634d78, tx_evtq=false, ep_obj_locked=false) at prov/cxi/src/cxip_ctrl.c:320
out192ReadHang..130:#0 0x000014e57a5deb8e in cxip_ep_ctrl_eq_progress (ep_obj=<optimized out>, ctrl_evtq=<optimized out>, tx_evtq=false, ep_obj_locked=<optimized out>) at prov/cxi/src/cxip_ctrl.c:355
out192ReadHang..130:#0 0x000015160ec5b0fd in ofi_genlock_unlock (lock=0x48882c0) at ./include/ofi_lock.h:364
out192ReadHang..130:#0 0x0000146e742e5c30 in pthread_spin_lock@plt () from /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1
out192ReadHang..131:#0 cxip_ep_ctrl_progress (ep_obj=0x5aa5490) at prov/cxi/src/cxip_ctrl.c:372
out192ReadHang..131:#0 cxip_util_cq_progress (util_cq=0x581c250) at prov/cxi/src/cxip_cq.c:566
out192ReadHang..131:#0 0x000014af9e537915 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..131:#0 cxip_util_cq_progress (util_cq=0x581c250) at prov/cxi/src/cxip_cq.c:566
out192ReadHang..131:#0 0x000014af9e537915 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..131:#0 0x000014725d5070ed in ofi_cq_readfrom (cq_fid=0x5a793a0, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..132:#0 0x00001522fe65ab8e in cxip_ep_ctrl_eq_progress (ep_obj=<optimized out>, ctrl_evtq=<optimized out>, tx_evtq=false, ep_obj_locked=<optimized out>) at prov/cxi/src/cxip_ctrl.c:355
out192ReadHang..132:#0 0x0000148ea6706506 in cxip_cq_eq_progress (eq=0x43ba540, cq=0x43ba420) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..132:#0 0x0000151bbd27e9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..132:#0 0x00001461fa023b12 in cxi_eq_peek_event (eq=0x460d9a8) at /usr/include/cxi_prov_hw.h:1540
out192ReadHang..133:#0 0x00001499d046309b in __tls_get_addr () from /lib64/ld-linux-x86-64.so.2
out192ReadHang..133:#0 0x00001463b556e9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..133:#0 cxip_ep_ctrl_progress (ep_obj=0x5956660) at prov/cxi/src/cxip_ctrl.c:374
out192ReadHang..133:#0 0x0000148aa41da6e9 in _dl_update_slotinfo () from /lib64/ld-linux-x86-64.so.2
out192ReadHang..134:#0 0x000014b4f89f4d2a in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..134:#0 0x00001553cb443d1f in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..134:#0 0x00001495091089b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..134:#0 0x000014f919324af4 in cxip_ep_ctrl_eq_progress (ep_obj=0x468c620, ctrl_evtq=0x4657bf8, tx_evtq=true, ep_obj_locked=false) at prov/cxi/src/cxip_ctrl.c:320
out192ReadHang..134:#0 0x000014a449eadd2a in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..134:#0 0x000014b30d5a7274 in ofi_genlock_unlock (lock=0x559e490) at ./include/ofi_lock.h:364
out192ReadHang..135:#0 0x0000146b816d4919 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..135:#0 0x0000149bc645d0bf in ofi_cq_readfrom (cq_fid=0x5bfe460, buf=0x7ffd30b561a0, count=8, src_addr=0x0) at prov/util/src/util_cq.c:221
out192ReadHang..135:#0 cxi_eq_peek_event (eq=0x54f7aa8) at /usr/include/cxi_prov_hw.h:1532
out192ReadHang..135:#0 0x000014f845c67fec in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..135:#0 0x000014d66fc9e9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..136:#0 0x0000150caa4159b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..136:#0 cxip_cq_progress (cq=0x5162460) at prov/cxi/src/cxip_cq.c:545
out192ReadHang..136:#0 ofi_mutex_lock_noop (lock=0x448f308) at ./include/ofi_lock.h:295
out192ReadHang..136:#0 cxip_cq_progress (cq=0x4f18250) at prov/cxi/src/cxip_cq.c:545
out192ReadHang..137:#0 0x00001525f13e6270 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..137:#0 0x00001485304f29b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..137:#0 0x000014f8069c3506 in cxip_cq_eq_progress (eq=0x4d6b710, cq=0x4d6b5f0) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..137:#0 0x000014c55fbe09b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..138:#0 0x0000145f0b07df7c in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..138:#0 0x0000148be6cd7284 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..138:#0 0x000014820c7669b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..139:#0 ofi_mutex_lock_noop (lock=0x4f54308) at ./include/ofi_lock.h:295
out192ReadHang..139:#0 0x00001507266179b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..139:#0 ofi_mutex_lock_noop (lock=0x4f54308) at ./include/ofi_lock.h:295
out192ReadHang..139:#0 0x00001507266179b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..139:#0 ofi_mutex_lock_noop (lock=0x51954d8) at ./include/ofi_lock.h:295
out192ReadHang..139:#0 0x000014ebddbea9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..139:#0 0x000014584c42b9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..139:#0 ofi_cq_read (cq_fid=0x51f7420, buf=0x7ffc6d748f40, count=8) at prov/util/src/util_cq.c:286
out192ReadHang..14:#0 0x00001504aef730ed in ofi_cq_readfrom (cq_fid=0x521d5b0, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..14:#0 0x000014abcff8b9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..14:#0 0x000014a9f79359b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..140:#0 0x000014601de458a5 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..140:#0 0x000014d2a0981506 in cxip_cq_eq_progress (eq=0x402f370, cq=0x402f250) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..140:#0 0x0000148847010d19 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..140:#0 0x0000151cca47baf2 in cxip_ep_ctrl_eq_progress (ep_obj=0x4a36e20, ctrl_evtq=0x4a1e3b8, tx_evtq=true, ep_obj_locked=false) at prov/cxi/src/cxip_ctrl.c:320
out192ReadHang..140:#0 0x00001517b777e0ed in ofi_cq_readfrom (cq_fid=0x4883290, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..141:#0 cxip_util_cq_progress (util_cq=0x49f5250) at prov/cxi/src/cxip_cq.c:566
out192ReadHang..141:#0 0x000014c7e60772c1 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..141:#0 0x00001553157600ed in ofi_cq_readfrom (cq_fid=0x52315f0, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..141:#0 0x000014854741ad19 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..142:#0 0x000014c9e471d563 in _dl_update_slotinfo () from /lib64/ld-linux-x86-64.so.2
out192ReadHang..142:#0 cxip_ep_ctrl_progress (ep_obj=0x5155490) at prov/cxi/src/cxip_ctrl.c:374
out192ReadHang..143:#0 0x000014b92e597b90 in cxip_ep_ctrl_eq_progress (ep_obj=<optimized out>, ctrl_evtq=<optimized out>, tx_evtq=true, ep_obj_locked=<optimized out>) at prov/cxi/src/cxip_ctrl.c:355
out192ReadHang..143:#0 0x000014d6cbadd0ed in ofi_cq_readfrom (cq_fid=0x59f9420, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..143:#0 0x00001495a5b2c28a in ofi_cq_readfrom (cq_fid=0x5ab1290, buf=<optimized out>, count=<optimized out>, src_addr=<optimized out>) at prov/util/src/util_cq.c:282
out192ReadHang..144:#0 0x000014a0ed85477b in update_get_addr () from /lib64/ld-linux-x86-64.so.2
out192ReadHang..144:#0 ofi_genlock_lock (lock=0x555c300) at ./include/ofi_lock.h:359
out192ReadHang..144:#0 0x00001492108854e7 in cxip_cq_progress (cq=0x47f2290) at prov/cxi/src/cxip_cq.c:554
out192ReadHang..145:#0 0x0000148d9aadb274 in ofi_genlock_unlock (lock=0x42bb2c0) at ./include/ofi_lock.h:364
out192ReadHang..145:#0 0x00001490925b9da8 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..146:#0 cxip_cq_eq_progress (eq=0x52c6710, cq=0x52c65f0) at prov/cxi/src/cxip_cq.c:535
out192ReadHang..146:#0 0x00001537bf0cac30 in pthread_spin_lock@plt () from /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1
out192ReadHang..146:#0 0x0000151f8dfbe9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..147:#0 0x0000154842f493e0 in ofi_mutex_unlock_noop () at src/common.c:996
out192ReadHang..147:#0 0x000014b3381cf230 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..147:#0 0x00001537706640ed in ofi_cq_readfrom (cq_fid=0x5386b20, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..147:#0 0x000014b3381cf230 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..147:#0 0x00001537706640ed in ofi_cq_readfrom (cq_fid=0x5386b20, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..147:#0 cxip_ep_ctrl_eq_progress (ep_obj=<optimized out>, ctrl_evtq=<optimized out>, tx_evtq=true, ep_obj_locked=<optimized out>) at prov/cxi/src/cxip_ctrl.c:355
out192ReadHang..147:#0 0x0000145fb2dbd9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..147:#0 0x0000151c8ca649b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..147:#0 cxip_util_cq_progress (util_cq=0x4700290) at prov/cxi/src/cxip_cq.c:566
out192ReadHang..147:#0 0x000014dcbd38f919 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..148:#0 0x00001525e3c75919 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..148:#0 0x0000154e907c2506 in cxip_cq_eq_progress (eq=0x5535710, cq=0x55355f0) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..148:#0 ofi_cq_read (cq_fid=0x560cb40, buf=0x7fff4a0ce600, count=8) at prov/util/src/util_cq.c:286
out192ReadHang..148:#0 0x0000145bc14469b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..148:#0 cxip_ep_ctrl_eq_progress (ep_obj=0x47a9490, ctrl_evtq=0x4774338, tx_evtq=false, ep_obj_locked=false) at prov/cxi/src/cxip_ctrl.c:320
out192ReadHang..148:#0 0x00001540e99bf0ed in ofi_cq_readfrom (cq_fid=0x45a7b20, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..149:#0 0x000015076feb7506 in cxip_cq_eq_progress (eq=0x5dc8be0, cq=0x5dc8ac0) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..149:#0 0x000014fc837bf0ed in ofi_cq_readfrom (cq_fid=0x52f4460, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..149:#0 0x00001521f95079b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..15:#0 0x000014bdf437b9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..15:#0 0x00001522c0d26539 in cxip_cq_eq_progress (eq=0x5dd95f0, cq=0x5dd94d0) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..150:#0 0x000014ffc7fe8175 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..150:#0 0x00001514ff30d9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..150:#0 0x000014f9650f2919 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..151:#0 0x000014f2c85f99b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..151:#0 0x0000150652a20da8 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..151:#0 0x0000149a60fbbaf4 in cxip_ep_ctrl_eq_progress (ep_obj=0x5709de0, ctrl_evtq=0x56f1378, tx_evtq=true, ep_obj_locked=false) at prov/cxi/src/cxip_ctrl.c:320
out192ReadHang..151:#0 0x000014cdafa660ed in ofi_cq_readfrom (cq_fid=0x50a3ae0, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..151:#0 0x000015092ec909b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..152:#0 0x000014c7f60d8517 in cxip_cq_eq_progress (eq=0x53903b0, cq=0x5390290) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..152:#0 0x00001526ae2886e2 in _dl_update_slotinfo () from /lib64/ld-linux-x86-64.so.2
out192ReadHang..152:#0 0x0000153b51553301 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..152:#0 ofi_cq_read (cq_fid=0x46bc3a0, buf=0x7ffe05a57400, count=8) at prov/util/src/util_cq.c:286
out192ReadHang..152:#0 0x000014906d3800ed in ofi_cq_readfrom (cq_fid=0x4ff0420, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
from cgns.
getting it down to a page of what I perceive to be the most likely suspects (and you can see who I have taken my eye off with the grep -v list)
kjansen@aurora-uan-0009:~> grep -v pthread 809DoingWhat.log |grep -v ofi_cq|grep -v MPIDI_progress_test |grep -v cxip_ep_ctrl_progress |grep -v cxi_eq_peek_event |grep -v cxip |grep -v noop |grep -v lib64 |grep -v ofi_ |grep -v MPIDU_genq_shmem_queue_dequeue |grep -v MPIDI_OFI_gpu_progress_task
out192ReadHang..110:#0 0x000014c9c7dc1cce in MPIR_Progress_hook_exec_all () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..12:#0 0x000014c89cc6aca1 in MPIR_Progress_hook_exec_all () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..153:#0 0x0000148f14a3cc90 in MPIR_Progress_hook_exec_all () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..175:#0 0x000014fc86d2eca1 in MPIR_Progress_hook_exec_all () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..176:#0 0x000014d58561abb0 in __tls_get_addr@plt () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..180:#0 0x000014685f2c9cce in MPIR_Progress_hook_exec_all () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..24:#0 0x000014dae7854b34 in MPIDI_POSIX_eager_recv_begin () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..38:#0 0x000014ca202b6b49 in MPIDI_POSIX_eager_recv_begin () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..44:#0 0x000014d74b4efcce in MPIR_Progress_hook_exec_all () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..46:#0 0x0000151c2f05dc90 in MPIR_Progress_hook_exec_all () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..47:#0 0x000015276af2ab34 in MPIDI_POSIX_eager_recv_begin () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..50:#0 0x000014cbc3d84ad0 in MPIDI_POSIX_eager_recv_begin () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..51:#0 0x000014d545d91b53 in MPIDI_POSIX_eager_recv_begin () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..55:#0 0x00001501093b2bb0 in __tls_get_addr@plt () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..58:#0 0x000014f9492dabb0 in __tls_get_addr@plt () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..73:#0 0x000014fd5f5e1cce in MPIR_Progress_hook_exec_all () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..85:#0 0x0000153b53336b30 in MPIDI_POSIX_eager_recv_begin () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
from cgns.
tls_get_addr has #2 0x0000150109daceda in MPIR_Allreduce
As does MPIDI_POSIX_eager_recv_begin and MPIR_Progress_hook_exec_all
so perhaps that means one of the grep -v terms is the villain.
I don't know these functions so giving up to let more knowledgable people sift the haystack for the needle.
from cgns.
The shmem
and pthreads
parts in the backtraces stick out to me right now. That seems like some weird race condition to me, but within some kind of multithreading scheme, not rank-to-rank MPI. I wonder if it's possible to turn off multithreading or the shared memory parallelism. Although I'm not sure what performance hit that will cause.
from cgns.
96 nodes were able to read that same file, run 1000 steps, and wrote a new one (297GB in 12.4 seconds according to PETSc VecView timers).
Currently running a second attempt at 192 nodes to try to read the file 96 nodes just wrote. It looks like it succeeded this time (past all CGNS reads I think). Not sure if this is because CGNS really wrote the file this time (last time it was a cp of a CGNS written file within a directory changed from 16 (what it was written from CGNS in) to -1 or if we are just into the limits of reliability of GGNS + Lustre and encountering the "locking and contention" issues that motivated subfiling in papers
from cgns.
But we won't get write performance numbers from that run because....
ping failed on x4217c6s6b0n0: No reply from x4309c7s1b0n0.hostmgmt2309.cm.aurora.alcf.anl.gov after 97s
from cgns.
The
shmem
andpthreads
parts in the backtraces stick out to me right now. That seems like some weird race condition to me, but within some kind of multithreading scheme, not rank-to-rank MPI. I wonder if it's possible to turn off multithreading or the shared memory parallelism. Although I'm not sure what performance hit that will cause.
This raises an interesting question? My Aurora runs have been with 12 ranks per node because the flow solve is entirely on the GPU and the CPUs are "assisting" that phase but they are of course primary for IO and problem setup.
When one of those 12 CPU processes on a given node calls CGNS and on to HDF5 is it trying to use threads to gain more parallelism in the read/write or is it sticking to one process per MPI process?
A related question, is it relevant to debug this with the solver in CPU mode (not taking many steps because it will be slower) and use the full 104 Saphire Rapids cores? This would get us into interesting (failing???) MPI process counts with far fewer nodes as there are 8.67 processors for every "tile" (104 vs 12) so we can get to 10k with 96 nodes or 20k with 192. That said, earlier when we were debugging, I was getting hangs when I tried to use 48 processes per node on this problem. The nodes have a TB of memory so I don't think I was exhausting that.
from cgns.
Back to the original error in this ticket, just documenting a brief code dive:
mismatch in number of children and child IDs read
comes from src/cgns_internals.c:cgi_get_nodes
, which is called from cg_open
to search for the CGNSLibraryVersion_t
node. The mismatch that it's talking about the difference between src/cgns_io.c:cgio_number_children
(which counts the number of child nodes root node) and src/cgns_io.c:cgio_children_ids
(which gets the actual id numbers for those children).
Both cgio_number_children
and cgio_children_ids
call H5Literate2
(key word there is "iterate", not "literate"), which simply loops over child nodes of the HDF5 given to it and runs a function on it. I'm not sure how these could disagree with each other frankly. But it'd be interesting to augment the error message to see what it thinks those children numbers are and compare them with the actual file (which iirc, should have only 2 or 3).
from cgns.
Second attempt at 192 with 96 written input revives our original error for this thread
[939]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
kjansen@aurora-uan-0010:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> grep "child IDs" JZ192Nodes1215_240108.o622110 |wc
24 336 1944
so 2 nodes likely found a bad Lustre?
from cgns.
When one of those 12 CPU processes on a given node calls CGNS and on to HDF5 is it trying to use threads to gain more parallelism in the read right or is it sticking to one process per MPI process?
I believe it's running multiple threads per MPI process. I can't think of another reason why pthreads_spin_lock
would be called instead of MPI_wait
if it was a single process.
from cgns.
I am hopefully not jinxing it but so far larger process counts are more successful with the minus one striping choice. The 1536 has not run yet but the 768 case read and wrote correctly all four times
kjansen@aurora-uan-0009:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> du -sh *-4[6-9]000.cgns |head
197G Q2fromQ1_21k-46000.cgns
197G Q2fromQ1_21k-47000.cgns
197G Q2fromQ1_21k-48000.cgns
197G Q2fromQ1_21k-49000.cgns
425M stats-46000.cgns
425M stats-47000.cgns
425M stats-48000.cgns
425M stats-49000.cgns
kjansen@aurora-uan-0010:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> grep VecView JZ768Nodes1215_240108.o*
JZ768Nodes1215_240108.o620924:VecView 2 1.0 1.7643e+01 1.0 1.43e+06 2.0 3.5e+05 2.1e+04 3.0e+01 1 0 0 0 0 1 0 0 0 0 720 7269759248 0 0.00e+00 0 0.00e+00 97
JZ768Nodes1215_240108.o620925:VecView 2 1.0 1.8186e+01 1.0 1.43e+06 2.1 3.5e+05 2.1e+04 3.0e+01 1 0 0 0 0 1 0 0 0 0 699 7277891282 0 0.00e+00 0 0.00e+00 97
JZ768Nodes1215_240108.o620926:VecView 2 1.0 1.3742e+01 1.0 1.43e+06 1.7 3.5e+05 2.1e+04 3.0e+01 1 0 0 0 0 1 0 0 0 0 925 7634409062 0 0.00e+00 0 0.00e+00 97
JZ768Nodes1215_240108.o620927:VecView 2 1.0 1.3611e+01 1.0 1.43e+06 2.1 3.5e+05 2.1e+04 3.0e+01 1 0 0 0 0 1 0 0 0 0 933 7258857984 0 0.00e+00 0 0.00e+00 97
We don't have timers yet on the reader but you can see the writer is called twice once for a big file and once for a small file and there is some variation in the performance but 13 to 18 seconds is more than acceptable I think for a large and small file. I am not sure when the 1536 case will be picked as there are only 2048 notes as far as I know.
from cgns.
My second battery of jobs are running and still so far so good with no read or write failures. Since we didn't really change code and only changed the luster striping I think we have to attribute this behavior to the Lustre striping or perhaps luck innot getting bad nodes. Will keep you posted as more data is obtained.
That said we still don't have any data from 1536 nodes. I am wondering if there are even 1536 nodes up.
from cgns.
With help from Tim Williams, the mystery of why my 1536 node jobs are not running is resolved
kjansen@aurora-uan-0010:/lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ> /home/zippy/bin/pu_nodeStat EarlyAppAccess
PARTITION: LustreApps (EarlyAppAccess)
------------------------
Nodes Status
----- ------
1 down
17 free
1417 job-exclusive
417 offline
0 state-unknown
18 state-unknown,down
2 state-unknown,down,offline
0 state-unknown,offline
0 broken
----- --------------
1872 Total Nodes
I have queued 4 1152 node jobs but of course they will take a while to get enough priority to run (and machine emptying out as I am not sure they have a drain for large jobs with priority at this point anyway).
from cgns.
WOOHOOO. We are running on 1124 nodes, 13488 tiles and thus have broken the 10k GPU barrier finally (previously CGNS+HDF5+Lustre were erroring out on the read of our inputs).
No coded change. It is either the Lustre striping to -1 (32 stripes is the max I think) OR they finally pulled the bad nodes that could not really talk properly to the Lustre file system out of service.
I had to qalter my job node request down to what Tim's script said was available this morning to get it to go. They have a large job queue problem in that I suspect it drained all night to try to get a mysteriously missing 200+ nodes in job-exclusive category of that script. I have seen this many times on new machines so I just got around it with qalter.
Job appears to have run the requested 1k steps and written to CGNS correctly as well.
from cgns.
kjansen@aurora-uan-0009:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> grep VecView JZ1536Nodes1215_240108.o621462
VecView 2 1.0 2.2446e+01 1.0 9.87e+05 2.0 5.2e+05 1.6e+04 3.0e+01 2 0 0 0 0 2 0 0 0 0 574 7316723234 0 0.00e+00 0 0.00e+00 97
VecView 2 1.0 2.2786e+01 1.0 9.87e+05 2.0 5.2e+05 1.6e+04 3.0e+01 2 0 0 0 0 2 0 0 0 0 565 7383870184 0 0.00e+00 0 0.00e+00 97
VecView 2 1.0 2.4723e+01 1.0 9.87e+05 2.0 5.2e+05 1.6e+04 3.0e+01 2 0 0 0 0 2 0 0 0 0 521 7377604385 0 0.00e+00 0 0.00e+00 97
so the job ran three times with the same inputs and ran out of time on the 4th. So that is 4/4 successful reads and 3/3 successful writes. Not that log file is miss-named since I qaltered the node count to hit was was available (1124).
22-24 seconds is about half the rate we were getting out of lower node count ( O (12) seconds) but still not bad.
from cgns.
What stripe size are you using?
You might try setting the alignment in HDF5 to the Lustre stripe size.
http://cgns.github.io/CGNS_docs_current/midlevel/fileops.html, CG_CONFIG_HDF5_ALIGNMENT
Are you doing independent or collective IO?
What are those numbers in terms of bandwidth?
Do you have the darshan logs?
from cgns.
@jedbrown and @jrwrigh might have better answers but I will share what I know:
Stripe size:
We have only set stripe count. We initially did kjansen@aurora-uan-0009:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> lfs setstripe -c 16 .
but then went to -1. No direct setting of stripe size but I guess we can see what I got
kjansen@aurora-uan-0009:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> lfs getstripe Q2fromQ1_21k-42000.cgns
Q2fromQ1_21k-42000.cgns
lmm_stripe_count: 16
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 0
obdidx objid objid group
0 1168972 0x11d64c 0xa80000405
3 1167709 0x11d15d 0x980000405
1 1169071 0x11d6af 0xc80000403
11 1172051 0x11e253 0x900000405
103 4342256 0x4241f0 0x440000bd1
14 1172388 0x11e3a4 0xbc0000404
7 1168387 0x11d403 0xb40000405
10 1167154 0x11cf32 0xa40000405
15 1169510 0x11d866 0xb00000405
5 1168023 0x11d297 0x940000405
2 1170305 0x11db81 0xac0000405
4 1168529 0x11d491 0xc00000404
12 1167890 0x11d212 0x9c0000405
6 1169015 0x11d677 0xcc0000405
8 1166910 0x11ce3e 0xa00000405
109 3762900 0x396ad4 0x600000bd2
kjansen@aurora-uan-0009:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> lfs setstripe -c -1 .
kjansen@aurora-uan-0009:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> ls -alt |head -5
total 9900951924
-rw-r--r-- 1 kjansen PHASTA_aesp_CNDA 656391 Feb 24 19:01 JZ192Nodes1215_240108.o622004
-rw-r--r-- 1 kjansen PHASTA_aesp_CNDA 445378519 Feb 24 19:01 stats-44000.cgns
drwxr-sr-x 79 kjansen PHASTA_aesp_CNDA 73728 Feb 24 19:01 .
-rw-r--r-- 1 kjansen PHASTA_aesp_CNDA 211362088776 Feb 24 19:01 Q2fromQ1_21k-44000.cgns
kjansen@aurora-uan-0009:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> mv Q2fromQ1_21k-44000.cgns Q2fromQ1_21k-44000.cgns_asWritten
kjansen@aurora-uan-0009:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> cp Q2fromQ1_21k-44000.cgns_asWritten Q2fromQ1_21k-44000.cgns
kjansen@aurora-uan-0009:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> lfs getstripe Q2fromQ1_21k-44000.cgns
Q2fromQ1_21k-44000.cgns
lmm_stripe_count: 32
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 1
obdidx objid objid group
1 1169095 0x11d6c7 0xc80000403
110 3835167 0x3a851f 0x680000bd4
3 1167735 0x11d177 0x980000405
13 1170984 0x11de28 0xc40000404
15 1169535 0x11d87f 0xb00000405
0 1168998 0x11d666 0xa80000405
7 1168413 0x11d41d 0xb40000405
5 1168047 0x11d2af 0x940000405
11 1172072 0x11e268 0x900000405
10 1167181 0x11cf4d 0xa40000405
102 4182240 0x3fd0e0 0x4c0000bd1
9 1167850 0x11d1ea 0xb80000405
14 1172410 0x11e3ba 0xbc0000404
4 1168554 0x11d4aa 0xc00000404
2 1170329 0x11db99 0xac0000405
6 1169039 0x11d68f 0xcc0000405
8 1166937 0x11ce59 0xa00000405
104 3737545 0x3907c9 0x500000bd1
12 1167916 0x11d22c 0x9c0000405
105 3952709 0x3c5045 0x640000bd1
106 3894464 0x3b6cc0 0x6c0000bd1
112 3842721 0x3aa2a1 0x400000bd4
107 3878829 0x3b2fad 0x5c0000bd1
108 3862786 0x3af102 0x540000bd2
113 3777100 0x39a24c 0x480000bd4
100 3753931 0x3947cb 0x780000bd1
115 3744821 0x392435 0x740000bd4
114 3779125 0x39aa35 0x7c0000405
111 3672239 0x3808af 0x580000bd1
103 4342896 0x424470 0x440000bd1
109 3763538 0x396d52 0x600000bd2
101 3689664 0x384cc0 0x700000bd1
Jed wrote the writer. I think it is independent if by that you mean each rank is writing its segment through the parallel routines. I modified the existing reader to read in parallel (assuming that is what you mean by independent??).
Two files are written. The first is 197 GB. The second is much smaller but I suspect that it adds latency time that distorts bandwidth numbers a tad. If I do the math right, 197GB/22 sec is about 9GB/s.
I don't know where to find or how to access darshan logs but I know Jed said that Rob Latham would help us take a look at them. I think he is trying to schedule a meeting for that.
from cgns.
I calculated 15-20 GB/s on the smaller node counts (96 or 192), so we're seeing about half that here on 1124 nodes. The stripe size is default, which looks like 1MiB. We use collective IO. I'm working with Rob Latham to get Darshan logs (it's not "supported" on Aurora yet).
from cgns.
According to ALCF, @brtnfld now has access to these project directories. Let me know if you need any orientation beyond the directories and file names I pasted above.
from cgns.
Related Issues (20)
- Call to undeclared internal Tk function TkWmAddToColormapWindows in tkogl HOT 3
- Address warnings generated from gcc 11.3.0 HOT 1
- CMake build fails with hdf5 1.14 HOT 7
- HDF read fails if creation order is not tracked HOT 5
- Add cgnsupdate regression tests
- CGNS failing due to > 1.8 HDF5 library version being used with older CGNS versions to generate files. HOT 14
- Possible to raise limit on maximum open files HOT 3
- Create daily/weekly tests for github actions
- Check #ifdef skip in pcgns_ftest.F90 is still needed
- Best Practices for mixed canonical/polyhedron zones?
- CGNSTools compilation problem under windows and CGNSView open hdf5 type file error
- PartitionID
- FamVC_ HOT 5
- cgnscheck is returning errors for cgns Fortran examples
- Building CGNS tools with Intel LLVM compilers fails HOT 1
- Add citation to GitHub repository
- Format specifier macros for cgsize_t HOT 2
- Build fails with LTO
- Build CGNS Utilities independent of X11/OpenGL HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cgns.