Problem
It has been puzzling me that ~40% of GCHP simulation time is spent on MPI_Barrier
, as shown by the IPM profiler (https://github.com/nerscadmin/IPM).
For example, here is the IPM profiling result of a 7-day c180 benchmark on 288 cores (version 12.3.2, runs on AWS):
##IPMv2.0.6########################################################
#
# command : ./geos
# start : Thu Sep 26 03:23:50 2019 host : ip-172-31-0-86
# stop : Thu Sep 26 08:47:40 2019 wallclock : 19430.50
# mpi_tasks : 288 on 8 nodes %comm : 48.44
# mem [GB] : 905.40 gflop/sec : 0.00
#
# : [total] <avg> min max
# wallclock : 5595949.52 19430.38 19430.11 19430.50
# MPI : 2710587.40 9411.76 7384.92 11767.68
# %wall :
# MPI : 48.44 38.01 60.56
# #calls :
# MPI : 6023502175 20914938 16331252 21806384
# mem [GB] : 905.40 3.14 3.04 11.42
#
# [time] [count] <%wall>
# MPI_Barrier 2280341.77 11672064 40.75
# MPI_Bcast 242708.94 48370752 4.34
# MPI_Allreduce 89036.54 49428288 1.59
# MPI_Wait 73996.00 2953775418 1.32
# MPI_Scatterv 21071.88 185184 0.38
# MPI_Isend 2117.89 1476895338 0.04
# MPI_Gatherv 969.00 5689728 0.02
# MPI_Irecv 266.05 1476880080 0.00
# MPI_Comm_create 30.24 576 0.00
# MPI_Recv 30.11 15258 0.00
# MPI_Comm_split 17.11 8064 0.00
# MPI_Allgather 1.17 864 0.00
# MPI_Reduce 0.45 1728 0.00
# MPI_Comm_rank 0.24 503073 0.00
# MPI_Comm_size 0.02 74600 0.00
# MPI_Comm_free 0.01 296 0.00
# MPI_Comm_group 0.00 576 0.00
# MPI_Init 0.00 288 0.00
#
###################################################################
Here the total wall time is 19430.50 seconds; the MPI_Barrier
time is 2280341.77 / 288 = 7918 seconds (IPM prints the total time across all ranks), accounting for ~40% of total time. The fraction of MPI_Barrier
is reduced to ~30% at 576 cores and ~20% at 1152 cores, but this is still much larger than a normal value (being blocked at 20%~40% of time seems a bit ridiculous).
Full log:
Visualize profiling results
I wrote a Python script to parse IPM results (https://github.com/JiaweiZhuang/ipm_util) so I can easily analyze & visualize MPI time.
Averaging over all ranks, MPI_Barrier
takes much longer than any other MPI calls.
Break into per-rank time:
(similar to the "Communication balance by task" plot in IPM's default HTLM report)
Same data but on individual panels:
Full notebook: https://gist.github.com/JiaweiZhuang/587a17fbb2b757182c5e49dcd3d1f8a9
The notebook reads these IPM XML log files: gchp_ipm_logs.zip
Possible explanations
I originally expected that MPI_Barrier
comes from old MAPL's serial I/O, where other ranks are waiting for the master rank to read data from disk. However, I/O can't explain the problem because:
- Time spent on EXTDATA is only 1154 seconds, much smaller than
MPI_Barrier
time (7918 seconds). There must be other components that have load-imbalance and cause this long blocking, maybe advection or gas-phase chemistry (say different spatial regions require different numbers of inner solver steps?)
- All ranks spend roughly the same time on
MPI_Barrier
(7900 ± 790 seconds) -- this cannot come from a blocking I/O, where the master process should have near-zero barrier time.
A typical serial & blocking I/O should look like this toy MPI_Barrier example, with the core body:
call MPI_Barrier( MPI_COMM_WORLD, ierror)
if (rank .eq. 0) then
call SLEEP(3) ! delaying everyone else
end if
call MPI_Barrier( MPI_COMM_WORLD, ierror)
in which case rank 0
will have zero MPI_Barrier time, while other ranks will have 3 seconds of MPI_Barrier time. IPM shows very accurate results on this toy program.
As a comparison, WRF doesn't have such a long MPI_Barrier, shown by this WRF profiling result
More clues
An intriguing observation is that MPI_Barrier
time decreases with the number of cores, while other MPI calls (especially MPI_Bcast
) generally take longer time with more cores (due to increased communication, obviously).
One hypothesis is that, MPI_Barrier
comes from the load-imbalance of photochemistry, as the KPP solver time varies a lot across day and night. Since the chemistry component scales almost perfectly with core counts, its time drops quickly with more cores.
Suggestions
We can locate those time-consuming MPI_Barrier
calls by either
- mark code regions with
MPI_Pcontrol(...)
, as shown in IPM User Guide.
- or use heavy-weight HPC profiler like HPCToolKit or TAU, to collect the full call stack.
If such load imbalance does come the chemical solver, then people should pay extra attention to the slowest spatial regions when trying to speed-up chemistry solvers. This might be a GCHP-only problem; GC-classic would be OK if OpenMP dynamic thread scheduling is used.
cc. @yantosca @lizziel something worth investigating if you want to profile the code.