llnl / lulesh Goto Github PK
View Code? Open in Web Editor NEWLivermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH)
Home Page: https://asc.llnl.gov/codes/proxy-apps/lulesh
Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH)
Home Page: https://asc.llnl.gov/codes/proxy-apps/lulesh
This is the README for LULESH 2.0 More information including LULESH 1.0 can be found at https://codesign.llnl.gov/lulesh.php If you have any questions or problems please contact: Ian Karlin <[email protected]> or Rob Neely <[email protected]> Also please send any notable results to Ian Karlin <[email protected]> as we are still evaluating the performance of this code. A Makefile and a CMake build system are provided. *** Building with CMake *** Create a build directory and run cmake. Example: $ mkdir build; cd build; cmake -DCMAKE_BUILD_TYPE=Release -DMPI_CXX_COMPILER=`which mpicxx` .. CMake variables: CMAKE_BUILD_TYPE "Debug", "Release", or "RelWithDebInfo" CMAKE_CXX_COMPILER Path to the C++ compiler MPI_CXX_COMPILER Path to the MPI C++ compiler WITH_MPI=On|Off Build with MPI (Default: On) WITH_OPENMP=On|Off Build with OpenMP support (Default: On) WITH_SILO=On|Off Build with support for SILO. (Default: Off). SILO_DIR Path to SILO library (only needed when WITH_SILO is "On") *** Notable changes in LULESH 2.0 *** Split functionality into different files lulesh.cc - where most (all?) of the timed functionality lies lulesh-comm.cc - MPI functionality lulesh-init.cc - Setup code lulesh-viz.cc - Support for visualization option lulesh-util.cc - Non-timed functions The concept of "regions" was added, although every region is the same ideal gas material, and the same sedov blast wave problem is still the only problem its hardcoded to solve. Regions allow two things important to making this proxy app more representative: Four of the LULESH routines are now performed on a region-by-region basis, making the memory access patterns non-unit stride Artificial load imbalances can be easily introduced that could impact parallelization strategies. * The load balance flag changes region assignment. Region number is raised to the power entered for assignment probability. Most likely regions changes with MPI process id. * The cost flag raises the cost of ~45% of the regions to evaluate EOS by the entered multiple. The cost of 5% is 10x the entered multiple. MPI and OpenMP were added, and coalesced into a single version of the source that can support serial builds, MPI-only, OpenMP-only, and MPI+OpenMP Added support to write plot files using "poor mans parallel I/O" when linked with the silo library, which in turn can be read by VisIt. Enabled variable timestep calculation by default (courant condition), which results in an additional reduction. Also, seeded the initial timestep based on analytical equation to allow scaling to arbitrary size. Therefore steps to solution will differ from LULESH 1.0. Default domain (mesh) size reduced from 45^3 to 30^3 Command line options to allow for numerous test cases without needing to recompile Performance optimizations and code cleanup uncovered during study of LULESH 1.0 Added a "Figure of Merit" calculation (elements solved per microsecond) and output in support of using LULESH 2.0 for the 2017 CORAL procurement *** Notable changes in LULESH 2.1 *** Minor bug fixes. Code cleanup to add consitancy to variable names, loop indexing, memory allocation/deallocation, etc. Destructor added to main class to clean up when code exits. Possible Future 2.0 minor updates (other changes possible as discovered) * Different default parameters * Minor code performance changes and cleanupS TODO in future versions * Add reader for (truly) unstructured meshes, probably serial only
What are the values I should aim for to validate my runs?
I cannot recreate any of the 'Final Origin Energy' values of 4.3 in:
https://asc.llnl.gov/sites/asc/files/2021-01/lulesh2.0_changes1.pdf
Tried running with ./lulesh2.0 -s 5 -i 72
My output:
Running problem size 5^3 per domain until completion
Num processors: 8
Total number of elements: 1000
To run other sizes, use -s <integer>.
To run a fixed number of iterations, use -i <integer>.
To run a more or less balanced region set, use -b <integer>.
To change the relative costs of regions, use -c <integer>.
To print out progress, use -p
To write an output file for VisIt, use -v
See help (-h) for more options
Run completed:
Problem size = 5
MPI tasks = 8
Iteration count = 72
Final Origin Energy = 6.232687e+04
Testing Plane 0 of Energy Array on rank 0:
MaxAbsDiff = 4.547474e-12
TotalAbsDiff = 1.360249e-11
MaxRelDiff = 4.855244e-15
Elapsed time = 0.01 (s)
Grind time (us/z/c) = 1.1185698 (per dom) (0.010067128 overall)
FOM = 7151.9901 (z/s)
All tests were done using the version on the main branch.
I know DOE has mastered the governing equations of the universe, but mortals like me struggle with the fact that version 2.0.3 appeared 10 months before version 2.0.2.
What is the most recent version of LULESH in the sense that a developer with a modest understanding of Newtonian physics would understand it?
I am currently working on a port of Lulesh and noticed a strange usage of the combination of std::vector::clear
and std::vector::resize
, which is used for managing temporary buffers for strains and gradients. As an example, the comment in this line https://github.com/LLNL/LULESH/blob/master/lulesh.cc#L2018 states Free up memory
but the following call to std::vector::clear
does not actually free the memory. The standard mandates that the number of elements in the vector is set to 0 and the capacity is left unchanged [1]. The next call to std::vector::resize
will find the capacity to be sufficient (as numElem
does not change) and initialize all elements in the vector to 0.0 (essentially acting as memset
), something that does not seem to be needed as the values are later overwritten anyway. In my benchmarks, this leads to some noticeable overhead (e.g., ~15% on 8 ranks/24threads on our XC40, 256**3 elements per rank) that seems easily avoidable.
Hence my question: is the reset of the gradient and strain temporaries an essential part of the benchmark or is it a valid optimization to either make these buffers static or use Allocate
/Release
on them instead?
[1] https://en.cppreference.com/w/cpp/container/vector/clear
We are scaling LULESH to large numbers of nodes (around 2000 nodes) with 8 MPI ranks per node and a problem size of 90. The result is that FOMs given go negative.
Elapsed time = 55.50 (s)
Grind time (us/z/c) = 0.38064213 (per dom) (-0.00015125329 overall)
FOM = -6611426.6 (z/s)
This probably shouldn't happen.
I'm not familiar with LULESH, I've just seen it used as one of many recent benchmarks for the new AMD 5000 series processors.
A 6-core / 12-thread AMD 5600X scores 993 z/sec, while the larger models (with the next being 8-core / 16-thread) fall down to 11 z/sec.
This appears to be the command the benchmarking software is using:
if [ -z \${NUM_CPU_PHYSICAL_CORES_CUBE+x} ]; then NUM_CPU_PHYSICAL_CORES_CUBE=\$NUM_CPU_PHYSICAL_CORES; fi
mpirun --allow-run-as-root -np \$NUM_CPU_PHYSICAL_CORES_CUBE ./lulesh2.0 -s 36 -i 1 > \$LOG_FILE 2>&1
The linked page shows many CPUs with higher core counts seem to have the dramatic drop in performance. Is it linked to hitting a bottleneck when scaling cores? CPU cache or system memory perhaps?
I am trying to run Lulesh on larger scales on a Cray XC40 (2x12C Haswell, one process per node, 24 OpenMP threads) using the Intel 18.0.1 compiler, but run into the following error at s=400
on >=512 processes:
mpirun -n 512 -N 1 -bind-to none /zhome/academic/HLRS/hlrs/hpcjschu/src/LULESH-2.0.3/lulesh_mpi -s 400 -i 100 -p -b 0
Num threads: 24
Total number of elements: 32768000000
To run other sizes, use -s <integer>.
To run a fixed number of iterations, use -i <integer>.
To run a more or less balanced region set, use -b <integer>.
To change the relative costs of regions, use -c <integer>.
To print out progress, use -p
To write an output file for VisIt, use -v
See help (-h) for more options
cycle = 1, time = 3.298540e-11, dt=3.298540e-11
cycle = 2, time = 7.256788e-11, dt=3.958248e-11
cycle = 3, time = 8.613524e-11, dt=1.356736e-11
cycle = 4, time = 9.746505e-11, dt=1.132980e-11
cycle = 5, time = 1.075651e-10, dt=1.010008e-11
cycle = 6, time = 1.169435e-10, dt=9.378386e-12
cycle = 7, time = 1.258599e-10, dt=8.916427e-12
ERROR: domain.q(1) = 1026443320729.883911
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -2.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
The ERROR
line was added by me to track down where the abort happens and why. I tried different combinations of Cray MPICH and Open MPI (3.1.2) and compiling with -O2
or -O3
, all showing the same behavior. Interestingly, smaller runs (such as with s=400
and 256 processes succeed as do runs with s=300
at 512 processes.) The error occurs both in the latest git commit and the 2.0.3 release. Any idea what might be causing this problem? I don't think that I am running into an integer overflow (see #7) but I might be wrong (8*(400**3) = 512000000
is still well within the bounds of 32bit integers).
I will try to run with 64bit integers nevertheless, just to make sure (it just takes some time on that machine). In general, are such larger runs supported by Lulesh?
Hi there,
I am having a very strange issue running the CUDA variant of LULESH (release of 2.0.2).
I'm compiling using Cuda compilation tools, release 9.0, V9.0.176
and setting either the flag -arch=sm_35
or, to avoid compilation warnings, the flag -arch=sm_70
.
When running the code on a Tesla V100-SXM2-32GB, the program crash as follows:
$ ./lulesh -s 10
Host compute1-exec-206.ris.wustl.edu using GPU 0: Tesla V100-SXM2-32GB
terminate called after throwing an instance of 'thrust::system::system_error'
what(): parallel_for failed: invalid argument
[compute1-exec-206:00204] *** Process received signal ***
[compute1-exec-206:00204] Signal: Aborted (6)
[compute1-exec-206:00204] Signal code: (-6)
[compute1-exec-206:00204] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f7e38bda390]
[compute1-exec-206:00204] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38)[0x7f7e37d8f428]
[compute1-exec-206:00204] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7f7e37d9102a]
[compute1-exec-206:00204] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x16d)[0x7f7e386d284d]
[compute1-exec-206:00204] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8d6b6)[0x7f7e386d06b6]
[compute1-exec-206:00204] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8d701)[0x7f7e386d0701]
[compute1-exec-206:00204] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8d919)[0x7f7e386d0919]
[compute1-exec-206:00204] [ 7] ./lulesh[0x41f252]
[compute1-exec-206:00204] [ 8] ./lulesh[0x417330]
[compute1-exec-206:00204] [ 9] ./lulesh[0x41ade5]
[compute1-exec-206:00204] [10] ./lulesh[0x405cff]
[compute1-exec-206:00204] [11] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f7e37d7a830]
[compute1-exec-206:00204] [12] ./lulesh[0x409bf9]
[compute1-exec-206:00204] *** End of error message ***
Aborted (core dumped)
As anyone else observed or reported something similar?
What version of CUDA do you usually use to compile LULESH?
Thank you in advance,
Umberto
I am running Lulesh on a single node with 160 cpus and 4 (Tesla V100-SXM2) gpus.
I am using openmpi-3.0.0 with cuda cuda 9.1. I execute the following command:
mpirun -n 27 ./lulesh -s 60
and I get the following error:
Rank 22: Volume Error in cell 211619 at iteration 14
The error appears in different number of iterations on each execution.
Any idea what is causing this error?
Is the OpenMP 4.5 implementation described in Section 4.1 of this paper available? Is the version labeled omp_4.0
in this repository the version used in the paper?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.