llnl / lulesh Goto Github PK

View Code? Open in Web Editor NEW

94.0 6.0 79.0 1.31 MB

Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH)

Home Page: https://asc.llnl.gov/codes/proxy-apps/lulesh

Makefile 0.68% C++ 98.69% CMake 0.63%

proxy-application

lulesh's Introduction

This is the README for LULESH 2.0

More information including LULESH 1.0 can be found at https://codesign.llnl.gov/lulesh.php

If you have any questions or problems please contact:

Ian Karlin <[email protected]> or
Rob Neely <[email protected]>

Also please send any notable results to Ian Karlin <[email protected]> as we are still evaluating the performance of this code.

A Makefile and a CMake build system are provided.

*** Building with CMake ***

Create a build directory and run cmake. Example:

$ mkdir build; cd build; cmake -DCMAKE_BUILD_TYPE=Release -DMPI_CXX_COMPILER=`which mpicxx` ..

CMake variables:

CMAKE_BUILD_TYPE "Debug", "Release", or "RelWithDebInfo"

CMAKE_CXX_COMPILER Path to the C++ compiler
MPI_CXX_COMPILER Path to the MPI C++ compiler

WITH_MPI=On|Off Build with MPI (Default: On)
WITH_OPENMP=On|Off Build with OpenMP support (Default: On)
WITH_SILO=On|Off Build with support for SILO. (Default: Off).

SILO_DIR Path to SILO library (only needed when WITH_SILO is "On")

*** Notable changes in LULESH 2.0 ***

Split functionality into different files
lulesh.cc - where most (all?) of the timed functionality lies
lulesh-comm.cc - MPI functionality
lulesh-init.cc - Setup code
lulesh-viz.cc - Support for visualization option
lulesh-util.cc - Non-timed functions

The concept of "regions" was added, although every region is the same ideal gas material, and the same sedov blast wave problem is still the only problem its hardcoded to solve. Regions allow two things important to making this proxy app more representative:

Four of the LULESH routines are now performed on a region-by-region basis, making the memory access patterns non-unit stride

Artificial load imbalances can be easily introduced that could impact parallelization strategies.
* The load balance flag changes region assignment. Region number is raised to the power entered for assignment probability. Most likely regions changes with MPI process id.
* The cost flag raises the cost of ~45% of the regions to evaluate EOS by the entered multiple. The cost of 5% is 10x the entered
multiple.

MPI and OpenMP were added, and coalesced into a single version of the source that can support serial builds, MPI-only, OpenMP-only, and MPI+OpenMP

Added support to write plot files using "poor mans parallel I/O" when linked with the silo library, which in turn can be read by VisIt.

Enabled variable timestep calculation by default (courant condition), which results in an additional reduction. Also, seeded the initial timestep based on analytical equation to allow scaling to arbitrary size. Therefore steps to solution will differ from LULESH 1.0.

Default domain (mesh) size reduced from 45^3 to 30^3

Command line options to allow for numerous test cases without needing to recompile

Performance optimizations and code cleanup uncovered during study of LULESH 1.0

Added a "Figure of Merit" calculation (elements solved per microsecond) and output in support of using LULESH 2.0 for the 2017 CORAL procurement

*** Notable changes in LULESH 2.1 ***

Minor bug fixes.
Code cleanup to add consitancy to variable names, loop indexing, memory allocation/deallocation, etc.
Destructor added to main class to clean up when code exits.

Possible Future 2.0 minor updates (other changes possible as discovered)

* Different default parameters
* Minor code performance changes and cleanupS

TODO in future versions
* Add reader for (truly) unstructured meshes, probably serial only

lulesh's People

Contributors

Stargazers

Watchers

Forkers

connorimes milthorpe daboehme utkarshayachit baptistegerondeau artv3 schulzm exascalable zwghit socal-ucr mikebentley15 ozturkosu chabbimilind peihunglin devreal zygyz nmhamster jeffhammond sma31 tmnnt rmmilewi mfournial knoxort reyad010 itaygarin tiberiusfox schoenemeyer lmcad-unicamp mcopik eeethon pkestene findhao ggeorgakoudis gregbolet ganeshutah matt-stack quentin-ma anlsys jprotze sowhat1 jtronge faasm nh-turja lilux618 wilkowsk huangvincent170 01ismail bozhang-hpc jgphpc koparasy weilewei mahsanchez sunilvijays buckingb keichi dev3225 isazi kboyarinov gcorbin timelord-apps arthurkom sameerdalal illuhad ethicalsecurity-agency okamiwong itec-snape isidorostsa pansysk75 ashishd giantcone ambipomyan passlab ivanradanov

lulesh's Issues

Correctness Checking

What are the values I should aim for to validate my runs?

I cannot recreate any of the 'Final Origin Energy' values of 4.3 in:
https://asc.llnl.gov/sites/asc/files/2021-01/lulesh2.0_changes1.pdf

Tried running with ./lulesh2.0 -s 5 -i 72
My output:

Running problem size 5^3 per domain until completion
Num processors: 8
Total number of elements: 1000 

To run other sizes, use -s <integer>.
To run a fixed number of iterations, use -i <integer>.
To run a more or less balanced region set, use -b <integer>.
To change the relative costs of regions, use -c <integer>.
To print out progress, use -p
To write an output file for VisIt, use -v
See help (-h) for more options

Run completed:
   Problem size        =  5
   MPI tasks           =  8
   Iteration count     =  72
   Final Origin Energy =  6.232687e+04
   Testing Plane 0 of Energy Array on rank 0:
        MaxAbsDiff   = 4.547474e-12
        TotalAbsDiff = 1.360249e-11
        MaxRelDiff   = 4.855244e-15

Elapsed time         =       0.01 (s)
Grind time (us/z/c)  =  1.1185698 (per dom)  (0.010067128 overall)
FOM                  =  7151.9901 (z/s)

All tests were done using the version on the main branch.

temporal causality and version release dates

I know DOE has mastered the governing equations of the universe, but mortals like me struggle with the fact that version 2.0.3 appeared 10 months before version 2.0.2.

What is the most recent version of LULESH in the sense that a developer with a modest understanding of Newtonian physics would understand it?

Strange use of std::vector::clear/resize

I am currently working on a port of Lulesh and noticed a strange usage of the combination of std::vector::clear and std::vector::resize, which is used for managing temporary buffers for strains and gradients. As an example, the comment in this line https://github.com/LLNL/LULESH/blob/master/lulesh.cc#L2018 states Free up memory but the following call to std::vector::clear does not actually free the memory. The standard mandates that the number of elements in the vector is set to 0 and the capacity is left unchanged [1]. The next call to std::vector::resize will find the capacity to be sufficient (as numElem does not change) and initialize all elements in the vector to 0.0 (essentially acting as memset), something that does not seem to be needed as the values are later overwritten anyway. In my benchmarks, this leads to some noticeable overhead (e.g., ~15% on 8 ranks/24threads on our XC40, 256**3 elements per rank) that seems easily avoidable.

Hence my question: is the reset of the gradient and strain temporaries an essential part of the benchmark or is it a valid optimization to either make these buffers static or use Allocate/Release on them instead?

[1] https://en.cppreference.com/w/cpp/container/vector/clear

Negative FOM when Running Large MPI Counts

We are scaling LULESH to large numbers of nodes (around 2000 nodes) with 8 MPI ranks per node and a problem size of 90. The result is that FOMs given go negative.

Elapsed time         =      55.50 (s)
Grind time (us/z/c)  = 0.38064213 (per dom)  (-0.00015125329 overall)
FOM                  = -6611426.6 (z/s)

This probably shouldn't happen.

Why does the z/sec fall drastically with higher core/thread counts?

I'm not familiar with LULESH, I've just seen it used as one of many recent benchmarks for the new AMD 5000 series processors.

A 6-core / 12-thread AMD 5600X scores 993 z/sec, while the larger models (with the next being 8-core / 16-thread) fall down to 11 z/sec.

This appears to be the command the benchmarking software is using:

if [ -z \${NUM_CPU_PHYSICAL_CORES_CUBE+x} ]; then NUM_CPU_PHYSICAL_CORES_CUBE=\$NUM_CPU_PHYSICAL_CORES; fi
mpirun --allow-run-as-root -np \$NUM_CPU_PHYSICAL_CORES_CUBE ./lulesh2.0 -s 36 -i 1 > \$LOG_FILE 2>&1

The linked page shows many CPUs with higher core counts seem to have the dramatic drop in performance. Is it linked to hitting a bottleneck when scaling cores? CPU cache or system memory perhaps?

Errors on large(r)-scale runs

I am trying to run Lulesh on larger scales on a Cray XC40 (2x12C Haswell, one process per node, 24 OpenMP threads) using the Intel 18.0.1 compiler, but run into the following error at s=400 on >=512 processes:

mpirun -n 512 -N 1 -bind-to none /zhome/academic/HLRS/hlrs/hpcjschu/src/LULESH-2.0.3/lulesh_mpi -s 400 -i 100 -p -b 0

Num threads: 24
Total number of elements: 32768000000

To run other sizes, use -s <integer>.
To run a fixed number of iterations, use -i <integer>.
To run a more or less balanced region set, use -b <integer>.
To change the relative costs of regions, use -c <integer>.
To print out progress, use -p
To write an output file for VisIt, use -v
See help (-h) for more options

cycle = 1, time = 3.298540e-11, dt=3.298540e-11
cycle = 2, time = 7.256788e-11, dt=3.958248e-11
cycle = 3, time = 8.613524e-11, dt=1.356736e-11
cycle = 4, time = 9.746505e-11, dt=1.132980e-11
cycle = 5, time = 1.075651e-10, dt=1.010008e-11
cycle = 6, time = 1.169435e-10, dt=9.378386e-12
cycle = 7, time = 1.258599e-10, dt=8.916427e-12
ERROR: domain.q(1) = 1026443320729.883911
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -2.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

The ERROR line was added by me to track down where the abort happens and why. I tried different combinations of Cray MPICH and Open MPI (3.1.2) and compiling with -O2 or -O3, all showing the same behavior. Interestingly, smaller runs (such as with s=400 and 256 processes succeed as do runs with s=300 at 512 processes.) The error occurs both in the latest git commit and the 2.0.3 release. Any idea what might be causing this problem? I don't think that I am running into an integer overflow (see #7) but I might be wrong (8*(400**3) = 512000000 is still well within the bounds of 32bit integers).

I will try to run with 64bit integers nevertheless, just to make sure (it just takes some time on that machine). In general, are such larger runs supported by Lulesh?

Issues with Cuda 9.0

Hi there,

I am having a very strange issue running the CUDA variant of LULESH (release of 2.0.2).

I'm compiling using Cuda compilation tools, release 9.0, V9.0.176 and setting either the flag -arch=sm_35 or, to avoid compilation warnings, the flag -arch=sm_70.

When running the code on a Tesla V100-SXM2-32GB, the program crash as follows:

$ ./lulesh -s 10
Host compute1-exec-206.ris.wustl.edu using GPU 0: Tesla V100-SXM2-32GB
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  parallel_for failed: invalid argument
[compute1-exec-206:00204] *** Process received signal ***
[compute1-exec-206:00204] Signal: Aborted (6)
[compute1-exec-206:00204] Signal code:  (-6)
[compute1-exec-206:00204] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f7e38bda390]
[compute1-exec-206:00204] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38)[0x7f7e37d8f428]
[compute1-exec-206:00204] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7f7e37d9102a]
[compute1-exec-206:00204] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x16d)[0x7f7e386d284d]
[compute1-exec-206:00204] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8d6b6)[0x7f7e386d06b6]
[compute1-exec-206:00204] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8d701)[0x7f7e386d0701]
[compute1-exec-206:00204] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8d919)[0x7f7e386d0919]
[compute1-exec-206:00204] [ 7] ./lulesh[0x41f252]
[compute1-exec-206:00204] [ 8] ./lulesh[0x417330]
[compute1-exec-206:00204] [ 9] ./lulesh[0x41ade5]
[compute1-exec-206:00204] [10] ./lulesh[0x405cff]
[compute1-exec-206:00204] [11] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f7e37d7a830]
[compute1-exec-206:00204] [12] ./lulesh[0x409bf9]
[compute1-exec-206:00204] *** End of error message ***
Aborted (core dumped)

As anyone else observed or reported something similar?
What version of CUDA do you usually use to compile LULESH?

Thank you in advance,

Umberto

Volume error When running with cuda and mpi

I am running Lulesh on a single node with 160 cpus and 4 (Tesla V100-SXM2) gpus.
I am using openmpi-3.0.0 with cuda cuda 9.1. I execute the following command:
mpirun -n 27 ./lulesh -s 60
and I get the following error:
Rank 22: Volume Error in cell 211619 at iteration 14
The error appears in different number of iterations on each execution.
Any idea what is causing this error?

OpenMP 4.5 Version?

Is the OpenMP 4.5 implementation described in Section 4.1 of this paper available? Is the version labeled omp_4.0 in this repository the version used in the paper?

llnl / lulesh Goto Github PK

lulesh's Introduction

lulesh's People

Contributors

Stargazers

Watchers

Forkers

lulesh's Issues

Correctness Checking

temporal causality and version release dates

Strange use of std::vector::clear/resize

Negative FOM when Running Large MPI Counts

Why does the z/sec fall drastically with higher core/thread counts?

Errors on large(r)-scale runs

Issues with Cuda 9.0

Volume error When running with cuda and mpi

OpenMP 4.5 Version?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent