Coder Social home page Coder Social logo

hpcg-benchmark / hpcg Goto Github PK

View Code? Open in Web Editor NEW
281.0 18.0 117.0 8.96 MB

Official HPCG benchmark source code

Home Page: http://www.hpcg-benchmark.org/

License: BSD 3-Clause "New" or "Revised" License

Makefile 4.12% C++ 69.33% PHP 8.05% HTML 3.14% CSS 5.93% Python 2.82% MATLAB 0.32% Shell 0.35% CMake 0.89% M4 5.05%

hpcg's Introduction

########################################################

High Performance Conjugate Gradient Benchmark (HPCG)

########################################################

Jack Dongarra and Michael Heroux and Piotr Luszczek

Revision: 3.1

Date: March 28, 2019

Introduction

HPCG is a software package that performs a fixed number of multigrid preconditioned (using a symmetric Gauss-Seidel smoother) conjugate gradient (PCG) iterations using double precision (64 bit) floating point values.

The HPCG rating is is a weighted GFLOP/s (billion floating operations per second) value that is composed of the operations performed in the PCG iteration phase over the time taken. The overhead time of problem construction and any modifications to improve performance are divided by 500 iterations (the amortization weight) and added to the runtime.

Integer arrays have global and local scope (global indices are unique across the entire distributed memory system, local indices are unique within a memory image). Integer data for global/local indices have three modes:

  • 32/32 - global and local integers are 32-bit
  • 64/32 - global integers are 64-bit, local are 32-bit
  • 64/64 - global and local are 64-bit.

These various modes are required in order to address sufficiently big problems if the range of indexing goes above 2^31 (roughly 2.1B), or to conserve storage costs if the range of indexing is less than 2^31.

The HPCG software package requires the availibility on your system of an implementation of the Message Passing Interface (MPI) if enabling the MPI build of HPCG, and a compiler that supports OpenMP syntax. An implementation compliant with MPI version 1.1 is sufficient.

Installation

See the file INSTALL in this directory.

Valid Runs

HPCG can be run in just a few minutes from start to finish. However, official runs must be at least 1800 seconds (30 minutes) as reported in the output file. The Quick Path option is an exception for machines that are in production mode prior to broad availability of an optimized version of HPCG 3.0 for a given platform. In this situation (which should be confirmed by sending a note to the HPCG Benchmark owners) the Quick Path option can be invoked by setting the run time parameter equal to 0 (zero).

A valid run must also execute a problem size that is large enough so that data arrays accessed in the CG iteration loop do not fit in the cache of the device in a way that would be unrealistic in a real application setting. Presently this restriction means that the problem size should be large enough to occupy a significant fraction of main memory, at least 1/4 of the total.

Future memory system architectures may require restatement of the specific memory size requirements. But the guiding principle will always be that the problem size should reflect what would be reasonable for a real sparse iterative solver.

Documentation

The source code documentation can be generated with a Doxygen (version 1.8 or newer). In this directory type:

doxygen tools/hpcg.dox

Doxygen will then generate various output formats in the out directory.

Tuning

See the file TUNING in this directory.

Bugs

Known problems and bugs with this release are documented in the file BUGS.

Further information

Check out the website http://www.hpcg-benchmark.org/ for the latest information and performance results.

hpcg's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hpcg's Issues

Consider a replacement for YAML output

Capturing a comment from Piotr Luszczek:

PS. My real headache comes from YAML. All versions so far generated
YAML with first line not being YAML. On top of that, YAML is nice
to look at with a text editor that submitters edit it by hand and
add new sections. All of it is not valid YAML and things break.
I fixed it in the reference code so now YAML parsers work with HPCG
output. But we have old versions generating problematic first line.
We have Intel/NVIDIA/IBM codes that generating additional things
that are not up to YAML standard.

I was thinking of switching to something less easy to "break" by
manual editing. JSON could get complex but it's popular. For HPCC
we used key-value pairs and we kept adding new pairs as needed.

Currently, I have to be permissive when taking in the YAML-looking
input. And it might be OK.

In the end, it's a minor issue.

Change the array allocation strategy in GenerateProblem.cpp

In order to improve baseline HPCG performance, Consider the following:

Replace this code:

// Now allocate the arrays pointed to
for (local_int_t i=0; i< localNumberOfRows; ++i) {
mtxIndL[i] = new local_int_t[numberOfNonzerosPerRow];
matrixValues[i] = new double[numberOfNonzerosPerRow];
mtxIndG[i] = new global_int_t[numberOfNonzerosPerRow];
}

With this code:

// Now allocate the arrays pointed to
mtxIndL[0] = new local_int_t[localNumberOfRows * numberOfNonzerosPerRow];
matrixValues[0] = new double[localNumberOfRows * numberOfNonzerosPerRow];
mtxIndG[0] = new global_int_t[localNumberOfRows * numberOfNonzerosPerRow];
for (local_int_t i=1; i< localNumberOfRows; ++i) {
mtxIndL[i] = mtxIndL[0] + i * numberOfNonzerosPerRow;
matrixValues[i] = matrixValues[0] + i * numberOfNonzerosPerRow;
mtxIndG[i] = mtxIndG[0] + i * numberOfNonzerosPerRow;
}

Add discussion of why HPCG is not "just" Streams to the main HPCG page

Some people dismiss HPCG as redundant because it is "just like Streams." While it is very true that HPCG performance is highly dependent on memory bandwidth, realizing a significant fraction of that performance depends on several other factors: performance of global collective operations, the ability to re-use data in the SpMV and SymGS kernels and the fine-grain threading performance of SymGS. These are factors that can vary from system to system and directly impact performance.

std::map type unmatched in SetupHalo_ref.cpp

In SetupHalo_ref.cpp, the definition of externalToLocalMap is std::map<local_int_t, local_int_t>. However, line 147 and 152 use a global_int_tas the key of the map.externalToLocalMapshould bestd::map<global_int_t, local_int_t>`

New output feature leaks memory

The approach to generating output has a memory leak. Please fix ASAP. Valgrind reports:

==49256== 168 bytes in 1 blocks are definitely lost in loss record 282 of 343
==49256== at 0x100030EA1: malloc (vg_replace_malloc.c:303)
==49256== by 0x10007843D: operator new(unsigned long) (in /usr/lib/libc++.1.dylib)
==49256== by 0x100014B0A: OutputFile::add(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, std::__1::basic_string<char, std::__1::char_traits, std::1::allocator > const&) (OutputFile.cpp:126)
==49256== by 0x100007389: ReportResults(SparseMatrix_STRUCT const&, int, int, int, int, double
, TestCGData_STRUCT const&, TestSymmetryData_STRUCT const&, TestNormsData_STRUCT const&, int, bool) (ReportResults.cpp:212)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC41E3F: ???
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x104829CCF: ???
==49256== by 0x7FFF5FC17DA1: ImageLoaderMachOCompressed::findExportedSymbol(char const
, ImageLoader const**) const (in /usr/lib/dyld)
==49256== by 0x7FFF5FC37E4F: ??? (in /usr/lib/dyld)
==49256== by 0x15: ???
==49256==
==49256== 216 (168 direct, 48 indirect) bytes in 1 blocks are definitely lost in loss record 291 of 343
==49256== at 0x100030EA1: malloc (vg_replace_malloc.c:303)
==49256== by 0x10007843D: operator new(unsigned long) (in /usr/lib/libc++.1.dylib)
==49256== by 0x100014B0A: OutputFile::add(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, std::__1::basic_string<char, std::__1::char_traits, std::1::allocator > const&) (OutputFile.cpp:126)
==49256== by 0x100007ADF: ReportResults(SparseMatrix_STRUCT const&, int, int, int, int, double
, TestCGData_STRUCT const&, TestSymmetryData_STRUCT const&, TestNormsData_STRUCT const&, int, bool) (ReportResults.cpp:233)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC41E3F: ???
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x104829CCF: ???
==49256== by 0x7FFF5FC17DA1: ImageLoaderMachOCompressed::findExportedSymbol(char const
, ImageLoader const**) const (in /usr/lib/dyld)
==49256== by 0x7FFF5FC37E4F: ??? (in /usr/lib/dyld)
==49256== by 0x15: ???
==49256==
==49256== 216 (168 direct, 48 indirect) bytes in 1 blocks are definitely lost in loss record 292 of 343
==49256== at 0x100030EA1: malloc (vg_replace_malloc.c:303)
==49256== by 0x10007843D: operator new(unsigned long) (in /usr/lib/libc++.1.dylib)
==49256== by 0x100014B0A: OutputFile::add(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, std::__1::basic_string<char, std::__1::char_traits, std::1::allocator > const&) (OutputFile.cpp:126)
==49256== by 0x1000082D7: ReportResults(SparseMatrix_STRUCT const&, int, int, int, int, double
, TestCGData_STRUCT const&, TestSymmetryData_STRUCT const&, TestNormsData_STRUCT const&, int, bool) (ReportResults.cpp:255)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC41E3F: ???
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x104829CCF: ???
==49256== by 0x7FFF5FC17DA1: ImageLoaderMachOCompressed::findExportedSymbol(char const
, ImageLoader const**) const (in /usr/lib/dyld)
==49256== by 0x7FFF5FC37E4F: ??? (in /usr/lib/dyld)
==49256== by 0x15: ???
==49256==
==49256== 216 (168 direct, 48 indirect) bytes in 1 blocks are definitely lost in loss record 293 of 343
==49256== at 0x100030EA1: malloc (vg_replace_malloc.c:303)
==49256== by 0x10007843D: operator new(unsigned long) (in /usr/lib/libc++.1.dylib)
==49256== by 0x100014B0A: OutputFile::add(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, std::__1::basic_string<char, std::__1::char_traits, std::_1::allocator > const&) (OutputFile.cpp:126)
==49256== by 0x1000087EC: ReportResults(SparseMatrix_STRUCT const&, int, int, int, int, double
, TestCGData_STRUCT const&, TestSymmetryData_STRUCT const&, TestNormsData_STRUCT const&, int, bool) (ReportResults.cpp:270)
==49256== by 0x7364697246: ???
==49256== by 0xB0482B09F: ???
==49256== by 0x2079726F6D654D2B: ???
==49256== by 0x6F666E4920657354: ???
==49256== by 0x6E6F6974616D71: ???
==49256== by 0x654C206469724713: ???
==49256== by 0x1006C6575: _objc_opt_data (in /usr/lib/libobjc.A.dylib)
==49256== by 0xA: ???
==49256==
==49256== 216 (168 direct, 48 indirect) bytes in 1 blocks are definitely lost in loss record 294 of 343
==49256== at 0x100030EA1: malloc (vg_replace_malloc.c:303)
==49256== by 0x10007843D: operator new(unsigned long) (in /usr/lib/libc++.1.dylib)
==49256== by 0x100014B0A: OutputFile::add(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, std::__1::basic_string<char, std::__1::char_traits, std::_1::allocator > const&) (OutputFile.cpp:126)
==49256== by 0x1000090C1: ReportResults(SparseMatrix_STRUCT const&, int, int, int, int, double
, TestCGData_STRUCT const&, TestSymmetryData_STRUCT const&, TestNormsData_STRUCT const&, int, bool) (ReportResults.cpp:292)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC41E3F: ???
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x104829CCF: ???
==49256== by 0x7FFF5FC17DA1: ImageLoaderMachOCompressed::findExportedSymbol(char const*, ImageLoader const**) const (in /usr/lib/dyld)
==49256== by 0x7FFF5FC37E4F: ??? (in /usr/lib/dyld)
==49256== by 0x15: ???
==49256==
==49256== 216 (168 direct, 48 indirect) bytes in 1 blocks are definitely lost in loss record 295 of 343
==49256== at 0x100030EA1: malloc (vg_replace_malloc.c:303)
==49256== by 0x10007843D: operator new(unsigned long) (in /usr/lib/libc++.1.dylib)
==49256== by 0x100014B0A: OutputFile::add(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, std::__1::basic_string<char, std::__1::char_traits, std::1::allocator > const&) (OutputFile.cpp:126)
==49256== by 0x1000094D7: ReportResults(SparseMatrix_STRUCT const&, int, int, int, int, double
, TestCGData_STRUCT const&, TestSymmetryData_STRUCT const&, TestNormsData_STRUCT const&, int, bool) (ReportResults.cpp:303)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC41E3F: ???
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x104829CCF: ???
==49256== by 0x7FFF5FC17DA1: ImageLoaderMachOCompressed::findExportedSymbol(char const
, ImageLoader const**) const (in /usr/lib/dyld)
==49256== by 0x7FFF5FC37E4F: ??? (in /usr/lib/dyld)
==49256== by 0x15: ???
==49256==
==49256== 232 (168 direct, 64 indirect) bytes in 1 blocks are definitely lost in loss record 296 of 343
==49256== at 0x100030EA1: malloc (vg_replace_malloc.c:303)
==49256== by 0x10007843D: operator new(unsigned long) (in /usr/lib/libc++.1.dylib)
==49256== by 0x100014B0A: OutputFile::add(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, std::__1::basic_string<char, std::__1::char_traits, std::1::allocator > const&) (OutputFile.cpp:126)
==49256== by 0x1000097EE: ReportResults(SparseMatrix_STRUCT const&, int, int, int, int, double
, TestCGData_STRUCT const&, TestSymmetryData_STRUCT const&, TestNormsData_STRUCT const&, int, bool) (ReportResults.cpp:312)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC41E3F: ???
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x104829CCF: ???
==49256== by 0x7FFF5FC17DA1: ImageLoaderMachOCompressed::findExportedSymbol(char const
, ImageLoader const**) const (in /usr/lib/dyld)
==49256== by 0x7FFF5FC37E4F: ??? (in /usr/lib/dyld)
==49256== by 0x15: ???
==49256==
==49256== 360 (168 direct, 192 indirect) bytes in 1 blocks are definitely lost in loss record 298 of 343
==49256== at 0x100030EA1: malloc (vg_replace_malloc.c:303)
==49256== by 0x10007843D: operator new(unsigned long) (in /usr/lib/libc++.1.dylib)
==49256== by 0x100014B0A: OutputFile::add(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, std::__1::basic_string<char, std::__1::char_traits, std::1::allocator > const&) (OutputFile.cpp:126)
==49256== by 0x100007B45: ReportResults(SparseMatrix_STRUCT const&, int, int, int, int, double
, TestCGData_STRUCT const&, TestSymmetryData_STRUCT const&, TestNormsData_STRUCT const&, int, bool) (ReportResults.cpp:235)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC41E3F: ???
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x104829CCF: ???
==49256== by 0x7FFF5FC17DA1: ImageLoaderMachOCompressed::findExportedSymbol(char const
, ImageLoader const**) const (in /usr/lib/dyld)
==49256== by 0x7FFF5FC37E4F: ??? (in /usr/lib/dyld)
==49256== by 0x15: ???
==49256==
==49256== 552 (168 direct, 384 indirect) bytes in 1 blocks are definitely lost in loss record 304 of 343
==49256== at 0x100030EA1: malloc (vg_replace_malloc.c:303)
==49256== by 0x10007843D: operator new(unsigned long) (in /usr/lib/libc++.1.dylib)
==49256== by 0x100014B0A: OutputFile::add(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, std::__1::basic_string<char, std::__1::char_traits, std::1::allocator > const&) (OutputFile.cpp:126)
==49256== by 0x1000073EF: ReportResults(SparseMatrix_STRUCT const&, int, int, int, int, double
, TestCGData_STRUCT const&, TestSymmetryData_STRUCT const&, TestNormsData_STRUCT const&, int, bool) (ReportResults.cpp:214)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC41E3F: ???
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x104829CCF: ???
==49256== by 0x7FFF5FC17DA1: ImageLoaderMachOCompressed::findExportedSymbol(char const
, ImageLoader const**) const (in /usr/lib/dyld)
==49256== by 0x7FFF5FC37E4F: ??? (in /usr/lib/dyld)
==49256== by 0x15: ???
==49256==
==49256== 648 (168 direct, 480 indirect) bytes in 1 blocks are definitely lost in loss record 308 of 343
==49256== at 0x100030EA1: malloc (vg_replace_malloc.c:303)
==49256== by 0x10007843D: operator new(unsigned long) (in /usr/lib/libc++.1.dylib)
==49256== by 0x100014B0A: OutputFile::add(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, std::__1::basic_string<char, std::__1::char_traits, std::1::allocator > const&) (OutputFile.cpp:126)
==49256== by 0x100007C2C: ReportResults(SparseMatrix_STRUCT const&, int, int, int, int, double
, TestCGData_STRUCT const&, TestSymmetryData_STRUCT const&, TestNormsData_STRUCT const&, int, bool) (ReportResults.cpp:238)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC41E3F: ???
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x104829CCF: ???
==49256== by 0x7FFF5FC17DA1: ImageLoaderMachOCompressed::findExportedSymbol(char const
, ImageLoader const**) const (in /usr/lib/dyld)
==49256== by 0x7FFF5FC37E4F: ??? (in /usr/lib/dyld)
==49256== by 0x15: ???
==49256==
==49256== 712 (168 direct, 544 indirect) bytes in 1 blocks are definitely lost in loss record 309 of 343
==49256== at 0x100030EA1: malloc (vg_replace_malloc.c:303)
==49256== by 0x10007843D: operator new(unsigned long) (in /usr/lib/libc++.1.dylib)
==49256== by 0x100014B0A: OutputFile::add(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, std::__1::basic_string<char, std::__1::char_traits, std::1::allocator > const&) (OutputFile.cpp:126)
==49256== by 0x10000A87A: ReportResults(SparseMatrix_STRUCT const&, int, int, int, int, double
, TestCGData_STRUCT const&, TestSymmetryData_STRUCT const&, TestNormsData_STRUCT const&, int, bool) (ReportResults.cpp:349)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC41E3F: ???
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x104829CCF: ???
==49256== by 0x7FFF5FC17DA1: ImageLoaderMachOCompressed::findExportedSymbol(char const
, ImageLoader const**) const (in /usr/lib/dyld)
==49256== by 0x7FFF5FC37E4F: ??? (in /usr/lib/dyld)
==49256== by 0x15: ???
==49256==
==49256== 744 (168 direct, 576 indirect) bytes in 1 blocks are definitely lost in loss record 311 of 343
==49256== at 0x100030EA1: malloc (vg_replace_malloc.c:303)
==49256== by 0x10007843D: operator new(unsigned long) (in /usr/lib/libc++.1.dylib)
==49256== by 0x100014B0A: OutputFile::add(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, std::__1::basic_string<char, std::__1::char_traits, std::1::allocator > const&) (OutputFile.cpp:126)
==49256== by 0x100007731: ReportResults(SparseMatrix_STRUCT const&, int, int, int, int, double
, TestCGData_STRUCT const&, TestSymmetryData_STRUCT const&, TestNormsData_STRUCT const&, int, bool) (ReportResults.cpp:223)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC41E3F: ???
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x104829CCF: ???
==49256== by 0x7FFF5FC17DA1: ImageLoaderMachOCompressed::findExportedSymbol(char const
, ImageLoader const**) const (in /usr/lib/dyld)
==49256== by 0x7FFF5FC37E4F: ??? (in /usr/lib/dyld)
==49256== by 0x15: ???
==49256==
==49256== 792 (168 direct, 624 indirect) bytes in 1 blocks are definitely lost in loss record 312 of 343
==49256== at 0x100030EA1: malloc (vg_replace_malloc.c:303)
==49256== by 0x10007843D: operator new(unsigned long) (in /usr/lib/libc++.1.dylib)
==49256== by 0x100014B0A: OutputFile::add(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, std::__1::basic_string<char, std::__1::char_traits, std::1::allocator > const&) (OutputFile.cpp:126)
==49256== by 0x10000754E: ReportResults(SparseMatrix_STRUCT const&, int, int, int, int, double
, TestCGData_STRUCT const&, TestSymmetryData_STRUCT const&, TestNormsData_STRUCT const&, int, bool) (ReportResults.cpp:218)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC41E3F: ???
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x104829CCF: ???
==49256== by 0x7FFF5FC17DA1: ImageLoaderMachOCompressed::findExportedSymbol(char const
, ImageLoader const**) const (in /usr/lib/dyld)
==49256== by 0x7FFF5FC37E4F: ??? (in /usr/lib/dyld)
==49256== by 0x15: ???
==49256==
==49256== 792 (168 direct, 624 indirect) bytes in 1 blocks are definitely lost in loss record 313 of 343
==49256== at 0x100030EA1: malloc (vg_replace_malloc.c:303)
==49256== by 0x10007843D: operator new(unsigned long) (in /usr/lib/libc++.1.dylib)
==49256== by 0x100014B0A: OutputFile::add(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, std::__1::basic_string<char, std::__1::char_traits, std::1::allocator > const&) (OutputFile.cpp:126)
==49256== by 0x100007908: ReportResults(SparseMatrix_STRUCT const&, int, int, int, int, double
, TestCGData_STRUCT const&, TestSymmetryData_STRUCT const&, TestNormsData_STRUCT const&, int, bool) (ReportResults.cpp:228)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC41E3F: ???
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x104829CCF: ???
==49256== by 0x7FFF5FC17DA1: ImageLoaderMachOCompressed::findExportedSymbol(char const
, ImageLoader const**) const (in /usr/lib/dyld)
==49256== by 0x7FFF5FC37E4F: ??? (in /usr/lib/dyld)
==49256== by 0x15: ???
==49256==
==49256== 824 (168 direct, 656 indirect) bytes in 1 blocks are definitely lost in loss record 314 of 343
==49256== at 0x100030EA1: malloc (vg_replace_malloc.c:303)
==49256== by 0x10007843D: operator new(unsigned long) (in /usr/lib/libc++.1.dylib)
==49256== by 0x100014B0A: OutputFile::add(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, std::__1::basic_string<char, std::__1::char_traits, std::1::allocator > const&) (OutputFile.cpp:126)
==49256== by 0x100008E11: ReportResults(SparseMatrix_STRUCT const&, int, int, int, int, double
, TestCGData_STRUCT const&, TestSymmetryData_STRUCT const&, TestNormsData_STRUCT const&, int, bool) (ReportResults.cpp:284)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC41E3F: ???
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x104829CCF: ???
==49256== by 0x7FFF5FC17DA1: ImageLoaderMachOCompressed::findExportedSymbol(char const
, ImageLoader const**) const (in /usr/lib/dyld)
==49256== by 0x7FFF5FC37E4F: ??? (in /usr/lib/dyld)
==49256== by 0x15: ???
==49256==
==49256== 840 (168 direct, 672 indirect) bytes in 1 blocks are definitely lost in loss record 315 of 343
==49256== at 0x100030EA1: malloc (vg_replace_malloc.c:303)
==49256== by 0x10007843D: operator new(unsigned long) (in /usr/lib/libc++.1.dylib)
==49256== by 0x100014B0A: OutputFile::add(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, std::__1::basic_string<char, std::__1::char_traits, std::1::allocator > const&) (OutputFile.cpp:126)
==49256== by 0x10000953D: ReportResults(SparseMatrix_STRUCT const&, int, int, int, int, double
, TestCGData_STRUCT const&, TestSymmetryData_STRUCT const&, TestNormsData_STRUCT const&, int, bool) (ReportResults.cpp:304)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC41E3F: ???
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x104829CCF: ???
==49256== by 0x7FFF5FC17DA1: ImageLoaderMachOCompressed::findExportedSymbol(char const
, ImageLoader const**) const (in /usr/lib/dyld)
==49256== by 0x7FFF5FC37E4F: ??? (in /usr/lib/dyld)
==49256== by 0x15: ???
==49256==
==49256== 1,000 (168 direct, 832 indirect) bytes in 1 blocks are definitely lost in loss record 317 of 343
==49256== at 0x100030EA1: malloc (vg_replace_malloc.c:303)
==49256== by 0x10007843D: operator new(unsigned long) (in /usr/lib/libc++.1.dylib)
==49256== by 0x100014B0A: OutputFile::add(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, std::__1::basic_string<char, std::__1::char_traits, std::1::allocator > const&) (OutputFile.cpp:126)
==49256== by 0x100009FD3: ReportResults(SparseMatrix_STRUCT const&, int, int, int, int, double
, TestCGData_STRUCT const&, TestSymmetryData_STRUCT const&, TestNormsData_STRUCT const&, int, bool) (ReportResults.cpp:330)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC41E3F: ???
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x104829CCF: ???
==49256== by 0x7FFF5FC17DA1: ImageLoaderMachOCompressed::findExportedSymbol(char const
, ImageLoader const**) const (in /usr/lib/dyld)
==49256== by 0x7FFF5FC37E4F: ??? (in /usr/lib/dyld)
==49256== by 0x15: ???
==49256==
==49256== 1,320 (168 direct, 1,152 indirect) bytes in 1 blocks are definitely lost in loss record 324 of 343
==49256== at 0x100030EA1: malloc (vg_replace_malloc.c:303)
==49256== by 0x10007843D: operator new(unsigned long) (in /usr/lib/libc++.1.dylib)
==49256== by 0x100014B0A: OutputFile::add(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, std::__1::basic_string<char, std::__1::char_traits, std::1::allocator > const&) (OutputFile.cpp:126)
==49256== by 0x100009854: ReportResults(SparseMatrix_STRUCT const&, int, int, int, int, double
, TestCGData_STRUCT const&, TestSymmetryData_STRUCT const&, TestNormsData_STRUCT const&, int, bool) (ReportResults.cpp:314)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC41E3F: ???
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x104829CCF: ???
==49256== by 0x7FFF5FC17DA1: ImageLoaderMachOCompressed::findExportedSymbol(char const
, ImageLoader const**) const (in /usr/lib/dyld)
==49256== by 0x7FFF5FC37E4F: ??? (in /usr/lib/dyld)
==49256== by 0x15: ???
==49256==
==49256== 1,368 (168 direct, 1,200 indirect) bytes in 1 blocks are definitely lost in loss record 326 of 343
==49256== at 0x100030EA1: malloc (vg_replace_malloc.c:303)
==49256== by 0x10007843D: operator new(unsigned long) (in /usr/lib/libc++.1.dylib)
==49256== by 0x100014B0A: OutputFile::add(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, std::__1::basic_string<char, std::__1::char_traits, std::1::allocator > const&) (OutputFile.cpp:126)
==49256== by 0x100009127: ReportResults(SparseMatrix_STRUCT const&, int, int, int, int, double
, TestCGData_STRUCT const&, TestSymmetryData_STRUCT const&, TestNormsData_STRUCT const&, int, bool) (ReportResults.cpp:293)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC41E3F: ???
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x104829CCF: ???
==49256== by 0x7FFF5FC17DA1: ImageLoaderMachOCompressed::findExportedSymbol(char const
, ImageLoader const**) const (in /usr/lib/dyld)
==49256== by 0x7FFF5FC37E4F: ??? (in /usr/lib/dyld)
==49256== by 0x15: ???
==49256==
==49256== 1,416 (168 direct, 1,248 indirect) bytes in 1 blocks are definitely lost in loss record 328 of 343
==49256== at 0x100030EA1: malloc (vg_replace_malloc.c:303)
==49256== by 0x10007843D: operator new(unsigned long) (in /usr/lib/libc++.1.dylib)
==49256== by 0x100014B0A: OutputFile::add(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, std::__1::basic_string<char, std::__1::char_traits, std::1::allocator > const&) (OutputFile.cpp:126)
==49256== by 0x100009BBF: ReportResults(SparseMatrix_STRUCT const&, int, int, int, int, double
, TestCGData_STRUCT const&, TestSymmetryData_STRUCT const&, TestNormsData_STRUCT const&, int, bool) (ReportResults.cpp:322)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC41E3F: ???
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x104829CCF: ???
==49256== by 0x7FFF5FC17DA1: ImageLoaderMachOCompressed::findExportedSymbol(char const
, ImageLoader const**) const (in /usr/lib/dyld)
==49256== by 0x7FFF5FC37E4F: ??? (in /usr/lib/dyld)
==49256== by 0x15: ???
==49256==
==49256== 1,624 (168 direct, 1,456 indirect) bytes in 1 blocks are definitely lost in loss record 329 of 343
==49256== at 0x100030EA1: malloc (vg_replace_malloc.c:303)
==49256== by 0x10007843D: operator new(unsigned long) (in /usr/lib/libc++.1.dylib)
==49256== by 0x100014B0A: OutputFile::add(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, std::__1::basic_string<char, std::__1::char_traits, std::1::allocator > const&) (OutputFile.cpp:126)
==49256== by 0x10000A3D5: ReportResults(SparseMatrix_STRUCT const&, int, int, int, int, double
, TestCGData_STRUCT const&, TestSymmetryData_STRUCT const&, TestNormsData_STRUCT const&, int, bool) (ReportResults.cpp:337)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC41E3F: ???
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x104829CCF: ???
==49256== by 0x7FFF5FC17DA1: ImageLoaderMachOCompressed::findExportedSymbol(char const
, ImageLoader const**) const (in /usr/lib/dyld)
==49256== by 0x7FFF5FC37E4F: ??? (in /usr/lib/dyld)
==49256== by 0x15: ???
==49256==
==49256== 1,752 (168 direct, 1,584 indirect) bytes in 1 blocks are definitely lost in loss record 330 of 343
==49256== at 0x100030EA1: malloc (vg_replace_malloc.c:303)
==49256== by 0x10007843D: operator new(unsigned long) (in /usr/lib/libc++.1.dylib)
==49256== by 0x100014B0A: OutputFile::add(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, std::__1::basic_string<char, std::__1::char_traits, std::_1::allocator > const&) (OutputFile.cpp:126)
==49256== by 0x100008852: ReportResults(SparseMatrix_STRUCT const&, int, int, int, int, double
, TestCGData_STRUCT const&, TestSymmetryData_STRUCT const&, TestNormsData_STRUCT const&, int, bool) (ReportResults.cpp:271)
==49256== by 0x7364697246: ???
==49256== by 0xB0482B09F: ???
==49256== by 0x2079726F6D654D2B: ???
==49256== by 0x6F666E4920657354: ???
==49256== by 0x6E6F6974616D71: ???
==49256== by 0x654C206469724713: ???
==49256== by 0x1006C6575: _objc_opt_data (in /usr/lib/libobjc.A.dylib)
==49256== by 0xA: ???
==49256==
==49256== 2,488 (168 direct, 2,320 indirect) bytes in 1 blocks are definitely lost in loss record 335 of 343
==49256== at 0x100030EA1: malloc (vg_replace_malloc.c:303)
==49256== by 0x10007843D: operator new(unsigned long) (in /usr/lib/libc++.1.dylib)
==49256== by 0x100014B0A: OutputFile::add(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, std::__1::basic_string<char, std::1::char_traits, std::1::allocator > const&) (OutputFile.cpp:126)
==49256== by 0x10000833D: ReportResults(SparseMatrix_STRUCT const&, int, int, int, int, double
, TestCGData_STRUCT const&, TestSymmetryData_STRUCT const&, TestNormsData_STRUCT const&, int, bool) (ReportResults.cpp:257)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC41E3F: ???
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x104829CCF: ???
==49256== by 0x7FFF5FC17DA1: ImageLoaderMachOCompressed::findExportedSymbol(char const
, ImageLoader const
) const (in /usr/lib/dyld)
==49256== by 0x7FFF5FC37E4F: ??? (in /usr/lib/dyld)
==49256== by 0x15: ???
==49256==
==49256== 2,712 (168 direct, 2,544 indirect) bytes in 1 blocks are definitely lost in loss record 336 of 343
==49256== at 0x100030EA1: malloc (vg_replace_malloc.c:303)
==49256== by 0x10007843D: operator new(unsigned long) (in /usr/lib/libc++.1.dylib)
==49256== by 0x100014B0A: OutputFile::add(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, std::__1::basic_string<char, std::1::char_traits, std::1::allocator > const&) (OutputFile.cpp:126)
==49256== by 0x10000A9DD: ReportResults(SparseMatrix_STRUCT const&, int, int, int, int, double
, TestCGData_STRUCT const&, TestSymmetryData_STRUCT const&, TestNormsData_STRUCT const&, int, bool) (ReportResults.cpp:362)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC41E3F: ???
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x104829CCF: ???
==49256== by 0x7FFF5FC17DA1: ImageLoaderMachOCompressed::findExportedSymbol(char const
, ImageLoader const
) const (in /usr/lib/dyld)
==49256== by 0x7FFF5FC37E4F: ??? (in /usr/lib/dyld)
==49256== by 0x15: ???
==49256==
==49256== 3,912 (168 direct, 3,744 indirect) bytes in 1 blocks are definitely lost in loss record 340 of 343
==49256== at 0x100030EA1: malloc (vg_replace_malloc.c:303)
==49256== by 0x10007843D: operator new(unsigned long) (in /usr/lib/libc++.1.dylib)
==49256== by 0x100014B0A: OutputFile::add(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, std::__1::basic_string<char, std::_1::char_traits, std::1::allocator > const&) (OutputFile.cpp:126)
==49256== by 0x100007D8A: ReportResults(SparseMatrix_STRUCT const&, int, int, int, int, double
, TestCGData_STRUCT const&, TestSymmetryData_STRUCT const&, TestNormsData_STRUCT const&, int, bool) (ReportResults.cpp:242)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x7FFF5FC41E3F: ???
==49256== by 0x7FFF5FC38597: ??? (in /usr/lib/dyld)
==49256== by 0x104829CCF: ???
==49256== by 0x7FFF5FC17DA1: ImageLoaderMachOCompressed::findExportedSymbol(char const
, ImageLoader const
*) const (in /usr/lib/dyld)
==49256== by 0x7FFF5FC37E4F: ??? (in /usr/lib/dyld)
==49256== by 0x15: ???
==49256==
==49256== LEAK SUMMARY:
==49256== definitely lost: 4,200 bytes in 25 blocks
==49256== indirectly lost: 21,136 bytes in 248 blocks
==49256== possibly lost: 0 bytes in 0 blocks
==49256== still reachable: 16,190 bytes in 30 blocks
==49256== suppressed: 34,813 bytes in 424 blocks
==49256== Reachable blocks (those to which a pointer was found) are not shown.
==49256== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==49256==
==49256== For counts of detected and suppressed errors, rerun with: -v
==49256== ERROR SUMMARY: 25 errors from 25 contexts (suppressed: 22 from 22)

Make output filename correspond to runtime parameters of the run for easier analysis.

From user:

I notice that the naming convention of the output files (*.txt and *.yaml) has changed in v 3.0. Benchmarkers like me, we love the old convention which stated the no. of processes (p) and no. of threads (t) in the file name. This is useful when we experiment with various combination p and t. If possible, could you bring back the old naming convention in v 3.1.

Add explicit check for proper local grid dimensions

Presently HPCG has asserts that check to make sure each fine grid can be coarsened by a power of two. There are comments in the source that explain why the assert fails. But this is not as user friendly as having a simple check in the beginning of the run. We could even compute the grid dimensions that are larger and smaller which would work.

Change the way that the residual variance is computed to avoid false variation values

Comment from vendor:

In a run valid for submission, the code needs to run for at least one hour and
will complete several hundreds identical iterations. The output contains the mean
and variance of the residuals to understand if the non-associativity of the floating
point addition played a role in the final results. Versions up to 2.1 will likely print
a non zero variance even if the output is always the same once the number of
CG sets grows. For example, on a small run with 4 nodes running for 129 CG
iterations, each one with the same residual value of 4.250796E ?08, the original
code will report a scaled residual variance of 1.931164e ? 44.
Residual = 4.25079640861079939557e-08
Residual = 4.25079640861079939557e-08
.....
Residual = 4.25079640861079939557e-08
Reproducibility Information:
Result: PASSED
Scaled residual mean: 4.250796e-08
Scaled residual variance: 1.931164e-44
While the result is still valid, since the threshold on the scaled residual vari-
ance is 1E ? 6, it would be nice to have the variance equal to zero when the
residuals are all the same, in order to detect possible ECC errors or network
corruptions. The original code is computing this using a standard summation
and the non-zero value is an artifact of the floating point division. Using ex-
tended precision will solve this problem but there is a better way of computing
this since the results are expected to be equal or very close. Instead of summing
the values, we can sum the differences among all the residuals with respect to
the initial iteration. This approach will compute a correct zero variance for the
previous output and it is now used in the latest versions of the HPCG code.

Another approach:

The routine reporting the reproducibility information will print a non zero variance even if the output is always the same once the number of CG sets grows.
This is an artifact of the floating point division.

Scaled residual mean: 0.00470554
Scaled residual variance: 3.68635e-35

While the number is small, I would prefer to see a zero when the results are all identical.
Using a double double will fix the issue (but will make the code depending on the availability of long double in the compiler).
The other option is to have a first loop that check if all the results are the same and then compute the mean and variance only if the test failed.

Minimize official execution time of benchmark for established production systems

HPCG can be run to completion in a few minutes, if we keep the number of sets low. We should have a rapid execution mode that permits a valid result to be produced in a short amount of time.

Originally HPCG was intended to be run for a long time. This is not feasible for installed production systems where they have little time when the whole machine is available. We should permit short execution times for established production systems.

There exits code bug in graph multicoloring in OptimizeProblem.cpp

local_int_t old, old0;
  for (int i=1; i < totalColors; ++i) {
    old0 = counters[i];
    counters[i] = counters[i-1] + old;
    old = old0;
  }
  counters[0] = 0;

Using GCC compiler in Linux, old is not initialized when i = 1 and it is used in counters[i] = counters[i-1] + old, this will make counters[1] a garbage. What more, what is the snippet going to do ?

Volta-enabled HPCG compilation

Hello,
I downloaded the Volta-enabled tar/binary file (hpcg-3.1_cuda9_ompi1.10.2_gcc485_sm_35_sm_50_sm_60_sm_70_ver_10_8_17.tgz) from the official website. The thing is that some elements (compiler, CUDA & ompi) are a bit outdated. Would it be possible to add a Github branch with the CUDA version so that users can build the code based on their system characteristics? Thanks.

Test validity of permutation used for SymGS

Presently we trust that the optimized implementation of SymGS accurately keeps all entries of the sparse triangular system when performing the permuted operation. Lost entries, which could increase parallelism and affect convergence, but have a net positive impact on performance, are not specifically tested for. By requesting the permutation vector used for the optimized SymGS we can test for dropped entries using the reference kernels with the permuted vectors.

HPCG 3.0- Assertion nxf%2==0 failed.

#Dear All,
I have compiled HPCG benchmark with intel 2016 parallel studio XE for Xeon(R) CPU E5-2680 (RHEL 6.6). I have 64 GB of memory installed on my machine. Also, my hpcg.dat file is :

HPCG Linpack benchmark input file
Sandia National Laboratories, University of Tennessee
430 430 430
1200

With this configuration , the xhpcg binary occupies 83% of memory but after ~ 13 minutes the run crashes with error:
xhpcg: src/GenerateCoarseProblem.cpp:50: void GenerateCoarseProblem(const SparseMatrix_STRUCT &): Assertion `nxf%2==0' failed.

Could you please let me know where i am going wrong with the configurations?

Changes to improve compilation on MS Windows

These changes will help compilation on Windows:

In Testsymmetry.cpp put mpi.h as the first include

In init.cpp:

ifdef _WIN32

char* NULLDEVICE="nul";

else

char* NULLDEVICE="/dev/null";

endif

and in line 145:

HPCG_fout.open(NULLDEVICE);  

If the --rt parameter is read from file, it does not get used

In init.cpp line 99 ReadHpcgDat(iparams, rt, iparams+7); reads the --rt parameter into the rt variable, which does not get used anywhere else. This means that specifying it in the hpcg.dat file has no effect.

Furthermore, ReadHpcgDat only reads this parameter if rt != 0, but rt is 0 if iparams[3]==0, which is the default value. This means that that if the --rt parameter is not set via the command line arguments, it is not read from hpcg.dat either.

Bugs in SparseMatrix C++11 checks for map selection

SparseMatrix.hpp contains two bugs in the following C++ version check as of commit 299047b:

#if __cplusplus <= 201103L
// for C++03
#include <map>
typedef std::map< global_int_t, local_int_t > GlobalToLocalMap;
#else
// for C++11 or greater
#include <unordered_map>
using GlobalToLocalMap = std::unorderd_map< global_int_t, local_int_t >;
#endif

The check of __cplusplus performs a less than or equal check which should be a less than check. As it is written passing -std=c++11 to g++ causes the "// for C++03" code to be used.

In the "// for C++11 or greater" code, std::unordered_map is incorrectly spelled as std::unorderd_map. It is missing the e between the r and d.

Use of [] operator in CheckProblem with HPCG_DEBUG_DETAILED

CheckProblem.cpp declares the A SparseMatrix argument as const:

void CheckProblem(const SparseMatrix & A, Vector * b, Vector * x, Vector * xexact) {

Later, std::cout is used to display the local row using A.globalToLocalMap[currentGlobalRow].

Since A is declared as const the [] operator cannot be used because it may insert a new entry if one does not exist.

The following:

#ifdef HPCG_DETAILED_DEBUG
        HPCG_fout << " rank, globalRow, localRow = " << A.geom->rank << " " << currentGlobalRow << " " << A.globalToLocalMap[currentGlobalRow] << endl;                                     
#endif

can be replaced in C++11 with:

#ifdef HPCG_DETAILED_DEBUG
    HPCG_fout << " rank, globalRow, localRow = " << A.geom->rank << " " << currentGlobalRow << " " << A.globalToLocalMap.at(currentGlobalRow) << endl;
#endif

or prior to C++11 with:

#ifdef HPCG_DETAILED_DEBUG
        HPCG_fout << " rank, globalRow, localRow = " << A.geom->rank << " " << currentGlobalRow << " " << A.globalToLocalMap.find(currentGlobalRow)->second << endl;
#endif

Keep display of performance result from version 2.4 for historical continuity

The calculation of reported performance changed between versions 2.4 and 3.0. To preserve between new and old results we'd like to keep version 2.4 result. There is already some code to that effect but it should be reviewed and modified so it is clear what is the reported result and what is only for reference based on old code base.

I get a problem in the build step

I am a beginner and my English is not very good. But I hope you can take some time to answer my doubts. I will be very grateful
I have clone it from Github.com.Then I choose hpcg/setup/Make.linux_MPI . Copy and paste it to setup.
And I have change something about it.
MPdir = $(HOME)mpich
MPinc = -I$(MPdir)/include
MPlib = $(MPdir)/lib
HPCG_OPTS = -DHPCG_NO_OPENMP
CXX = $(HOME)mpich/bin/mpicc
LINKER = $(HOME)mpich/bin/mpicc
LINKFLAGS = $(HOME)mpich/bin/mpicc
Then I want to start the next step. In top-level directory, type: make path/to/setup/file
It says: make: *** No rule to make target 'path/to/setup/file'
I really want to know how to solve this problem.
If you can answer It I will be greatly appreciated.

HPCG Memory Output

The memory output from HPCG is a base 10 value. This is for both the GB and GB/s. The end result is confusing since systems are speced in GiB and standard benchmarks, such as Stream measure bandwidth in GiB/s. Having to do this conversion is both error prone and it is not always obvious to users that they need to convert their data. This causes potential issues when using the code as a benchmark in some contexts.

I suggest converting to the standard base 2 GiB.

Change order of array allocation to improve reference kernel performance

The following email thread captures the options available to improve performance of the reference HPCG kernels. The first suggestion (allocating mtxIndL, matrixValues and mtxIndG in separate loops) certainly makes sense. The second suggestion (allocating a single big array and setting pointers to point inside) is certainly superior for current mainline architectures, but reduces code readability. It also eliminates the broader value of an array-of-pointers data structure to be able to re-allocate data on a row-by-row basis.

I won't implement the second suggestion.

Mike H.,
If the modification is acceptable, it could be improved even further, to better ensure compaction/contiguity in the arrays, with something like:

// Now allocate the arrays pointed to
mtxIndL[0] = new local_int_t[localNumberOfRows * numberOfNonzerosPerRow];
matrixValues[0] = new double[localNumberOfRows * numberOfNonzerosPerRow];
mtxIndG[0] = new global_int_t[localNumberOfRows * numberOfNonzerosPerRow];
for (local_int_t i=1; i< localNumberOfRows; ++i) {
mtxIndL[i] = &mtxIndL[0] + i * numberOfNonzerosPerRow;
matrixValues[i] = &matrixValues[0] + i * numberOfNonzerosPerRow;
mtxIndG[i] = &mtxIndG[0] + i * numberOfNonzerosPerRow;
}

Mike Davis
Cielo Applications Analyst
Cray Inc. / Sandia National Laboratories

From: Heroux, Michael A
Sent: Tuesday, August 18, 2015 10:41 AM
To: Davis, Mike E
Cc: Rajan, Mahesh; Bookey, Zachary A; Demeshko, Irina Petrovna (-EXP); Rajamanickam, Sivasankaran (-EXP)
Subject: Re: hpcg question

Hi Mike,

This is an interesting observation. I can certainly add your version of the loops to the reference code, under the assumption that the change would be beneficial in general, which is probably a reasonable guess for any cache-based micro processor.

Thanks for sending this to me. Although the reference version of HPCG is not intended to be performant, there is no reason to avoid general performance improvements.

I am copying a student and colleagues, Zach Bookey, Irina Demeshko and Siva Rajamanickam, resp, who are working on a Kokkos version of the code, in case the same optimization could be helpful for them.

Thanks again.

Mike

From: "Davis, Mike E" [email protected]
Date: Tuesday, August 18, 2015 at 11:05 AM
To: Michael A Heroux [email protected]
Cc: Mahesh Rajan [email protected], "Davis, Mike E" [email protected]
Subject: hpcg question

Mike H.,
I’ve been doing some runs of HPCG and have found that I get a significant speedup out of ComputeSYMGS when I rearrange the order of allocations of arrays in GenerateProblem. My change to GenerateProblem is shown below (a separate loop for each array). My question is, is this a legitimate change to make in the code? Or might you consider making this change if it turns out to benefit everyone? Or is the current method (with vectors broken up) more representative of “the real world”? Thanks for any feedback you can provide.

// Now allocate the arrays pointed to
for (local_int_t i=0; i< localNumberOfRows; ++i) {
mtxIndL[i] = new local_int_t[numberOfNonzerosPerRow];
matrixValues[i] = new double[numberOfNonzerosPerRow];
mtxIndG[i] = new global_int_t[numberOfNonzerosPerRow];
}

// Now allocate the arrays pointed to
for (local_int_t i=0; i< localNumberOfRows; ++i) {
mtxIndL[i] = new local_int_t[numberOfNonzerosPerRow];
}
for (local_int_t i=0; i< localNumberOfRows; ++i) {
matrixValues[i] = new double[numberOfNonzerosPerRow];
}
for (local_int_t i=0; i< localNumberOfRows; ++i) {
mtxIndG[i] = new global_int_t[numberOfNonzerosPerRow];
}

Mike Davis
Cielo Applications Analyst
Cray Inc. / Sandia National Laboratories

Fix B/W reporting for raw vs weighted I/O rates.

Current B/W report does not correctly the raw B/W values. Here is the correct code:

doc.add("GB/s Summary","");
doc.get("GB/s Summary")->add("Raw Read B/W",fnreads/times[0]/1.0E9);
doc.get("GB/s Summary")->add("Raw Write B/W",fnwrites/times[0]/1.0E9);
doc.get("GB/s Summary")->add("Raw Total B/W",(fnreads+fnwrites)/(times[0])/1.0E9);
doc.get("GB/s Summary")->add("Total with convergence and optimization phase overhead",(frefnreads+frefnwrites)/(times[0]+fNumberOfCgSets*(times[7]/10.0+times[9]/10.0))/1.0E9);

Runtime limit issue with 3.0

I noticed that the last line of hpcg.dat doesn’t seem to be used as a runtime limit by HPCG 3.0.
Before 3.0, when the value was set to 1200, I was able to run HPCG during at least 20 minutes.
But, with the 3.0 (HPCG 3.0 Reference Code), the same input only run for 166 seconds.
It looks like that using --rt=1200 fixed the issue but I believe that there is a bug with this last parameter inside hpcg.dat.

Add --rt option to the command line

Right now the user can specify nx, ny and nz on the command line. It would be useful to specify --rt for runtime. It would also fix the bug of param[3] not being initialized.

//@Header
// ***************************************************
//
// HPCG: High Performance Conjugate Gradient Benchmark
//
// Contact:
// Michael A. Heroux ( [email protected])
// Jack Dongarra ([email protected])
// Piotr Luszczek ([email protected])
//
// ***************************************************
//@Header

ifndef HPCG_NOMPI

include <mpi.h>

endif

ifndef HPCG_NOOPENMP

include <omp.h>

endif

include

include

include

include

include

include "hpcg.hpp"

include "ReadHpcgDat.hpp"

ifdef _WIN32 //added by as

char* NULLDEVICE="nul";

else

char* NULLDEVICE="/dev/null";

endif

std::ofstream HPCG_fout; //!< output file stream for logging activities during HPCG run

static int
startswith(const char * s, const char * prefix) {
size_t n = strlen( prefix );
if (strncmp( s, prefix, n ))
return 0;
return 1;
}

/*!
Initializes an HPCG run by obtaining problem parameters (from a file on
command line) and then broadcasts them to all nodes. It also initializes
loggin I/O streams that are used throughout the HPCG run. Only MPI rank 0
performs I/O operations.

The function assumes that MPI has already been initialized for MPI runs.

@param[in] argc_p the pointer to the "argc" parameter passed to the main() function
@param[in] argv_p the pointer to the "argv" parameter passed to the main() function
@param[out] params the reference to the data structures that is filled the basic parameters of the run

@return returns 0 upon success and non-zero otherwise

@see HPCG_Finalize
/
int
HPCG_Init(int * argc_p, char *
argv_p, HPCG_Params & params) {
int argc = *argc_p;
char *
argv = *argv_p;
char fname[80];
int i, j, iparams[4];
char cparams[4][6] = {"--nx=", "--ny=", "--nz=", "--rt="}; //--rt added by as
time_t rawtime;
tm * ptm;

iparams[3]=0; // added by as
/* for sequential and some MPI implementations it's OK to read first three args */
for (i = 0; i < 4; ++i) // changed to 4 by as
if (argc <= i+1 || sscanf(argv[i+1], "%d", iparams+i) != 1 || iparams[i] < 10) iparams[i] = 0;

/* for some MPI environments, command line arguments may get complicated so we need a prefix */
for (i = 1; i <= argc && argv[i]; ++i)
for (j = 0; j < 4; ++j) // changed to 4 by as
if (startswith(argv[i], cparams[j]))
if (sscanf(argv[i]+strlen(cparams[j]), "%d", iparams+j) != 1 || iparams[j] < 10) iparams[j] = 0;

if (! iparams[0] && ! iparams[1] && ! iparams[2]) { /* no geometry arguments on the command line */
ReadHpcgDat(iparams, iparams+3);
}

if (0== iparams[3]) iparams[3]=30*60; // default 30 min by as

for (i = 0; i < 3; ++i) {
if (iparams[i] < 16)
for (j = 1; j <= 2; ++j)
if (iparams[(i+j)%3] > iparams[i])
iparams[i] = iparams[(i+j)%3];
if (iparams[i] < 16)
iparams[i] = 16;
}

ifndef HPCG_NOMPI

MPI_Bcast( iparams, 4, MPI_INT, 0, MPI_COMM_WORLD );

endif

params.nx = iparams[0];
params.ny = iparams[1];
params.nz = iparams[2];

params.runningTime = iparams[3];

ifdef HPCG_NOMPI

params.comm_rank = 0;
params.comm_size = 1;

else

MPI_Comm_rank( MPI_COMM_WORLD, &params.comm_rank );
MPI_Comm_size( MPI_COMM_WORLD, &params.comm_size );

endif

ifdef HPCG_NOOPENMP

params.numThreads = 1;

else

#pragma omp parallel
params.numThreads = omp_get_num_threads();

endif

time ( &rawtime );
ptm = localtime(&rawtime);
sprintf( fname, "hpcg_log_%04d.%02d.%02d.%02d.%02d.%02d.txt",
1900 + ptm->tm_year, ptm->tm_mon+1, ptm->tm_mday, ptm->tm_hour, ptm->tm_min, ptm->tm_sec );

if (0 == params.comm_rank)
HPCG_fout.open(fname);
else {

if defined(HPCG_DEBUG) || defined(HPCG_DETAILED_DEBUG)

char local[15];
sprintf( local, "%d_", params.comm_rank );
sprintf( fname, "hpcg_log_%s%04.d%02d.%02d.%02d.%02d.%02d.txt", local,
    1900 + ptm->tm_year, ptm->tm_mon+1, ptm->tm_mday, ptm->tm_hour, ptm->tm_min, ptm->tm_sec );
HPCG_fout.open(fname);

else

HPCG_fout.open(NULLDEVICE); // changed by as 

endif

}

return 0;
}

Matlab example

The current Matlab/Octave example code (here) does not use a multigrid preconditioner and requires generation of the matrix by the C code. While this is useful for validating the C code, it can be useful to also generate the matrix directly in Matlab/Octave. Can contribute such a code if it would be helpful.

Set minimum set iteration count for optimized run to be the same as the reference run

Presently the iteration count for the optimized solver phase is permitted to be less that the reference version count of 50 iterations. This was permitted to allow for the case where a permutation of the symmetric Gauss-Seidel would actually improve the effectiveness of the smoother. However, this feature has been a chronic source of false high scores due to variations in residual values having nothing to do with better permutations.

Therefore, I am changing the policy so that the optimized run will do at least as many iterations as the reference run. I am still leaving in place the policy that forces execution of more iterations if the reordering weakens the convergence of the solver. This is necessary to penalize optimization strategies that dramatically increase parallelism in the Gauss-Seidel iteration at the expense of severely weakening the convergence rate.

Unit tests in `unittesting` directory fail to compile

Looks like the unit tests use an older(?) version of HPCG, and the current function signatures don't match. For example:

int ReadHpcgDat(int *localDimensions, int *secondsPerRun, int *localProcDimensions);

...is called with too few arguments:

ReadHpcgDat(localDims, &seconds);

and similarly the function:

void GenerateGeometry(int size, int rank, int numThreads, int pz, local_int_t zl, local_int_t zu, local_int_t nx, local_int_t ny, local_int_t nz, int npx, int npy, int npz, Geometry * geom);

...is also called with too few arguments:

GenerateGeometry(size, rank, numThreads, nx, ny, nz, geom)

Maybe the unittests are considered deprecated and should be removed?

Linux_Serial fails to compile with gcc version 7.1.0

Build with GCC version 7.1.0 fails with next output:

[]$ make arch=Linux_Serial
/storage/software/gcc-7.1.0/bin/gcc -DHPCG_NO_MPI -DHPCG_NO_OPENMP -I./src -I./src/Linux_Serial  -fomit-frame-pointer -O3 -funroll-loops -W -Wall -pedantic   -c -o src/main.o src/main.cpp
In file included from src/SparseMatrix.hpp:27:0,
                 from src/GenerateProblem.hpp:17,
                 from src/main.cpp:42:
src/Vector.hpp:84:3: internal compiler error: Illegal instruction
   for (int i=0; i<localLength; ++i) vv[i] = rand() / (double)(RAND_MAX) + 1.0;
   ^~~
0xad289f crash_signal
	../.././gcc/toplev.c:337
0x7f83145bbd9b parsed_string_to_mpfr
	/install/mpfr-3.1.4/src/strtofr.c:518
0x7f83145bd038 mpfr_strtofr
	/install/mpfr-3.1.4/src/strtofr.c:833
0xa44674 real_from_string(real_value*, char const*)
	../.././gcc/real.c:2107
0xa44fab real_from_string3(real_value*, char const*, format_helper)
	../.././gcc/real.c:2173
0x715943 interpret_float
	../.././gcc/c-family/c-lex.c:932
0x716827 c_lex_with_flags(tree_node**, unsigned int*, unsigned char*, int)
	../.././gcc/c-family/c-lex.c:432
0x619d3e cp_lexer_get_preprocessor_token
	../.././gcc/cp/parser.c:793
0x64c9c4 cp_lexer_new_main
	../.././gcc/cp/parser.c:657
0x64c9c4 cp_parser_new
	../.././gcc/cp/parser.c:3725
0x64c9c4 c_parse_file()
	../.././gcc/cp/parser.c:38428
0x71c1c3 c_common_parse_file()
	../.././gcc/c-family/c-opts.c:1107
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.
make: *** [src/main.o] Error 1

Move GenerateProblem function into the timed overhead of problem setup

In order to increase the mix of computational and data access patterns, we will add the GenerateProblem function to the timed setup cost, along with the OptimizeProblem.

Because of this, we will also add a new function that checks the values of the matrix and RHS to make sure that they are defined as expected, since the benchmarker is now allowed to modify GenerateProblem.

Improve documentation of SymGS to note that a zero initial vector is not used when used in MG

Mike.
Sure it is – that’s exactly what I see during investigating the code. The comments to code covered GS preconditioner but not GS with smoother, so probably they are out of date. Moreover, such functionalities could be optimized in a different way for zero and non-zero input array, so does it make sense to provide it in different kernels?
Thanks,
Alex

From: Heroux, Michael A [mailto:[email protected]]
Sent: Tuesday, June 10, 2014 10:10 AM
To: Kalinkin, Alexander A
Cc: [email protected]; Story, Shane; Park, Jongsoo; Scott, David S; Heinecke, Alexander; Pirogov, Vadim O
Subject: Re: possible HPCG issue?

Alex,

I will look at this more closely, but I think it is an issue of out-of-date documentation. When using SymGS as a stand-along preconditioner for CG, the initial vector should be all zeros. But when using it as a smoother for multigrid, you have a non-trivial residual obtained from the grid above or below, and you are applying SymGS to reduce the error associated with the high frequency modes on the given grid.

Does this make sense?

Mike

On Jun 9, 2014, at 5:44 PM, "Kalinkin, Alexander A" [email protected] wrote:

Jack and Mike,

Is it correct not to nullify the array x before GZ computation in the MG algorithm?
In the reference implementation of the HPCG the initial array x is not nullified before symmetric-Gauss Seidel in MG routines, and that seems to be an issue.

First loop in ComputeSYMGS_ref.cpp corresponds to following equation:
Dxnew=f-Lxnew-Uxold
That means x after the first loop is equal to
xnew=D+L-1(f-Uxold)
After the second loop it gets:
xnew=D+U-1DD+L-1(f-Uxold)

This indeed corresponds to typical symmetric GS preconditioner if x^{old} is equal to zero that is mentioned in comments of the file:

Symmetric Gauss-Seidel notes:

  • We use the input vector x as the RHS and start with an initial guess for y of all zeros.
    • We perform one forward sweep. x should be initially zero on the first GS sweep, but we do not attempt to exploit this fact

But in multigrid code we didn’t set it to zero, which is the main reason why I ask the question: am I doing something wrong in calculus or we have an issue with setting the initial vector to zero in HPCG code?

Make sure params[3] is assigned a default value

From a vendor:

An interesting issue came up when I was testing HPCG with various compilers and options. In HPCG_Init, I believe that iparams[3] deftly avoids ever being assigned a default value. If no command-line args are given, and hpcg.dat is not present, then iparams[3] holds junk.

In one test I ran using icc, params.runningTime ended up with 32767; with the Cray compiler, it was 0. Very different behavior from HPCG in those cases!

Add tuning build of HPCG

Execution of HPCG for tuning can be made much faster if repeated executions are eliminated. This should be made possible by adding something like -DHPCG_QUICK_RUN

Add memory size requirements policy

HPCG has not had an explicit policy about how big a problem should be in order to be valid for an official result. We still prefer a non-specific value. Instead the policy is:

  • HPCG should be run with a problem size large enough that data accessed during the main CG iteration loop is obtained from the memory resource that matches how a real application would expect an iterative solver to behave.
  • This is a subjective criterion, but permits specific decisions to be made in the future, especially has additional memory hierarchies are added to computing systems.
  • The policy will be explicitly stated as needed for any particular processor line.

compile error

error

I use setup/Make.Linux_MPI to set and ../configuration in build/
but when I type "make" in build/
there is an error

/mpich/include: file not recognized: Is a directory
collect2: error: ld returned 1 exit status

in the INSTALL file
there is

* ``MPinc`` specifies the path to include directories with MPI header files. A
  common setting here would be ``MPinc = $(MPdir)/include``, provided that the
  ``MPdir`` variable was set properly.

solution

Add "-I" before $(MPdir)/include

Typo in SparseMatrix.hpp - unorderd_map

Attempting to build with a modern GCC I get

/usr/bin/g++ -c -DHPCG_NO_MPI -I./src -I./src/GCC_OMP  -O3 -ffast-math -ftree-vectorize -ftree-vectorizer-verbose=0 -fopenmp -I../src ../src/main.cpp -o src/main.o
In file included from ../src/GenerateProblem.hpp:17:0,
                 from ../src/main.cpp:42:
../src/SparseMatrix.hpp:36:31: error: ‘unorderd_map’ in namespace ‘std’ does not name a template type
 using GlobalToLocalMap = std::unorderd_map< global_int_t, local_int_t >;
                               ^~~~~~~~~~~~

Spelling it unordered_map allows the build to complete

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.