Coder Social home page Coder Social logo

edf-hpc / verrou Goto Github PK

View Code? Open in Web Editor NEW
43.0 13.0 12.0 8.13 MB

floating-point errors checker

Home Page: http://edf-hpc.github.io/verrou/vr-manual.html

License: GNU General Public License v2.0

Makefile 1.39% C 48.83% Shell 0.39% C++ 30.90% Python 17.06% M4 0.37% TeX 0.07% Gnuplot 0.86% Dockerfile 0.11%
floating-point arithmetic valgrind diagnostics

verrou's Introduction

Verrou

Build Status Documentation

Verrou helps you look for floating-point round-off errors in programs. It implements various forms of arithmetic, including:

  • all IEEE-754 standard rounding modes;

  • three variants of stochastic floating-point arithmetic based on random rounding: all floating-point operations are perturbed by randomly switching rounding modes. These can be seen as an asynchronous variant of the CESTAC method, or a subset of Monte Carlo Arithmetic, performing only output randomization through random rounding;

  • an emulation of single-precision rounding, in order to test the effect of reduced precision without any need to change the source code.

Verrou also comes with a verrou_dd utility, which simplifies the Verrou-based debugging process by implementing several variants of the Delta-Debugging algorithm. This allows easily locating which parts of the analyzed source code are likely to be responsible for Floating-Point-related instabilities.

The documentation for Verrou is available as a dedicated chapter in the Valgrind manual.

Installation

Get the sources

The preferred way to get Verrou sources is to download the latest stable version: v2.5.0. Older versions are available in the releases page. After downloading one of the released versions, skip to the "Configure and build" section below.

 

In order to build the development version of Verrou, it is necessary to first download a specific Valgrind version, and patch it. Fetch valgrind's sources:

git clone --branch=VALGRIND_3_23_0 --single-branch https://sourceware.org/git/valgrind.git valgrind-3.23.0+verrou-dev

Add verrou's sources to it:

cd valgrind-3.23.0+verrou-dev
git clone https://github.com/edf-hpc/verrou.git verrou

cat verrou/valgrind.*diff | patch -p1

Configure and build

First, install all required dependencies (the names of relevant Debian packages are put in parentheses as examples):

  • C & C++ compilers (build-essential),
  • autoconf & automake (automake),
  • Python 3 (python3)
  • C standard library with debugging symbols (libc6-dbg).

 

Configure valgrind:

./autogen.sh
./configure --enable-only64bit --prefix=PREFIX

Depending on your system, it may be required to set CFLAGS in order to enable the use of FMA in your compiler:

./configure --enable-only64bit --prefix=PREFIX CFLAGS="-mfma"

On systems that don't support FMA instructions the --enable-verrou-fma=no configure switch need to be used, but be aware that this causes some tests to fail:

./configure --enable-only64bit --enable-verrou-fma=no --prefix=PREFIX

 

Advanced users can use the following configure flags :

  • --enable-verrou-check-naninf=yes|no (default yes). If NaN does not appear in the verified code set this option to 'no' can slightly speed up verrou.
  • --with-det-hash=hash_name with hash_name in [dietzfelbinger,multiply_shift,double_tabulation,xxhash,mersenne_twister] to select the hash function used for [random|average]_[det|comdet] rounding mode. The default is xxhash. double_tabulation was the previous default(before introduction of xxhash). mersenne_twister is the reference but slow. dietzfelbinger and multiply_shift are faster but are not able to reproduce the reference results.
  • --with-verrou-denorm-hack=[none|float|double|all] (default float). With denormal number the EFT are no more necessary exact. With the average* rounding modes this problem is always ignored, but the random* rounding, there are the following available options : with none the problem is ignored. With float a hack based on computation in double is applied on float operations ; With double an experimental hack is applied for double operations ; With all the float and double hacks are applied. float is selected by default.
  • --enable-verrou-xoshiro=[no|yes] (default yes). If set to yes the tiny mersenne twister prng is replaced (for random, prandom and average) by the xo[ro]shiro prng.
  • --enable-verrou-quad=[yes|no] (default yes). If set to no the backend mcaquad is disabled. This option is only useful to reduce the dependencies.

 

Build and install:

make
make install

Load the environment

In order to actually use Verrou, you must load the correct environment. This can be done using:

source PREFIX/env.sh

Test (optional)

General tests

You can test the whole platform:

make check
perl tests/vg_regtest --all

or only verrou:

make -C tests check
make -C verrou check
perl tests/vg_regtest verrou

Specific tests

These tests are more closely related to the arithmetic part in Verrou:

make -C verrou/unitTest

Documentation

The documentation for verrou is available as a chapter in valgrind's manual.

 

You can also re-build it:

make -C docs html-docs man-pages

and browse it locally:

firefox docs/html/vr-manual.html

Beware, this requires lots of tools which are not necessarily tested for in configure, including (but not necessarily limited to):

  • xsltproc
  • docbook-xsl

Bibliography & References

The following papers explain in more details the internals of Verrou, as well as some of its applications. If you use Verrou for a research work, please consider citing one of these references:

  1. François Févotte and Bruno Lathuilière. Debugging and optimization of HPC programs with the Verrou tool. In International Workshop on Software Correctness for HPC Applications (Correctness), Denver, CO, USA, Nov. 2019. DOI: 10.1109/Correctness49594.2019.00006
  2. Hadrien Grasland, François Févotte, Bruno Lathuilière, and David Chamont. Floating-point profiling of ACTS using Verrou. EPJ Web Conf., 214, 2019. DOI: 10.1051/epjconf/201921405025
  3. François Févotte and Bruno Lathuilière. Studying the numerical quality of an industrial computing code: A case study on code_aster. In 10th International Workshop on Numerical Software Verification (NSV), pages 61--80, Heidelberg, Germany, July 2017. DOI: 10.1007/978-3-319-63501-9_5
  4. François Févotte and Bruno Lathuilière. VERROU: a CESTAC evaluation without recompilation. In International Symposium on Scientific Computing, Computer Arithmetics and Verified Numerics (SCAN), Uppsala, Sweden, September 2016.

(These references are also available in bibtex format)

verrou's People

Contributors

ffevotte avatar hadrieng2 avatar lathuili avatar lathuili-home avatar nestordemeure avatar rb214678 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

verrou's Issues

Idea: Alternate operating mode which alters FP precision

I am not sure if this could fall into the scope of Verrou, but if not maybe it could be food for thought for the associated Interflop project.

Most hardware architectures these days either favor (from a performance point of view) or outright mandate use of single-precision floating-point arithmetic. But most numerical algorithms are written in double precision as a safe default. Porting such algorithms to single precision usually involves a tedious process of locating and working on the code regions which are sensitive to use of reduced-precision arithmetic.

In principle, Verrou could help at this by offering an alternate operating mode where instead of twiddling the IEEE-754 rounding mode, it performs operations in reduced precision then casts back to the original floating-point format.

By integrating this functionality with delta-debugging, one could then easily locate which functions or code regions are most sensitive to use of reduced precisions. The double rounding would make this approach somewhat pessimistic with respect to an actual double -> float substitution in the code, but at least it could act as a good starting point for locating the precision-sensitive regions of the code.

What do you think about this idea?

Adding call graph information to verrou_dd

When using libraries such as Eigen, verrou_dd finds sources of numerical instabilities deep inside of their implementation. This means that, for example, I can learn that numerical instabilities can occur while performing matrix additions, but not much more, and this information is not terribly useful per se.

It would be nice if you could provide some call graph context around why/when a given line of code is called. Maybe this could be done by somehow combining the information provided by verrou_dd and callgrind in an intelligent way?

Does not compile with openmpi-4.0.3 (and it seems with any openmpi >= 3.0)

Hi,

Verrou fails to compile on my ubuntu20.04 where openmpi is 4.0.3. It seems that it fails with any openmpi >= 3.0, as the error message looks like:

>> 1628    /usr/lib/x86_64-linux-gnu/openmpi/include/mpi.h:322:57: error: expected expression before '_Static_assert'
    1629      322 | #define THIS_SYMBOL_WAS_REMOVED_IN_MPI30(func, newfunc) _Static_assert(0, #func " was removed in MPI-3.0.  Use " #newfunc " instead.")
    1630          |                                                         ^~~~~~~~~~~~~~
    1631    /usr/lib/x86_64-linux-gnu/openmpi/include/mpi.h:745:45: note: in expansion of macro 'THIS_SYMBOL_WAS_REMOVED_IN_MPI30'
    1632      745 | #        define MPI_COMBINER_STRUCT_INTEGER THIS_SYMBOL_WAS_REMOVED_IN_MPI30(MPI_COMBINER_STRUCT_INTEGER, MPI_COMBINER_STRUCT);
    1633          |                                             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    1634    libmpiwrap.c:366:12: note: in expansion of macro 'MPI_COMBINER_STRUCT_INTEGER'
    1635      366 |       case MPI_COMBINER_STRUCT_INTEGER: fprintf(f, "STRUCT_INTEGER"); break;

It works with openmpi-2.1.6.

The spack build also fails for the same reason (the dependency with mpi seems to be missing in the spack package).

Thanks!
Olivier

Improving reproducibility in verrou_dd

Debugging rare failures with verrou_dd can be very difficult. Either you are lucky and the failure can be reproduced in upward/downward rounding mode, or you are in for a long time tuning VERROU_DD_NRUNS, potentially to absurdly high values, without managing to reproduce the failure on every run.

Farthest does not always help because its result not stable under delta-debugging (as it depends on previous rounding decisions), and I suspect that this also holds for toward_zero. Similarly, vr-seed does not help reprodubility in delta-debugging mode, because it only reproduces the sequence of roundings that will be applied, and not the places where they will be applied.

I think this could be improved, admittedly at the cost of large overhead (which may prove intractable in practice), by recording the sequence of rounding modes that was applied on each source file location, and reproducing that instead of just the global random number sequence.

RDDMin's dd.sym folder name can go over the filesystem limit

Modern programming languages use name mangling for the purpose of namespace disambiguation. In the presence of generic types, name mangling can generate very long symbol names because a type's methods are preceded by all the the type parameters. When this is combined with RDDMin's willingness to feature all faulty symbole names in a set as a comma-separated list, the filesystem name length limit can be reached:

dd: done
ddmin23 (
  _ZN4Acts12AtlasStepperINS_14ConstantBFieldEE5State6updateINS_26SingleBoundTrackParametersINS_13ChargedPolicyEEEEEvRKT_        /root/acts-core/spack-build/Tests/Integration/PropagationTests
  _ZNK4Acts15IntegrationTest29covariance_validation_fixtureINS_10PropagatorINS_12AtlasStepperINS_14ConstantBFieldEEENS_13VoidNavigatorEEEE19calculateCovarianceINS_26SingleBoundTrackParametersINS_13ChargedPolicyEEESC_NS7_7OptionsINS_10ActionListIJEEENS_9AbortListIJEEEEEEEN5Eigen6MatrixIdLi5ELi5ELi0ELi5ELi5EEERKT_RKSL_RKT0_RKT1_     /root/acts-core/spack-build/Tests/Integration/PropagationTests
  _ZSt18generate_canonicalIdLm53ESt26linear_congruential_engineImLm16807ELm0ELm2147483647EEET_RT1_      /root/acts-core/spack-build/Tests/Integration/PropagationTests
):
/root/acts-core/spack-build/Tests/Integration/dd.sym/84a02f517d8a210580d728fbb41b2286 --( run )-> PASS
/root/acts-core/spack-build/Tests/Integration/dd.sym/fb2eb4131a126169ec3705da3b4f7f1f --(cache)-> FAIL
/root/acts-core/spack-build/Tests/Integration/dd.sym/577d59aae63c7a37b3ec8411a8577ed7 --(cache)-> FAIL
/root/acts-core/spack-build/Tests/Integration/dd.sym/ebcc1ee1dcb242877e864e535f6ba54b --(cache)-> FAIL
/root/acts-core/spack-build/Tests/Integration/dd.sym/bdbbcdaf58738172d79424cb10a9ab5c --(cache)-> FAIL
/root/acts-core/spack-build/Tests/Integration/dd.sym/52c72e8cce653bc26e5ee8deb1ee256a --(cache)-> FAIL
/root/acts-core/spack-build/Tests/Integration/dd.sym/9746df00562dffd2feb25f3d4d5c0056 --(cache)-> FAIL
/root/acts-core/spack-build/Tests/Integration/dd.sym/6d933a0326f9294f0b020a172d958430 --(cache)-> FAIL
Traceback (most recent call last):
  File "/opt/spack/opt/spack/linux-opensuse_tumbleweed20180919-x86_64/gcc-8.2.1/verrou-valgrind-update-g3q4crfabmhq4qddliomqlfjpfrhxqw3/bin/verrou_dd", line 638, in <module>
    main(runScript, cmpScript, algoSearch=ddAlgo)
  File "/opt/spack/opt/spack/linux-opensuse_tumbleweed20180919-x86_64/gcc-8.2.1/verrou-valgrind-update-g3q4crfabmhq4qddliomqlfjpfrhxqw3/bin/verrou_dd", line 613, in main
    (refSym, confSymsTab) = ddSymRDDMin(run, compare)
  File "/opt/spack/opt/spack/linux-opensuse_tumbleweed20180919-x86_64/gcc-8.2.1/verrou-valgrind-update-g3q4crfabmhq4qddliomqlfjpfrhxqw3/bin/verrou_dd", line 457, in ddSymRDDMin
    dd.testSym(conf)
  File "/opt/spack/opt/spack/linux-opensuse_tumbleweed20180919-x86_64/gcc-8.2.1/verrou-valgrind-update-g3q4crfabmhq4qddliomqlfjpfrhxqw3/bin/verrou_dd", line 428, in testSym
    symlink(dirname, linkname)
  File "/opt/spack/opt/spack/linux-opensuse_tumbleweed20180919-x86_64/gcc-8.2.1/verrou-valgrind-update-g3q4crfabmhq4qddliomqlfjpfrhxqw3/bin/verrou_dd", line 288, in symlink
    os.symlink(src, dst)
OSError: [Errno 36] File name too long: '/root/acts-core/spack-build/Tests/Integration/dd.sym/6d933a0326f9294f0b020a172d958430' -> '/root/acts-core/spack-build/Tests/Integration/dd.sym._ZNK4Acts15IntegrationTest29covariance_validation_fixtureINS_17PropagatorWrapperISt10shared_ptrINS_16RungeKuttaEngineINS_14ConstantBFieldEEEEEEE19calculateCovarianceINS_32SingleCurvilinearTrackParametersINS_13ChargedPolicyEEESD_NS8_7OptionsINS_10ActionListIJEEENS_9AbortListIJEEEEEEEN5Eigen6MatrixIdLi5ELi5ELi0ELi5ELi5EEERKT_RKSM_RKT0_RKT1_,_ZNK4Acts15RungeKuttaUtils22transformLocalToGlobalEbPKNS_7SurfaceEPKdPd'

Since we can't tune up the filesystem name length limit easily (the Linux VFS name length limit is set at kernel build time, and on-disk filesystems have hard length limits in their specification), we need to get around it somehow. I can think of several possibilities:

  • Do not put symbol names in file names, instead only use hashes and provide the list of associated symbol names in a different way (e.g. by a text file index)
  • Carefully truncate symbol names so that we stay within our file name budget

These possibilities are not mutually exclusive. It could seem unfair to penalize the ergonomics of C and Fortran users just because languages with namespaces and generics generate ridiculously long symbol names. At the same time, if we truncate symbol names, we do need to provide the full list somewhere in case the truncated version is ambiguous (name mangling is used for a reason).

If we do end up truncating, I think it is best to truncate in the middle. That is because the beginning of the mangled name tells us which part of the program (namespace) we are talking about, whereas the end of the mangled name tells us which end-user method we are dealing with. The middle part of the mangled name is only important in cases where a generic type's parameters strongly affect its behaviour, which can happen but is not a very common issue.

Branch off valgrind instead of patching it

Recently, I tried packaging verrou using Spack, and discovered that the current use of patchfiles on top of valgrind brings in some... complications in that process.

One Spack maintainer suggested that now that Valgrind has switched to Git, we could make everyone's life easier by making Verrou a branch on top of valgrind's git repository, rather than a patch which exists on its own:

  • Your life would be easier, because you could easily keep your patch up to date w.r.t. valgrind using standard git tools
  • The packagers' life would be easier, because there is only one source tree to be downloaded, unlike the current situation where one must keep two source trees in sync in the package file.

How would you feel about that?

"Exclude below" symbol selection

Some functions are called from many different program contexts. This is for example the case with linear algebra libraries or transcendental functions, but can be generalized to any utility library.

In that situation, being able to enable/disable a given symbol, or even a source line within the function, is often too coarse to be useful. What would be needed is a way to say "exclude this function and any other code that it calls". A different way to phrase this would be "disable verrou instrumentation until the active function call returns".

This functionality would require verrou to have some form of call graph sensitivity, and therefore can be considered related to / a prerequisite of #15 .

GDB's "stop-at" commands do not work

I recently ended up in a case where valgrind's GDB integration would be useful. Unfortunately, its facilities for controlled execution do not seem to work properly when using Verrou. Here is a non-exhaustive list of things that do not work:

  • Setting breakpoints does not seem to have any effect (the program does not stop there)
  • When interrupting via Ctrl+C, backtraces are often strange (corrupted?):
(gdb) bt 
#0  0x000000000055cf9b in Acts::CuboidVolumeBuilder::buildLayer (this=0x426a880, cfg=...) at /home/hadrien/Bureau/Programmation/acts-core/Core/include/Acts/Tools/CuboidVolumeBuilder.hpp:212
#1  0x000000000055e16a in Acts::CuboidVolumeBuilder::buildVolume (this=0x426a880, cfg=...) at /home/hadrien/Bureau/Programmation/acts-core/Core/include/Acts/Tools/CuboidVolumeBuilder.hpp:278
#2  0x000000000055ee9b in Acts::CuboidVolumeBuilder::trackingVolume (this=0x426a880) at /home/hadrien/Bureau/Programmation/acts-core/Core/include/Acts/Tools/CuboidVolumeBuilder.hpp:321
#3  0x0000000005d0381c in ?? ()
#4  0x0000001ffefe9500 in ?? ()
#5  0x0000001ffefe97b8 in ?? ()
#6  0x0000001ffefe9508 in ?? ()
#7  0x000000000424c0c0 in ?? ()
#8  0x0000000000000000 in ?? ()
  • "next"/"step"/"finish" seem broken, sometimes they go to an even weirder place where the backtrace is all ??, other times they just run the program to completion.

It should be noted that only execution control is broken. So when I'm really desperate for a debugger, I can manually sprinkle asm("int3"); in the code, rebuild, and remove them when I'm done. When setting breakpoints manually using that trick, backtraces and facilities for evaluating expressions, setting memory cells, etc seem to work as intended. But that's obviously not an enjoyable way to use a debugger 😉

The same functionalities work just fine when performing the exact same steps, but using memcheck instead of verrou, which is why I think verrou is the culprit.

Improving verrou_dd cache invalidation

verrou_dd only invalidates its cache when the run-script or the cmp-script change. But these scripts usually recursively call other binaries or touch configuration files which verrou_dd has no knowledge of. Therefore, the dd.sym cache is often reused when it shouldn't be.

I propose two possible strategies to deal with this, a sane but unsatisfying one an a less sane but more powerful strategy:

  • Sane-but-meh option: Invalidate the cache by default when a top-level verrou_dd command is run. If the command calls itself recursively, make sure that the recursive invocations do not invalidate the cache. This can be done by adding an optional "--reuse-cache" script option, which will also allow users who know what they are doing to reuse a cache from a previous run.
  • Insane option: In principle, one could strace run-script or cmp-script calls in order to tell which files the run_script and cmp_script recursively depend on, and then compute a hash of them to detect dependency changes... ^^

Provide an option to exclude exact zeros from cancellation reports

I've gotten the chance to experiment with Verrou's new --check-cancellation=yes mode, and I find it very nice and helpful... but a bit prone to false positives on "true" zeroes (e.g. ln(1) or x/x - 1), which typically emerge when the computation is configured to study a certain special case.

Would it be possible to have an option to disable reporting cancellations which return exactly zero? That would eliminate most false positives for me, and if I ever end up on a computation which is actually bad enough to lose all significant bits, there's still the option of flipping the option the other way around in order to study that...

Parallel verrou_dd crashes with ValueError

When I run verrou_dd in parallel on a test workload of mine, it systematically crashes on the second iteration with this kind of backtrace. Sequential runs work fine on the same workload.

$ VERROU_DD_NUM_THREADS=4 VERROU_DD_NRUNS=4 verrou_dd `pwd`/run.sh `pwd`/cmp.sh
[...]
dd (run #1): trying 6275 + 6275
/root/acts-core/build/IntegrationTests/dd.sym/ca2681d399ee504572a37d53b1416f6f  --( run )-> 
Traceback (most recent call last):
  File "/usr/local/bin/verrou_dd", line 633, in <module>
    main(runScript, cmpScript, algoSearch=ddAlgo)
  File "/usr/local/bin/verrou_dd", line 605, in main
    (refSym, confSymsTab) = ddSym(run, compare)
  File "/usr/local/bin/verrou_dd", line 438, in ddSym
    conf = dd.ddmax(deltas)
  File "/usr/local/lib/python2.7/site-packages/valgrind/DD.py", line 733, in ddmax
    return self.ddgen(c, 0, 1)
  File "/usr/local/lib/python2.7/site-packages/valgrind/DD.py", line 607, in ddgen
    outcome = self._dd(c, n)
  File "/usr/local/lib/python2.7/site-packages/valgrind/DD.py", line 670, in _dd
    (t, cs[i]) = self.test_mix(cs[i], c, self.REMOVE)
  File "/usr/local/lib/python2.7/site-packages/valgrind/DD.py", line 580, in test_mix
    directionbar)
  File "/usr/local/lib/python2.7/site-packages/valgrind/DD.py", line 384, in test_and_resolve
    t = self.test(csubr)
  File "/usr/local/lib/python2.7/site-packages/valgrind/DD.py", line 313, in test
    outcome = self._test(c)
  File "/usr/local/bin/verrou_dd", line 409, in _test
    return vT.run()
  File "/usr/local/bin/verrou_dd", line 127, in run
    return self.runParMax(maxNbPROC)
  File "/usr/local/bin/verrou_dd", line 202, in runParMax
    run=self.pidRunTab.index(pid)                
ValueError: 50 is not in list

My test workload is a bit complicated, but I have it inside of a docker container if that can be useful. Or maybe we can find a simpler reproducer.

This is somewhat related to #8 , in the sense that if the end decision is to change the verrou_dd parallelization algorithm, it may not be worth expending too much energy at fixing the existing one.

fmaintrin.h

Hi,
When configuring with fm, the configuration fails with:

configure:15976: error: A compiler with fmaintrin.h is required for --enable-verrou-fma

Despite the fact that the header is there.

But looking closer, I've found that the actual failure is in the detection test:

configure:15941: checking for fmaintrin.h
configure:15959: icpc -c -fma -I/trinity/shared/OCA/softs/gnurt-8.3/lib/gcc/x86_64-pc-linux-gnu/8.3.0/include/ -mfma  conftest.cpp >&5
In file included from conftest.cpp(101):
/trinity/shared/OCA/softs/gnurt-8.3/lib/gcc/x86_64-pc-linux-gnu/8.3.0/include/fmaintrin.h(25): error: #error directive: "Never use <fmaintrin.h> directly; include <immintrin.h> instead."
  # error "Never use <fmaintrin.h> directly; include <immintrin.h> instead."
    ^

conftest.cpp(107): error: identifier "EXIT_SUCCESS" is undefined
        return EXIT_SUCCESS;
               ^

I have the same problem with the platform distribution of gcc:

/usr/lib/gcc/x86_64-redhat-linux/4.8.5/include/fmaintrin.h(25): error: #error directive: "Never use <fmaintrin.h> directly; include <immintrin.h> instead."
  # error "Never use <fmaintrin.h> directly; include <immintrin.h> instead."
    ^

Regards

Dealing with a verrou-unstable compiler intrinsic

I just discovered that the way the Rust compiler performs u64 -> f64 conversions is not stable under verrou's cool new --rounding-mode=float if the sub operation is being instrumented. Fair enough, I wouldn't really expect a cast to survive unexpected rounding occurring in the middle. The problem is that since casts are compiler intrinsics, they do not have associated symbols, so it does not seem to be possible to roll an exclude rule for them. What would you do in this kind of cases?

README broken link to manual

Hi,

I just wanted to let you know that there is a broken link on README.md at line 125:

[chapter in valgrind's manual](//edf-hpc.github.com/verrou/vr-manual.html).

This is due to github.com subdomains being deprecated since April 15, and needs to be changed with the new github.io subdomain:

[chapter in valgrind's manual](///edf-hpc.github.io/verrou/vr-manual.html).

Have a nice day,
Alexis

Detection of cancellations

It would be useful to (re)introduce in Verrou a feature allowing to detect and locate large cancellations.

verrou_dd usability issues

Besides #6 , another issue which I have with verrou_dd is that it is a bit difficult to use correctly. Here are some suggestions of quality-of-life improvements:

  • verrou_dd should accept relative paths to the run_script and cmp_script
  • verrou_dd should run the scripts from its initial working directory in order to remove the need for cumbersome absolute paths inside of said scripts => Probably not a good idea, as applications may leave state around in the working directory and we aim for independent runs.
  • verrou_dd could be better at detecting when the dd.sym cache must be invalidated. => Moved to #9
  • verrou_dd's interface could be redesigned so that it is the one responsible for running verrou (or puts the proper valgrind command in an environment variable). This way, one would not need to repeat the valgrind command line in every run_script => Probably does not save enough characters to be useful, considering verrou_dd's requirements
  • As verrou_dd can take a lot of executions to converge, a way to state "my scripts are thread-safe, please fill up my idle CPU cores by testing multiple configurations in parallel" would be very nice. => In progress, parallel configuration testing is not there yet but parallel runs are in, see #8
  • Error reporting could be better. For example, "internal error" is not a very explicit way to say that an inconsistency was detected in the dataset.

Does it buid with Intel compilers ?

Hi,

My build is failing with:

backend_mcaquad/verrou_amd64_linux-interflop_mcaquad.o: In function `_mca_dbin':
/beegfs/home/alainm/install/valgrind-3.14.0/verrou/backend_mcaquad/mcalib.c:369: undefined reference to `__dtoq'
/beegfs/home/alainm/install/valgrind-3.14.0/verrou/backend_mcaquad/mcalib.c:370: undefined reference to `__dtoq'
backend_mcaquad/verrou_amd64_linux-interflop_mcaquad.o: In function `_mca_inexactq':
/beegfs/home/alainm/install/valgrind-3.14.0/verrou/backend_mcaquad/mcalib.c:280: undefined reference to `__addq'
/beegfs/home/alainm/install/valgrind-3.14.0/verrou/backend_mcaquad/mcalib.c:280: undefined reference to `__addq'
backend_mcaquad/verrou_amd64_linux-interflop_mcaquad.o: In function `_mca_dbin':
/beegfs/home/alainm/install/valgrind-3.14.0/verrou/backend_mcaquad/mcalib.c:378: undefined reference to `__addq'
/beegfs/home/alainm/install/valgrind-3.14.0/verrou/backend_mcaquad/mcalib.c:378: undefined reference to `__addq'
/beegfs/home/alainm/install/valgrind-3.14.0/verrou/backend_mcaquad/mcalib.c:378: undefined reference to `__addq'
backend_mcaquad/verrou_amd64_linux-interflop_mcaquad.o:/beegfs/home/alainm/install/valgrind-3.14.0/verrou/backend_mcaquad/mcalib.c:280: more undefined references to `__addq' follow
[....]

Is there something I forgot to disable ?

Update: I do not have that problem when building with gnu compilers (as opposed to Intel compilers)

Thanks

Consider capping verrou_dd's delta list length

A debug build of a C++ program can use thousands of symbols. In this scenario, verrou_dd's helpful printout of the list of deltas can become a nuisance, as it heavily clobbers the TTY and makes it hard to locate previous messages.

Since said messages become especially important in the new rddmin operating mode, I would propose the following:

  • Cap the length of the "remaining deltas" printout to a reasonable length (say, 100 symbols), adding a "... and N more ..." indicator at the end of the list if this limit is reached.
  • Allow the user to gain back access to the list of deltas when needed by either 1/allowing the length cap to be tuned via CLI options or 2/logging this information to a file, perhaps in dd.sym.

Problem when excluding symbols resulting from C++ template instanciations

When testing verrou on C++ code using STL functions, many symbols resulting from template instanciation have very complex names (like in the following anonymized example, as output by --gen-exclude)

std::_List_iterator<MyType>::_List_iterator(std::_List_node_base*)     my_binary
std::iterator_traits<__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > > >::iterator_category std::__iterator_category<__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > > >(__gnu_cxx::__normal_i     my_binary

Such symbols are not currently excluded.

verrou 2.1.0 + glibc 2.27's "pow" = segfault

I am aware that running the libm under verrou is bound to produce all sorts of weird results, but I suspect segfaulting is not an expected outcome and might hint at a deeper problem.

#include <cmath>
#include <iostream>

int main() {
  volatile double argument = 0.43312570424167907;
  std::cout << (double) std::pow(argument, 0.25) << std::endl;
}
hadrien@linux-2ak3:~/Bureau/test> valgrind --tool=verrou --rounding-mode=farthest ./a.out
==2326== Verrou, Check floating-point rounding errors
==2326== Copyright (C) 2014-2016, F. Fevotte & B. Lathuiliere.
==2326== Using Valgrind-3.14.0+verrou-2.1.0 and LibVEX; rerun with -h for copyright info
==2326== Command: ./a.out
==2326== 
==2326== Backend verrou : 1.x-dev
==2326== Simulating FARTHEST rounding mode
==2326== Instrumented operations :
==2326==        add : yes
==2326==        sub : yes
==2326==        mul : yes
==2326==        div : yes
==2326==        mAdd : yes
==2326==        mSub : yes
==2326==        cmp : no
==2326==        conv : yes
==2326==        max : no
==2326==        min : no
==2326== Instrumented scalar operations : no
==2326== 
==2326== Process terminating with default action of signal 11 (SIGSEGV)
==2326==  Access not within mapped region at address 0x5737978
==2326==    at 0x4AA6D8B: __ieee754_pow_fma (e_pow.c:415)
==2326==    by 0x4A36773: pow (w_pow_compat.c:30)
==2326==    by 0x401198: main (in /home/hadrien/Bureau/test/a.out)
==2326==  If you believe this happened as a result of a stack
==2326==  overflow in your program's main thread (unlikely but
==2326==  possible), you can try to increase the size of the
==2326==  main thread stack using the --main-stacksize= flag.
==2326==  The main thread stack size used in this run was 8388608.
==2326== 
==2326==  ---------------------------------------------------------------------
==2326==  Operation                            Instruction count
==2326==   `- Precision
==2326==       `- Vectorization          Total             Instrumented
==2326==  ---------------------------------------------------------------------
==2326==  add                       30                       30          (100%)
==2326==   `- dbl                       30                       30      (100%)
==2326==       `- llo                       30                       30  (100%)
==2326==  ---------------------------------------------------------------------
==2326==  sub                       23                       23          (100%)
==2326==   `- dbl                       23                       23      (100%)
==2326==       `- llo                       23                       23  (100%)
==2326==  ---------------------------------------------------------------------
==2326==  mul                       23                       23          (100%)
==2326==   `- dbl                       23                       23      (100%)
==2326==       `- llo                       23                       23  (100%)
==2326==  ---------------------------------------------------------------------
==2326==  mAdd                      10                       10          (100%)
==2326==   `- dbl                       10                       10      (100%)
==2326==       `- llo                       10                       10  (100%)
==2326==  ---------------------------------------------------------------------
==2326==  cmp                        7                        0          (  0%)
==2326==   `- dbl                        7                        0      (  0%)
==2326==       `- scal                       7                        0  (  0%)
==2326==  ---------------------------------------------------------------------
==2326== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Erreur de segmentation (core dumped)

Excluding the __ieee754_pow_fma symbol from the libm will make the segfault go away.

Some typos

I know that this probably not high priority, but here are some minor typos in the following error message:

FAILURE: the comparison between the reference (code instrumented with nearest mode) **andthe** code without instrumentation failed
Suggestions:
	1) check if **reproducibilty** discrepancies are larger than the failure criteria of the script 
	2) check the libm library is not instrumented

see, andthe and reproducibilty.

This is obtained after running verrou_dd_line.

Simultaneous use of the --gen-exclude, --exclude and --rounding-mode command-line switches

When simultaneously providing the --gen-exclude=GEN_LIST and --exclude=EXC_LIST command-line switches, it looks like verrou excludes EXC_LIST symbols from the list generated in GEN_LIST. However, EXC_LIST symbols are still instrumented (i.e. rounding-modes are perturbated if --rounding-mode=random is provided as well).

Is such a behavior intended? If so, the documentation should mention it.

verrou_dd fails with parse errors in the valgrind output

I am trying to play with verrou's delta-debugging feature, but did not manage to get it to work so far.

Command and associated output (the rm is only there to invalidate the cache):

$ rm -rf dd.sym/ && verrou_dd `pwd`/run.sh `pwd`/cmp.sh
/root/acts-core/build/IntegrationTests/dd.sym/d41d8cd98f00b204e9800998ecf8427e  --( run )->  FAIL(1)
Traceback (most recent call last):
  File "/usr/local/bin/verrou_dd", line 244, in <module>
    main(sys.argv[1], sys.argv[2])
  File "/usr/local/bin/verrou_dd", line 240, in main
    (refSym, confSyms) = ddSym(run, compare)
  File "/usr/local/bin/verrou_dd", line 144, in ddSym
    conf = dd.ddmax(deltas)
  File "/usr/local/lib/python2.7/site-packages/valgrind/DD.py", line 724, in ddmax
    return self.ddgen(c, 0, 1)
  File "/usr/local/lib/python2.7/site-packages/valgrind/DD.py", line 607, in ddgen
    outcome = self._dd(c, n)
  File "/usr/local/lib/python2.7/site-packages/valgrind/DD.py", line 617, in _dd
    assert self.test([]) == self.PASS
AssertionError

run_script argument:

#!/bin/bash
DIR="$1"
WORKDIR="/root/acts-core/build/IntegrationTests"
valgrind --tool=verrou --rounding-mode=random --demangle=no --exclude="$WORKDIR/libm.ex" $WORKDIR/PropagationTests > ${DIR}/results.dat

cmp_script argument:

#!/bin/bash 
REF="$1"
RUN="$2"
diff ${REF}/results.dat ${RUN}/results.dat

My initial exclude rules (the "libm.ex" file):

__sin_fma       /lib64/libm-2.27.so
__cos_fma       /lib64/libm-2.27.so
__tan_fma       /lib64/libm-2.27.so
sincos  /lib64/libm-2.27.so

Contents of dd.sym/d41d8cd98f00b204e9800998ecf8427e/dd.run1/dd.run.err:

==2296== Loading exclusions list from `/root/acts-core/build/IntegrationTests/libm.ex'... OK.
==2296== Verrou, Check floating-point rounding errors
==2296== Copyright (C) 2014-2016, F. Fevotte & B. Lathuiliere.
==2296== Using Valgrind-3.13.0+verrou-1.1.0 and LibVEX; rerun with -h for copyright info
==2296== Command: /root/acts-core/build/IntegrationTests/PropagationTests
==2296== 
==2296== Loading exclusions list from `/root/acts-core/build/IntegrationTests/dd.sym/d41d8cd98f00b204e9800998ecf8427e/dd.exclude'... ERROR (parse)
==2296== First seed : 123030
==2296== Simulating RANDOM rounding mode
==2296== Instrumented operations :
==2296==        add : yes
==2296==        sub : yes
==2296==        mul : yes
==2296==        div : yes
==2296==        mAdd : yes
==2296==        mSub : yes
==2296==        cmp : no
==2296==        conv : no
==2296==        max : no
==2296==        min : no
==2296== Instrumented scalar operations : no
==2296== FATAL: in suppressions file "/usr/local/lib/valgrind/default.supp" near line 1:
==2296==    expected '{' or end-of-file
==2296== exiting now.

The contents of /root/acts-core/build/IntegrationTests/dd.sym/d41d8cd98f00b204e9800998ecf8427e/dd.exclude and /usr/local/lib/valgrind/default.supp are available at https://gist.github.com/HadrienG2/286e46a5f474ddcd73017e7815d19cf0 .

My best guess so far is that either Valgrind or Verrou is overwhelmed by the remarkably verbose output of g++'s name mangler. But that is only a guess.

Do you have any suggestions of where to start in order to debug this further?

Incompatibility between Valgrind 3.13 and recent binutils

I tried verrou-fying a Rust program of mine, but got all kinds of weird symptoms. Making verrou generate an exclude list made me discover a problem which I encountered before:

__log10_finite  /lib64/libm-2.27.so
__ieee754_exp_fma       /lib64/libm-2.27.so
__ieee754_log_fma       /lib64/libm-2.27.so

Clearly, not every function in the program is present. I suspect that is because Rust does not follow the usual structure of a C/++ program (my "main" function is actually called _ZN13trois_photons4main28_$u7b$$u7b$closure$u7d$$u7d$17ha2ce800db5acffacE ). In the past, I could hack around this with a suitable gen-above parameter, but since 4991694 gen-above is gone as it was assumed that recent changes made it obsolete.

Can you help me figure out what's wrong or, if all else fails, bring gen-above back?

A proposal for a more efficient parallel verrou_dd

Currently, verrou_dd processes configurations in order, and only allows itself to perform multiple runs of a configuration in parallel. This has two problems:

  • If you need less runs than your CPU has cores, you can't fully leverage your CPU time.
  • As each run determines whether it is useful to do the next run or not, parallel run processing may not be very efficient because it hampers early exit.

Here is an alternate algorithm proposal:

Parameters:

NUM_PROCS = Maximal number of processes in flight
NRUNS = Maximal number of runs per configuration

Main thread:

Set up a pool of NUM_PROCS processes which listen to a FIFO queue of cancellable tasks (IIRC python's stdlib has something for this built-in)
For each schedule (set of configurations)
    For each run (1 to NRUNS)
        For each configuration:
            Schedule a run of the active configuration to be executed
    Wait for any of the runs in flight to complete or be canceled (out-of-order, select-style):
        If the run was a success and was the last run associated with this configuration...
            Mark the configuration as successful
            Print a log message indicating that this configuration is successful.
        If the run was a failure
            Cancel all other runs associated with this configuration.
            Mark the configuration as failing
            Print a log message indicating that this configuration is failing
    Once all runs have executed or have been canceled, do some error checking and determine next schedule

I think this algorithm is pretty good, because since the executor is FIFO and we schedule the first run of each configuration, then the second run of each configuration, and so on, multiple runs will only actually execute in parallel when the CPU has nothing better to do. Therefore, we reach a good compromise between keeping the CPU busy and avoiding potentially wasted work.

One thing which is lost with respect to the current algorithm is that no message will be printed when a configuration starts to be tested, only when verrou_dd is done testing a configuration. Also, configurations will be printed out of order, but I don't think this is a big deal (they were hardly identifiable anyhow).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.