ecp-veloc / veloc Goto Github PK

View Code? Open in Web Editor NEW

52.0 11.0 21.0 946 KB

Very-Low Overhead Checkpointing System

Home Page: http://veloc.rtfd.io

License: MIT License

CMake 4.99% Shell 1.00% C 12.59% C++ 76.47% Python 4.94%

checkpointing checkpoint-restart async-storage

veloc's People

Contributors

Stargazers

Watchers

veloc's Issues

warning: strncpy() specified bound 50 equals destination size

I'm seeing this warning on Fedora 28 with GCC 8.2.1:

[  5%] Building CXX object src/modules/CMakeFiles/veloc-modules.dir/module_manager.cpp.o
[ 10%] Building CXX object src/modules/CMakeFiles/veloc-modules.dir/client_watchdog.cpp.o
[ 15%] Building CXX object src/modules/CMakeFiles/veloc-modules.dir/transfer_module.cpp.o
[ 20%] Building CXX object src/modules/CMakeFiles/veloc-modules.dir/client_aggregator.cpp.o
[ 25%] Building CXX object src/modules/CMakeFiles/veloc-modules.dir/ec_module.cpp.o
[ 30%] Building CXX object src/modules/CMakeFiles/veloc-modules.dir/__/common/config.cpp.o
In file included from /home/hutter/veloc/src/common/config.hpp:4,
                 from /home/hutter/veloc/src/common/config.cpp:1:
In function ‘char* strncpy0(char*, const char*, size_t)’,
    inlined from ‘int ini_parse_stream(ini_reader, void*, ini_handler, void*)’ at /home/hutter/veloc/src/common/INIReader.h:234:25,
    inlined from ‘int ini_parse_file(FILE*, ini_handler, void*)’ at /home/hutter/veloc/src/common/INIReader.h:283:28,
    inlined from ‘int ini_parse(const char*, ini_handler, void*)’ at /home/hutter/veloc/src/common/INIReader.h:295:27,
    inlined from ‘INIReader::INIReader(std::__cxx11::string)’ at /home/hutter/veloc/src/common/INIReader.h:370:23,
    inlined from ‘config_t::config_t(const string&)’ at /home/hutter/veloc/src/common/config.cpp:20:72:
/home/hutter/veloc/src/common/INIReader.h:163:12: warning: ‘char* strncpy(char*, const char*, size_t)’ specified bound 50 equals destination size [-Wstringop-truncation]
     strncpy(dest, src, size);
     ~~~~~~~^~~~~~~~~~~~~~~~~
In function ‘char* strncpy0(char*, const char*, size_t)’,
    inlined from ‘int ini_parse_stream(ini_reader, void*, ini_handler, void*)’ at /home/hutter/veloc/src/common/INIReader.h:257:25,
    inlined from ‘int ini_parse_file(FILE*, ini_handler, void*)’ at /home/hutter/veloc/src/common/INIReader.h:283:28,
    inlined from ‘int ini_parse(const char*, ini_handler, void*)’ at /home/hutter/veloc/src/common/INIReader.h:295:27,
    inlined from ‘INIReader::INIReader(std::__cxx11::string)’ at /home/hutter/veloc/src/common/INIReader.h:370:23,
    inlined from ‘config_t::config_t(const string&)’ at /home/hutter/veloc/src/common/config.cpp:20:72:
/home/hutter/veloc/src/common/INIReader.h:163:12: warning: ‘char* strncpy(char*, const char*, size_t)’ specified bound 50 equals destination size [-Wstringop-truncation]
     strncpy(dest, src, size);
     ~~~~~~~^~~~~~~~~~~~~~~~~
[ 35%] Linking CXX shared library libveloc-modules.so

Alternative to OpenSSL for md5

Currently you're using the libcrypto from OpenSSL for MD5 hashing of files when computing checksums. Since the only use of OpenSSL is for MD5, it's a pretty heavyweight dependency for just that. Would you be open to a PR that brought in a different MD5 implementation in order to reduce the dependency footprint?

Build VELOC as a static library

Some users requested a static build (or at least fixed rpaths). We shall keep this in mind for the next refactoring of the build process.

can't build with AXL 4.0.0

This commit broke veloc

-- Build files have been written to: /var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5_build                                                                                                                                         
>>> Source configured.                                                                                                                                                                                                                      
>>> Compiling source in /var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5 ...                                                                                                                                                        
 * Source directory (CMAKE_USE_DIR): "/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5"                                                                                                                                             
 * Build directory  (BUILD_DIR):     "/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5_build"                                                                                                                                       
ninja -v -j12 -l12                                                                                                                                                                                                                          
[1/25] /usr/bin/x86_64-pc-linux-gnu-g++ -D__ASSERT -D__BENCHMARK -D__INFO -Dveloc_modules_EXPORTS -I/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5 -I/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src  -Os -pipe -marc
h=native -frecord-gcc-switches -fPIC   -Wall -std=gnu++14 -MD -MT src/modules/CMakeFiles/veloc-modules.dir/module_manager.cpp.o -MF src/modules/CMakeFiles/veloc-modules.dir/module_manager.cpp.o.d -o src/modules/CMakeFiles/veloc-modules.
dir/module_manager.cpp.o -c /var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/module_manager.cpp                                                                                                                         
[2/25] /usr/bin/x86_64-pc-linux-gnu-g++ -D__ASSERT -D__BENCHMARK -D__INFO -Dveloc_modules_EXPORTS -I/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5 -I/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src  -Os -pipe -marc
h=native -frecord-gcc-switches -fPIC   -Wall -std=gnu++14 -MD -MT src/modules/CMakeFiles/veloc-modules.dir/client_watchdog.cpp.o -MF src/modules/CMakeFiles/veloc-modules.dir/client_watchdog.cpp.o.d -o src/modules/CMakeFiles/veloc-module
s.dir/client_watchdog.cpp.o -c /var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/client_watchdog.cpp                                                                                                                     
[3/25] /usr/bin/x86_64-pc-linux-gnu-g++ -D__ASSERT -D__BENCHMARK -D__INFO -Dveloc_modules_EXPORTS -I/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5 -I/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src  -Os -pipe -marc
h=native -frecord-gcc-switches -fPIC   -Wall -std=gnu++14 -MD -MT src/modules/CMakeFiles/veloc-modules.dir/transfer_module.cpp.o -MF src/modules/CMakeFiles/veloc-modules.dir/transfer_module.cpp.o.d -o src/modules/CMakeFiles/veloc-module
s.dir/transfer_module.cpp.o -c /var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/transfer_module.cpp                                                                                                                     
FAILED: src/modules/CMakeFiles/veloc-modules.dir/transfer_module.cpp.o                                                                                                                                                                      
/usr/bin/x86_64-pc-linux-gnu-g++ -D__ASSERT -D__BENCHMARK -D__INFO -Dveloc_modules_EXPORTS -I/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5 -I/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src  -Os -pipe -march=nativ
e -frecord-gcc-switches -fPIC   -Wall -std=gnu++14 -MD -MT src/modules/CMakeFiles/veloc-modules.dir/transfer_module.cpp.o -MF src/modules/CMakeFiles/veloc-modules.dir/transfer_module.cpp.o.d -o src/modules/CMakeFiles/veloc-modules.dir/t
ransfer_module.cpp.o -c /var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/transfer_module.cpp                                                                                                                            
/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/transfer_module.cpp: In constructor ‘transfer_module_t::transfer_module_t(const config_t&)’:                                                                           
/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/transfer_module.cpp:54:23: error: too many arguments to function ‘int AXL_Init()’                                                                                      
   54 |     int ret = AXL_Init(NULL);                                                                                 
      |               ~~~~~~~~^~~~~~                       
In file included from /var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/transfer_module.hpp:12,    
                 from /var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/transfer_module.cpp:1:     
/usr/include/axl.h:58:5: note: declared here                                                                          
   58 | int AXL_Init (void);                               
      |     ^~~~~~~~                                                                                                  
/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/transfer_module.cpp: In function ‘int axl_transfer_file(axl_xfer_t, const string&, const string&)’:                                                                    
/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/transfer_module.cpp:68:24: error: too few arguments to function ‘int AXL_Create(axl_xfer_t, const char*, const char*)’                                                 
   68 |     int id = AXL_Create(type, source.c_str());                                                                
      |              ~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~                                                                 
In file included from /var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/transfer_module.hpp:12,    
                 from /var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/transfer_module.cpp:1:     
/usr/include/axl.h:73:5: note: declared here               
   73 | int AXL_Create (axl_xfer_t xtype, const char* name, const char* state_file);                                  
      |     ^~~~~~~~~~                                     
ninja: build stopped: subcommand failed.                                                                              
 * ERROR: sys-cluster/veloc-1.5-r1::guru failed (compile phase):                                                      
 *   ninja -v -j12 -l12 failed

Change VELOC_MAX_NAME from size_t to a macro?

With VELOC_MAX_NAME in veloc.h defined as a size_t variable rather than a #define, I'm getting the following error when trying to declare an array.

[100%] Building C object src/CMakeFiles/scr_o.dir/scr2veloc.c.o
/usr/workspace/wsb/moody20/projects/scr2veloc/src/scr2veloc.c:24:13: error: variably modified 'current_name' at file scope
 static char current_name[VELOC_MAX_NAME] = "";
             ^~~~~~~~~~~~

Build fails at linking with undefined reference to `kvtree_xxx` on Cori (NERSC)

Hi all,
I wanted to try VELOC 1.4 on Cori at NERSC, but it fails at the linking phase of veloc-backend, it seems:

$ mkdir ~/veloc-1.4/
$ ./auto-install.py ~/veloc-1.4/
Installing VeloC in /global/homes/c/chiusole/veloc-1.4...
Downloading Boost...
100% [......................................................................] 121849575 / 121849575Installing KVTree...
Cloning into '/tmp/veloc/KVTree'...

...

Scanning dependencies of target veloc-backend
[ 47%] Building CXX object src/backend/CMakeFiles/veloc-backend.dir/main.cpp.o
[ 52%] Linking CXX executable veloc-backend
/usr/bin/ld: /global/homes/c/chiusole/veloc-1.4/lib64/liber.a(er.c.o): in function `ER_Init':
er.c:(.text+0x2): undefined reference to `kvtree_new'
/usr/bin/ld: er.c:(.text+0xe): undefined reference to `kvtree_new'
/usr/bin/ld: er.c:(.text+0x1c): undefined reference to `redset_init'
/usr/bin/ld: er.c:(.text+0x25): undefined reference to `shuffile_init'
/usr/bin/ld: /global/homes/c/chiusole/veloc-1.4/lib64/liber.a(er.c.o): in function `ER_Finalize':
er.c:(.text+0x57): undefined reference to `kvtree_get'
/usr/bin/ld: er.c:(.text+0x5f): undefined reference to `kvtree_size'
/usr/bin/ld: er.c:(.text+0x89): undefined reference to `kvtree_get'
/usr/bin/ld: er.c:(.text+0x91): undefined reference to `kvtree_size'
/usr/bin/ld: er.c:(.text+0xb9): undefined reference to `kvtree_delete'
/usr/bin/ld: er.c:(.text+0xc5): undefined reference to `kvtree_delete'
/usr/bin/ld: er.c:(.text+0xcc): undefined reference to `shuffile_finalize'
/usr/bin/ld: er.c:(.text+0xd8): undefined reference to `redset_finalize'
/usr/bin/ld: /global/homes/c/chiusole/veloc-1.4/lib64/liber.a(er.c.o): in function `ER_Create_Scheme':
er.c:(.text+0x14b): undefined reference to `redset_create'
/usr/bin/ld: er.c:(.text+0x171): undefined reference to `kvtree_set_kv_int'
/usr/bin/ld: er.c:(.text+0x185): undefined reference to `kvtree_util_set_ptr'
/usr/bin/ld: er.c:(.text+0x1d5): undefined reference to `kvtree_unset_kv_int'
/usr/bin/ld: er.c:(.text+0x1df): undefined reference to `redset_delete'
...

/usr/bin/ld: axl_sync.c:(.text+0x14c): undefined reference to `kvtree_util_set_int'
/usr/bin/ld: axl_sync.c:(.text+0x17a): undefined reference to `kvtree_util_set_int'
/usr/bin/ld: axl_sync.c:(.text+0x1aa): undefined reference to `kvtree_util_set_int'
/usr/bin/ld: /global/homes/c/chiusole/veloc-1.4/lib64/libaxl.a(axl_sync.c.o): in function `axl_sync_wait':
axl_sync.c:(.text+0x1e8): undefined reference to `kvtree_get_kv_int'
/usr/bin/ld: axl_sync.c:(.text+0x1fc): undefined reference to `kvtree_util_get_int'
collect2: error: ld returned 1 exit status
gmake[2]: *** [src/backend/CMakeFiles/veloc-backend.dir/build.make:100: src/backend/veloc-backend] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:204: src/backend/CMakeFiles/veloc-backend.dir/all] Error 2
gmake: *** [Makefile:141: all] Error 2
Installation failed!

I've tried with the default PrgEnv-intel and also swapping with PrgEnv-gnu, and both get stuck at the same point.
Any idea what the problem may be?
Happy to provide more details

MPI_Comm_split with uninitialized key value?

I'm working with @kosinovsky to debug an XOR rebuild problem. While looking through the code, this line caught my eye:

VELOC/src/lib/client.cpp

Line 62 in a5a9b8a

MPI_Comm_split(comm, provided == 0 ? 0 : MPI_UNDEFINED, rank, &backends);

I don't think rank has been initialized at this point, which means it could have an arbitrary value. That could then lead to a potential random reordering of rank values in the backends communicator as compared to the parent communicator.

A potential fix would be to replace rank with 0 or otherwise move the MPI_Comm_rank(comm, &rank) higher up in the function.

restart-in-place: test veloc scripts while powering off nodes

Power off CORAL nodes, and check that detection / relaunch logic works.

Script logic has already been tested in SCR versions, but we should repeat these tests with the veloc scripts

error using test/heatdis example

Trying to restart the test/heasdis_file example from a broken state i got the follwing error

ERROR 3830768914477] [/u/dbertini/mpiio/veloc/src/lib/client.cpp:145:route_file] must call checkpoint_begin() first

than the program hangs ...
Should one add a call to checkpoint_begin() ? If yes, where exactly ?

Include Fedora build instructions

The quick start guide (https://veloc.readthedocs.io/en/latest/quick.html) should be updated with more detail on how to build VeloC, including listing all its dependencies. Here's what worked for me on Fedora 28:

sudo yum install -y python3-pip cmake boost boost-devel openmpi-devel
pip3 install wget --user
pip3 install bs4 --user 
module load mpi

git clone -b 'veloc-1.1' --single-branch --depth 1 https://github.com/ECP-VeloC/veloc.git
cd veloc
mkdir build install
./auto-install.py --no-boost install

command.hpp: error: array used as initializer command_t() { }

I'm unable to build VeloC master (using GCC 4.9.3):

$ cmake -DCMAKE_BUILD_TYPE=Debug -DWITH_AXL_PREFIX=`pwd`/install -DWITH_ER_PREFIX=`pwd`/install -DBoost_NO_BOOST_CMAKE=TRUE -DBoost_NO_SYSTEM_PATHS=TRUE -DBOOST_ROOT=/g/g0/hutter2/boost_1_69_0 .
-- The C compiler identification is GNU 4.9.3
-- The CXX compiler identification is GNU 4.9.3
-- Check for working C compiler: /usr/tcetmp/bin/cc
-- Check for working C compiler: /usr/tcetmp/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/tcetmp/bin/c++
-- Check for working CXX compiler: /usr/tcetmp/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Boost version: 1.69.0
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found MPI_C: /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/libmpiprofilesupport.so;/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/libmpi_ibm.so  
-- Found MPI_CXX: /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/libmpiprofilesupport.so;/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/libmpi_ibm.so  
-- Found AXL: /g/g0/hutter2/VELOC/install/lib64/libaxl.a  
-- Found ER: /g/g0/hutter2/VELOC/install/lib64/liber.a;/g/g0/hutter2/VELOC/install/lib64/libkvtree.a;/g/g0/hutter2/VELOC/install/lib64/libredset.a;/g/g0/hutter2/VELOC/install/lib64/libshuffile.a;/g/g0/hutter2/VELOC/install/lib64/librankstr.a;z  
-- Configuring done
-- Generating done
-- Build files have been written to: /g/g0/hutter2/VELOC

$ make
[  5%] Building CXX object src/modules/CMakeFiles/veloc-modules.dir/module_manager.cpp.o
In file included from /g/g0/hutter2/VELOC/src/modules/module_manager.hpp:4:0,
                from /g/g0/hutter2/VELOC/src/modules/module_manager.cpp:1:
/g/g0/hutter2/VELOC/src/common/command.hpp: In constructor 'command_t::command_t()':
/g/g0/hutter2/VELOC/src/common/command.hpp:16:17: error: array used as initializer
    command_t() { }
                ^
/g/g0/hutter2/VELOC/src/common/command.hpp:16:17: error: array used as initializer
/g/g0/hutter2/VELOC/src/common/command.hpp: In constructor 'command_t::command_t(int, int, int, const string&)':
/g/g0/hutter2/VELOC/src/common/command.hpp:17:95: error: array used as initializer
    command_t(int r, int c, int v, const std::string &s) : unique_id(r), command(c), version(v) {
                                                                                              ^
/g/g0/hutter2/VELOC/src/common/command.hpp:17:95: error: array used as initializer
make[2]: *** [src/modules/CMakeFiles/veloc-modules.dir/build.make:63: src/modules/CMakeFiles/veloc-modules.dir/module_manager.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:110: src/modules/CMakeFiles/veloc-modules.dir/all] Error 2
make: *** [Makefile:130: all] Error 2

Down node detection on LSF has wrong node count

Issue

LSB_DJOB_HOSTFILE also contains the launch node.

VELOC/scripts/LSF/veloc_env.in

Line 81 in 7b2eeed

my $hostfile = $ENV{LSB_DJOB_HOSTFILE};

When it is used to generate a nodelist without filtering out the launch node:

VELOC/scripts/LSF/veloc_jsrun.in

Lines 50 to 52 in 7b2eeed

    
           nodelist=`$bindir/veloc_env --nodes` 
        
           if [ $? -eq 0 ] ; then 
        
             VELOC_NODELIST=$nodelist

VeloC thinks one more node is needed than actual, and that one more node is available in the event of a restart.

How to Replicate

Configure JSM to allow jsrun to launch new jobs after a node failure in an allocation by setting FAULT_TOLERANCE=1 in a private ~/.jsm.conf file:

# Create a custom jam.conf file to be used in your jobs
cp /opt/ibm/spectrum_mpi/jsm_pmix/etc/jsm.conf ~/.jsm.conf
 
# Then modify your ~/.jsm.conf to uncomment
FAULT_TOLERANCE = 1

Allocate two nodes and run veloc_jsrun to identify them:

user@butte5 bin $ ./veloc_jsrun -r 1 ./mpihostname
veloc_jsrun: Started: Thu Apr  4 14:28:23 PDT 2019
veloc_jsrun: RUN 1: Thu Apr  4 14:28:23 PDT 2019
0 of 2 on butte17
1 of 2 on butte18
veloc_jsrun: $VELOC_RUNS exhausted, ending run.
veloc_jsrun: Ended: Thu Apr  4 14:28:23 PDT 2019

Then run again on a single node to identify which is the default:

user@butte5 bin $ ./veloc_jsrun -r 1 -n1 ./mpihostname
veloc_jsrun: Started: Thu Apr  4 14:49:41 PDT 2019
veloc_jsrun: RUN 1: Thu Apr  4 14:49:41 PDT 2019
0 of 1 on butte17
veloc_jsrun: $VELOC_RUNS exhausted, ending run.
veloc_jsrun: Ended: Thu Apr  4 14:49:41 PDT 2019

Have someone with superpowers kill that node for you and wait at least 1 minute, if not more, before running again. There is a jsrun bug that will cause a hang if you try too soon. While waiting, set the debug variable to see what is happening.

export VELOC_DEBUG=1

After at least 1 minute, run again:

user@butte5 bin $ ./veloc_jsrun -r 1 -n1 ./mpihostname
+ ⋮
veloc_jsrun: Started: Thu Apr  4 15:03:58 PDT 2019
+ ⋮
++ /<bindir>/veloc_env --nodes
+ nodelist=butte17,butte18,butte5
+ '[' 0 -eq 0 ']'
+ VELOC_NODELIST=butte17,butte18,butte5
+ '[' -z butte17,butte18,butte5 ']'
+ export VELOC_NODELIST
+ ⋮
++ /<bindir>/veloc_list_down_nodes --free
butte17: ssh: connect to host butte17 port 22: Connection timed out
pdsh@butte5: butte17: ssh exited with exit code 255
+ down_nodes=butte17
+ '[' butte17 '!=' '' ']'
+ /<bindir>/veloc_list_down_nodes --free --reason
butte17: ssh: connect to host butte17 port 22: Connection timed out
pdsh@butte5: butte17: ssh exited with exit code 255
butte17: Failed to pdsh echo UP
+ ⋮
NNODES=-1 RUNTIME=0 FAILED=butte17'
+ ⋮
++ /<bindir>/veloc_glob_hosts --count --hosts butte17,butte18,butte5
+ num_needed=3
+ '[' -n 1 ']'
+ num_needed=1
++ /<bindir>/veloc_glob_hosts --count --minus butte17,butte18,butte5:butte17
+ num_left=2
+ '[' 2 -lt 1 ']'
+ exclude_hosts=--exclude_hosts=butte17
+ ⋮
veloc_jsrun: RUN 1: Thu Apr  4 15:04:19 PDT 2019
+ jsrun --exclude_hosts=butte17 -n 1 -r 1 ./mpihostname
0 of 1 on butte18
+ ⋮
veloc_jsrun: $VELOC_RUNS exhausted, ending run.
+ ⋮
veloc_jsrun: Ended: Thu Apr  4 15:04:19 PDT 2019

Notice the launch node is included in the VELOC_NODELIST and that num_needed=3 originally and num_left=2 after the failed node was accounted for. But we only allocated 2 nodes to begin with.

Had VELOC_MIN_NODES been set to 2, this would still think it has enough nodes to try again.

Unable to run the example program

Hi, I am going to use VELOC in my project, but when I run the example program that comes with it after installing VELOC, there seems to be an error, I use mpirun -np 3 heatdis_mem 2 heatdis.cfg to execute the example program, here is the output log:

[INFO 0] [/home/Huimin97/soft/VELOC/src/common/config.cpp:68:config_t] using POSIX to interact with persistent storage in single file mode, path: /tmp/persistent
[INFO 0] [/home/Huimin97/soft/VELOC/src/common/config.cpp:68:config_t] using POSIX to interact with persistent storage in single file mode, path: /tmp/persistent
[INFO 0] [/home/Huimin97/soft/VELOC/src/common/config.cpp:68:config_t] using POSIX to interact with persistent storage in single file mode, path: /tmp/persistent
[INFO 0] [/home/Huimin97/soft/VELOC/src/modules/ec_module.cpp:15:ec_module_t] EC interval not specified, every checkpoint will be protected using EC
[INFO 0] [/home/Huimin97/soft/VELOC/src/modules/ec_module.cpp:21:ec_module_t] Running on a single host, EC deactivated
[INFO 0] [/home/Huimin97/soft/VELOC/src/modules/transfer_module.cpp:18:transfer_module_t] Persistence interval not specified, every checkpoint will be persisted
[INFO 0] [/home/Huimin97/soft/VELOC/src/modules/chksum_module.cpp:20:chksum_module_t] checksumming active: 1
mpirun: Forwarding signal 24 to job

The configuration file comes with the test folder, I just changed the mode to sync, and I don't know what's wrong, can you give me some advice?

Use MPI_Exscan to compute offsets?

Just browsing through code, I noticed this pattern:

VELOC/src/lib/client.cpp

Lines 194 to 199 in c857688

    
           long offset = 0, next_offset = file_size(current_ckpt.filename(cfg.get("scratch"))); 
        
           if (rank > 0) 
        
               MPI_Recv(&offset, 1, MPI_LONG, rank - 1, 0, comm, MPI_STATUS_IGNORE); 
        
           next_offset += offset; 
        
           if (rank + 1 < no_ranks) 
        
               MPI_Send(&next_offset, 1, MPI_LONG, rank + 1, 0, comm);

I suspect you might be able to replace that code segment with an MPI_Scan or MPI_Exscan, which takes O(log P) time instead of O(P).

Node down, VeloC XOR restart on the new allocated node

Hi,

I'm testing VeloC restart capability after a single node's failure (node down). For this reason, I created a test job that checkpoints its data periodically on the node's local storage (/tmp) using the VeloC library. I also configured VeloC to protect the data by erasure coding (ec_interval = 0).

To test the restart capability, after the job does the computation for a number of steps (iterations, checkpoints), I inject a failure on one of the job's nodes and restart the job with the same number of nodes as before. The job will be executed on the previous set of nodes from the first run, except for a newly allocated node replaced by the failed node.

Here I would expect that the checkpoint of the failed node is computed using the EC data and loaded into the newly allocated node. However, as I see this functionality is not available on the VeloC library and the job is restarted from the beginning. This leaves the EC data useless if a node goes down.

However, I managed to successfully restart the job from the local checkpoints if all nodes are alive and the job is restarted on the exact same set of the nodes as the first run.

To further investigate the case, I compared the source code of VeloC with the SCR library (for which I successfully restarted the job using XOR). If I'm correct, the difference is that in SCR in the SCR_Init function, the scr_cache_rebuild call writes the data on the newly allocated node at the start of the second run (after the failure) and before calling the SCR_Have_restart and SCR_Start_restart functions. However, this is not implemented in the VeloC library as I see.

So please let me know if I'm missing something. Otherwise, is this going to be available on the next release?

Component releases for Veloc v1.7

Tracking issue for component releases.

VeloC and MPI IO

Hi
Just beginner question: is it possible to adapt a code using MPI- collective IO for its
checkpointing files to VeloC?
Thanks
Denis

restart-in-place: detect halt file from library to know when to stop restarting

Without knowing otherwise, the scripts will assume the job must always be restarted, including the case that the job actually ran to completion. To avoid having the scripts auto-restart the job, they need to know that the job ended on purpose.

Note that it's not sufficient to use the exit code of the launch command because some jobs return a non-zero exit code to indicate various info -- e.g., maybe the calculation went bad.

With SCR, we ended up writing a "halt" file in SCR_Finalize, and then we look for that "halt" file in the scripts. If we see it, we assume the job completed and we won't try to restart it. If there is no file, the scripts will try to restart the job.

example: function call within assert

I checked your library and looked into your test case to get a feeling of how your library should be used.
The test is using the pattern assert(VELOC_*()), this is dangerous because in cases where the code will be compiled with -DNDEBUG the function call within assert will not be executed because assert () is a noop.

example:

VELOC/test/heatdis_fault.cpp

Line 123 in 4a8ec34

assert(VELOC_Restart("heatdis", v) == VELOC_SUCCESS);

VELOC/test/heatdis_fault.cpp

Line 138 in 4a8ec34

assert(VELOC_Checkpoint("heatdis", i) == VELOC_SUCCESS);

I know that the code I pointed to is the test case of VELOC but users will most likely start with copy-pasting from existing code and therefore propagate the issue into their own codebase.

VELOC install 64 bit libraries in /usr/lib instead of /usr/lib64

Too many hardcoded stuff, please let the user override the install paths with CMAKE_INSTALL_LIBDIR
https://github.com/ECP-VeloC/VELOC/blob/master/src/modules/CMakeLists.txt#L12
https://github.com/ECP-VeloC/VELOC/blob/master/src/lib/CMakeLists.txt#L13
https://github.com/ECP-VeloC/VELOC/blob/master/src/backend/CMakeLists.txt#L12

SLURM restart-in-place script double counts down node

When testing veloc_srun on SLURM, on back-to-back runs after a node was already taken down, the second run ended up double counting the same downed node in down_nodes.

Unfortunately I don't have the output from this test as it was done on a different machine.

SCR-to-VELOC differences

The VELOC API is missing some semantics needed for SCR. Most of these can be worked around, but I'll build a list to record where we stand:

No support for non-checkpoint output sets, e.g., SCR_Start_output. VELOC assumes each output set is a checkpoint.
No ability for app to ask when to checkpoint, i.e., SCR_Need_checkpoint
No ability for app to ask whether it should exit, i.e., SCR_Should_exit
Route_file also renames file whereas SCR keeps the same file name and only changes the path
Because veloc does not return checkpoint name to application, app must track a name-to-id map in an external file, so this map may become out of sync with checkpoints that are actually available

Program not finishing in async mode

Hi,

I'm testing VeloC with a heatdis example using the single-mode (using VELOC_Init_single) option.
My cfg file contains:

scratch = /tmp/scratch
persistent = /tmp/persistent
mode = async

I'm using MPICH version 3.3.2, and VeloC 1.4 release, I'm not launching veloc-backend before running my program, and I'm using a single machine.

The issue is that when I run my program letting the VeloC library starts the backend by itself, my program doesn't finish (I think it gets stuck in the VELOC_Finalize function). The backend log seems to be normal.

If I start the backend before running the program everything goes fine.

Any idea of what is going on?

VELOC_Route_file returns empty string during restart

With the file-based method, during a restart, the app calls VELOC_Route_file to get the path to each of its checkpoints files. However, veloc currently returns "" in this case, due to a check as to whether the library is in an active checkpoint state.

[ERROR 804727076878] [.../src/lib/client.cpp:145:route_file] must call checkpoint_begin() first

We need to also enable this to work during restart, i.e. for the following sequence,

VELOC_Init()
VELOC_Restart_test()
VELOC_Restart_begin()
VELOC_Route_file()
VELOC_Restart_end()

Interop with GPU compute kernels

Is there a plan for veloc to support a direct HIP/CUDA interface?

Fortran 90 bindingd to VeloC?

Is there already available Fortran bindings to VeloC ?

SLURM restart-in-place script hangs when forcing prolog on down node

If a node is in the allocation but is down (i.e., in down_nodes), this causes a hang when attempting to run on the down node.

VELOC/scripts/SLURM/veloc_srun.in

Lines 57 to 59 in 4144d92

    
           # NOP srun to force every node to run prolog to delete files from cache 
        
           # TODO: remove this if admins find a better place to clear cache 
        
           srun /bin/hostname > /dev/null

restart-in-place: copy cray aprun variant from scr

restart-in-place: test paths to node-local SSDs

When we had problems with node-local SSDs failing, we added scripting to run tests against the SSDs on each node. For that, the scripts have to know the path to the SSDs to be tested. The library often knows this, so again this kind of info could be stored to a file. In SCR, we had the scripts read the same config files as the library to identify the set of storage paths that were used. For now, I've commented these tests out of the scripts.

restart-in-place: record number of nodes used in first run, so restart logic knows whether enough healthy nodes exist

To know whether there are enough nodes left, it's useful to have the first job that runs record the number of nodes it used in a file. Then the scripts can process that file to get the number of nodes needed to know whether there are enough nodes for a restart. We can work around that by having the user set a variable or config param stating the number of nodes they need, like VELOC_MIN_NODES. However, it's nice to automate this, since it's one less setting for the user.

VeloC should target Boost 1.53 instad of 1.60

It would be much easier for users if VeloC targeted Boost 1.53 instead of 1.60. 1.53 is included with RHEL 7 & CentOS 7, which is what a lot of sites are going to be running. Is there something absolutely essential to 1.60 that we can't get with 1.53?

$   cmake -DCMAKE_BUILD_TYPE=Debug -DWITH_KVTREE_PREFIX=`pwd`/install -DWITH_AXL_PREFIX=`pwd`/install .
CMake Error at /usr/tce/packages/cmake/cmake-3.9.2/share/cmake-3.9/Modules/FindBoost.cmake:1878 (message):
  Unable to find the requested Boost libraries.

  Boost version: 1.53.0

  Boost include path: /usr/include

  Detected version of Boost is too old.  Requested version was 1.60 (or
  newer).

	nodelist=`$bindir/veloc_env --nodes`
	if [ $? -eq 0 ] ; then
	VELOC_NODELIST=$nodelist

	long offset = 0, next_offset = file_size(current_ckpt.filename(cfg.get("scratch")));
	if (rank > 0)
	MPI_Recv(&offset, 1, MPI_LONG, rank - 1, 0, comm, MPI_STATUS_IGNORE);
	next_offset += offset;
	if (rank + 1 < no_ranks)
	MPI_Send(&next_offset, 1, MPI_LONG, rank + 1, 0, comm);

	# NOP srun to force every node to run prolog to delete files from cache
	# TODO: remove this if admins find a better place to clear cache
	srun /bin/hostname > /dev/null

ecp-veloc / veloc Goto Github PK

veloc's People

Contributors

Stargazers

Watchers

Forkers

veloc's Issues

Issue

How to Replicate

Recommend Projects

Recommend Topics

Recommend Org