Coder Social home page Coder Social logo

ecp-veloc / veloc Goto Github PK

View Code? Open in Web Editor NEW
52.0 11.0 21.0 946 KB

Very-Low Overhead Checkpointing System

Home Page: http://veloc.rtfd.io

License: MIT License

CMake 4.99% Shell 1.00% C 12.59% C++ 76.47% Python 4.94%
checkpointing checkpoint-restart async-storage

veloc's People

Contributors

adammoody avatar bnicolae avatar camstan avatar gonsie avatar kosinovsky avatar matthew-whitlock avatar nmm0 avatar philmiller avatar rinkug avatar tonyhutter avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

veloc's Issues

warning: strncpy() specified bound 50 equals destination size

I'm seeing this warning on Fedora 28 with GCC 8.2.1:

[  5%] Building CXX object src/modules/CMakeFiles/veloc-modules.dir/module_manager.cpp.o
[ 10%] Building CXX object src/modules/CMakeFiles/veloc-modules.dir/client_watchdog.cpp.o
[ 15%] Building CXX object src/modules/CMakeFiles/veloc-modules.dir/transfer_module.cpp.o
[ 20%] Building CXX object src/modules/CMakeFiles/veloc-modules.dir/client_aggregator.cpp.o
[ 25%] Building CXX object src/modules/CMakeFiles/veloc-modules.dir/ec_module.cpp.o
[ 30%] Building CXX object src/modules/CMakeFiles/veloc-modules.dir/__/common/config.cpp.o
In file included from /home/hutter/veloc/src/common/config.hpp:4,
                 from /home/hutter/veloc/src/common/config.cpp:1:
In function ‘char* strncpy0(char*, const char*, size_t)’,
    inlined from ‘int ini_parse_stream(ini_reader, void*, ini_handler, void*)’ at /home/hutter/veloc/src/common/INIReader.h:234:25,
    inlined from ‘int ini_parse_file(FILE*, ini_handler, void*)’ at /home/hutter/veloc/src/common/INIReader.h:283:28,
    inlined from ‘int ini_parse(const char*, ini_handler, void*)’ at /home/hutter/veloc/src/common/INIReader.h:295:27,
    inlined from ‘INIReader::INIReader(std::__cxx11::string)’ at /home/hutter/veloc/src/common/INIReader.h:370:23,
    inlined from ‘config_t::config_t(const string&)’ at /home/hutter/veloc/src/common/config.cpp:20:72:
/home/hutter/veloc/src/common/INIReader.h:163:12: warning: ‘char* strncpy(char*, const char*, size_t)’ specified bound 50 equals destination size [-Wstringop-truncation]
     strncpy(dest, src, size);
     ~~~~~~~^~~~~~~~~~~~~~~~~
In function ‘char* strncpy0(char*, const char*, size_t)’,
    inlined from ‘int ini_parse_stream(ini_reader, void*, ini_handler, void*)’ at /home/hutter/veloc/src/common/INIReader.h:257:25,
    inlined from ‘int ini_parse_file(FILE*, ini_handler, void*)’ at /home/hutter/veloc/src/common/INIReader.h:283:28,
    inlined from ‘int ini_parse(const char*, ini_handler, void*)’ at /home/hutter/veloc/src/common/INIReader.h:295:27,
    inlined from ‘INIReader::INIReader(std::__cxx11::string)’ at /home/hutter/veloc/src/common/INIReader.h:370:23,
    inlined from ‘config_t::config_t(const string&)’ at /home/hutter/veloc/src/common/config.cpp:20:72:
/home/hutter/veloc/src/common/INIReader.h:163:12: warning: ‘char* strncpy(char*, const char*, size_t)’ specified bound 50 equals destination size [-Wstringop-truncation]
     strncpy(dest, src, size);
     ~~~~~~~^~~~~~~~~~~~~~~~~
[ 35%] Linking CXX shared library libveloc-modules.so

Alternative to OpenSSL for md5

Currently you're using the libcrypto from OpenSSL for MD5 hashing of files when computing checksums. Since the only use of OpenSSL is for MD5, it's a pretty heavyweight dependency for just that. Would you be open to a PR that brought in a different MD5 implementation in order to reduce the dependency footprint?

Build VELOC as a static library

Some users requested a static build (or at least fixed rpaths). We shall keep this in mind for the next refactoring of the build process.

can't build with AXL 4.0.0

This commit broke veloc

-- Build files have been written to: /var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5_build                                                                                                                                         
>>> Source configured.                                                                                                                                                                                                                      
>>> Compiling source in /var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5 ...                                                                                                                                                        
 * Source directory (CMAKE_USE_DIR): "/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5"                                                                                                                                             
 * Build directory  (BUILD_DIR):     "/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5_build"                                                                                                                                       
ninja -v -j12 -l12                                                                                                                                                                                                                          
[1/25] /usr/bin/x86_64-pc-linux-gnu-g++ -D__ASSERT -D__BENCHMARK -D__INFO -Dveloc_modules_EXPORTS -I/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5 -I/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src  -Os -pipe -marc
h=native -frecord-gcc-switches -fPIC   -Wall -std=gnu++14 -MD -MT src/modules/CMakeFiles/veloc-modules.dir/module_manager.cpp.o -MF src/modules/CMakeFiles/veloc-modules.dir/module_manager.cpp.o.d -o src/modules/CMakeFiles/veloc-modules.
dir/module_manager.cpp.o -c /var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/module_manager.cpp                                                                                                                         
[2/25] /usr/bin/x86_64-pc-linux-gnu-g++ -D__ASSERT -D__BENCHMARK -D__INFO -Dveloc_modules_EXPORTS -I/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5 -I/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src  -Os -pipe -marc
h=native -frecord-gcc-switches -fPIC   -Wall -std=gnu++14 -MD -MT src/modules/CMakeFiles/veloc-modules.dir/client_watchdog.cpp.o -MF src/modules/CMakeFiles/veloc-modules.dir/client_watchdog.cpp.o.d -o src/modules/CMakeFiles/veloc-module
s.dir/client_watchdog.cpp.o -c /var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/client_watchdog.cpp                                                                                                                     
[3/25] /usr/bin/x86_64-pc-linux-gnu-g++ -D__ASSERT -D__BENCHMARK -D__INFO -Dveloc_modules_EXPORTS -I/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5 -I/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src  -Os -pipe -marc
h=native -frecord-gcc-switches -fPIC   -Wall -std=gnu++14 -MD -MT src/modules/CMakeFiles/veloc-modules.dir/transfer_module.cpp.o -MF src/modules/CMakeFiles/veloc-modules.dir/transfer_module.cpp.o.d -o src/modules/CMakeFiles/veloc-module
s.dir/transfer_module.cpp.o -c /var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/transfer_module.cpp                                                                                                                     
FAILED: src/modules/CMakeFiles/veloc-modules.dir/transfer_module.cpp.o                                                                                                                                                                      
/usr/bin/x86_64-pc-linux-gnu-g++ -D__ASSERT -D__BENCHMARK -D__INFO -Dveloc_modules_EXPORTS -I/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5 -I/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src  -Os -pipe -march=nativ
e -frecord-gcc-switches -fPIC   -Wall -std=gnu++14 -MD -MT src/modules/CMakeFiles/veloc-modules.dir/transfer_module.cpp.o -MF src/modules/CMakeFiles/veloc-modules.dir/transfer_module.cpp.o.d -o src/modules/CMakeFiles/veloc-modules.dir/t
ransfer_module.cpp.o -c /var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/transfer_module.cpp                                                                                                                            
/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/transfer_module.cpp: In constructor ‘transfer_module_t::transfer_module_t(const config_t&)’:                                                                           
/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/transfer_module.cpp:54:23: error: too many arguments to function ‘int AXL_Init()’                                                                                      
   54 |     int ret = AXL_Init(NULL);                                                                                 
      |               ~~~~~~~~^~~~~~                       
In file included from /var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/transfer_module.hpp:12,    
                 from /var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/transfer_module.cpp:1:     
/usr/include/axl.h:58:5: note: declared here                                                                          
   58 | int AXL_Init (void);                               
      |     ^~~~~~~~                                                                                                  
/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/transfer_module.cpp: In function ‘int axl_transfer_file(axl_xfer_t, const string&, const string&)’:                                                                    
/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/transfer_module.cpp:68:24: error: too few arguments to function ‘int AXL_Create(axl_xfer_t, const char*, const char*)’                                                 
   68 |     int id = AXL_Create(type, source.c_str());                                                                
      |              ~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~                                                                 
In file included from /var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/transfer_module.hpp:12,    
                 from /var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/transfer_module.cpp:1:     
/usr/include/axl.h:73:5: note: declared here               
   73 | int AXL_Create (axl_xfer_t xtype, const char* name, const char* state_file);                                  
      |     ^~~~~~~~~~                                     
ninja: build stopped: subcommand failed.                                                                              
 * ERROR: sys-cluster/veloc-1.5-r1::guru failed (compile phase):                                                      
 *   ninja -v -j12 -l12 failed

Change VELOC_MAX_NAME from size_t to a macro?

With VELOC_MAX_NAME in veloc.h defined as a size_t variable rather than a #define, I'm getting the following error when trying to declare an array.

[100%] Building C object src/CMakeFiles/scr_o.dir/scr2veloc.c.o
/usr/workspace/wsb/moody20/projects/scr2veloc/src/scr2veloc.c:24:13: error: variably modified 'current_name' at file scope
 static char current_name[VELOC_MAX_NAME] = "";
             ^~~~~~~~~~~~

Build fails at linking with undefined reference to `kvtree_xxx` on Cori (NERSC)

Hi all,
I wanted to try VELOC 1.4 on Cori at NERSC, but it fails at the linking phase of veloc-backend, it seems:

$ mkdir ~/veloc-1.4/
$ ./auto-install.py ~/veloc-1.4/
Installing VeloC in /global/homes/c/chiusole/veloc-1.4...
Downloading Boost...
100% [......................................................................] 121849575 / 121849575Installing KVTree...
Cloning into '/tmp/veloc/KVTree'...

...

Scanning dependencies of target veloc-backend
[ 47%] Building CXX object src/backend/CMakeFiles/veloc-backend.dir/main.cpp.o
[ 52%] Linking CXX executable veloc-backend
/usr/bin/ld: /global/homes/c/chiusole/veloc-1.4/lib64/liber.a(er.c.o): in function `ER_Init':
er.c:(.text+0x2): undefined reference to `kvtree_new'
/usr/bin/ld: er.c:(.text+0xe): undefined reference to `kvtree_new'
/usr/bin/ld: er.c:(.text+0x1c): undefined reference to `redset_init'
/usr/bin/ld: er.c:(.text+0x25): undefined reference to `shuffile_init'
/usr/bin/ld: /global/homes/c/chiusole/veloc-1.4/lib64/liber.a(er.c.o): in function `ER_Finalize':
er.c:(.text+0x57): undefined reference to `kvtree_get'
/usr/bin/ld: er.c:(.text+0x5f): undefined reference to `kvtree_size'
/usr/bin/ld: er.c:(.text+0x89): undefined reference to `kvtree_get'
/usr/bin/ld: er.c:(.text+0x91): undefined reference to `kvtree_size'
/usr/bin/ld: er.c:(.text+0xb9): undefined reference to `kvtree_delete'
/usr/bin/ld: er.c:(.text+0xc5): undefined reference to `kvtree_delete'
/usr/bin/ld: er.c:(.text+0xcc): undefined reference to `shuffile_finalize'
/usr/bin/ld: er.c:(.text+0xd8): undefined reference to `redset_finalize'
/usr/bin/ld: /global/homes/c/chiusole/veloc-1.4/lib64/liber.a(er.c.o): in function `ER_Create_Scheme':
er.c:(.text+0x14b): undefined reference to `redset_create'
/usr/bin/ld: er.c:(.text+0x171): undefined reference to `kvtree_set_kv_int'
/usr/bin/ld: er.c:(.text+0x185): undefined reference to `kvtree_util_set_ptr'
/usr/bin/ld: er.c:(.text+0x1d5): undefined reference to `kvtree_unset_kv_int'
/usr/bin/ld: er.c:(.text+0x1df): undefined reference to `redset_delete'
...

/usr/bin/ld: axl_sync.c:(.text+0x14c): undefined reference to `kvtree_util_set_int'
/usr/bin/ld: axl_sync.c:(.text+0x17a): undefined reference to `kvtree_util_set_int'
/usr/bin/ld: axl_sync.c:(.text+0x1aa): undefined reference to `kvtree_util_set_int'
/usr/bin/ld: /global/homes/c/chiusole/veloc-1.4/lib64/libaxl.a(axl_sync.c.o): in function `axl_sync_wait':
axl_sync.c:(.text+0x1e8): undefined reference to `kvtree_get_kv_int'
/usr/bin/ld: axl_sync.c:(.text+0x1fc): undefined reference to `kvtree_util_get_int'
collect2: error: ld returned 1 exit status
gmake[2]: *** [src/backend/CMakeFiles/veloc-backend.dir/build.make:100: src/backend/veloc-backend] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:204: src/backend/CMakeFiles/veloc-backend.dir/all] Error 2
gmake: *** [Makefile:141: all] Error 2
Installation failed!

I've tried with the default PrgEnv-intel and also swapping with PrgEnv-gnu, and both get stuck at the same point.
Any idea what the problem may be?
Happy to provide more details

MPI_Comm_split with uninitialized key value?

I'm working with @kosinovsky to debug an XOR rebuild problem. While looking through the code, this line caught my eye:

MPI_Comm_split(comm, provided == 0 ? 0 : MPI_UNDEFINED, rank, &backends);

I don't think rank has been initialized at this point, which means it could have an arbitrary value. That could then lead to a potential random reordering of rank values in the backends communicator as compared to the parent communicator.

A potential fix would be to replace rank with 0 or otherwise move the MPI_Comm_rank(comm, &rank) higher up in the function.

error using test/heatdis example

Trying to restart the test/heasdis_file example from a broken state i got the follwing error

ERROR 3830768914477] [/u/dbertini/mpiio/veloc/src/lib/client.cpp:145:route_file] must call checkpoint_begin() first

than the program hangs ...
Should one add a call to checkpoint_begin() ? If yes, where exactly ?

Include Fedora build instructions

The quick start guide (https://veloc.readthedocs.io/en/latest/quick.html) should be updated with more detail on how to build VeloC, including listing all its dependencies. Here's what worked for me on Fedora 28:

sudo yum install -y python3-pip cmake boost boost-devel openmpi-devel
pip3 install wget --user
pip3 install bs4 --user 
module load mpi

git clone -b 'veloc-1.1' --single-branch --depth 1 https://github.com/ECP-VeloC/veloc.git
cd veloc
mkdir build install
./auto-install.py --no-boost install

command.hpp: error: array used as initializer command_t() { }

I'm unable to build VeloC master (using GCC 4.9.3):

$ cmake -DCMAKE_BUILD_TYPE=Debug -DWITH_AXL_PREFIX=`pwd`/install -DWITH_ER_PREFIX=`pwd`/install -DBoost_NO_BOOST_CMAKE=TRUE -DBoost_NO_SYSTEM_PATHS=TRUE -DBOOST_ROOT=/g/g0/hutter2/boost_1_69_0 .
-- The C compiler identification is GNU 4.9.3
-- The CXX compiler identification is GNU 4.9.3
-- Check for working C compiler: /usr/tcetmp/bin/cc
-- Check for working C compiler: /usr/tcetmp/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/tcetmp/bin/c++
-- Check for working CXX compiler: /usr/tcetmp/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Boost version: 1.69.0
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found MPI_C: /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/libmpiprofilesupport.so;/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/libmpi_ibm.so  
-- Found MPI_CXX: /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/libmpiprofilesupport.so;/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/libmpi_ibm.so  
-- Found AXL: /g/g0/hutter2/VELOC/install/lib64/libaxl.a  
-- Found ER: /g/g0/hutter2/VELOC/install/lib64/liber.a;/g/g0/hutter2/VELOC/install/lib64/libkvtree.a;/g/g0/hutter2/VELOC/install/lib64/libredset.a;/g/g0/hutter2/VELOC/install/lib64/libshuffile.a;/g/g0/hutter2/VELOC/install/lib64/librankstr.a;z  
-- Configuring done
-- Generating done
-- Build files have been written to: /g/g0/hutter2/VELOC

$ make
[  5%] Building CXX object src/modules/CMakeFiles/veloc-modules.dir/module_manager.cpp.o
In file included from /g/g0/hutter2/VELOC/src/modules/module_manager.hpp:4:0,
                from /g/g0/hutter2/VELOC/src/modules/module_manager.cpp:1:
/g/g0/hutter2/VELOC/src/common/command.hpp: In constructor 'command_t::command_t()':
/g/g0/hutter2/VELOC/src/common/command.hpp:16:17: error: array used as initializer
    command_t() { }
                ^
/g/g0/hutter2/VELOC/src/common/command.hpp:16:17: error: array used as initializer
/g/g0/hutter2/VELOC/src/common/command.hpp: In constructor 'command_t::command_t(int, int, int, const string&)':
/g/g0/hutter2/VELOC/src/common/command.hpp:17:95: error: array used as initializer
    command_t(int r, int c, int v, const std::string &s) : unique_id(r), command(c), version(v) {
                                                                                              ^
/g/g0/hutter2/VELOC/src/common/command.hpp:17:95: error: array used as initializer
make[2]: *** [src/modules/CMakeFiles/veloc-modules.dir/build.make:63: src/modules/CMakeFiles/veloc-modules.dir/module_manager.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:110: src/modules/CMakeFiles/veloc-modules.dir/all] Error 2
make: *** [Makefile:130: all] Error 2

Down node detection on LSF has wrong node count

Issue

LSB_DJOB_HOSTFILE also contains the launch node.

my $hostfile = $ENV{LSB_DJOB_HOSTFILE};

When it is used to generate a nodelist without filtering out the launch node:

nodelist=`$bindir/veloc_env --nodes`
if [ $? -eq 0 ] ; then
VELOC_NODELIST=$nodelist

VeloC thinks one more node is needed than actual, and that one more node is available in the event of a restart.

How to Replicate

Configure JSM to allow jsrun to launch new jobs after a node failure in an allocation by setting FAULT_TOLERANCE=1 in a private ~/.jsm.conf file:

# Create a custom jam.conf file to be used in your jobs
cp /opt/ibm/spectrum_mpi/jsm_pmix/etc/jsm.conf ~/.jsm.conf
 
# Then modify your ~/.jsm.conf to uncomment
FAULT_TOLERANCE = 1

Allocate two nodes and run veloc_jsrun to identify them:

user@butte5 bin $ ./veloc_jsrun -r 1 ./mpihostname
veloc_jsrun: Started: Thu Apr  4 14:28:23 PDT 2019
veloc_jsrun: RUN 1: Thu Apr  4 14:28:23 PDT 2019
0 of 2 on butte17
1 of 2 on butte18
veloc_jsrun: $VELOC_RUNS exhausted, ending run.
veloc_jsrun: Ended: Thu Apr  4 14:28:23 PDT 2019

Then run again on a single node to identify which is the default:

user@butte5 bin $ ./veloc_jsrun -r 1 -n1 ./mpihostname
veloc_jsrun: Started: Thu Apr  4 14:49:41 PDT 2019
veloc_jsrun: RUN 1: Thu Apr  4 14:49:41 PDT 2019
0 of 1 on butte17
veloc_jsrun: $VELOC_RUNS exhausted, ending run.
veloc_jsrun: Ended: Thu Apr  4 14:49:41 PDT 2019

Have someone with superpowers kill that node for you and wait at least 1 minute, if not more, before running again. There is a jsrun bug that will cause a hang if you try too soon. While waiting, set the debug variable to see what is happening.

export VELOC_DEBUG=1

After at least 1 minute, run again:

user@butte5 bin $ ./veloc_jsrun -r 1 -n1 ./mpihostname
+ ⋮
veloc_jsrun: Started: Thu Apr  4 15:03:58 PDT 2019
+ ⋮
++ /<bindir>/veloc_env --nodes
+ nodelist=butte17,butte18,butte5
+ '[' 0 -eq 0 ']'
+ VELOC_NODELIST=butte17,butte18,butte5
+ '[' -z butte17,butte18,butte5 ']'
+ export VELOC_NODELIST
+ ⋮
++ /<bindir>/veloc_list_down_nodes --free
butte17: ssh: connect to host butte17 port 22: Connection timed out
pdsh@butte5: butte17: ssh exited with exit code 255
+ down_nodes=butte17
+ '[' butte17 '!=' '' ']'
+ /<bindir>/veloc_list_down_nodes --free --reason
butte17: ssh: connect to host butte17 port 22: Connection timed out
pdsh@butte5: butte17: ssh exited with exit code 255
butte17: Failed to pdsh echo UP
+ ⋮
NNODES=-1 RUNTIME=0 FAILED=butte17'
+ ⋮
++ /<bindir>/veloc_glob_hosts --count --hosts butte17,butte18,butte5
+ num_needed=3
+ '[' -n 1 ']'
+ num_needed=1
++ /<bindir>/veloc_glob_hosts --count --minus butte17,butte18,butte5:butte17
+ num_left=2
+ '[' 2 -lt 1 ']'
+ exclude_hosts=--exclude_hosts=butte17
+ ⋮
veloc_jsrun: RUN 1: Thu Apr  4 15:04:19 PDT 2019
+ jsrun --exclude_hosts=butte17 -n 1 -r 1 ./mpihostname
0 of 1 on butte18
+ ⋮
veloc_jsrun: $VELOC_RUNS exhausted, ending run.
+ ⋮
veloc_jsrun: Ended: Thu Apr  4 15:04:19 PDT 2019

Notice the launch node is included in the VELOC_NODELIST and that num_needed=3 originally and num_left=2 after the failed node was accounted for. But we only allocated 2 nodes to begin with.

Had VELOC_MIN_NODES been set to 2, this would still think it has enough nodes to try again.

Unable to run the example program

Hi, I am going to use VELOC in my project, but when I run the example program that comes with it after installing VELOC, there seems to be an error, I use mpirun -np 3 heatdis_mem 2 heatdis.cfg to execute the example program, here is the output log:

[INFO 0] [/home/Huimin97/soft/VELOC/src/common/config.cpp:68:config_t] using POSIX to interact with persistent storage in single file mode, path: /tmp/persistent
[INFO 0] [/home/Huimin97/soft/VELOC/src/common/config.cpp:68:config_t] using POSIX to interact with persistent storage in single file mode, path: /tmp/persistent
[INFO 0] [/home/Huimin97/soft/VELOC/src/common/config.cpp:68:config_t] using POSIX to interact with persistent storage in single file mode, path: /tmp/persistent
[INFO 0] [/home/Huimin97/soft/VELOC/src/modules/ec_module.cpp:15:ec_module_t] EC interval not specified, every checkpoint will be protected using EC
[INFO 0] [/home/Huimin97/soft/VELOC/src/modules/ec_module.cpp:21:ec_module_t] Running on a single host, EC deactivated
[INFO 0] [/home/Huimin97/soft/VELOC/src/modules/transfer_module.cpp:18:transfer_module_t] Persistence interval not specified, every checkpoint will be persisted
[INFO 0] [/home/Huimin97/soft/VELOC/src/modules/chksum_module.cpp:20:chksum_module_t] checksumming active: 1
mpirun: Forwarding signal 24 to job

The configuration file comes with the test folder, I just changed the mode to sync, and I don't know what's wrong, can you give me some advice?

Use MPI_Exscan to compute offsets?

Just browsing through code, I noticed this pattern:

VELOC/src/lib/client.cpp

Lines 194 to 199 in c857688

long offset = 0, next_offset = file_size(current_ckpt.filename(cfg.get("scratch")));
if (rank > 0)
MPI_Recv(&offset, 1, MPI_LONG, rank - 1, 0, comm, MPI_STATUS_IGNORE);
next_offset += offset;
if (rank + 1 < no_ranks)
MPI_Send(&next_offset, 1, MPI_LONG, rank + 1, 0, comm);

I suspect you might be able to replace that code segment with an MPI_Scan or MPI_Exscan, which takes O(log P) time instead of O(P).

Node down, VeloC XOR restart on the new allocated node

Hi,

I'm testing VeloC restart capability after a single node's failure (node down). For this reason, I created a test job that checkpoints its data periodically on the node's local storage (/tmp) using the VeloC library. I also configured VeloC to protect the data by erasure coding (ec_interval = 0).

To test the restart capability, after the job does the computation for a number of steps (iterations, checkpoints), I inject a failure on one of the job's nodes and restart the job with the same number of nodes as before. The job will be executed on the previous set of nodes from the first run, except for a newly allocated node replaced by the failed node.

Here I would expect that the checkpoint of the failed node is computed using the EC data and loaded into the newly allocated node. However, as I see this functionality is not available on the VeloC library and the job is restarted from the beginning. This leaves the EC data useless if a node goes down.

However, I managed to successfully restart the job from the local checkpoints if all nodes are alive and the job is restarted on the exact same set of the nodes as the first run.

To further investigate the case, I compared the source code of VeloC with the SCR library (for which I successfully restarted the job using XOR). If I'm correct, the difference is that in SCR in the SCR_Init function, the scr_cache_rebuild call writes the data on the newly allocated node at the start of the second run (after the failure) and before calling the SCR_Have_restart and SCR_Start_restart functions. However, this is not implemented in the VeloC library as I see.

So please let me know if I'm missing something. Otherwise, is this going to be available on the next release?

VeloC and MPI IO

Hi
Just beginner question: is it possible to adapt a code using MPI- collective IO for its
checkpointing files to VeloC?
Thanks
Denis

restart-in-place: detect halt file from library to know when to stop restarting

Without knowing otherwise, the scripts will assume the job must always be restarted, including the case that the job actually ran to completion. To avoid having the scripts auto-restart the job, they need to know that the job ended on purpose.

Note that it's not sufficient to use the exit code of the launch command because some jobs return a non-zero exit code to indicate various info -- e.g., maybe the calculation went bad.

With SCR, we ended up writing a "halt" file in SCR_Finalize, and then we look for that "halt" file in the scripts. If we see it, we assume the job completed and we won't try to restart it. If there is no file, the scripts will try to restart the job.

example: function call within assert

I checked your library and looked into your test case to get a feeling of how your library should be used.
The test is using the pattern assert(VELOC_*()), this is dangerous because in cases where the code will be compiled with -DNDEBUG the function call within assert will not be executed because assert () is a noop.

example:

assert(VELOC_Restart("heatdis", v) == VELOC_SUCCESS);

assert(VELOC_Checkpoint("heatdis", i) == VELOC_SUCCESS);

I know that the code I pointed to is the test case of VELOC but users will most likely start with copy-pasting from existing code and therefore propagate the issue into their own codebase.

SLURM restart-in-place script double counts down node

When testing veloc_srun on SLURM, on back-to-back runs after a node was already taken down, the second run ended up double counting the same downed node in down_nodes.

Unfortunately I don't have the output from this test as it was done on a different machine.

SCR-to-VELOC differences

The VELOC API is missing some semantics needed for SCR. Most of these can be worked around, but I'll build a list to record where we stand:

  1. No support for non-checkpoint output sets, e.g., SCR_Start_output. VELOC assumes each output set is a checkpoint.
  2. No ability for app to ask when to checkpoint, i.e., SCR_Need_checkpoint
  3. No ability for app to ask whether it should exit, i.e., SCR_Should_exit
  4. Route_file also renames file whereas SCR keeps the same file name and only changes the path
  5. Because veloc does not return checkpoint name to application, app must track a name-to-id map in an external file, so this map may become out of sync with checkpoints that are actually available

Program not finishing in async mode

Hi,

I'm testing VeloC with a heatdis example using the single-mode (using VELOC_Init_single) option.
My cfg file contains:

scratch = /tmp/scratch
persistent = /tmp/persistent
mode = async

I'm using MPICH version 3.3.2, and VeloC 1.4 release, I'm not launching veloc-backend before running my program, and I'm using a single machine.

The issue is that when I run my program letting the VeloC library starts the backend by itself, my program doesn't finish (I think it gets stuck in the VELOC_Finalize function). The backend log seems to be normal.

If I start the backend before running the program everything goes fine.

Any idea of what is going on?

VELOC_Route_file returns empty string during restart

With the file-based method, during a restart, the app calls VELOC_Route_file to get the path to each of its checkpoints files. However, veloc currently returns "" in this case, due to a check as to whether the library is in an active checkpoint state.

[ERROR 804727076878] [.../src/lib/client.cpp:145:route_file] must call checkpoint_begin() first

We need to also enable this to work during restart, i.e. for the following sequence,

VELOC_Init()
VELOC_Restart_test()
VELOC_Restart_begin()
VELOC_Route_file()
VELOC_Restart_end()

restart-in-place: test paths to node-local SSDs

When we had problems with node-local SSDs failing, we added scripting to run tests against the SSDs on each node. For that, the scripts have to know the path to the SSDs to be tested. The library often knows this, so again this kind of info could be stored to a file. In SCR, we had the scripts read the same config files as the library to identify the set of storage paths that were used. For now, I've commented these tests out of the scripts.

restart-in-place: record number of nodes used in first run, so restart logic knows whether enough healthy nodes exist

To know whether there are enough nodes left, it's useful to have the first job that runs record the number of nodes it used in a file. Then the scripts can process that file to get the number of nodes needed to know whether there are enough nodes for a restart. We can work around that by having the user set a variable or config param stating the number of nodes they need, like VELOC_MIN_NODES. However, it's nice to automate this, since it's one less setting for the user.

VeloC should target Boost 1.53 instad of 1.60

It would be much easier for users if VeloC targeted Boost 1.53 instead of 1.60. 1.53 is included with RHEL 7 & CentOS 7, which is what a lot of sites are going to be running. Is there something absolutely essential to 1.60 that we can't get with 1.53?

$   cmake -DCMAKE_BUILD_TYPE=Debug -DWITH_KVTREE_PREFIX=`pwd`/install -DWITH_AXL_PREFIX=`pwd`/install .
CMake Error at /usr/tce/packages/cmake/cmake-3.9.2/share/cmake-3.9/Modules/FindBoost.cmake:1878 (message):
  Unable to find the requested Boost libraries.

  Boost version: 1.53.0

  Boost include path: /usr/include

  Detected version of Boost is too old.  Requested version was 1.60 (or
  newer).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.