ecp-veloc / veloc Goto Github PK
View Code? Open in Web Editor NEWVery-Low Overhead Checkpointing System
Home Page: http://veloc.rtfd.io
License: MIT License
Very-Low Overhead Checkpointing System
Home Page: http://veloc.rtfd.io
License: MIT License
I'm seeing this warning on Fedora 28 with GCC 8.2.1:
[ 5%] Building CXX object src/modules/CMakeFiles/veloc-modules.dir/module_manager.cpp.o
[ 10%] Building CXX object src/modules/CMakeFiles/veloc-modules.dir/client_watchdog.cpp.o
[ 15%] Building CXX object src/modules/CMakeFiles/veloc-modules.dir/transfer_module.cpp.o
[ 20%] Building CXX object src/modules/CMakeFiles/veloc-modules.dir/client_aggregator.cpp.o
[ 25%] Building CXX object src/modules/CMakeFiles/veloc-modules.dir/ec_module.cpp.o
[ 30%] Building CXX object src/modules/CMakeFiles/veloc-modules.dir/__/common/config.cpp.o
In file included from /home/hutter/veloc/src/common/config.hpp:4,
from /home/hutter/veloc/src/common/config.cpp:1:
In function ‘char* strncpy0(char*, const char*, size_t)’,
inlined from ‘int ini_parse_stream(ini_reader, void*, ini_handler, void*)’ at /home/hutter/veloc/src/common/INIReader.h:234:25,
inlined from ‘int ini_parse_file(FILE*, ini_handler, void*)’ at /home/hutter/veloc/src/common/INIReader.h:283:28,
inlined from ‘int ini_parse(const char*, ini_handler, void*)’ at /home/hutter/veloc/src/common/INIReader.h:295:27,
inlined from ‘INIReader::INIReader(std::__cxx11::string)’ at /home/hutter/veloc/src/common/INIReader.h:370:23,
inlined from ‘config_t::config_t(const string&)’ at /home/hutter/veloc/src/common/config.cpp:20:72:
/home/hutter/veloc/src/common/INIReader.h:163:12: warning: ‘char* strncpy(char*, const char*, size_t)’ specified bound 50 equals destination size [-Wstringop-truncation]
strncpy(dest, src, size);
~~~~~~~^~~~~~~~~~~~~~~~~
In function ‘char* strncpy0(char*, const char*, size_t)’,
inlined from ‘int ini_parse_stream(ini_reader, void*, ini_handler, void*)’ at /home/hutter/veloc/src/common/INIReader.h:257:25,
inlined from ‘int ini_parse_file(FILE*, ini_handler, void*)’ at /home/hutter/veloc/src/common/INIReader.h:283:28,
inlined from ‘int ini_parse(const char*, ini_handler, void*)’ at /home/hutter/veloc/src/common/INIReader.h:295:27,
inlined from ‘INIReader::INIReader(std::__cxx11::string)’ at /home/hutter/veloc/src/common/INIReader.h:370:23,
inlined from ‘config_t::config_t(const string&)’ at /home/hutter/veloc/src/common/config.cpp:20:72:
/home/hutter/veloc/src/common/INIReader.h:163:12: warning: ‘char* strncpy(char*, const char*, size_t)’ specified bound 50 equals destination size [-Wstringop-truncation]
strncpy(dest, src, size);
~~~~~~~^~~~~~~~~~~~~~~~~
[ 35%] Linking CXX shared library libveloc-modules.so
Currently you're using the libcrypto from OpenSSL for MD5 hashing of files when computing checksums. Since the only use of OpenSSL is for MD5, it's a pretty heavyweight dependency for just that. Would you be open to a PR that brought in a different MD5 implementation in order to reduce the dependency footprint?
Some users requested a static build (or at least fixed rpaths). We shall keep this in mind for the next refactoring of the build process.
This commit broke veloc
-- Build files have been written to: /var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5_build
>>> Source configured.
>>> Compiling source in /var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5 ...
* Source directory (CMAKE_USE_DIR): "/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5"
* Build directory (BUILD_DIR): "/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5_build"
ninja -v -j12 -l12
[1/25] /usr/bin/x86_64-pc-linux-gnu-g++ -D__ASSERT -D__BENCHMARK -D__INFO -Dveloc_modules_EXPORTS -I/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5 -I/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src -Os -pipe -marc
h=native -frecord-gcc-switches -fPIC -Wall -std=gnu++14 -MD -MT src/modules/CMakeFiles/veloc-modules.dir/module_manager.cpp.o -MF src/modules/CMakeFiles/veloc-modules.dir/module_manager.cpp.o.d -o src/modules/CMakeFiles/veloc-modules.
dir/module_manager.cpp.o -c /var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/module_manager.cpp
[2/25] /usr/bin/x86_64-pc-linux-gnu-g++ -D__ASSERT -D__BENCHMARK -D__INFO -Dveloc_modules_EXPORTS -I/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5 -I/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src -Os -pipe -marc
h=native -frecord-gcc-switches -fPIC -Wall -std=gnu++14 -MD -MT src/modules/CMakeFiles/veloc-modules.dir/client_watchdog.cpp.o -MF src/modules/CMakeFiles/veloc-modules.dir/client_watchdog.cpp.o.d -o src/modules/CMakeFiles/veloc-module
s.dir/client_watchdog.cpp.o -c /var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/client_watchdog.cpp
[3/25] /usr/bin/x86_64-pc-linux-gnu-g++ -D__ASSERT -D__BENCHMARK -D__INFO -Dveloc_modules_EXPORTS -I/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5 -I/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src -Os -pipe -marc
h=native -frecord-gcc-switches -fPIC -Wall -std=gnu++14 -MD -MT src/modules/CMakeFiles/veloc-modules.dir/transfer_module.cpp.o -MF src/modules/CMakeFiles/veloc-modules.dir/transfer_module.cpp.o.d -o src/modules/CMakeFiles/veloc-module
s.dir/transfer_module.cpp.o -c /var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/transfer_module.cpp
FAILED: src/modules/CMakeFiles/veloc-modules.dir/transfer_module.cpp.o
/usr/bin/x86_64-pc-linux-gnu-g++ -D__ASSERT -D__BENCHMARK -D__INFO -Dveloc_modules_EXPORTS -I/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5 -I/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src -Os -pipe -march=nativ
e -frecord-gcc-switches -fPIC -Wall -std=gnu++14 -MD -MT src/modules/CMakeFiles/veloc-modules.dir/transfer_module.cpp.o -MF src/modules/CMakeFiles/veloc-modules.dir/transfer_module.cpp.o.d -o src/modules/CMakeFiles/veloc-modules.dir/t
ransfer_module.cpp.o -c /var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/transfer_module.cpp
/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/transfer_module.cpp: In constructor ‘transfer_module_t::transfer_module_t(const config_t&)’:
/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/transfer_module.cpp:54:23: error: too many arguments to function ‘int AXL_Init()’
54 | int ret = AXL_Init(NULL);
| ~~~~~~~~^~~~~~
In file included from /var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/transfer_module.hpp:12,
from /var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/transfer_module.cpp:1:
/usr/include/axl.h:58:5: note: declared here
58 | int AXL_Init (void);
| ^~~~~~~~
/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/transfer_module.cpp: In function ‘int axl_transfer_file(axl_xfer_t, const string&, const string&)’:
/var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/transfer_module.cpp:68:24: error: too few arguments to function ‘int AXL_Create(axl_xfer_t, const char*, const char*)’
68 | int id = AXL_Create(type, source.c_str());
| ~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
In file included from /var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/transfer_module.hpp:12,
from /var/tmp/portage/sys-cluster/veloc-1.5-r1/work/VELOC-1.5/src/modules/transfer_module.cpp:1:
/usr/include/axl.h:73:5: note: declared here
73 | int AXL_Create (axl_xfer_t xtype, const char* name, const char* state_file);
| ^~~~~~~~~~
ninja: build stopped: subcommand failed.
* ERROR: sys-cluster/veloc-1.5-r1::guru failed (compile phase):
* ninja -v -j12 -l12 failed
With VELOC_MAX_NAME in veloc.h defined as a size_t variable rather than a #define, I'm getting the following error when trying to declare an array.
[100%] Building C object src/CMakeFiles/scr_o.dir/scr2veloc.c.o
/usr/workspace/wsb/moody20/projects/scr2veloc/src/scr2veloc.c:24:13: error: variably modified 'current_name' at file scope
static char current_name[VELOC_MAX_NAME] = "";
^~~~~~~~~~~~
Hi all,
I wanted to try VELOC 1.4 on Cori at NERSC, but it fails at the linking phase of veloc-backend
, it seems:
$ mkdir ~/veloc-1.4/
$ ./auto-install.py ~/veloc-1.4/
Installing VeloC in /global/homes/c/chiusole/veloc-1.4...
Downloading Boost...
100% [......................................................................] 121849575 / 121849575Installing KVTree...
Cloning into '/tmp/veloc/KVTree'...
...
Scanning dependencies of target veloc-backend
[ 47%] Building CXX object src/backend/CMakeFiles/veloc-backend.dir/main.cpp.o
[ 52%] Linking CXX executable veloc-backend
/usr/bin/ld: /global/homes/c/chiusole/veloc-1.4/lib64/liber.a(er.c.o): in function `ER_Init':
er.c:(.text+0x2): undefined reference to `kvtree_new'
/usr/bin/ld: er.c:(.text+0xe): undefined reference to `kvtree_new'
/usr/bin/ld: er.c:(.text+0x1c): undefined reference to `redset_init'
/usr/bin/ld: er.c:(.text+0x25): undefined reference to `shuffile_init'
/usr/bin/ld: /global/homes/c/chiusole/veloc-1.4/lib64/liber.a(er.c.o): in function `ER_Finalize':
er.c:(.text+0x57): undefined reference to `kvtree_get'
/usr/bin/ld: er.c:(.text+0x5f): undefined reference to `kvtree_size'
/usr/bin/ld: er.c:(.text+0x89): undefined reference to `kvtree_get'
/usr/bin/ld: er.c:(.text+0x91): undefined reference to `kvtree_size'
/usr/bin/ld: er.c:(.text+0xb9): undefined reference to `kvtree_delete'
/usr/bin/ld: er.c:(.text+0xc5): undefined reference to `kvtree_delete'
/usr/bin/ld: er.c:(.text+0xcc): undefined reference to `shuffile_finalize'
/usr/bin/ld: er.c:(.text+0xd8): undefined reference to `redset_finalize'
/usr/bin/ld: /global/homes/c/chiusole/veloc-1.4/lib64/liber.a(er.c.o): in function `ER_Create_Scheme':
er.c:(.text+0x14b): undefined reference to `redset_create'
/usr/bin/ld: er.c:(.text+0x171): undefined reference to `kvtree_set_kv_int'
/usr/bin/ld: er.c:(.text+0x185): undefined reference to `kvtree_util_set_ptr'
/usr/bin/ld: er.c:(.text+0x1d5): undefined reference to `kvtree_unset_kv_int'
/usr/bin/ld: er.c:(.text+0x1df): undefined reference to `redset_delete'
...
/usr/bin/ld: axl_sync.c:(.text+0x14c): undefined reference to `kvtree_util_set_int'
/usr/bin/ld: axl_sync.c:(.text+0x17a): undefined reference to `kvtree_util_set_int'
/usr/bin/ld: axl_sync.c:(.text+0x1aa): undefined reference to `kvtree_util_set_int'
/usr/bin/ld: /global/homes/c/chiusole/veloc-1.4/lib64/libaxl.a(axl_sync.c.o): in function `axl_sync_wait':
axl_sync.c:(.text+0x1e8): undefined reference to `kvtree_get_kv_int'
/usr/bin/ld: axl_sync.c:(.text+0x1fc): undefined reference to `kvtree_util_get_int'
collect2: error: ld returned 1 exit status
gmake[2]: *** [src/backend/CMakeFiles/veloc-backend.dir/build.make:100: src/backend/veloc-backend] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:204: src/backend/CMakeFiles/veloc-backend.dir/all] Error 2
gmake: *** [Makefile:141: all] Error 2
Installation failed!
I've tried with the default PrgEnv-intel and also swapping with PrgEnv-gnu, and both get stuck at the same point.
Any idea what the problem may be?
Happy to provide more details
I'm working with @kosinovsky to debug an XOR rebuild problem. While looking through the code, this line caught my eye:
Line 62 in a5a9b8a
I don't think rank
has been initialized at this point, which means it could have an arbitrary value. That could then lead to a potential random reordering of rank values in the backends
communicator as compared to the parent communicator.
A potential fix would be to replace rank
with 0
or otherwise move the MPI_Comm_rank(comm, &rank)
higher up in the function.
Power off CORAL nodes, and check that detection / relaunch logic works.
Script logic has already been tested in SCR versions, but we should repeat these tests with the veloc scripts
Trying to restart the test/heasdis_file example from a broken state i got the follwing error
ERROR 3830768914477] [/u/dbertini/mpiio/veloc/src/lib/client.cpp:145:route_file] must call checkpoint_begin() first
than the program hangs ...
Should one add a call to checkpoint_begin() ? If yes, where exactly ?
The quick start guide (https://veloc.readthedocs.io/en/latest/quick.html) should be updated with more detail on how to build VeloC, including listing all its dependencies. Here's what worked for me on Fedora 28:
sudo yum install -y python3-pip cmake boost boost-devel openmpi-devel
pip3 install wget --user
pip3 install bs4 --user
module load mpi
git clone -b 'veloc-1.1' --single-branch --depth 1 https://github.com/ECP-VeloC/veloc.git
cd veloc
mkdir build install
./auto-install.py --no-boost install
I'm unable to build VeloC master (using GCC 4.9.3):
$ cmake -DCMAKE_BUILD_TYPE=Debug -DWITH_AXL_PREFIX=`pwd`/install -DWITH_ER_PREFIX=`pwd`/install -DBoost_NO_BOOST_CMAKE=TRUE -DBoost_NO_SYSTEM_PATHS=TRUE -DBOOST_ROOT=/g/g0/hutter2/boost_1_69_0 .
-- The C compiler identification is GNU 4.9.3
-- The CXX compiler identification is GNU 4.9.3
-- Check for working C compiler: /usr/tcetmp/bin/cc
-- Check for working C compiler: /usr/tcetmp/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/tcetmp/bin/c++
-- Check for working CXX compiler: /usr/tcetmp/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Boost version: 1.69.0
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found MPI_C: /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/libmpiprofilesupport.so;/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/libmpi_ibm.so
-- Found MPI_CXX: /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/libmpiprofilesupport.so;/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/libmpi_ibm.so
-- Found AXL: /g/g0/hutter2/VELOC/install/lib64/libaxl.a
-- Found ER: /g/g0/hutter2/VELOC/install/lib64/liber.a;/g/g0/hutter2/VELOC/install/lib64/libkvtree.a;/g/g0/hutter2/VELOC/install/lib64/libredset.a;/g/g0/hutter2/VELOC/install/lib64/libshuffile.a;/g/g0/hutter2/VELOC/install/lib64/librankstr.a;z
-- Configuring done
-- Generating done
-- Build files have been written to: /g/g0/hutter2/VELOC
$ make
[ 5%] Building CXX object src/modules/CMakeFiles/veloc-modules.dir/module_manager.cpp.o
In file included from /g/g0/hutter2/VELOC/src/modules/module_manager.hpp:4:0,
from /g/g0/hutter2/VELOC/src/modules/module_manager.cpp:1:
/g/g0/hutter2/VELOC/src/common/command.hpp: In constructor 'command_t::command_t()':
/g/g0/hutter2/VELOC/src/common/command.hpp:16:17: error: array used as initializer
command_t() { }
^
/g/g0/hutter2/VELOC/src/common/command.hpp:16:17: error: array used as initializer
/g/g0/hutter2/VELOC/src/common/command.hpp: In constructor 'command_t::command_t(int, int, int, const string&)':
/g/g0/hutter2/VELOC/src/common/command.hpp:17:95: error: array used as initializer
command_t(int r, int c, int v, const std::string &s) : unique_id(r), command(c), version(v) {
^
/g/g0/hutter2/VELOC/src/common/command.hpp:17:95: error: array used as initializer
make[2]: *** [src/modules/CMakeFiles/veloc-modules.dir/build.make:63: src/modules/CMakeFiles/veloc-modules.dir/module_manager.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:110: src/modules/CMakeFiles/veloc-modules.dir/all] Error 2
make: *** [Makefile:130: all] Error 2
LSB_DJOB_HOSTFILE
also contains the launch node.
VELOC/scripts/LSF/veloc_env.in
Line 81 in 7b2eeed
When it is used to generate a nodelist
without filtering out the launch node:
VELOC/scripts/LSF/veloc_jsrun.in
Lines 50 to 52 in 7b2eeed
VeloC thinks one more node is needed than actual, and that one more node is available in the event of a restart.
Configure JSM to allow jsrun to launch new jobs after a node failure in an allocation by setting FAULT_TOLERANCE=1 in a private ~/.jsm.conf file:
# Create a custom jam.conf file to be used in your jobs
cp /opt/ibm/spectrum_mpi/jsm_pmix/etc/jsm.conf ~/.jsm.conf
# Then modify your ~/.jsm.conf to uncomment
FAULT_TOLERANCE = 1
Allocate two nodes and run veloc_jsrun
to identify them:
user@butte5 bin $ ./veloc_jsrun -r 1 ./mpihostname
veloc_jsrun: Started: Thu Apr 4 14:28:23 PDT 2019
veloc_jsrun: RUN 1: Thu Apr 4 14:28:23 PDT 2019
0 of 2 on butte17
1 of 2 on butte18
veloc_jsrun: $VELOC_RUNS exhausted, ending run.
veloc_jsrun: Ended: Thu Apr 4 14:28:23 PDT 2019
Then run again on a single node to identify which is the default:
user@butte5 bin $ ./veloc_jsrun -r 1 -n1 ./mpihostname
veloc_jsrun: Started: Thu Apr 4 14:49:41 PDT 2019
veloc_jsrun: RUN 1: Thu Apr 4 14:49:41 PDT 2019
0 of 1 on butte17
veloc_jsrun: $VELOC_RUNS exhausted, ending run.
veloc_jsrun: Ended: Thu Apr 4 14:49:41 PDT 2019
Have someone with superpowers kill that node for you and wait at least 1 minute, if not more, before running again. There is a jsrun
bug that will cause a hang if you try too soon. While waiting, set the debug variable to see what is happening.
export VELOC_DEBUG=1
After at least 1 minute, run again:
user@butte5 bin $ ./veloc_jsrun -r 1 -n1 ./mpihostname
+ ⋮
veloc_jsrun: Started: Thu Apr 4 15:03:58 PDT 2019
+ ⋮
++ /<bindir>/veloc_env --nodes
+ nodelist=butte17,butte18,butte5
+ '[' 0 -eq 0 ']'
+ VELOC_NODELIST=butte17,butte18,butte5
+ '[' -z butte17,butte18,butte5 ']'
+ export VELOC_NODELIST
+ ⋮
++ /<bindir>/veloc_list_down_nodes --free
butte17: ssh: connect to host butte17 port 22: Connection timed out
pdsh@butte5: butte17: ssh exited with exit code 255
+ down_nodes=butte17
+ '[' butte17 '!=' '' ']'
+ /<bindir>/veloc_list_down_nodes --free --reason
butte17: ssh: connect to host butte17 port 22: Connection timed out
pdsh@butte5: butte17: ssh exited with exit code 255
butte17: Failed to pdsh echo UP
+ ⋮
NNODES=-1 RUNTIME=0 FAILED=butte17'
+ ⋮
++ /<bindir>/veloc_glob_hosts --count --hosts butte17,butte18,butte5
+ num_needed=3
+ '[' -n 1 ']'
+ num_needed=1
++ /<bindir>/veloc_glob_hosts --count --minus butte17,butte18,butte5:butte17
+ num_left=2
+ '[' 2 -lt 1 ']'
+ exclude_hosts=--exclude_hosts=butte17
+ ⋮
veloc_jsrun: RUN 1: Thu Apr 4 15:04:19 PDT 2019
+ jsrun --exclude_hosts=butte17 -n 1 -r 1 ./mpihostname
0 of 1 on butte18
+ ⋮
veloc_jsrun: $VELOC_RUNS exhausted, ending run.
+ ⋮
veloc_jsrun: Ended: Thu Apr 4 15:04:19 PDT 2019
Notice the launch node is included in the VELOC_NODELIST
and that num_needed=3
originally and num_left=2
after the failed node was accounted for. But we only allocated 2 nodes to begin with.
Had VELOC_MIN_NODES
been set to 2, this would still think it has enough nodes to try again.
Hi, I am going to use VELOC in my project, but when I run the example program that comes with it after installing VELOC, there seems to be an error, I use mpirun -np 3 heatdis_mem 2 heatdis.cfg to execute the example program, here is the output log:
[INFO 0] [/home/Huimin97/soft/VELOC/src/common/config.cpp:68:config_t] using POSIX to interact with persistent storage in single file mode, path: /tmp/persistent
[INFO 0] [/home/Huimin97/soft/VELOC/src/common/config.cpp:68:config_t] using POSIX to interact with persistent storage in single file mode, path: /tmp/persistent
[INFO 0] [/home/Huimin97/soft/VELOC/src/common/config.cpp:68:config_t] using POSIX to interact with persistent storage in single file mode, path: /tmp/persistent
[INFO 0] [/home/Huimin97/soft/VELOC/src/modules/ec_module.cpp:15:ec_module_t] EC interval not specified, every checkpoint will be protected using EC
[INFO 0] [/home/Huimin97/soft/VELOC/src/modules/ec_module.cpp:21:ec_module_t] Running on a single host, EC deactivated
[INFO 0] [/home/Huimin97/soft/VELOC/src/modules/transfer_module.cpp:18:transfer_module_t] Persistence interval not specified, every checkpoint will be persisted
[INFO 0] [/home/Huimin97/soft/VELOC/src/modules/chksum_module.cpp:20:chksum_module_t] checksumming active: 1
mpirun: Forwarding signal 24 to job
The configuration file comes with the test folder, I just changed the mode to sync, and I don't know what's wrong, can you give me some advice?
Just browsing through code, I noticed this pattern:
Lines 194 to 199 in c857688
I suspect you might be able to replace that code segment with an MPI_Scan
or MPI_Exscan
, which takes O(log P) time instead of O(P).
Hi,
I'm testing VeloC restart capability after a single node's failure (node down). For this reason, I created a test job that checkpoints its data periodically on the node's local storage (/tmp
) using the VeloC library. I also configured VeloC to protect the data by erasure coding (ec_interval = 0
).
To test the restart capability, after the job does the computation for a number of steps (iterations, checkpoints), I inject a failure on one of the job's nodes and restart the job with the same number of nodes as before. The job will be executed on the previous set of nodes from the first run, except for a newly allocated node replaced by the failed node.
Here I would expect that the checkpoint of the failed node is computed using the EC data and loaded into the newly allocated node. However, as I see this functionality is not available on the VeloC library and the job is restarted from the beginning. This leaves the EC data useless if a node goes down.
However, I managed to successfully restart the job from the local checkpoints if all nodes are alive and the job is restarted on the exact same set of the nodes as the first run.
To further investigate the case, I compared the source code of VeloC with the SCR library (for which I successfully restarted the job using XOR). If I'm correct, the difference is that in SCR in the SCR_Init
function, the scr_cache_rebuild
call writes the data on the newly allocated node at the start of the second run (after the failure) and before calling the SCR_Have_restart
and SCR_Start_restart
functions. However, this is not implemented in the VeloC library as I see.
So please let me know if I'm missing something. Otherwise, is this going to be available on the next release?
Tracking issue for component releases.
Hi
Just beginner question: is it possible to adapt a code using MPI- collective IO for its
checkpointing files to VeloC?
Thanks
Denis
Without knowing otherwise, the scripts will assume the job must always be restarted, including the case that the job actually ran to completion. To avoid having the scripts auto-restart the job, they need to know that the job ended on purpose.
Note that it's not sufficient to use the exit code of the launch command because some jobs return a non-zero exit code to indicate various info -- e.g., maybe the calculation went bad.
With SCR, we ended up writing a "halt" file in SCR_Finalize, and then we look for that "halt" file in the scripts. If we see it, we assume the job completed and we won't try to restart it. If there is no file, the scripts will try to restart the job.
I checked your library and looked into your test case to get a feeling of how your library should be used.
The test is using the pattern assert(VELOC_*())
, this is dangerous because in cases where the code will be compiled with -DNDEBUG
the function call within assert will not be executed because assert ()
is a noop.
example:
Line 123 in 4a8ec34
Line 138 in 4a8ec34
I know that the code I pointed to is the test case of VELOC but users will most likely start with copy-pasting from existing code and therefore propagate the issue into their own codebase.
Too many hardcoded stuff, please let the user override the install paths with CMAKE_INSTALL_LIBDIR
https://github.com/ECP-VeloC/VELOC/blob/master/src/modules/CMakeLists.txt#L12
https://github.com/ECP-VeloC/VELOC/blob/master/src/lib/CMakeLists.txt#L13
https://github.com/ECP-VeloC/VELOC/blob/master/src/backend/CMakeLists.txt#L12
When testing veloc_srun on SLURM, on back-to-back runs after a node was already taken down, the second run ended up double counting the same downed node in down_nodes
.
Unfortunately I don't have the output from this test as it was done on a different machine.
The VELOC API is missing some semantics needed for SCR. Most of these can be worked around, but I'll build a list to record where we stand:
Hi,
I'm testing VeloC with a heatdis example using the single-mode (using VELOC_Init_single
) option.
My cfg file contains:
scratch = /tmp/scratch
persistent = /tmp/persistent
mode = async
I'm using MPICH version 3.3.2, and VeloC 1.4 release, I'm not launching veloc-backend before running my program, and I'm using a single machine.
The issue is that when I run my program letting the VeloC library starts the backend by itself, my program doesn't finish (I think it gets stuck in the VELOC_Finalize
function). The backend log seems to be normal.
If I start the backend before running the program everything goes fine.
Any idea of what is going on?
With the file-based method, during a restart, the app calls VELOC_Route_file to get the path to each of its checkpoints files. However, veloc currently returns "" in this case, due to a check as to whether the library is in an active checkpoint state.
[ERROR 804727076878] [.../src/lib/client.cpp:145:route_file] must call checkpoint_begin() first
We need to also enable this to work during restart, i.e. for the following sequence,
VELOC_Init()
VELOC_Restart_test()
VELOC_Restart_begin()
VELOC_Route_file()
VELOC_Restart_end()
Is there a plan for veloc to support a direct HIP/CUDA interface?
Is there already available Fortran bindings to VeloC ?
If a node is in the allocation but is down (i.e., in down_nodes
), this causes a hang when attempting to run on the down node.
VELOC/scripts/SLURM/veloc_srun.in
Lines 57 to 59 in 4144d92
When we had problems with node-local SSDs failing, we added scripting to run tests against the SSDs on each node. For that, the scripts have to know the path to the SSDs to be tested. The library often knows this, so again this kind of info could be stored to a file. In SCR, we had the scripts read the same config files as the library to identify the set of storage paths that were used. For now, I've commented these tests out of the scripts.
To know whether there are enough nodes left, it's useful to have the first job that runs record the number of nodes it used in a file. Then the scripts can process that file to get the number of nodes needed to know whether there are enough nodes for a restart. We can work around that by having the user set a variable or config param stating the number of nodes they need, like VELOC_MIN_NODES. However, it's nice to automate this, since it's one less setting for the user.
It would be much easier for users if VeloC targeted Boost 1.53 instead of 1.60. 1.53 is included with RHEL 7 & CentOS 7, which is what a lot of sites are going to be running. Is there something absolutely essential to 1.60 that we can't get with 1.53?
$ cmake -DCMAKE_BUILD_TYPE=Debug -DWITH_KVTREE_PREFIX=`pwd`/install -DWITH_AXL_PREFIX=`pwd`/install .
CMake Error at /usr/tce/packages/cmake/cmake-3.9.2/share/cmake-3.9/Modules/FindBoost.cmake:1878 (message):
Unable to find the requested Boost libraries.
Boost version: 1.53.0
Boost include path: /usr/include
Detected version of Boost is too old. Requested version was 1.60 (or
newer).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.