Comments (5)
Hi @gongotar, can you please provide more information about what is happening? How many nodes did you use? What error messages do the active backends show on restart (if any at all)? Do all active backends show the same error message or are there any differences?
from veloc.
Hi Bogdan, sorry for the late response! So the VeloC library contains a test
folder in which there is a file named heatdis_fault.cpp
. Now, in the beginning, the code in the file checks if there are any restarts available. If not then it starts to compute from the beginning. However, in the middle of the computation, a failure is injected which removes the local checkpoints of the last rank (lines 134 - 138) and calls MPI_Abort(MPI_COMM_WORLD, 1)
and exit(1)
.
Now, I have written a slurm script named test_sync.sh
to run the test code. It runs heatdis_fault
twice (in the sync mode). The first run will have no restarts available (the computation starts from the beginning), the second run, however, would be called after the first run encountered the injected failure and lost one of its checkpoints (out of three). This run is expected to be restarted from the most recent checkpoint from the first run. But for the second run, no restarts will be found and the job starts to compute from the beginning again.
I tested the code with 3 and also higher numbers of nodes allocated to the job. In the VeloC config file, the erasure coding is enabled (ec_interval = 0
). So I would expect that VeloC restores the lost checkpoint of a single rank out of three or more ranks (one rank per node) and restarts the computation from the saved checkpoint in the second run. Which does not happen and the second run starts from the beginning again. Here is the script I wrote:
#!/bin/bash
#SBATCH -N3
#SBATCH -o out
CFG=heatdis.cfg
srun -N3 heatdis_fault 256 $CFG
srun -N3 heatdis_fault 256 $CFG
EXIT_CODE=$?
exit $EXIT_CODE
I digged into the problem and found out the cause, which I mentioned in my previous post. However, let me know if you could reproduce the problem.
from veloc.
Hi @gongotar, can you list the content of the VeloC config file? Did you also flush the checkpoints to the parallel file system? If so, the new node should fetch the missing local checkpoint from the PFS on restart. You can check this by deactivating the EC module.
from veloc.
Hi Bogdan,
Here is the config file to test the functionality of erasure coding:
scratch = /tmp/user/veloc_test
persistent = /shared/user/veloc_test_cp
max_versions = 2
mode = sync
ec_interval = 0
persistent_interval = -1
axl_type = native
However, I also changed the config file to flush the checkpoints to the shared PFS as you suggested by setting persistent_interval=0
and ec_interval=-1
. In the case of the shared PFS, the second run is restarted successfully from the most recent checkpoint in the shared PFS. But if I deactivate the shared PFS and instead activate the ec_interval (as in the provided conf file: persistent_interval = -1; ec_interval = 0
) the restart fails to restore the lost checkpoint using EC.
Here you see the VeloC messages for the job running on 6 nodes with EC activated:
# VeloC messages of the first run:
[INFO 1910392244657] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[INFO 1910401162927] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[INFO 1910400580068] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[INFO 1910401412467] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[INFO 1910400857710] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[INFO 1910401879747] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[DEBUG 1910392244763[DEBUG 1910401412598] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:164:process_command]
[DEBUG 1910400857816] [164:process_command] obtain latest version for veloc_t[DEBUG 1910401879860] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:obtain latest version for veloc_t
/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:164:process_command] obtain latest version for veloc_t
164:process_command] obtain latest version for veloc_t
[DEBUG 1910401163027] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:[DEBUG 1910400580180] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:
164:process_command] obtain latest version for veloc_t
164:process_command] obtain latest version for veloc_t
<Outputs of the job during the computation of the first run>
<Injected failure, job exits>
# VeloC messages of the second run:
[INFO 1910471167561] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[INFO 1910479502975] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[INFO 1910480085845] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[INFO 1910480802648] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[INFO 1910479780616] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[INFO 1910480335366] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[DEBUG 1910471167721] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:164:process_command] [DEBUG 1910479503092] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:164]
obtain latest version for veloc_t
[DEBUG 1910479780770] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:164:process_command:process_command] obtain latest version for veloc_t
[DEBUG 1910480085986] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:164:process_command] obtain latest version for [DEBUG 1910480802784] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:164:process_command]
obtain latest version for veloc_t
[DEBUG 1910480335509] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:164:process_command] veloc_t
obtain latest version for veloc_t
obtain latest version for veloc_t
<Outputs of the job starting the computation from the beginning on the second run>
from veloc.
Hi @gongotar, we have a new VELOC release. If it is working for you, I will close this issue.
from veloc.
Related Issues (20)
- restart-in-place: copy cray aprun variant from scr HOT 2
- Down node detection on LSF has wrong node count HOT 1
- SLURM restart-in-place script hangs when forcing prolog on down node HOT 1
- SLURM restart-in-place script double counts down node HOT 1
- Build VELOC as a static library HOT 1
- VELOC install 64 bit libraries in /usr/lib instead of /usr/lib64 HOT 1
- VeloC and MPI IO HOT 9
- error using test/heatdis example HOT 7
- Fortran 90 bindingd to VeloC? HOT 2
- Alternative to OpenSSL for md5 HOT 2
- Program not finishing in async mode HOT 19
- Build fails at linking with undefined reference to `kvtree_xxx` on Cori (NERSC) HOT 1
- can't build with AXL 4.0.0 HOT 4
- Unable to run the example program HOT 2
- example: function call within assert HOT 2
- Interop with GPU compute kernels HOT 5
- Use MPI_Exscan to compute offsets?
- Component releases for Veloc v1.7 HOT 1
- MPI_Comm_split with uninitialized key value? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from veloc.