Coder Social home page Coder Social logo

Comments (5)

bnicolae avatar bnicolae commented on July 18, 2024

Hi @gongotar, can you please provide more information about what is happening? How many nodes did you use? What error messages do the active backends show on restart (if any at all)? Do all active backends show the same error message or are there any differences?

from veloc.

gongotar avatar gongotar commented on July 18, 2024

Hi Bogdan, sorry for the late response! So the VeloC library contains a test folder in which there is a file named heatdis_fault.cpp. Now, in the beginning, the code in the file checks if there are any restarts available. If not then it starts to compute from the beginning. However, in the middle of the computation, a failure is injected which removes the local checkpoints of the last rank (lines 134 - 138) and calls MPI_Abort(MPI_COMM_WORLD, 1) and exit(1).
Now, I have written a slurm script named test_sync.sh to run the test code. It runs heatdis_fault twice (in the sync mode). The first run will have no restarts available (the computation starts from the beginning), the second run, however, would be called after the first run encountered the injected failure and lost one of its checkpoints (out of three). This run is expected to be restarted from the most recent checkpoint from the first run. But for the second run, no restarts will be found and the job starts to compute from the beginning again.
I tested the code with 3 and also higher numbers of nodes allocated to the job. In the VeloC config file, the erasure coding is enabled (ec_interval = 0). So I would expect that VeloC restores the lost checkpoint of a single rank out of three or more ranks (one rank per node) and restarts the computation from the saved checkpoint in the second run. Which does not happen and the second run starts from the beginning again. Here is the script I wrote:

#!/bin/bash
#SBATCH -N3
#SBATCH -o out
CFG=heatdis.cfg

srun -N3 heatdis_fault 256 $CFG
srun -N3 heatdis_fault 256 $CFG

EXIT_CODE=$?
exit $EXIT_CODE

I digged into the problem and found out the cause, which I mentioned in my previous post. However, let me know if you could reproduce the problem.

from veloc.

bnicolae avatar bnicolae commented on July 18, 2024

Hi @gongotar, can you list the content of the VeloC config file? Did you also flush the checkpoints to the parallel file system? If so, the new node should fetch the missing local checkpoint from the PFS on restart. You can check this by deactivating the EC module.

from veloc.

gongotar avatar gongotar commented on July 18, 2024

Hi Bogdan,

Here is the config file to test the functionality of erasure coding:

scratch = /tmp/user/veloc_test                                          
persistent = /shared/user/veloc_test_cp
max_versions = 2
mode = sync
ec_interval = 0
persistent_interval = -1
axl_type = native

However, I also changed the config file to flush the checkpoints to the shared PFS as you suggested by setting persistent_interval=0 and ec_interval=-1. In the case of the shared PFS, the second run is restarted successfully from the most recent checkpoint in the shared PFS. But if I deactivate the shared PFS and instead activate the ec_interval (as in the provided conf file: persistent_interval = -1; ec_interval = 0) the restart fails to restore the lost checkpoint using EC.
Here you see the VeloC messages for the job running on 6 nodes with EC activated:

# VeloC messages of the first run:
[INFO 1910392244657] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized       
[INFO 1910401162927] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[INFO 1910400580068] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[INFO 1910401412467] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[INFO 1910400857710] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[INFO 1910401879747] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[DEBUG 1910392244763[DEBUG 1910401412598] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:164:process_command]
[DEBUG 1910400857816] [164:process_command] obtain latest version for veloc_t[DEBUG 1910401879860] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:obtain latest version for veloc_t
/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:164:process_command] obtain latest version for veloc_t
164:process_command] obtain latest version for veloc_t                           
[DEBUG 1910401163027] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:[DEBUG 1910400580180] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:
164:process_command] obtain latest version for veloc_t                           
164:process_command] obtain latest version for veloc_t                           

<Outputs of the job during the computation of the first run>

<Injected failure, job exits>

# VeloC messages of the second run:
[INFO 1910471167561] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[INFO 1910479502975] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[INFO 1910480085845] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[INFO 1910480802648] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[INFO 1910479780616] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[INFO 1910480335366] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:99:transfer_module_t] AXL successfully initialized
[DEBUG 1910471167721] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:164:process_command] [DEBUG 1910479503092] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:164]
obtain latest version for veloc_t                                                                          
[DEBUG 1910479780770] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:164:process_command:process_command] obtain latest version for veloc_t
[DEBUG 1910480085986] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:164:process_command] obtain latest version for [DEBUG 1910480802784] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:164:process_command] 
obtain latest version for veloc_t
[DEBUG 1910480335509] [/home/user/VeloC_prefix/VELOC-veloc-1.2/src/modules/transfer_module.cpp:164:process_command] veloc_t
obtain latest version for veloc_t                                                                          
obtain latest version for veloc_t                                                


<Outputs of the job starting the computation from the beginning on the second run>

from veloc.

bnicolae avatar bnicolae commented on July 18, 2024

Hi @gongotar, we have a new VELOC release. If it is working for you, I will close this issue.

from veloc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.