llnl / scr Goto Github PK

SCR caches checkpoint data in storage on the compute nodes of a Linux cluster to provide a fast, scalable checkpoint / restart capability for MPI codes.

Home Page: http://computing.llnl.gov/projects/scalable-checkpoint-restart-for-mpi

License: Other

Shell 4.15% C 69.43% Perl 0.54% Python 22.76% CMake 3.12% HTML 0.02%

scalable checkpoint mpi radiuss data-management

scr's Introduction

Scalable Checkpoint / Restart (SCR) Library

The Scalable Checkpoint / Restart (SCR) library enables MPI applications to utilize distributed storage on Linux clusters to attain high file I/O bandwidth for checkpointing, restarting, and output in large-scale jobs. With SCR, jobs run more efficiently, recompute less work upon a failure, and reduce load on critical shared resources such as the parallel file system.

Users

Instructions to build and use SCR are hosted at scr.readthedocs.io.

For new users, the Quick Start guide shows one how to build and run an example using SCR.

For more detailed build instructions, refer to Build SCR.

Contribute

As an open source project, we welcome contributions via pull requests, as well as questions, feature requests, or bug reports via issues. Please refer to both our code of conduct and our contributing guidelines.

Developers

Developer documentation is provided at SCR-dev.ReadTheDocs.io.

SCR uses components from ECP-VeloC, which have their own user and developer docs.

A development build is useful for those who wish to modify how SCR works. It checks out and builds SCR and many of its dependencies separately. The process is more complicated than the user build described above, but the development build is helpful when one intends to commit changes back to the project.

For a development build of SCR and its dependencies on SLURM systems, one can use the bootstrap.sh script:

git clone https://github.com/LLNL/scr.git
cd scr

./bootstrap.sh

cd build
cmake -DCMAKE_INSTALL_PREFIX=../install ..
make install

When using a debugger with SCR, one can build with the following flags to disable compiler optimizations:

./bootstrap.sh --debug

cd build
cmake -DCMAKE_INSTALL_PREFIX=../install -DCMAKE_BUILD_TYPE=Debug ..
make install

One can then run a test program:

cd examples
srun -n4 -N4 ./test_api

For developers who may be installing SCR outside of an HPC cluster, who are using Fedora, and who have sudo access, the following steps install and activate most of the necessary base dependencies:

sudo dnf groupinstall "Development Tools"
sudo dnf install cmake gcc-c++ mpi mpi-devel environment-modules zlib-devel pdsh
[restart shell]
module load mpi

Authors

Numerous people have contributed to the SCR project.

To reference SCR in a publication, please cite the following paper:

Adam Moody, Greg Bronevetsky, Kathryn Mohror, Bronis R. de Supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, LLNL-CONF-427742, Supercomputing 2010, New Orleans, LA, November 2010.

Additional information and research publications can be found here:

https://computing.llnl.gov/projects/scalable-checkpoint-restart-for-mpi

scr's People

Contributors

Stargazers

Watchers

scr's Issues

SCR_CNTL_BASE variable

Hi,
I have installed scr 1.2.0 and I'm willing to add a XOR descriptor in scr.conf as discussed in the user manual. For now I want to have a single XOR descriptor in /tmp/username/cache. So I have added following lines to the scr.conf as described in the manual:

SCR_COPY_TYPE=FILE
STORE=/tmp/username/cache          GROUP=NODE   COUNT=1
CKPT=0 INTERVAL=1 GROUP=NODE   STORE=/tmp/username/cache TYPE=XOR     SET_SIZE=8

after running my scr test program I get following error:

SCR v1.2.0 ABORT: rank 0 on <NodeName>: Failed to create store descriptor for control directory [/tmp] @ <SCR_DIRECTORY>/src/scr_storedesc.c:355

which I believe indicates that there is a default store descriptor (/tmp) which cannot be found among available descriptors (in this case the only available descriptor is /tmp/username/cache according to the scr.conf file).
I have traced the code and found out that the default store descriptor is stored in a global variable scr_cntl_base and its value will be set in scr.c:841 from scr.conf or environment if found, and else it will be set in scr.c:843 from a constant named SCR_CNTL_BASE (default descriptor /tmp).
This means if I want to change the value of scr_cntl_base I should set the correct descriptor in the scr.conf or environment as follows:

in scr.conf:
SCR_CNTL_BASE = /tmp/username/cache

or in the environment:
export SCR_CNTL_BASE=/tmp/username/cache

But after setting the SCR_CNTL_BASE value using scr.conf or environment and running the test program I get following error:

SCR v1.2.0 ERROR: rank 0 on <NodeName>: SCR_CNTL_BASE cannot be set in the environment or user configuration file, ignoring setting

This prevents me from updating the value of scr_cntl_base. I'm confused about how to add my descriptor to the scr.conf file. Am I missing something?

Bring SCR support up-to-date with latest PMIx

@c-a-h is taking the first stab. Thanks!

SCR 1.2.0

Spinning a new SCR release (for @mmpozulp). Progress tracking:

add #define SCR_VERSION_MAJOR/MINOR/PATCH in scr.h
ensure spack package is up-to-date (see #22)
test that SCR builds on all the machines
tag a new release on GitHub (with tarbell)

Define API to set/get SCR params

Failed to record jobid

Hi,

I'm trying to run the example "test_api" using scr 1.1.8. After the successful installation of dependencies and scr now I cannot execute the example because no job id will be created and I get following error:
SCR v1.1.8 ABORT: rank 0 on ubuntu: Failed to record jobid @ scr.c:329 application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0 [unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=0 : system msg for write_line failure : Bad file descriptor

I have looked in the code of "test_api" and found out that this error is thrown when "scr_init" is called. I also get this error by running all other examples. I'm using mpich v 3.2 on ubuntu. Any idea what the problem could be?

Integrate changes for Datawarp into master

@becker33

SCR commands in Slurm script

Hi,

I'm currently trying to get scr work in slurm script. I have used the example script from the tutorial. but I cannot call scr commands in the script. I get following errors by running sbatch slurm_script:

/local/tmp/slurmd/job08249/slurm_script: line 8: /usr/local/tools/dotkit/init.sh: File or directory not found
/local/tmp/slurmd/job08249/slurm_script: line 9: use: Command not found
/local/tmp/slurmd/job08249/slurm_script: line 23: scr_srun: Command not found

In fact there is no init.sh or dotkit installed on my system. I couldn't find anything in the source nor in the documentation.

Refactor scr async flush code to make it easier to add new vendor-specific APIs

Let's go with @gonsie 's approach for factorizing vendor-specific async flush methods.

clean up of testing

Review output from the examples. I think the cleanup step is missing a few files.

LSF: Check that we can run another MPI job in an allocation with a failed node

tests: compilers

Should test all the possible compilers: intel, gcc, clang, pgi, etc.

tests: Add cases for asynch flush

Want to verify that things work in at least the following cases:

job ends normally (no failure)
job exits early while an async flush was ongoing and it gets scavenged
job exits early while an async flush was ongoing and it it restarted in the allocation (be careful that scr_distribute doesn't pull files out from under the on-going flush)

Support machines that have no cache space (data can only be written to file system)

For systems that have no SSDs available for cache or sufficient free memory for ramdisk, let's write checkpoints directly to the file system. Let's assume we can still write filemap info to ramdisk, though (it's very small).

Route_file will need to do mkdir() calls here, probably for each component below the prefix directory.

We'll need to track file names and write SCR metadata that we'd normally get after a flush.

We need fetch to be aware of this, too.

The scavenge script will probably have to be modified, too.

Update test cases to match new v1.2 API

tests: dependencies

We should test dependencies, either with known working versions or latest masters.

Can we add an API call to SCR that sets user parameters?

Section 6 "Configure a job" of the SCR User's Manual says

SCR searches the following locations in the following order for a parameter value, taking the first value it finds.

Environment variables,

User configuration file,

System configuration file,

Compile-time constants.

Some parameters, such as the location of the control directory, cannot be specified by the user. Such parameters must be either set in the system configuration file or hard-coded into SCR as compile-time constants.

Can we add an API call to SCR that sets user parameters, like those specified in 1. and 2.? Doing this would allow users of my code to put all scr configuration parameters inside of the input deck to my code.

Note that if SCR needs user-settable parameters at SCR_Init time, then they could be passed there instead of a new API call.

Update the Spack Package

@becker33

Port SCR to LSF resource manager

Check into providing Fortran interface as source file, rather than libscrf

Because C provides a standard ABI, libscr can be compiled once and then linked to an app built with any compiler. However, we must compile libscrf for each Fortran compiler. If we can provide the Fortran interface via a raw source file or include that apps compile in directly, then we may save on the number of SCR builds one has to have.

Job exiting due to SCR_FINALIZE_CALLED

My code worked well with previous version of SCR (I couldn't remember which version).
Last week I upgraded the SCR source code to the newest version in the github repo and recompile it. However, when I tried to run my code, it always exits before the function SCR_Init() finishes. The debugging info says the reason the job exits is because "SCR_FINALIZE_CALLED". Why the SCR_Finalize is called during the SCR_Init()? I located the problem happened in function "scr_bool_check_halt_and_decrement()", but I don't understand why it happened. I was wondering if someone could help me on this. Thanks!

configuring perl scripts

The way we build our perl scripts is currently broken. The scripts need to be configured with the path to the other perl scripts. This path at build time (thus at make test time) is different from the path needed at install time. We need to re-configure our scripts when a user runs make install

SCR License file currently points to old sourceforge location

We should update this to point to the github repo.

Add datawarp options to Spack package

LSF: how to get remaining time in job allocation?

cannot run configure

Hi,

when I run configure I get following:
config.status: creating Makefile config.status: error: cannot find input file: src/Makefile.in'`

before this by running autogen.sh script I get following error:
configure.ac:57: error: possibly undefined macro: AC_PROG_LIBTOOL If this token and others are legitimate, please use m4_pattern_allow. See the Autoconf documentation.

As a result I cannot make/make install the code

LSF: Can we execute a scavenge?

This requires running a copy command from each remaining healthy node to copy data to the parallel file system, perhaps in an allocation with failed nodes.

Simplify use in cases where some levels support async flush and some do not

Currently, the user must be careful to configure SCR_FLUSH to line up with their checkpoint descriptors if they want to use async flush on a multi-cache system, where only some levels support async functionality. This is cumbersome for users to keep track of. Let's make this easier.

We could either just fall back to sync flush in the case where the output lands in a cache location that only has sync or we could always select a cache location that supports async flush, even if the checkpoint id would normally write it to some other level.

Flush Async

The SCR's asynchronous flush capability currently doesn't work. When async flush is turned on (SCR_FLUSH=10, SCR_FLUSH_ASYNC=1), the the program hangs during SCR_Init.

The problem is the following:

When the transfer daemon starts, it sets the STATE in the transfer file to STOPPED.
Within SCR_Init, any existing transfer file is deleted and recreated. SCR then issues a "flush_async_stop" command. This command triggers a "wait" on STATE STOPPED in the transfer file.
Because the file has been deleted since the start of the transfer daemon, there is no STATE entry. This triggers an infinite wait loop.

Possible solutions:

Trigger the start of the transfer daemon from within SCR_Init. This is most likely a no-go because we are unable to fork from within our program.
Do not delete any existing transfer file. This may impact synchronous flush and/or restart operations.
Do not issue a flush_async_stop command at startup. This may impact asynchronous flush activities during a restart operation.

All in all, we need to clarify the correct initialization process for the transfer file and, in the case of flush async, the transfer daemon.

Support parsing environment variables in config files

Many sites define job-specific directories that users may write to, which are define via strings like: /scratch/$USER/$SLURM_JOBID. To enable different sites to configure SCR, we need to provide a way for users to build dynamic directory paths like this for their cache and control directories.

update in-repo user manual

oops, the current pdf is from 2015 (in doc/).

Update Computation site pdf

We need to track down any other outstanding, old copies that are out there.

io-watchdog link

The io-watchdog link in the README points to a Google Code site. With that service EOL, is there a better link to use now?

enabling PMIx leads to a build error

the error is: ( I added emphasis on the error lines)
make[1]: Circular libscr.la <- libscr.la dependency dropped.
/bin/sh ../libtool --tag=CC --mode=link gcc -DHAVE_CONFIG_H -I. -I../config -I../src -I../config -I/scratch/jenkins-2/artifacts/ompi-update-scratch/496/ompi/../pmix/include -I/scratch/jenkins-2/artifacts/ompi-update-scratch/496/ompi/../pmix/include -I/scratch/jenkins-2/artifacts/ompi-update-scratch/496/ompi/include -DHAVE_CONFIG_H -I. -I../config -I../src -I../config -I/scratch/jenkins-2/artifacts/ompi-update-scratch/496/ompi/../pmix/include -I/scratch/jenkins-2/artifacts/ompi-update-scratch/496/ompi/../pmix/include -I/scratch/jenkins-2/artifacts/ompi-update-scratch/496/ompi/../pmix/include -avoid-version -o libscr.la -rpath /home/caholgu1/test_install/lib libscr.la -L/scratch/jenkins-2/artifacts/ompi-update-scratch/496/hwloc/lib -L/scratch/jenkins-2/artifacts/ompi-update-scratch/496/ompi/lib -lmpi -Wl,-rpath -Wl,/scratch/jenkins-2/artifacts/ompi-update-scratch/496/hwloc/lib -Wl,-rpath -Wl,/scratch/jenkins-2/artifacts/ompi-update-scratch/496/ompi/lib -Wl,--enable-new-dtags -lz -lpmix
libtool: link: cannot find the library libscr.la' or unhandled argument libscr.la'
make[1]: *** [libscr.la] Error 1

Logging

We should add some better logging, either to sys-log or some user-defined log file. something like that. If we used sys-log, the information would get automatically sucked in to splunk on the LC system.

Build and prepare SCR for TR2

@becker33

Update user guide to v1.2

Merge extended node name format from Intel, SDSC, and ORNL

Support more general node naming schemes in scr_hostlist processing. In particular merge extensions from Intel and those required for SDSC. Obtain requirements from ORNL while we're at it.

Ensure we stop ongoing async flush on restart before moving files

On a restart in which we have to move/rebuild files to spare nodes, any ongoing async flush is likely to go bad or at least transfer bad data. We need to detect and stop any ongoing async flush before moving files around.

Integrate SCR into HACC IO benchmark

Integrate SCR into HACC IO and collect feedback on any issues / complexities with doing that to inform future SCR API changes.

Date::Manip in install doc

Hi,

In README,

Date::Manip -- Perl module for date/time interpretation
    http://search.cpan.org/~sbeck/Date-Manip-5.54/lib/Date/Manip.pod

that link is dead. However,

yum install perl-Date-Manip

provides (at least in CentOS 7) the desired module.

Multiple applications picking up same SCR data

I am trying some testing and it seem like multiple applications will pick up the same SCR data.

Notably, I'm trying to run the test_api example as well as some of the heatdis veloc examples. Is this known/expected behavior? Does each application calling SCR need their own install of the library?

Update and test Fortran interface for v1.2 API

See if we can get scr_run scripting to work using a single install of the scripts

If using scr_srun and the like, the user will need the install path corresponding to the executable they're using. That could be a pain. Perhaps a single system install of scr_srun and friends would work regardless of the actual SCR library build being used. Look into whether this is possible.

datawarp.h missing

checking out a clean version of master, the build fails due to not finding datawarp.h
This appears to be the case regardless of configure time options, as I tried enabling and disabling CPPR, and it made no difference.

my configure line was as follows:
./configure --with-file-lock=fcntl --prefix=/home/caholgu1/scr --with-scr-config-file=/home/caholgu1/scr_demo/scr.conf

LSF: how to get list of down nodes in job allocation?

Develop MPI-based scavenge

Some systems don't support pdsh, but they all generally support running MPI jobs. We can improve portability by having an MPI-based scavenge in addition to our pdsh-based version.

Also with MPI, we can:

identify the list of datasets we can flush
rebuild missing files
initiate transfers when using vendor-specific APIs

The tricky part is to be sure we have a way to avoid bad nodes when launching the MPI-based scavenge app.

Eventually, we could perhaps have rank 0 generate a value and bcast that to the other ranks. The tricky part is to verify that we have a value if needed in the scripts.

llnl / scr Goto Github PK

scr's Introduction

Scalable Checkpoint / Restart (SCR) Library

Users

Contribute

Developers

Authors

scr's People

Contributors

Stargazers

Watchers

Forkers

scr's Issues

Recommend Projects

Recommend Topics

Recommend Org