Coder Social home page Coder Social logo

llnl / scr Goto Github PK

View Code? Open in Web Editor NEW
97.0 22.0 35.0 299.8 MB

SCR caches checkpoint data in storage on the compute nodes of a Linux cluster to provide a fast, scalable checkpoint / restart capability for MPI codes.

Home Page: http://computing.llnl.gov/projects/scalable-checkpoint-restart-for-mpi

License: Other

Shell 4.15% C 69.43% Perl 0.54% Python 22.76% CMake 3.12% HTML 0.02%
scalable checkpoint mpi radiuss data-management

scr's Introduction

Scalable Checkpoint / Restart (SCR) Library

The Scalable Checkpoint / Restart (SCR) library enables MPI applications to utilize distributed storage on Linux clusters to attain high file I/O bandwidth for checkpointing, restarting, and output in large-scale jobs. With SCR, jobs run more efficiently, recompute less work upon a failure, and reduce load on critical shared resources such as the parallel file system.

Users

Instructions to build and use SCR are hosted at scr.readthedocs.io.

For new users, the Quick Start guide shows one how to build and run an example using SCR.

For more detailed build instructions, refer to Build SCR.

User Docs Status

Contribute

As an open source project, we welcome contributions via pull requests, as well as questions, feature requests, or bug reports via issues. Please refer to both our code of conduct and our contributing guidelines.

Developers

Developer documentation is provided at SCR-dev.ReadTheDocs.io.

Developer Docs Status

SCR uses components from ECP-VeloC, which have their own user and developer docs.

A development build is useful for those who wish to modify how SCR works. It checks out and builds SCR and many of its dependencies separately. The process is more complicated than the user build described above, but the development build is helpful when one intends to commit changes back to the project.

For a development build of SCR and its dependencies on SLURM systems, one can use the bootstrap.sh script:

git clone https://github.com/LLNL/scr.git
cd scr

./bootstrap.sh

cd build
cmake -DCMAKE_INSTALL_PREFIX=../install ..
make install

When using a debugger with SCR, one can build with the following flags to disable compiler optimizations:

./bootstrap.sh --debug

cd build
cmake -DCMAKE_INSTALL_PREFIX=../install -DCMAKE_BUILD_TYPE=Debug ..
make install

One can then run a test program:

cd examples
srun -n4 -N4 ./test_api

For developers who may be installing SCR outside of an HPC cluster, who are using Fedora, and who have sudo access, the following steps install and activate most of the necessary base dependencies:

sudo dnf groupinstall "Development Tools"
sudo dnf install cmake gcc-c++ mpi mpi-devel environment-modules zlib-devel pdsh
[restart shell]
module load mpi

Authors

Numerous people have contributed to the SCR project.

To reference SCR in a publication, please cite the following paper:

Additional information and research publications can be found here:

https://computing.llnl.gov/projects/scalable-checkpoint-restart-for-mpi

scr's People

Contributors

adammoody avatar becker33 avatar c-a-h avatar camstan avatar chaseleif avatar gonsie avatar hbchen1984 avatar ianlee1521 avatar jameshcorbett avatar kathrynmohror avatar mcfadden8 avatar ofaaland avatar planeta avatar pozulp avatar rhaas80 avatar robertkb avatar shurickdaryin avatar tonyhutter avatar white238 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scr's Issues

SCR_CNTL_BASE variable

Hi,
I have installed scr 1.2.0 and I'm willing to add a XOR descriptor in scr.conf as discussed in the user manual. For now I want to have a single XOR descriptor in /tmp/username/cache. So I have added following lines to the scr.conf as described in the manual:

SCR_COPY_TYPE=FILE
STORE=/tmp/username/cache          GROUP=NODE   COUNT=1
CKPT=0 INTERVAL=1 GROUP=NODE   STORE=/tmp/username/cache TYPE=XOR     SET_SIZE=8

after running my scr test program I get following error:

SCR v1.2.0 ABORT: rank 0 on <NodeName>: Failed to create store descriptor for control directory [/tmp] @ <SCR_DIRECTORY>/src/scr_storedesc.c:355

which I believe indicates that there is a default store descriptor (/tmp) which cannot be found among available descriptors (in this case the only available descriptor is /tmp/username/cache according to the scr.conf file).
I have traced the code and found out that the default store descriptor is stored in a global variable scr_cntl_base and its value will be set in scr.c:841 from scr.conf or environment if found, and else it will be set in scr.c:843 from a constant named SCR_CNTL_BASE (default descriptor /tmp).
This means if I want to change the value of scr_cntl_base I should set the correct descriptor in the scr.conf or environment as follows:

in scr.conf:
SCR_CNTL_BASE = /tmp/username/cache

or in the environment:
export SCR_CNTL_BASE=/tmp/username/cache

But after setting the SCR_CNTL_BASE value using scr.conf or environment and running the test program I get following error:

SCR v1.2.0 ERROR: rank 0 on <NodeName>: SCR_CNTL_BASE cannot be set in the environment or user configuration file, ignoring setting

This prevents me from updating the value of scr_cntl_base. I'm confused about how to add my descriptor to the scr.conf file. Am I missing something?

SCR 1.2.0

Spinning a new SCR release (for @mmpozulp). Progress tracking:

  • add #define SCR_VERSION_MAJOR/MINOR/PATCH in scr.h
  • ensure spack package is up-to-date (see #22)
  • test that SCR builds on all the machines
  • tag a new release on GitHub (with tarbell)

Failed to record jobid

Hi,

I'm trying to run the example "test_api" using scr 1.1.8. After the successful installation of dependencies and scr now I cannot execute the example because no job id will be created and I get following error:
SCR v1.1.8 ABORT: rank 0 on ubuntu: Failed to record jobid @ scr.c:329 application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0 [unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=0 : system msg for write_line failure : Bad file descriptor

I have looked in the code of "test_api" and found out that this error is thrown when "scr_init" is called. I also get this error by running all other examples. I'm using mpich v 3.2 on ubuntu. Any idea what the problem could be?

SCR commands in Slurm script

Hi,

I'm currently trying to get scr work in slurm script. I have used the example script from the tutorial. but I cannot call scr commands in the script. I get following errors by running sbatch slurm_script:

/local/tmp/slurmd/job08249/slurm_script: line 8: /usr/local/tools/dotkit/init.sh: File or directory not found
/local/tmp/slurmd/job08249/slurm_script: line 9: use: Command not found
/local/tmp/slurmd/job08249/slurm_script: line 23: scr_srun: Command not found

In fact there is no init.sh or dotkit installed on my system. I couldn't find anything in the source nor in the documentation.

clean up of testing

Review output from the examples. I think the cleanup step is missing a few files.

tests: compilers

Should test all the possible compilers: intel, gcc, clang, pgi, etc.

tests: Add cases for asynch flush

Want to verify that things work in at least the following cases:

  1. job ends normally (no failure)
  2. job exits early while an async flush was ongoing and it gets scavenged
  3. job exits early while an async flush was ongoing and it it restarted in the allocation (be careful that scr_distribute doesn't pull files out from under the on-going flush)

Support machines that have no cache space (data can only be written to file system)

For systems that have no SSDs available for cache or sufficient free memory for ramdisk, let's write checkpoints directly to the file system. Let's assume we can still write filemap info to ramdisk, though (it's very small).

Route_file will need to do mkdir() calls here, probably for each component below the prefix directory.

We'll need to track file names and write SCR metadata that we'd normally get after a flush.

We need fetch to be aware of this, too.

The scavenge script will probably have to be modified, too.

tests: dependencies

We should test dependencies, either with known working versions or latest masters.

Can we add an API call to SCR that sets user parameters?

Section 6 "Configure a job" of the SCR User's Manual says

SCR searches the following locations in the following order for a parameter value, taking the first value it finds.

  1. Environment variables,
  2. User configuration file,
  3. System configuration file,
  4. Compile-time constants.

Some parameters, such as the location of the control directory, cannot be specified by the user. Such parameters must be either set in the system configuration file or hard-coded into SCR as compile-time constants.

Can we add an API call to SCR that sets user parameters, like those specified in 1. and 2.? Doing this would allow users of my code to put all scr configuration parameters inside of the input deck to my code.

Note that if SCR needs user-settable parameters at SCR_Init time, then they could be passed there instead of a new API call.

Check into providing Fortran interface as source file, rather than libscrf

Because C provides a standard ABI, libscr can be compiled once and then linked to an app built with any compiler. However, we must compile libscrf for each Fortran compiler. If we can provide the Fortran interface via a raw source file or include that apps compile in directly, then we may save on the number of SCR builds one has to have.

Job exiting due to SCR_FINALIZE_CALLED

My code worked well with previous version of SCR (I couldn't remember which version).
Last week I upgraded the SCR source code to the newest version in the github repo and recompile it. However, when I tried to run my code, it always exits before the function SCR_Init() finishes. The debugging info says the reason the job exits is because "SCR_FINALIZE_CALLED". Why the SCR_Finalize is called during the SCR_Init()? I located the problem happened in function "scr_bool_check_halt_and_decrement()", but I don't understand why it happened. I was wondering if someone could help me on this. Thanks!

configuring perl scripts

The way we build our perl scripts is currently broken. The scripts need to be configured with the path to the other perl scripts. This path at build time (thus at make test time) is different from the path needed at install time. We need to re-configure our scripts when a user runs make install

cannot run configure

Hi,

when I run configure I get following:
config.status: creating Makefile config.status: error: cannot find input file: src/Makefile.in'`

before this by running autogen.sh script I get following error:
configure.ac:57: error: possibly undefined macro: AC_PROG_LIBTOOL If this token and others are legitimate, please use m4_pattern_allow. See the Autoconf documentation.

As a result I cannot make/make install the code

LSF: Can we execute a scavenge?

This requires running a copy command from each remaining healthy node to copy data to the parallel file system, perhaps in an allocation with failed nodes.

Simplify use in cases where some levels support async flush and some do not

Currently, the user must be careful to configure SCR_FLUSH to line up with their checkpoint descriptors if they want to use async flush on a multi-cache system, where only some levels support async functionality. This is cumbersome for users to keep track of. Let's make this easier.

We could either just fall back to sync flush in the case where the output lands in a cache location that only has sync or we could always select a cache location that supports async flush, even if the checkpoint id would normally write it to some other level.

Flush Async

The SCR's asynchronous flush capability currently doesn't work. When async flush is turned on (SCR_FLUSH=10, SCR_FLUSH_ASYNC=1), the the program hangs during SCR_Init.

The problem is the following:

  1. When the transfer daemon starts, it sets the STATE in the transfer file to STOPPED.
  2. Within SCR_Init, any existing transfer file is deleted and recreated. SCR then issues a "flush_async_stop" command. This command triggers a "wait" on STATE STOPPED in the transfer file.
    Because the file has been deleted since the start of the transfer daemon, there is no STATE entry. This triggers an infinite wait loop.

Possible solutions:

  • Trigger the start of the transfer daemon from within SCR_Init. This is most likely a no-go because we are unable to fork from within our program.
  • Do not delete any existing transfer file. This may impact synchronous flush and/or restart operations.
  • Do not issue a flush_async_stop command at startup. This may impact asynchronous flush activities during a restart operation.

All in all, we need to clarify the correct initialization process for the transfer file and, in the case of flush async, the transfer daemon.

Support parsing environment variables in config files

Many sites define job-specific directories that users may write to, which are define via strings like: /scratch/$USER/$SLURM_JOBID. To enable different sites to configure SCR, we need to provide a way for users to build dynamic directory paths like this for their cache and control directories.

io-watchdog link

The io-watchdog link in the README points to a Google Code site. With that service EOL, is there a better link to use now?

enabling PMIx leads to a build error

the error is: ( I added emphasis on the error lines)
make[1]: Circular libscr.la <- libscr.la dependency dropped.
/bin/sh ../libtool --tag=CC --mode=link gcc -DHAVE_CONFIG_H -I. -I../config -I../src -I../config -I/scratch/jenkins-2/artifacts/ompi-update-scratch/496/ompi/../pmix/include -I/scratch/jenkins-2/artifacts/ompi-update-scratch/496/ompi/../pmix/include -I/scratch/jenkins-2/artifacts/ompi-update-scratch/496/ompi/include -DHAVE_CONFIG_H -I. -I../config -I../src -I../config -I/scratch/jenkins-2/artifacts/ompi-update-scratch/496/ompi/../pmix/include -I/scratch/jenkins-2/artifacts/ompi-update-scratch/496/ompi/../pmix/include -I/scratch/jenkins-2/artifacts/ompi-update-scratch/496/ompi/../pmix/include -avoid-version -o libscr.la -rpath /home/caholgu1/test_install/lib libscr.la -L/scratch/jenkins-2/artifacts/ompi-update-scratch/496/hwloc/lib -L/scratch/jenkins-2/artifacts/ompi-update-scratch/496/ompi/lib -lmpi -Wl,-rpath -Wl,/scratch/jenkins-2/artifacts/ompi-update-scratch/496/hwloc/lib -Wl,-rpath -Wl,/scratch/jenkins-2/artifacts/ompi-update-scratch/496/ompi/lib -Wl,--enable-new-dtags -lz -lpmix
libtool: link: cannot find the library libscr.la' or unhandled argument libscr.la'
make[1]: *** [libscr.la] Error 1

Logging

We should add some better logging, either to sys-log or some user-defined log file. something like that. If we used sys-log, the information would get automatically sucked in to splunk on the LC system.

Date::Manip in install doc

Hi,

In README,

Date::Manip -- Perl module for date/time interpretation
    http://search.cpan.org/~sbeck/Date-Manip-5.54/lib/Date/Manip.pod

that link is dead. However,

yum install perl-Date-Manip

provides (at least in CentOS 7) the desired module.

Multiple applications picking up same SCR data

I am trying some testing and it seem like multiple applications will pick up the same SCR data.

Notably, I'm trying to run the test_api example as well as some of the heatdis veloc examples. Is this known/expected behavior? Does each application calling SCR need their own install of the library?

datawarp.h missing

checking out a clean version of master, the build fails due to not finding datawarp.h
This appears to be the case regardless of configure time options, as I tried enabling and disabling CPPR, and it made no difference.

my configure line was as follows:
./configure --with-file-lock=fcntl --prefix=/home/caholgu1/scr --with-scr-config-file=/home/caholgu1/scr_demo/scr.conf

Develop MPI-based scavenge

Some systems don't support pdsh, but they all generally support running MPI jobs. We can improve portability by having an MPI-based scavenge in addition to our pdsh-based version.

Also with MPI, we can:

  1. identify the list of datasets we can flush
  2. rebuild missing files
  3. initiate transfers when using vendor-specific APIs

The tricky part is to be sure we have a way to avoid bad nodes when launching the MPI-based scavenge app.

Define default value for SCR_JOB_ID

For cases where we can't read the jobid from the resource manager, let's make one up.

For now, let's hardcode some value here. Check that we use the same value in the library and the scripts, if needed.

Eventually, we could perhaps have rank 0 generate a value and bcast that to the other ranks. The tricky part is to verify that we have a value if needed in the scripts.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.