pvandyken / kslurm Goto Github PK

View Code? Open in Web Editor NEW

3.0 1.0 2.0 606 KB

Helper scripts and wrappers for running commands on SLURM compute clusters.

Home Page: https://kslurm.readthedocs.io/en/latest/

License: MIT License

Python 98.08% Shell 1.92%

python slurm compute-clusters cli

kslurm's People

Contributors

Stargazers

Watchers

Forkers

trellixvulnteam akhanf

kslurm's Issues

Singularity

Singularity will eventually be one of the core features of the library. The computecanada clusters present unique challenges for pulling and organizing images, and the recommended procedures should be baked into easy-to-use scripts. Ideally, the solution will still be generalizable to other compute clusters.

The basic scheme will be as follows:

Images will be pulled using the 2-step method recommended by computecanada (see here).
Pulled images will be stored in a location of the user's choosing. In the case of computecanada, they will be on the project space in the user's personal file. Any arbitrary location could be specified, however.
Users should also be able to indicate a central image repository on the cluster to which they have read access. This saves space by allowing users to share commonly used packages. The common images should be symlinked in the user's personal directory. kslurm should manage this symlinking process

kpy save timing could be optimized

kpy save first writes the venv to a tmp file, then moves it to the final destination. This helps prevent data loss when overwriting an old venv: should the process get interrupted, the overwritten file won't get deleted until the last possible moment.

This last possible moment could be delayed even further: currently, the tmp tar file is being written on whatever tmp filesystem is available. This could be a different filesystem the the storage dir, which would require a cp operation upon moving. This cp could take a minute or two if the tar is very large. A better approach would be to first mv the tar file into the dest directory under a tmp name, then delete the overwritten file, then rename the tmp file to final

how to add to an already created venv?

You can kpy create a venv on the login node, then install to it and save it, but then if you want to make changes to it in the future there doesn't seem to be an easy way to do this, since the commands to load and activate can only be run on a compute node. Am I missing something?

Also, would be useful to be able to kpy list the venvs on login node too, but this is also disabled for non-compute nodes..

kjupyter internet check failure does not cause RuntimeError

In kjupyter, we call call an internet process to get the public ip for ssh. This is supposed to fall back to <hostname> if the call fails, but instead of looking for the exit status, we watch for a RuntimeError which never gets thrown. This should be corrected.

Configuration

Currently, there is no way to configure values on a user-by-user basis, greatly limiting the generalizability of the library. Thus, an immediate priority is to integrate a config functionality. The most obvious variable to set is the slurm account, which is currently set to the one I use. But once the config is in place, we could allow the user to set the default resources. Future features will also add relevant editable variables.

Tests

Many parts of the app are lacking good tests:

The entire install suit
ChoiceArgs in the argument parser
Error handling in the argument parser

Docstrings

Documentation is almost non-existent at this point. The first priority is to write doc-strings for all of the functions.

kjupyter fails when run on a compute node

Rather than spawning a new compute node or using the existing one, we get some slurm error output:

slurmstepd: error: execve(): source: No such file or directory
srun: error: gra395: task 0: Exited with exit code 2
srun: launch/slurm: _step_signal: Terminating StepId=63208972.1
^Cscancel: error: Invalid job id 0

If running kjupyter on a compute node is not tenable, an error should be thrown immediately.

BidsBatch

Currently, neuroglia-helpers offers a BidsBatch script, which parallelizes a BidsApp over subjects over SLURM. Once we have #2 working, this script can be easily recreated.

As an improvement on the current limitation, our BidsBatch script should be able to more dynamically handle BidsApps. Currently, BidsApps, including versions, are hard-coded into a TSV version. The new implementation should be able to determine what versions are available automatically. We could also pull data from bids-apps: they keep a directory of Bids-Apps, and the data is available in the _config.yml file on their github.

Finally, users should be able to easily specify new BidsApps. Ideally, I would like to trust that the user knows what they're doing and allow them to specify any arbitrary docker image. If it's not in the list of "official" bidsapps, we can just display a warning message that the app is not recognized and errors may occur.

Generate Install Script from internal module

Currently, we have an install script (install-kslurm.py) in the root directory with almost identical code to the installer module. These could be united using the Python zipfile api. As part of the build process, zipfile would be used to package the necessary modules together into a self-contained install script that could then be pulled from the server and run. Thus, the same easy user experience, with less code redundancy.

To ensure the install script stays as small as possible, the installer code needs to be as isolated as possible from the rest of the library. Currently, it uses the custom argparsing library, however, once that library is repackaged as its own library, the installer code should lose any dependency on kslurm.

Long term, the installer module could itself be repackaged as its own library. The code is already quite functional, so it shouldn't be too much of a challenge to generalize it such that any Python library could be installed into it's own isolated venv.

Datarace issues with kpy load/save

No file locks are currently used by kpy to protect the integrity of a venv cache when opening. This would present a problem if another process tried to overwrite the save. Should be a simple enough fix to impose a read/write lock

Job templates

Not necessarily kslurm specific and all minor:
First two points was just looking through gh

I noticed this when I was working neuroglia-helpers as well, but there may be redundant job templates, such as 32core128gb3h vs ShortFat (see slurm_job_templates.json). Wonder if it might be better to move away from the original implementation of naming them (e.g. ShortFat) to just naming by resource (e.g. 32core128gb24h).
32core128gb12h is given 2 cpus rather than 32 in the current slurm_job_templates.json
This one might be on my end, but the data folder is missing on installation. When running krun --list, a FileNotFoundError is returned with it being unable to local the slurm_job_templates.json

Argument Parser Help should be available automatically

Currently, a user must manually specify an argument or method for the user to get help on a command. The argument parser should automatically add a Help Argument to every model sent for parsing. It should be a FlagArg accepting either -h or --help.

Note that given the current typed system, we can't actually return the Help Argument unless the user explicately makes an attribute for it. We would have to know which attribute should receive the help results, likely necessitating a new arg-type: HelpArg. Alternatively, we could return a tuple: (ArgList, HelpArg), although this is not particularly elegant. Eventually, the parser should be able to print help messages of it's own accord, making this less of an issue.

One caveat that should be worked out at the same time is if the user supplies their own arg matching -h or --help. One approach is to add our automatic Help Argument to the back of the list, so that it has the least matching priority. This could lead to unexpected behaviour, though. For instance, if the developer overrides -h, but not --help, and then implements their own entirely separate help system, the end-user would get an unintended message when running the command with --help. We could also require the user to opt in or out of automatic help via a boolean argument in parse-args(). Or, we could check FlagArgs and KeywordArgs to see if any of them match -h or --help and print a warning when they are detected.

Configured paths should be made automatically absolute

Otherwise, relative paths get interpreted based on the directory one is currently in: not desired behaviour

Potentially conflicting behaviour from the directory argument

Currently, a current working directory can be supplied to any of the commands using the Directory ShapeArg. To find a match, it simply takes the argument and checks if it's a valid directory. This will cause a conflict if the user is trying to run a program with the same name as a subdirectory. For instance, if the user was in ~/scratch and they have a sub-directory called ~/scratch/python, they would not be able to run python as a command. The parser would interpret python as referring to the directory.

As a fix, directory specifications should be required to have a forward slash (/) character somewhere in the string. So in the example above, if the user wanted python to be the cwd, they would need to specify ./python. This will eliminate the overlap issue.

Improved handling of keyword args in the argument parser

The current implementation of the argument parser uses the number to specify between a finite, fixed number of values (positive number), a "lazy" infinite mode that stops when it reaches a non-positional arg (0), and a "greedy" infinite mode that never stops (negative number). This does not allow for a "lazy" finite mode, where it stops when it reaches a non-positional arg, or when it reaches the limit. There's also no obvious way to add this in. Plus, the current parser can't distinguish between a lazy infinite and a completed fixed number. So if you have a fixed number, and a positional arg follows, that positional arg will get eaten by the keyword arg:

command --1-arg-keyword value positional Shape

Here, that positional would be counted as a value for --1-arg-keyword, an obvious bug.

As a solution, generalize greedy and lazy to a new parameter: greedy=[True|False]. Negative values for num will no longer be defined (although still used internally in the counter to represent any infinite keyword arg). Internally, a counter of 0 will always represent a completed keyword arg. A negative will always be infinite: lazy if greedy=False and greedy otherwise. And a positive will behave as before, decrementing each time a value is swallowed.

If greedy=False, i.e. we're in lazy mode, then whenever a non-positional arg is coming up, the keyword will end, whether the counter is negative (infinite) or positive (finite). The counter will be set immediately to 0.

Better Output handling

Currently, output is written to the default location: a file in the cwd named after the job title. We can allow more flexible naming using an --output flag. The SLURM naming syntax should be included in the help, and we could also extend the SLURM syntax with things like timestamp. Finally, if the user specifies a sub-directory, we should ensure that directory exists first (SLURM makes no such check, and will simply not create an output file if the directory doesn't exist)

<pip_dir>/venv_archives is not automatically created

Looks like the folder <pip_dir>/venv_archives needs to be manually created before running kpy save ...

(presurfer) [akhanf@gra-login3 presurfer-smk]$ kpy save presurfer
pipdir not set. Please set pipdir using `kslurm config pipdir <directory>`, typically to a project-space or permanent storage directory
(presurfer) [akhanf@gra-login3 presurfer-smk]$ kslurm config pipdir $WHEELHOUSE
(presurfer) [akhanf@gra-login3 presurfer-smk]$ kpy save presurfer
Traceback (most recent call last):
  File "/home/akhanf/.local/bin/kpy", line 8, in <module>
    sys.exit(kpy())
  File "/home/akhanf/.local/pipx/venvs/kslurm/lib/python3.9/site-packages/kslurm/args/command.py", line 60, in wrapper
    func(parsed)  # type: ignore
  File "/home/akhanf/.local/pipx/venvs/kslurm/lib/python3.9/site-packages/kslurm/cli/kpy.py", line 447, in kpy
    command([name, *tail.values])
  File "/home/akhanf/.local/pipx/venvs/kslurm/lib/python3.9/site-packages/kslurm/args/command.py", line 60, in wrapper
    func(parsed)  # type: ignore
  File "/home/akhanf/.local/pipx/venvs/kslurm/lib/python3.9/site-packages/kslurm/cli/kpy.py", line 251, in _save
    venv_cache = VenvCache()
  File "/home/akhanf/.local/pipx/venvs/kslurm/lib/python3.9/site-packages/kslurm/cli/kpy.py", line 98, in __init__
    venvs_re = [
  File "/home/akhanf/.local/pipx/venvs/kslurm/lib/python3.9/site-packages/kslurm/cli/kpy.py", line 98, in <listcomp>
    venvs_re = [
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.9.6/lib/python3.9/pathlib.py", line 1150, in iterdir
    for name in self._accessor.listdir(self):
FileNotFoundError: [Errno 2] No such file or directory: '/project/6050199/akhanf/opt/wheelhouse/venv_archives'

Allow global parsing of non-positionals

To meet the needs for our current implementation, the argument parse stops looking for Shapes, Flags, and Keywords as soon as it runs out of Positional args and meets a new unidentified argument. This commences the TailArg, and everything thereafter falls into that. It would be good, even for kslurm's needs, to allow global parsing of non-positionals, and only collecting unidentified args into a TailArg.

This will require a 2 pass approach: First one pass ignoring all positionals and aggregating keywords. Then separate out all the non-positionals. Then pass the remaining positionals and unidentified back into the parser for Positional parsing and collection into the TailArg.

Check for presence of slurm commands in bash rc script

Slurm commands, in particular sinfo, are not necessarily present, even in a compute environment. For example, when scping a file to the server, the slurm commands are never loaded, resulting in the error sinfo: command not found. This can be fixed with a simple presence check.

kpy bash not working for me?

Any idea what could be the problem here?

(presurfer) [akhanf@gra-login3 wheelhouse]$ kpy bash >> ~/.bashrc 
Traceback (most recent call last):
  File "/home/akhanf/.local/bin/kpy", line 8, in <module>
    sys.exit(kpy())
  File "/home/akhanf/.local/pipx/venvs/kslurm/lib/python3.9/site-packages/kslurm/args/command.py", line 60, in wrapper
    func(parsed)  # type: ignore
  File "/home/akhanf/.local/pipx/venvs/kslurm/lib/python3.9/site-packages/kslurm/cli/kpy.py", line 447, in kpy
    command([name, *tail.values])
  File "/home/akhanf/.local/pipx/venvs/kslurm/lib/python3.9/site-packages/kslurm/args/command.py", line 60, in wrapper
    func(parsed)  # type: ignore
TypeError: _bash() takes 0 positional arguments but 1 was given