pvandyken / kslurm Goto Github PK
View Code? Open in Web Editor NEWHelper scripts and wrappers for running commands on SLURM compute clusters.
Home Page: https://kslurm.readthedocs.io/en/latest/
License: MIT License
Helper scripts and wrappers for running commands on SLURM compute clusters.
Home Page: https://kslurm.readthedocs.io/en/latest/
License: MIT License
Singularity will eventually be one of the core features of the library. The computecanada clusters present unique challenges for pulling and organizing images, and the recommended procedures should be baked into easy-to-use scripts. Ideally, the solution will still be generalizable to other compute clusters.
The basic scheme will be as follows:
kpy save first writes the venv to a tmp file, then moves it to the final destination. This helps prevent data loss when overwriting an old venv: should the process get interrupted, the overwritten file won't get deleted until the last possible moment.
This last possible moment could be delayed even further: currently, the tmp tar file is being written on whatever tmp filesystem is available. This could be a different filesystem the the storage dir, which would require a cp operation upon moving. This cp could take a minute or two if the tar is very large. A better approach would be to first mv the tar file into the dest directory under a tmp name, then delete the overwritten file, then rename the tmp file to final
You can kpy create
a venv on the login node, then install to it and save it, but then if you want to make changes to it in the future there doesn't seem to be an easy way to do this, since the commands to load and activate can only be run on a compute node. Am I missing something?
Also, would be useful to be able to kpy list
the venvs on login node too, but this is also disabled for non-compute nodes..
In kjupyter, we call call an internet process to get the public ip for ssh. This is supposed to fall back to <hostname>
if the call fails, but instead of looking for the exit status, we watch for a RuntimeError
which never gets thrown. This should be corrected.
Currently, there is no way to configure values on a user-by-user basis, greatly limiting the generalizability of the library. Thus, an immediate priority is to integrate a config functionality. The most obvious variable to set is the slurm account, which is currently set to the one I use. But once the config is in place, we could allow the user to set the default resources. Future features will also add relevant editable variables.
Many parts of the app are lacking good tests:
Documentation is almost non-existent at this point. The first priority is to write doc-strings for all of the functions.
Rather than spawning a new compute node or using the existing one, we get some slurm error output:
slurmstepd: error: execve(): source: No such file or directory
srun: error: gra395: task 0: Exited with exit code 2
srun: launch/slurm: _step_signal: Terminating StepId=63208972.1
^Cscancel: error: Invalid job id 0
If running kjupyter
on a compute node is not tenable, an error should be thrown immediately.
Currently, neuroglia-helpers offers a BidsBatch script, which parallelizes a BidsApp over subjects over SLURM. Once we have #2 working, this script can be easily recreated.
As an improvement on the current limitation, our BidsBatch script should be able to more dynamically handle BidsApps. Currently, BidsApps, including versions, are hard-coded into a TSV version. The new implementation should be able to determine what versions are available automatically. We could also pull data from bids-apps: they keep a directory of Bids-Apps, and the data is available in the _config.yml file on their github.
Finally, users should be able to easily specify new BidsApps. Ideally, I would like to trust that the user knows what they're doing and allow them to specify any arbitrary docker image. If it's not in the list of "official" bidsapps, we can just display a warning message that the app is not recognized and errors may occur.
Currently, we have an install script (install-kslurm.py
) in the root directory with almost identical code to the installer module. These could be united using the Python zipfile api. As part of the build process, zipfile would be used to package the necessary modules together into a self-contained install script that could then be pulled from the server and run. Thus, the same easy user experience, with less code redundancy.
To ensure the install script stays as small as possible, the installer code needs to be as isolated as possible from the rest of the library. Currently, it uses the custom argparsing library, however, once that library is repackaged as its own library, the installer code should lose any dependency on kslurm.
Long term, the installer module could itself be repackaged as its own library. The code is already quite functional, so it shouldn't be too much of a challenge to generalize it such that any Python library could be installed into it's own isolated venv.
No file locks are currently used by kpy to protect the integrity of a venv cache when opening. This would present a problem if another process tried to overwrite the save. Should be a simple enough fix to impose a read/write lock
Not necessarily kslurm
specific and all minor:
First two points was just looking through gh
I noticed this when I was working neuroglia-helpers
as well, but there may be redundant job templates, such as 32core128gb3h
vs ShortFat
(see slurm_job_templates.json). Wonder if it might be better to move away from the original implementation of naming them (e.g. ShortFat) to just naming by resource (e.g. 32core128gb24h).
32core128gb12h
is given 2 cpus rather than 32 in the current slurm_job_templates.json
This one might be on my end, but the data
folder is missing on installation. When running krun --list
, a FileNotFoundError
is returned with it being unable to local the slurm_job_templates.json
Currently, a user must manually specify an argument or method for the user to get help on a command. The argument parser should automatically add a Help Argument to every model sent for parsing. It should be a FlagArg
accepting either -h
or --help
.
Note that given the current typed system, we can't actually return the Help Argument unless the user explicately makes an attribute for it. We would have to know which attribute should receive the help results, likely necessitating a new arg-type: HelpArg
. Alternatively, we could return a tuple: (ArgList, HelpArg)
, although this is not particularly elegant. Eventually, the parser should be able to print help messages of it's own accord, making this less of an issue.
One caveat that should be worked out at the same time is if the user supplies their own arg matching -h
or --help
. One approach is to add our automatic Help Argument to the back of the list, so that it has the least matching priority. This could lead to unexpected behaviour, though. For instance, if the developer overrides -h
, but not --help
, and then implements their own entirely separate help system, the end-user would get an unintended message when running the command with --help
. We could also require the user to opt in or out of automatic help via a boolean argument in parse-args()
. Or, we could check FlagArgs
and KeywordArgs
to see if any of them match -h
or --help
and print a warning when they are detected.
Otherwise, relative paths get interpreted based on the directory one is currently in: not desired behaviour
Currently, a current working directory can be supplied to any of the commands using the Directory ShapeArg
. To find a match, it simply takes the argument and checks if it's a valid directory. This will cause a conflict if the user is trying to run a program with the same name as a subdirectory. For instance, if the user was in ~/scratch
and they have a sub-directory called ~/scratch/python
, they would not be able to run python
as a command. The parser would interpret python
as referring to the directory.
As a fix, directory specifications should be required to have a forward slash (/
) character somewhere in the string. So in the example above, if the user wanted python
to be the cwd, they would need to specify ./python
. This will eliminate the overlap issue.
The current implementation of the argument parser uses the number to specify between a finite, fixed number of values (positive number), a "lazy" infinite mode that stops when it reaches a non-positional arg (0), and a "greedy" infinite mode that never stops (negative number). This does not allow for a "lazy" finite mode, where it stops when it reaches a non-positional arg, or when it reaches the limit. There's also no obvious way to add this in. Plus, the current parser can't distinguish between a lazy infinite and a completed fixed number. So if you have a fixed number, and a positional arg follows, that positional arg will get eaten by the keyword arg:
command --1-arg-keyword value positional Shape
Here, that positional would be counted as a value for --1-arg-keyword
, an obvious bug.
As a solution, generalize greedy and lazy to a new parameter: greedy=[True|False]
. Negative values for num
will no longer be defined (although still used internally in the counter to represent any infinite keyword arg). Internally, a counter of 0 will always represent a completed keyword arg. A negative will always be infinite: lazy if greedy=False
and greedy otherwise. And a positive will behave as before, decrementing each time a value is swallowed.
If greedy=False
, i.e. we're in lazy mode, then whenever a non-positional arg is coming up, the keyword will end, whether the counter is negative (infinite) or positive (finite). The counter will be set immediately to 0.
Currently, output is written to the default location: a file in the cwd named after the job title. We can allow more flexible naming using an --output
flag. The SLURM naming syntax should be included in the help, and we could also extend the SLURM syntax with things like timestamp. Finally, if the user specifies a sub-directory, we should ensure that directory exists first (SLURM makes no such check, and will simply not create an output file if the directory doesn't exist)
Looks like the folder <pip_dir>/venv_archives needs to be manually created before running kpy save ...
(presurfer) [akhanf@gra-login3 presurfer-smk]$ kpy save presurfer
pipdir not set. Please set pipdir using `kslurm config pipdir <directory>`, typically to a project-space or permanent storage directory
(presurfer) [akhanf@gra-login3 presurfer-smk]$ kslurm config pipdir $WHEELHOUSE
(presurfer) [akhanf@gra-login3 presurfer-smk]$ kpy save presurfer
Traceback (most recent call last):
File "/home/akhanf/.local/bin/kpy", line 8, in <module>
sys.exit(kpy())
File "/home/akhanf/.local/pipx/venvs/kslurm/lib/python3.9/site-packages/kslurm/args/command.py", line 60, in wrapper
func(parsed) # type: ignore
File "/home/akhanf/.local/pipx/venvs/kslurm/lib/python3.9/site-packages/kslurm/cli/kpy.py", line 447, in kpy
command([name, *tail.values])
File "/home/akhanf/.local/pipx/venvs/kslurm/lib/python3.9/site-packages/kslurm/args/command.py", line 60, in wrapper
func(parsed) # type: ignore
File "/home/akhanf/.local/pipx/venvs/kslurm/lib/python3.9/site-packages/kslurm/cli/kpy.py", line 251, in _save
venv_cache = VenvCache()
File "/home/akhanf/.local/pipx/venvs/kslurm/lib/python3.9/site-packages/kslurm/cli/kpy.py", line 98, in __init__
venvs_re = [
File "/home/akhanf/.local/pipx/venvs/kslurm/lib/python3.9/site-packages/kslurm/cli/kpy.py", line 98, in <listcomp>
venvs_re = [
File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.9.6/lib/python3.9/pathlib.py", line 1150, in iterdir
for name in self._accessor.listdir(self):
FileNotFoundError: [Errno 2] No such file or directory: '/project/6050199/akhanf/opt/wheelhouse/venv_archives'
To meet the needs for our current implementation, the argument parse stops looking for Shapes, Flags, and Keywords as soon as it runs out of Positional args and meets a new unidentified argument. This commences the TailArg, and everything thereafter falls into that. It would be good, even for kslurm's needs, to allow global parsing of non-positionals, and only collecting unidentified args into a TailArg.
This will require a 2 pass approach: First one pass ignoring all positionals and aggregating keywords. Then separate out all the non-positionals. Then pass the remaining positionals and unidentified back into the parser for Positional parsing and collection into the TailArg.
Slurm commands, in particular sinfo
, are not necessarily present, even in a compute environment. For example, when scp
ing a file to the server, the slurm commands are never loaded, resulting in the error sinfo: command not found
. This can be fixed with a simple presence check.
Any idea what could be the problem here?
(presurfer) [akhanf@gra-login3 wheelhouse]$ kpy bash >> ~/.bashrc
Traceback (most recent call last):
File "/home/akhanf/.local/bin/kpy", line 8, in <module>
sys.exit(kpy())
File "/home/akhanf/.local/pipx/venvs/kslurm/lib/python3.9/site-packages/kslurm/args/command.py", line 60, in wrapper
func(parsed) # type: ignore
File "/home/akhanf/.local/pipx/venvs/kslurm/lib/python3.9/site-packages/kslurm/cli/kpy.py", line 447, in kpy
command([name, *tail.values])
File "/home/akhanf/.local/pipx/venvs/kslurm/lib/python3.9/site-packages/kslurm/args/command.py", line 60, in wrapper
func(parsed) # type: ignore
TypeError: _bash() takes 0 positional arguments but 1 was given
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.