mikedacre / fyrd Goto Github PK
View Code? Open in Web Editor NEWSubmit functions and shell scripts to torque and slurm clusters or local machines using python.
Home Page: https://fyrd.science
License: MIT License
Submit functions and shell scripts to torque and slurm clusters or local machines using python.
Home Page: https://fyrd.science
License: MIT License
Possibly based on http://frougon.net/projects/CondConfigParser/doc/syntax.html or https://benjamin.smedbergs.us/pymake/ or snakemake
We want a directory at ~/.fyrd
with a config file at ~/.fyrd/conf.txt
and profiles defined in ~/.fyrd/profiles.txt
.
If SQLAlchemy support is added, that will go in the same place: ~/.fyrd/fyrd.sql
Profile management is currently only useful for relatively advanced python users, should write a management script that can call all of the config_file
functions easily.
Testing with the local queue reveals that it isn't actually using all available cores
Hi
In our queue system there are some array jobs with job ids like: 136374[].psk
When I run for example:
job = fyrd.Job('ls .', profile='short')
this gives me:
`~/anaconda3/lib/python3.5/site-packages/fyrd/queue.py in torque_queue_parser(user, partition)
--> 680 job_id = int(xmljob.find('Job_Id').text.split('.')[0])
681 job_owner = xmljob.find('Job_Owner').text.split('@')[0]
682 if user and job_owner != user:
ValueError: invalid literal for int() with base 10: '136374[]'`
On my computer, submitting a job creates four files. For instance, job = submit('ls -l')
creates:
ls.0.cluster.err
ls.0.cluster.out
ls.0.cluster.sbatch
ls.0.cluster.script
However, running job.clean()
only cleans up the .sbatch
and .script
files; the .err
and .out
files still exist after running job.clean()
.
The python-pipeline project is an effort to make it easy to create complex pipelines with python. It isn't that useful outside of a multithreading environment, so it makes sense to merge it in here and implement native multithreading in that project through the cluster module.
Rather than keep it as a separate project, the pipeline package should be added as a separate package alongside cluster to be used if the user wishes. However, it is important that its usage is not required in order to use the cluster package. ie. pipeline should depend on cluster, but cluster should not depend on pipeline.
Tests failing with the latest version of slurm
It would be nice if it could be installed via pip
.
Need >80% coverage for the main functions, currently lower than that
Needs:
Importantly:
Base
to make information persistent.I want folks to be able to add a simple decorator to a function to make it submit to the cluster when called.
Right now the flow of the Job submission process comes to a natural end at job completion. There is no real need for this.
Instead:
This should work naturally with the fix to Issue #4 so that the user can update attributes and then resubmit the job.
Documentation should be structured more like GRASP
It might be a nice idea to make it possible to spawn a pool object, like the multiprocessing module, and communicate with it similarly. The would require a database and daemon mode to work properly.
Including
import pandas as pd
Does not result in pandas being properly imported
A number of my docstrings are formatted to work well with python's help display, but they are not parsed correctly by Sphinx, all of the docstrings need to be updated so that documentation is clear.
Use google style docstrings and napoleon
pycluster, python-cluster, and similar are taken, could use a better name for pip
There isn't a good reason to use this anymore, rather than recreating all functionality, I think it would be a better idea to enforce a single task per fyrd job.
For slurm
, calling q = Queue()
or q = Queue(user=None)
raises an exception, even though None is the default value of user:
>>> q = cluster.Queue(user=None)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "cluster/queue.py", line 141, in __init__
self._update()
File "cluster/queue.py", line 311, in _update
self.user):
File "cluster/queue.py", line 641, in slurm_queue_parser
outqueue.append((sid, sname, suser, spartition, sstate, snodelist,
UnboundLocalError: local variable 'snodelist' referenced before assignment
The ezqsub project is old and out of date, but should be fairly easy to merge into here to allow batch job submission from a file.
Interesting project. In doing a quick test-drive, I've hit a PBS directive error. It might be a config error or even a torque version mismatch, but fyrd.get_cluster_environment()
does return 'torque'
. mem
gets the complaint, but I believe walltime
also needs to be prefixed by -l
in the job script.
>>> j = fyrd.Job('ls ', ['.'])
>>> j.submit()
20161129 17:28:48.344 | WARNING --> Command qsub /home/icooke/ls.0.1493c48f.cluster.qsub failed with code 1, retrying.
20161129 17:28:49.348 | WARNING --> Command qsub /home/icooke/ls.0.1493c48f.cluster.qsub failed with code 1, retrying.
20161129 17:28:50.352 | WARNING --> Command qsub /home/icooke/ls.0.1493c48f.cluster.qsub failed with code 1, retrying.
20161129 17:28:51.359 | WARNING --> Command qsub /home/icooke/ls.0.1493c48f.cluster.qsub failed with code 1, retrying.
20161129 17:28:52.364 | CRITICAL --> qsub failed with code 1
-----------------------------------> stdout:
-----------------------------------> stderr: qsub: directive error: mem=4000MB
And the script file
$ cat /home/icooke/ls.0.1493c48f.cluster.qsub
#!/bin/bash
#PBS -l nodes=1:ppn=1
#PBS mem=4000MB
#PBS -q myqueue
#PBS -e /home/icooke/ls.0.1493c48f.cluster.err
#PBS -o /home/icooke/ls.0.1493c48f.cluster.out
#PBS walltime=04:00:00
mkdir -p $LOCAL_SCRATCH > /dev/null 2>/dev/null
cd /home/icooke
date +'%y-%m-%d-%H:%M:%S'
echo "Running ls.0.1493c48f"
ls .
exitcode=$?
echo Done
date +'%y-%m-%d-%H:%M:%S'
if [[ $exitcode != 0 ]]; then
echo Exited with code: $exitcode >&2
fi
After fixing in the pull request I just submitted, I now see that somehow my slurm
jobs are not entering the queue. This worked for me last night, so I'm not sure why it has stopped working now:
>>> job = cluster.Job('ls')
>>> job.write()
>>> job.submit()
Job:ls.0<slurm:41472211(command:ls;args:None)SUBMITTED>
>>> job.wait()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "cluster/job.py", line 513, in wait
self.queue.wait(self)
File "cluster/queue.py", line 204, in wait
'{} not in queue'.format(job))
cluster.queue.QueueError: 41472211 not in queue
This problem arises despite the fact that the submitted job actually runs just fine and produces the expected output in the ls.0.cluster.out
file.
Very minor issue, but might be good to define a __version__
variable in the main __init__
file as this is the Python standard, and is the way that I think most code tries to extract the version from a package.
http://stackoverflow.com/questions/458550/standard-way-to-embed-version-into-python-package
It is still based on pyslurm, it needs to be updated to work with the new cluster library.
Issue #2 needs to be fixed first
Add a simple script to allow non-python users to submit jobs with profiles.
Right now jobs scripts are created by init. They should be stored in a format that allows them to be updated at any time prior to submission. i.e. changing the 'cores' keyword will update the script.
The best way to do this will be to move the string formatting into a write_script
method of the Job.Script
class that can be called any time to overwrite the current scripts.
Create a test function that writes out a pandas .py file and then submits a function from it.
I am worried that classes and complex functions will fail to pickle with the current methods.
To simulate a real life use case, we need to write out a .py file with a pandas utilizing function, submit it to the cluster, and then get the dataframe back again, all in a test.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.