The fyrd from mikedacre

Add Makefile style parsing

Possibly based on http://frougon.net/projects/CondConfigParser/doc/syntax.html or https://benjamin.smedbergs.us/pymake/ or snakemake

Update config file to separate config from profiles

We want a directory at ~/.fyrd with a config file at ~/.fyrd/conf.txt and profiles defined in ~/.fyrd/profiles.txt.

If SQLAlchemy support is added, that will go in the same place: ~/.fyrd/fyrd.sql

squeue runs three times on job creation and submission

Add a profile management script to bin

Profile management is currently only useful for relatively advanced python users, should write a management script that can call all of the config_file functions easily.

Local queue doesn't use 100% of the processing power

Testing with the local queue reveals that it isn't actually using all available cores

Hi
In our queue system there are some array jobs with job ids like: 136374[].psk
When I run for example:
job = fyrd.Job('ls .', profile='short')
this gives me:
`~/anaconda3/lib/python3.5/site-packages/fyrd/queue.py in torque_queue_parser(user, partition)
--> 680 job_id = int(xmljob.find('Job_Id').text.split('.')[0])
681 job_owner = xmljob.find('Job_Owner').text.split('@')[0]
682 if user and job_owner != user:

ValueError: invalid literal for int() with base 10: '136374[]'`

Add dependency testing to test suite

Implement a system to hold submission files in memory

clean doesn't get all the files

On my computer, submitting a job creates four files. For instance, job = submit('ls -l') creates:

ls.0.cluster.err 
ls.0.cluster.out 
ls.0.cluster.sbatch
ls.0.cluster.script

However, running job.clean() only cleans up the .sbatch and .script files; the .err and .out files still exist after running job.clean().

Merge python-pipeline into this project

The python-pipeline project is an effort to make it easy to create complex pipelines with python. It isn't that useful outside of a multithreading environment, so it makes sense to merge it in here and implement native multithreading in that project through the cluster module.

Rather than keep it as a separate project, the pipeline package should be added as a separate package alongside cluster to be used if the user wishes. However, it is important that its usage is not required in order to use the cluster package. ie. pipeline should depend on cluster, but cluster should not depend on pipeline.

Currently broken in slurm

Tests failing with the latest version of slurm

any plans to put this package on PyPI?

It would be nice if it could be installed via pip.

function submission get() function should return funtion output, not stdout/stderr/code

Make a link from .stdout to .out and .stderr to .err for Job objects

Make slurm code work with a single file

Fix definition of _stdout and others

Info: https://www.codacy.com/app/mike-dacre/fyrd/file/4307249686/issues/source?bid=3365550&fileBranchId=3365550#l523

Remove opts section from config, it is now covered by the DEFAULT profile

Make setup.py more sane an pip compatible

Move all scripts to be entry points instead of scripts

Improve Test Suite Coverage

Need >80% coverage for the main functions, currently lower than that

Implement an SQLAlchemy based database to hold job and submission info

Needs:

Job table that will hold all submitted jobs
- Link to job_file table that will contain any submission files

Importantly:

Make Job and Queue classes inherit from an SQLAlchemy Base to make information persistent.

Add decorator definitions

I want folks to be able to add a simple decorator to a function to make it submit to the cluster when called.

Add `resubmit` method to Job class

Right now the flow of the Job submission process comes to a natural end at job completion. There is no real need for this.

Instead:

Option to resubmit should be instantly responsive for all failed jobs
It should ask for confirmation if job succeeded
It should refuse to continue (without raising an Exception) if the job is currently queued or running, job must be cancelled first.

This should work naturally with the fix to Issue #4 so that the user can update attributes and then resubmit the job.

Update README to reflect recent changes to code.

Remote submit failing when filepath set

Make `my_queue` display state information

Change documentation structure to make API documentation a first class citizen

Documentation should be structured more like GRASP

Create a 'Pool' class?

It might be a nice idea to make it possible to spawn a pool object, like the multiprocessing module, and communicate with it similarly. The would require a database and daemon mode to work properly.

function submission does not import pandas

Including

import pandas as pd

Does not result in pandas being properly imported

Fix docstrings to work with sphinx

A number of my docstrings are formatted to work well with python's help display, but they are not parsed correctly by Sphinx, all of the docstrings need to be updated so that documentation is clear.

py.test failing on slurm

Improved documentation

Use google style docstrings and napoleon

New name?

pycluster, python-cluster, and similar are taken, could use a better name for pip

Add depend_failed keyword argument

Remove srun from slurm scripts

There isn't a good reason to use this anymore, rather than recreating all functionality, I think it would be a better idea to enforce a single task per fyrd job.

Calling Queue with no arguments throws error

For slurm, calling q = Queue() or q = Queue(user=None) raises an exception, even though None is the default value of user:

>>> q = cluster.Queue(user=None)
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "cluster/queue.py", line 141, in __init__
     self._update()
   File "cluster/queue.py", line 311, in _update
     self.user):
   File "cluster/queue.py", line 641, in slurm_queue_parser
     outqueue.append((sid, sname, suser, spartition, sstate, snodelist,
UnboundLocalError: local variable 'snodelist' referenced before assignment

Move ezqsub script into this project

The ezqsub project is old and out of date, but should be fairly easy to merge into here to allow batch job submission from a file.

Add kill() method to Job

torque script PBS directive error

Interesting project. In doing a quick test-drive, I've hit a PBS directive error. It might be a config error or even a torque version mismatch, but fyrd.get_cluster_environment() does return 'torque'. mem gets the complaint, but I believe walltime also needs to be prefixed by -l in the job script.

>>> j = fyrd.Job('ls ', ['.'])
>>> j.submit()
20161129 17:28:48.344 | WARNING --> Command qsub /home/icooke/ls.0.1493c48f.cluster.qsub failed with code 1, retrying.
20161129 17:28:49.348 | WARNING --> Command qsub /home/icooke/ls.0.1493c48f.cluster.qsub failed with code 1, retrying.
20161129 17:28:50.352 | WARNING --> Command qsub /home/icooke/ls.0.1493c48f.cluster.qsub failed with code 1, retrying.
20161129 17:28:51.359 | WARNING --> Command qsub /home/icooke/ls.0.1493c48f.cluster.qsub failed with code 1, retrying.
20161129 17:28:52.364 | CRITICAL --> qsub failed with code 1
-----------------------------------> stdout: 
-----------------------------------> stderr: qsub: directive error: mem=4000MB

And the script file

$ cat /home/icooke/ls.0.1493c48f.cluster.qsub
#!/bin/bash
#PBS -l nodes=1:ppn=1
#PBS mem=4000MB
#PBS -q myqueue
#PBS -e /home/icooke/ls.0.1493c48f.cluster.err
#PBS -o /home/icooke/ls.0.1493c48f.cluster.out
#PBS walltime=04:00:00
mkdir -p $LOCAL_SCRATCH > /dev/null 2>/dev/null
cd /home/icooke
date +'%y-%m-%d-%H:%M:%S'
echo "Running ls.0.1493c48f"
ls .
exitcode=$?
echo Done
date +'%y-%m-%d-%H:%M:%S'
if [[ $exitcode != 0 ]]; then
    echo Exited with code: $exitcode >&2
fi

Enhancement: Add method to Queue that will allow filtering of results by partition or user

Create a daemon mode that checks running jobs and updates the database

slurm jobs not entering queue

After fixing in the pull request I just submitted, I now see that somehow my slurm jobs are not entering the queue. This worked for me last night, so I'm not sure why it has stopped working now:

>>> job = cluster.Job('ls')
>>> job.write()
>>> job.submit()
Job:ls.0<slurm:41472211(command:ls;args:None)SUBMITTED>
>>> job.wait()
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "cluster/job.py", line 513, in wait
      self.queue.wait(self)
    File "cluster/queue.py", line 204, in wait
       '{} not in queue'.format(job))
    cluster.queue.QueueError: 41472211 not in queue

This problem arises despite the fact that the submitted job actually runs just fine and produces the expected output in the ls.0.cluster.out file.

To simulate a real life use case, we need to write out a .py file with a pandas utilizing function, submit it to the cluster, and then get the dataframe back again, all in a test.

mikedacre / fyrd Goto Github PK

fyrd's People

Contributors

Stargazers

Watchers

Forkers

fyrd's Issues

Recommend Projects

Recommend Topics

Recommend Org