pyiron / pysqa Goto Github PK

View Code? Open in Web Editor NEW

19.0 6.0 6.0 825 KB

Simple HPC queuing system adapter for Python on based jinja templates to automate the submission script creation.

Home Page: https://pysqa.readthedocs.io

License: BSD 3-Clause "New" or "Revised" License

Python 91.91% Shell 2.16% Jupyter Notebook 5.93%

python queue-manager torque moab sge slurm lsf hpc

pysqa's Introduction

pysqa

High-performance computing (HPC) does not have to be hard. In this context the aim of pysqa is to simplify the submission of calculation to an HPC cluster as easy as starting another subprocess locally. This is achieved based on the assumption that even though modern HPC queuing systems offer a wide range of different configuration options, most users submit the majority of their jobs with very similar parameters.

Therefore, in pysqa users define submission script templates once and reuse them to submit many different calculations or workflows. These templates are defined in the jinja2 template language, so current submission scripts can be easily extended to templates. In addition to the submission of new jobs to the queuing system pysqa also allows the users to track the progress of their jobs, delete them or enable reservations using the built-in functionality of the queuing system.

All this functionality is available from both a Python interface as well as a command line interface.

Features

The core feature of pysqa is the communication to an HPC queuing system (Flux, LFS, MOAB, SGE, SLURM and TORQUE). This includes:

Submission of new calculation to the queuing system.
List of calculation currently waiting or running on the queuing system.
Deleting calculation which are currently waiting or running on the queuing system.
List of available queue templates created by the user.
Restriction of templates to a specific number of cores, run time or other computing resources. With integrated checks if a given calculation follows these restrictions.

In addition to these core features, pysqa is continuously extended to support more use cases for a larger group of users. These new features include the support for remote queuing systems:

Remote connection via the secure shell protocol to access remote HPC clusters.
Transfer of file to and from remote HPC clusters, based on a predefined mapping of the remote file system into the local file system.
Support for both individual connections as well as continuous connections depending on the network availability.

Finally, there is current work in progress to support a combination of multiple local and remote queuing systems from within pysqa, which are represented to the user as a single resource.

Documentation

License

pysqa is released under the BSD license https://github.com/pyiron/pysqa/blob/main/LICENSE . It is a spin-off of the pyiron project https://github.com/pyiron/pyiron therefore if you use pysqa for calculation which result in a scientific publication, please cite:

@article{pyiron-paper,
  title = {pyiron: An integrated development environment for computational materials science},
  journal = {Computational Materials Science},
  volume = {163},
  pages = {24 - 36},
  year = {2019},
  issn = {0927-0256},
  doi = {https://doi.org/10.1016/j.commatsci.2018.07.043},
  url = {http://www.sciencedirect.com/science/article/pii/S0927025618304786},
  author = {Jan Janssen and Sudarsan Surendralal and Yury Lysogorskiy and Mira Todorova and Tilmann Hickel and Ralf Drautz and Jörg Neugebauer},
  keywords = {Modelling workflow, Integrated development environment, Complex simulation protocols},
}

pysqa's People

Contributors

Stargazers

Watchers

Forkers

molmod semenovra cloudac7 billmcglynnokon mgt16-lanl hujay2019

pysqa's Issues

Runtime in SLURM is in minutes while the documentation says seconds

Switch runtime from minutes to seconds in SLURM.

Add build cache

@niklassiemer Can you add the build cache to pysqa as well? While we do not have so many builds as pyiron_base and pyiron_atomistics waiting for the windows build is still annoying.

Release 0.0.17

Over the last year we merged a couple of important updates plus the updates of the dependencies, so I guess it is time to release a new version.

[documentation] provide a tutorial how to setup pysqa on a new cluster

The current documentation covers the different setting for the different queuing systems, but it does not help with debugging.

[idea] Add official support for GPUs

While GPUs can currently be added as optional variables, having official support would be a great extension.

[bug] the version in the documentation is wrong

The documentation references 0.0.22 while the current release is 0.0.25:
https://github.com/pyiron/pysqa/blob/main/docs/source/conf.py

[bug] Ubuntu issue

Ubuntu has the following lines in their default .bashrc:

# If not running interactively, don't do anything
case $- in
     *i*) ;;
     *) return;;
esac

With this the interactive shell is rather different from the shell that pysqa has access to. Typical errors include linking to the wrong python version and so on. So it is definitely worth checking the users .bashrc to see if pysqa is blocked by it.

[testing] Extend the Continuous Integration to test with a fully setup SLURM cluster

https://github.com/koesterlab/setup-slurm-action

[tests] Slurm is missing queuetable test

title

[idea] native support for multiple clusters

Currently, switching between clusters is a manual task:

qa = QueueAdapter()
qa.list_clusters()
>>> 'cluster_1', 'cluster_2'
qa.switch_cluster(cluster_name='cluster_1')
qa.submit_job(...)

It would be great if pysqa could switch between the clusters seamlessly. Basically by selecting the queue pysqa could automatically identify which cluster is used so that the different cluster behave like one to the user.

There are a couple of challenges:

Two clusters could potentially return the same queue ID. The current approach is to add an integer at the end of the queue ID which limits the number of clusters to 10.
The other issue is that getting the status of all clusters increases linearly with the number of clusters. This could be executed in parallel, still that would only be more complex.

[idea] python concurrent.futures

It would be great to represent submitted jobs as python concurrent futures objects, so it becomes more intuitive for the users who are already familiar with developing asynchronous simulation protocols in python.

[documentation] Update the documentation to explain the remote setup in more detail

The recent pull requests added a lot of functionality to the remote access, this still needs to be documented. Unfortunately, the remote functionality currently only works in combination with pyiron_base so this also requires some additional generalisation.

[documentation] Alternative solutions

https://exaworks.org/psij-python/index.html

Dependabot couldn't authenticate with https://pypi.python.org/simple/

Dependabot couldn't authenticate with https://pypi.python.org/simple/.

You can provide authentication details in your Dependabot dashboard by clicking into the account menu (in the top right) and selecting 'Config variables'.

View the update logs.

[feature] Extend the executor support

Basically migrate the Executor class from pympipool 0.6.X to pysqa.

[idea] flux.job.FluxExecutor

Currently they implement the following command:

from_command(
        cls,
        command,
        num_tasks=1,
        cores_per_task=1,
        gpus_per_task=None,
        num_nodes=None,
        exclusive=False,
    )

Shell scripts can be submitted to the queuing system and a concurrent.futures.Future object is returned. It would be great to introduce the same abstraction layer for other queuing systems. As the storage interface for such a system is already available in #206 it should not be too complicated to implement.

[idea] Split remote SSH configuration from queuing system configuration

In the current configuration each user has to define their own queues when connecting to a remote computing cluster. While the ssh key is personal this is typically not the case for the queue configuration, so the configuration has to be split in a user specific part and a user independent part.

[feature] Copy files on remote server

For example copy the WAVECAR file of one VASP calculation to another without transferring it to the localhost and back as it is currently the case.

sbatch FileNotFoundError

While trying to install and setup pyiron on a new machine we got following error when trying to submit a simple Lammps test job:

FileNotFoundError: [Errno 2] No such file or directory: 'sbatch'

When submitting the job directly logged into the cluster using sbatch run_queue.sh it works fine, also when manually starting python and submitting using subprocess.check_output(["sbatch", "run_queue.sh"]) no problems occur. I highly appreciate any ideas what could cause this.

Maybe @jan-janssen has a hint for me?

[feature] Add support for two factor authentication

I developed a small python library to generate six digit two factor authentication codes. https://github.com/jan-janssen/pyauthenticator It should be possible to integrate this functionality in paramiko to have it available for a larger audience.

Installing from conda broken, because of version spec

I've installed pysqa==0.1.1 (as a dep of pyiron_atomistics) in a clean environment from conda forge. Importing then fails with this

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[2], line 1
----> 1 import pysqa

File /srv/conda/envs/notebook/lib/python3.11/site-packages/pysqa/__init__.py:5
      2 __all__ = []
      4 from pysqa.queueadapter import QueueAdapter
----> 5 from pysqa.executor.executor import Executor
      7 from ._version import get_versions
      9 __version__ = get_versions()["version"]

File /srv/conda/envs/notebook/lib/python3.11/site-packages/pysqa/executor/executor.py:5
      2 import queue
      3 from concurrent.futures import Future, Executor as FutureExecutor
----> 5 from pympipool import cancel_items_in_queue, RaisingThread
      6 from pysqa.executor.helper import (
      7     reload_previous_futures,
      8     find_executed_tasks,
      9     serialize_funct,
     10     write_to_file,
     11 )
     14 class Executor(FutureExecutor):

ImportError: cannot import name 'cancel_items_in_queue' from 'pympipool' (/srv/conda/envs/notebook/lib/python3.11/site-packages/pympipool/__init__.py)

because conda installs pympipool==0.7.0 by default, which is incompatible.

I know how to fix this, but I'm posting anyway, because (I assume) this only happens because of our weird practice of leaving the upper version bound of our deps open and bumping the lower one. I've maintained for a while that we should instead bump the upper bound instead. This gives a nice example of why it is better. In that scheme this error would never happen. I will look into how we can do this with dependabot if I have time.

In practical terms, I guess #218 will fix the issue.

[idea] Support dependencies

Users could benefit a lot from having the task dependencies resolved by the queuing system as that is what it is optimized for.

[idea] make queue type available to the outside

It would be great if other tools could see which queuing system is selected, or even better if commands like srun would be available to the outside, so tools which construct their commands, can access pysqa to identify the corresponding run command.

[Feedback] Provide option to enter password just like two factor authentication

Currently the password in the queue.yaml file the password stored as ssh_password is plain text. So it would be better to ask the user to enter the password.

Enable reservation

To couple a reservation code to a job in pyiron, I am using the following function:

def enable_reservation_command(self, process_id,reservation_id):
        return ["scontrol", "update", "job", str(process_id), "reservation={}".format(reservation_id)]

However, due to a bug(?) in slurm, the reservation_id can be assigned to the job, but its priority is not bumped. As such, it is required to set the $SBATCH_RESERVATION environment variable right before calling sbatch to have an increased priority. To do this, there are two options:

assign it before the sbatch command is called in bash or python

export SBATCH_RESERVATION=name-of-reservation
os.environ['SBATCH_RESERVATION'] = 'name-of-reservation'
use the reservation flag when using sbatch

sbatch --reservation=name-of-reservation ...

Would there be an easy way of doing this?

Compatible with PBS Professional scheduler?

Hi there,

This is not an issue really.

I was wondering if it is possible to extend pysqa to submit jobs using PBS job scheduler.

I'm really happy with how pysqa nicely submits job in SLURM.

I have access to another cluster which uses PBS Professional to submit jobs.

It'd be really nice to run pyiron jobs over there as well.

Is it possible to just add a jinja2 template for PBS and that's all there is to extend? Or is there something else that needs to be done?

A sample submission script looks like this:

#!/bin/bash
 
#PBS -l walltime=10:00:00,select=1:ncpus=1:mem=2gb
#PBS -N job_name
#PBS -A alloc-code
#PBS -m abe
#PBS -M [email protected]
#PBS -o output.txt
#PBS -e error.txt
 
################################################################################
 
module load software_package_1
module load software_package_2
 
cd $PBS_O_WORKDIR
 
<executable commands here>

[idea] Dask job queue compatible interface

It might be interesting for some users to also use our template based approach with dask in analogy to the https://github.com/dask/dask-jobqueue

[feature] Executor: use HDF5 rather than pickle

Implementation

import re
import cloudpickle
import hashlib
import h5io
import h5py
import numpy as np
import queue
import os
import time
import queue
from threading import Thread
from concurrent.futures import Future, Executor
from pympipool.shared import cancel_items_in_queue


def get_hash(binary):
    # Remove specification of jupyter kernel from hash to be deterministic
    binary_no_ipykernel = re.sub(b"(?<=/ipykernel_)(.*)(?=/)", b"", binary)
    return str(hashlib.md5(binary_no_ipykernel).hexdigest())


def serialize_funct_h5(fn, *args, **kwargs):
    binary_funct = cloudpickle.dumps(fn)
    binary_all = cloudpickle.dumps({"fn": fn, "args": args, "kwargs": kwargs})
    task_key = fn.__name__ + get_hash(binary=binary_all)
    data = {"fn": binary_funct, "args": args, "kwargs": kwargs}
    return task_key, data


def apply_funct(apply_dict):
    return apply_dict["fn"].__call__(*apply_dict["args"], **apply_dict["kwargs"])


def get_result_h5(task_key):
    file_name = task_key + ".h5"
    with h5py.File(file_name, "r") as hdf:
        if "output" in hdf:
            return h5io.read_hdf5(fname=hdf, title="output", slash="ignore")
        elif "input_args" in hdf and "input_kwargs" in hdf:
            result = apply_funct(
                apply_dict={
                    "fn": cloudpickle.loads(
                        h5io.read_hdf5(fname=hdf, title="function", slash="ignore")
                    ),
                    "args": h5io.read_hdf5(
                        fname=hdf, title="input_args", slash="ignore"
                    ),
                    "kwargs": h5io.read_hdf5(
                        fname=hdf, title="input_kwargs", slash="ignore"
                    ),
                }
            )
        elif "input_args" in hdf:
            result = apply_funct(
                apply_dict={
                    "fn": cloudpickle.loads(
                        h5io.read_hdf5(fname=hdf, title="function", slash="ignore")
                    ),
                    "args": h5io.read_hdf5(
                        fname=hdf, title="input_args", slash="ignore"
                    ),
                    "kwargs": {},
                }
            )
        elif "input_kwargs" in hdf:
            result = apply_funct(
                apply_dict={
                    "fn": cloudpickle.loads(
                        h5io.read_hdf5(fname=hdf, title="function", slash="ignore")
                    ),
                    "args": [],
                    "kwargs": h5io.read_hdf5(
                        fname=hdf, title="input_kwargs", slash="ignore"
                    ),
                }
            )
        else:
            raise TypeError
    write_to_h5_file(task_key=task_key, data_dict={"output": result})
    return result


def write_to_h5_file(task_key, data_dict):
    file_name = task_key + ".h5"
    with h5py.File(file_name, "a") as fname:
        for data_key, data_value in data_dict.items():
            if data_key == "fn":
                h5io.write_hdf5(
                    fname=fname,
                    data=np.void(data_value),
                    overwrite="update",
                    title="function",
                )
            elif data_key == "args":
                h5io.write_hdf5(
                    fname=fname,
                    data=data_value,
                    overwrite="update",
                    title="input_args",
                    slash="ignore",
                )
            elif data_key == "kwargs":
                for k, v in data_value.items():
                    h5io.write_hdf5(
                        fname=fname,
                        data=v,
                        overwrite="update",
                        title="input_kwargs/" + k,
                        slash="ignore",
                    )
            elif data_key == "output":
                h5io.write_hdf5(
                    fname=fname,
                    data=data_value,
                    overwrite="update",
                    title="output",
                    slash="ignore",
                )


def execute_tasks_h5(future_queue):
    while True:
        task_dict = None
        try:
            task_dict = future_queue.get_nowait()
        except queue.Empty:
            pass
        if (
            task_dict is not None
            and "shutdown" in task_dict.keys()
            and task_dict["shutdown"]
        ):
            break
        elif task_dict is not None:
            key = list(task_dict.keys())[0]
            future = task_dict[key]
            if not future.done() and future.set_running_or_notify_cancel():
                future.set_result(get_result_h5(task_key=key))


def reload_previous_futures(future_dict, task_queue):
    for f in os.listdir():
        if f.endswith(".h5"):
            key = f.split(".h5")[0]
            task_queue.put({key: future_dict[key]})


class FileExecutor(Executor):
    def __init__(self):
        self._task_queue = queue.Queue()
        self._memory_dict = {}
        reload_previous_futures(
            future_dict=self._memory_dict, task_queue=self._task_queue
        )
        self._process = Thread(target=execute_tasks_h5, args=(self._task_queue,))
        self._process.start()

    def submit(self, fn, *args, **kwargs):
        task_key, data_dict = serialize_funct_h5(fn, *args, **kwargs)
        if task_key not in self._memory_dict.keys():
            write_to_h5_file(task_key=task_key, data_dict=data_dict)
            self._memory_dict[task_key] = Future()
            self._task_queue.put({task_key: self._memory_dict[task_key]})
        return self._memory_dict[task_key]

    def shutdown(self, wait=True, *, cancel_futures=False):
        if cancel_futures:
            cancel_items_in_queue(que=self._task_queue)
        self._task_queue.put({"shutdown": True, "wait": wait})
        self._process.join()

Example

def my_funct(a, b):
    time.sleep(0.1)
    return a+b

exe = FileExecutor()
fs1 = exe.submit(my_funct, 1, 2)
fs1.done(), fs1.result(), fs1.done()