Coder Social home page Coder Social logo

pyiron / pysqa Goto Github PK

View Code? Open in Web Editor NEW
19.0 6.0 6.0 825 KB

Simple HPC queuing system adapter for Python on based jinja templates to automate the submission script creation.

Home Page: https://pysqa.readthedocs.io

License: BSD 3-Clause "New" or "Revised" License

Python 91.91% Shell 2.16% Jupyter Notebook 5.93%
python queue-manager torque moab sge slurm lsf hpc

pysqa's Introduction

pysqa

Python package Documentation Status Coverage Status Binder

High-performance computing (HPC) does not have to be hard. In this context the aim of pysqa is to simplify the submission of calculation to an HPC cluster as easy as starting another subprocess locally. This is achieved based on the assumption that even though modern HPC queuing systems offer a wide range of different configuration options, most users submit the majority of their jobs with very similar parameters.

Therefore, in pysqa users define submission script templates once and reuse them to submit many different calculations or workflows. These templates are defined in the jinja2 template language, so current submission scripts can be easily extended to templates. In addition to the submission of new jobs to the queuing system pysqa also allows the users to track the progress of their jobs, delete them or enable reservations using the built-in functionality of the queuing system.

All this functionality is available from both a Python interface as well as a command line interface.

Features

The core feature of pysqa is the communication to an HPC queuing system (Flux, LFS, MOAB, SGE, SLURM and TORQUE). This includes:

  • Submission of new calculation to the queuing system.
  • List of calculation currently waiting or running on the queuing system.
  • Deleting calculation which are currently waiting or running on the queuing system.
  • List of available queue templates created by the user.
  • Restriction of templates to a specific number of cores, run time or other computing resources. With integrated checks if a given calculation follows these restrictions.

In addition to these core features, pysqa is continuously extended to support more use cases for a larger group of users. These new features include the support for remote queuing systems:

  • Remote connection via the secure shell protocol to access remote HPC clusters.
  • Transfer of file to and from remote HPC clusters, based on a predefined mapping of the remote file system into the local file system.
  • Support for both individual connections as well as continuous connections depending on the network availability.

Finally, there is current work in progress to support a combination of multiple local and remote queuing systems from within pysqa, which are represented to the user as a single resource.

Documentation

License

pysqa is released under the BSD license https://github.com/pyiron/pysqa/blob/main/LICENSE . It is a spin-off of the pyiron project https://github.com/pyiron/pyiron therefore if you use pysqa for calculation which result in a scientific publication, please cite:

@article{pyiron-paper,
  title = {pyiron: An integrated development environment for computational materials science},
  journal = {Computational Materials Science},
  volume = {163},
  pages = {24 - 36},
  year = {2019},
  issn = {0927-0256},
  doi = {https://doi.org/10.1016/j.commatsci.2018.07.043},
  url = {http://www.sciencedirect.com/science/article/pii/S0927025618304786},
  author = {Jan Janssen and Sudarsan Surendralal and Yury Lysogorskiy and Mira Todorova and Tilmann Hickel and Ralf Drautz and Jörg Neugebauer},
  keywords = {Modelling workflow, Integrated development environment, Complex simulation protocols},
}

pysqa's People

Contributors

codacy-badger avatar dependabot-preview[bot] avatar dependabot[bot] avatar hujay2019 avatar jan-janssen avatar leimeroth avatar liamhuber avatar ligerzero-ai avatar max-hassani avatar niklassiemer avatar pmrv avatar pyiron-runner avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

pysqa's Issues

Add build cache

@niklassiemer Can you add the build cache to pysqa as well? While we do not have so many builds as pyiron_base and pyiron_atomistics waiting for the windows build is still annoying.

Release 0.0.17

Over the last year we merged a couple of important updates plus the updates of the dependencies, so I guess it is time to release a new version.

[bug] Ubuntu issue

Ubuntu has the following lines in their default .bashrc:

# If not running interactively, don't do anything
case $- in
     *i*) ;;
     *) return;;
esac

With this the interactive shell is rather different from the shell that pysqa has access to. Typical errors include linking to the wrong python version and so on. So it is definitely worth checking the users .bashrc to see if pysqa is blocked by it.

[idea] native support for multiple clusters

Currently, switching between clusters is a manual task:

qa = QueueAdapter()
qa.list_clusters()
>>> 'cluster_1', 'cluster_2'
qa.switch_cluster(cluster_name='cluster_1')
qa.submit_job(...)

It would be great if pysqa could switch between the clusters seamlessly. Basically by selecting the queue pysqa could automatically identify which cluster is used so that the different cluster behave like one to the user.

There are a couple of challenges:

  • Two clusters could potentially return the same queue ID. The current approach is to add an integer at the end of the queue ID which limits the number of clusters to 10.
  • The other issue is that getting the status of all clusters increases linearly with the number of clusters. This could be executed in parallel, still that would only be more complex.

[idea] python concurrent.futures

It would be great to represent submitted jobs as python concurrent futures objects, so it becomes more intuitive for the users who are already familiar with developing asynchronous simulation protocols in python.

[idea] flux.job.FluxExecutor

Currently they implement the following command:

from_command(
        cls,
        command,
        num_tasks=1,
        cores_per_task=1,
        gpus_per_task=None,
        num_nodes=None,
        exclusive=False,
    )

Shell scripts can be submitted to the queuing system and a concurrent.futures.Future object is returned. It would be great to introduce the same abstraction layer for other queuing systems. As the storage interface for such a system is already available in #206 it should not be too complicated to implement.

[idea] Split remote SSH configuration from queuing system configuration

In the current configuration each user has to define their own queues when connecting to a remote computing cluster. While the ssh key is personal this is typically not the case for the queue configuration, so the configuration has to be split in a user specific part and a user independent part.

[feature] Copy files on remote server

For example copy the WAVECAR file of one VASP calculation to another without transferring it to the localhost and back as it is currently the case.

sbatch FileNotFoundError

While trying to install and setup pyiron on a new machine we got following error when trying to submit a simple Lammps test job:

FileNotFoundError: [Errno 2] No such file or directory: 'sbatch'

When submitting the job directly logged into the cluster using sbatch run_queue.sh it works fine, also when manually starting python and submitting using subprocess.check_output(["sbatch", "run_queue.sh"]) no problems occur. I highly appreciate any ideas what could cause this.

Maybe @jan-janssen has a hint for me?

Installing from conda broken, because of version spec

I've installed pysqa==0.1.1 (as a dep of pyiron_atomistics) in a clean environment from conda forge. Importing then fails with this

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[2], line 1
----> 1 import pysqa

File /srv/conda/envs/notebook/lib/python3.11/site-packages/pysqa/__init__.py:5
      2 __all__ = []
      4 from pysqa.queueadapter import QueueAdapter
----> 5 from pysqa.executor.executor import Executor
      7 from ._version import get_versions
      9 __version__ = get_versions()["version"]

File /srv/conda/envs/notebook/lib/python3.11/site-packages/pysqa/executor/executor.py:5
      2 import queue
      3 from concurrent.futures import Future, Executor as FutureExecutor
----> 5 from pympipool import cancel_items_in_queue, RaisingThread
      6 from pysqa.executor.helper import (
      7     reload_previous_futures,
      8     find_executed_tasks,
      9     serialize_funct,
     10     write_to_file,
     11 )
     14 class Executor(FutureExecutor):

ImportError: cannot import name 'cancel_items_in_queue' from 'pympipool' (/srv/conda/envs/notebook/lib/python3.11/site-packages/pympipool/__init__.py)

because conda installs pympipool==0.7.0 by default, which is incompatible.

I know how to fix this, but I'm posting anyway, because (I assume) this only happens because of our weird practice of leaving the upper version bound of our deps open and bumping the lower one. I've maintained for a while that we should instead bump the upper bound instead. This gives a nice example of why it is better. In that scheme this error would never happen. I will look into how we can do this with dependabot if I have time.

In practical terms, I guess #218 will fix the issue.

[idea] Support dependencies

Users could benefit a lot from having the task dependencies resolved by the queuing system as that is what it is optimized for.

[idea] make queue type available to the outside

It would be great if other tools could see which queuing system is selected, or even better if commands like srun would be available to the outside, so tools which construct their commands, can access pysqa to identify the corresponding run command.

Enable reservation

To couple a reservation code to a job in pyiron, I am using the following function:

def enable_reservation_command(self, process_id,reservation_id):
        return ["scontrol", "update", "job", str(process_id), "reservation={}".format(reservation_id)]

However, due to a bug(?) in slurm, the reservation_id can be assigned to the job, but its priority is not bumped. As such, it is required to set the $SBATCH_RESERVATION environment variable right before calling sbatch to have an increased priority. To do this, there are two options:

  • assign it before the sbatch command is called in bash or python

    export SBATCH_RESERVATION=name-of-reservation
    os.environ['SBATCH_RESERVATION'] = 'name-of-reservation'

  • use the reservation flag when using sbatch

    sbatch --reservation=name-of-reservation ...

Would there be an easy way of doing this?

Compatible with PBS Professional scheduler?

Hi there,

This is not an issue really.

I was wondering if it is possible to extend pysqa to submit jobs using PBS job scheduler.

I'm really happy with how pysqa nicely submits job in SLURM.

I have access to another cluster which uses PBS Professional to submit jobs.

It'd be really nice to run pyiron jobs over there as well.

Is it possible to just add a jinja2 template for PBS and that's all there is to extend? Or is there something else that needs to be done?

A sample submission script looks like this:

#!/bin/bash
 
#PBS -l walltime=10:00:00,select=1:ncpus=1:mem=2gb
#PBS -N job_name
#PBS -A alloc-code
#PBS -m abe
#PBS -M [email protected]
#PBS -o output.txt
#PBS -e error.txt
 
################################################################################
 
module load software_package_1
module load software_package_2
 
cd $PBS_O_WORKDIR
 
<executable commands here>

[feature] Executor: use HDF5 rather than pickle

Implementation

import re
import cloudpickle
import hashlib
import h5io
import h5py
import numpy as np
import queue
import os
import time
import queue
from threading import Thread
from concurrent.futures import Future, Executor
from pympipool.shared import cancel_items_in_queue


def get_hash(binary):
    # Remove specification of jupyter kernel from hash to be deterministic
    binary_no_ipykernel = re.sub(b"(?<=/ipykernel_)(.*)(?=/)", b"", binary)
    return str(hashlib.md5(binary_no_ipykernel).hexdigest())


def serialize_funct_h5(fn, *args, **kwargs):
    binary_funct = cloudpickle.dumps(fn)
    binary_all = cloudpickle.dumps({"fn": fn, "args": args, "kwargs": kwargs})
    task_key = fn.__name__ + get_hash(binary=binary_all)
    data = {"fn": binary_funct, "args": args, "kwargs": kwargs}
    return task_key, data


def apply_funct(apply_dict):
    return apply_dict["fn"].__call__(*apply_dict["args"], **apply_dict["kwargs"])


def get_result_h5(task_key):
    file_name = task_key + ".h5"
    with h5py.File(file_name, "r") as hdf:
        if "output" in hdf:
            return h5io.read_hdf5(fname=hdf, title="output", slash="ignore")
        elif "input_args" in hdf and "input_kwargs" in hdf:
            result = apply_funct(
                apply_dict={
                    "fn": cloudpickle.loads(
                        h5io.read_hdf5(fname=hdf, title="function", slash="ignore")
                    ),
                    "args": h5io.read_hdf5(
                        fname=hdf, title="input_args", slash="ignore"
                    ),
                    "kwargs": h5io.read_hdf5(
                        fname=hdf, title="input_kwargs", slash="ignore"
                    ),
                }
            )
        elif "input_args" in hdf:
            result = apply_funct(
                apply_dict={
                    "fn": cloudpickle.loads(
                        h5io.read_hdf5(fname=hdf, title="function", slash="ignore")
                    ),
                    "args": h5io.read_hdf5(
                        fname=hdf, title="input_args", slash="ignore"
                    ),
                    "kwargs": {},
                }
            )
        elif "input_kwargs" in hdf:
            result = apply_funct(
                apply_dict={
                    "fn": cloudpickle.loads(
                        h5io.read_hdf5(fname=hdf, title="function", slash="ignore")
                    ),
                    "args": [],
                    "kwargs": h5io.read_hdf5(
                        fname=hdf, title="input_kwargs", slash="ignore"
                    ),
                }
            )
        else:
            raise TypeError
    write_to_h5_file(task_key=task_key, data_dict={"output": result})
    return result


def write_to_h5_file(task_key, data_dict):
    file_name = task_key + ".h5"
    with h5py.File(file_name, "a") as fname:
        for data_key, data_value in data_dict.items():
            if data_key == "fn":
                h5io.write_hdf5(
                    fname=fname,
                    data=np.void(data_value),
                    overwrite="update",
                    title="function",
                )
            elif data_key == "args":
                h5io.write_hdf5(
                    fname=fname,
                    data=data_value,
                    overwrite="update",
                    title="input_args",
                    slash="ignore",
                )
            elif data_key == "kwargs":
                for k, v in data_value.items():
                    h5io.write_hdf5(
                        fname=fname,
                        data=v,
                        overwrite="update",
                        title="input_kwargs/" + k,
                        slash="ignore",
                    )
            elif data_key == "output":
                h5io.write_hdf5(
                    fname=fname,
                    data=data_value,
                    overwrite="update",
                    title="output",
                    slash="ignore",
                )


def execute_tasks_h5(future_queue):
    while True:
        task_dict = None
        try:
            task_dict = future_queue.get_nowait()
        except queue.Empty:
            pass
        if (
            task_dict is not None
            and "shutdown" in task_dict.keys()
            and task_dict["shutdown"]
        ):
            break
        elif task_dict is not None:
            key = list(task_dict.keys())[0]
            future = task_dict[key]
            if not future.done() and future.set_running_or_notify_cancel():
                future.set_result(get_result_h5(task_key=key))


def reload_previous_futures(future_dict, task_queue):
    for f in os.listdir():
        if f.endswith(".h5"):
            key = f.split(".h5")[0]
            task_queue.put({key: future_dict[key]})


class FileExecutor(Executor):
    def __init__(self):
        self._task_queue = queue.Queue()
        self._memory_dict = {}
        reload_previous_futures(
            future_dict=self._memory_dict, task_queue=self._task_queue
        )
        self._process = Thread(target=execute_tasks_h5, args=(self._task_queue,))
        self._process.start()

    def submit(self, fn, *args, **kwargs):
        task_key, data_dict = serialize_funct_h5(fn, *args, **kwargs)
        if task_key not in self._memory_dict.keys():
            write_to_h5_file(task_key=task_key, data_dict=data_dict)
            self._memory_dict[task_key] = Future()
            self._task_queue.put({task_key: self._memory_dict[task_key]})
        return self._memory_dict[task_key]

    def shutdown(self, wait=True, *, cancel_futures=False):
        if cancel_futures:
            cancel_items_in_queue(que=self._task_queue)
        self._task_queue.put({"shutdown": True, "wait": wait})
        self._process.join()

Example

def my_funct(a, b):
    time.sleep(0.1)
    return a+b

exe = FileExecutor()
fs1 = exe.submit(my_funct, 1, 2)
fs1.done(), fs1.result(), fs1.done()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.