Coder Social home page Coder Social logo

natefoo / slurm-drmaa Goto Github PK

View Code? Open in Web Editor NEW
44.0 6.0 22.0 220 KB

DRMAA for Slurm: Implementation of the DRMAA C bindings for Slurm

License: GNU General Public License v3.0

Makefile 2.79% Shell 8.51% M4 25.65% C 62.22% Dockerfile 0.83%
drmaa cluster clusters hpc resource-management distributed-computing slurm slurm-job-scheduler

slurm-drmaa's Introduction

DRMAA for Slurm

Please note: DRMAA for Slurm is a continuation of PSNC DRMAA for SLURM by an unrelated developer.

Introduction

DRMAA for Slurm Workload Manager (Slurm) is an implementation of Open Grid Forum Distributed Resource Management Application API (DRMAA) version 1 for submission and control of jobs to Slurm. Using DRMAA, grid applications builders, portal developers and ISVs can use the same high-level API to link their software with different cluster/resource management systems.

History

DRMAA for Slurm was originally developed at the Poznań Supercomputing and Networking Center as PSNC DRMAA for SLURM. Following the unexpected death in 2013 of its primary maintainer, Mariusz Mamoński, there has been little additional development from PSNC, and no new releases.

This fork is maintained by Nate Coraor and was originally created in 2014 to add support for Slurm's --clusters (-M) option. Since that time, others have found it useful and additional features and bug fixes have been added. However, the majority of the credit for this work belongs to the original authors (found below). In 2017, current maintainer Piotr Kopta created psnc-apps/slurm-drmaa in Github, a snapshot of the unreleased 1.2.0 version (upon which this fork is also based) that has seen occasional work.

The title of the software in this fork has been changed from PSNC DRMAA for SLURM to simply DRMAA for Slurm. This change was made in order to alleviate confusion and differentiate from the canonical version. Additionally, this fork is not affiliated with PSNC; as such, releasing this software under PSNC's name would not be appropriate. However, this fork's maintainer is incredibly grateful for the work of the original authors of this software and the name change is in no way intended to minimize the efforts of those people.

Download

DRMAA for Slurm is distributed as a source package which can be downloaded via the Releases section.

Installation

To compile and install the library, go to main source directory and type:

$ ./configure [options] && make
$ sudo make install

The library uses the standard GNU Autotools system, so standard configure arguments are available; see ./configure --help for a full list. The library was tested with Slurm versions 16.05 and 18.08. If you encountered any problems using the library on different systems, please see the contact section.

Notable ./configure script options:

--with-slurm-inc SLURM_INCLUDE_PATH

Path to Slurm header files (i.e. directory containing slurm/slurm.h ). By default the library tries to guess the SLURM_INCLUDE_PATH and SLURM_LIBRARY_PATH based on location of the srun executable.

--with-slurm-lib SLURM_LIBRARY_PATH

Path to Slurm libraries (i.e. directory containing libslurm.a ).

--prefix INSTALLATION_DIRECTORY

Root directory where PSNC DRMAA for Slurm shall be installed. When not given library is installed in /usr/local.

--enable-debug

Compiles library with debugging enabled (with debugging symbols not stripped, without optimizations, and with many log messages enabled). Useful when you are to debug DRMAA enabled application or investigate problems with DRMAA library itself.

There are no unusual requirements for basic usage of library: a C99 compiler and standard make program should suffice. If you have taken sources directly from this repository or wish to run test-suite you would need additional developer tools.

Configuration

During DRMAA session initialization (drmaa_init), the library tries to read its configuration parameters from locations: /etc/slurm_drmaa.conf, ~/.slurm_drmaa.conf and from the file given in the $SLURM_DRMAA_CONF environment variable (if set to a non-empty string). If multiple configuration sources are present then all configurations are merged with values from user-defined files taking precedence (in the following order: $SLURM_DRMAA_CONF, ~/.slurm_drmaa.conf, /etc/slurm_drmaa.conf).

Currently recognized configuration parameters are:

cache_job_state

According to the DRMAA specification, every drmaa_job_ps() call should query the DRM system for job state. With this option one may optimize communication with the DRM. If set to a positive integer, drmaa_job_ps() returns the remembered job state without communicating with the DRM for cache_job_state seconds since the last update. By default the library conforms to the specification (no caching will be performed).

Type: integer, default: 0

job_categories

Dictionary of job categories. Its keys are job categories names mapped to native specification strings. Attributes set by job category can be overridden by corresponding DRMAA attributes or native specification. The special category name default is used when the drmaa_job_category job attribute is not set.

Type: dictionary with string values, default: empty dictionary

Configuration file syntax

The configuration file is in a form of a dictionary. A dictionary is set of zero or more key-value pairs. A key is a string, while a value can be a string, an integer or another dictionary.

  configuration: dictionary | dictionary_body
  dictionary: '{' dictionary_body '}'
  dictionary_body: (string ':' value ',')*
  value: integer | string | dictionary
  string: unquoted-string | single-quoted-string | double-quoted-string
  unquoted-string: [^ \t\n\r:,0-9][^ \t\n\r:,]*
  single-quoted-string: '[^']*'
  double-quoted-string: "[^"]*"
  integer: [0-9]+

Native specification

The DRMAA interface allows passing DRM-dependent job submission options. Those options may be specified directly by setting the drmaa_native_specification job template attribute or indirectly by the drmaa_job_category job template attribute. The legal format of the native options looks like:

  -A My_job_name -s -N 1-10

List of parameters that can be passed in the drmaa_native_specification attribute:

Native specification Description
-A, --account=name Charge job to specified accounts
--acctg-freq=list Define the job accounting sampling interval
--comment=string An arbitrary comment
-C, --constraint=list Specify a list of constraints
-c, --cpus-per-task=n Number of processors per task
--contiguous If set, then the allocated nodes must form a contiguous set
-d, --dependency=list Defer the start of this job until the specified dependencies have been satisfied completed
--exclusive Allocate nodenumber of tasks to invoke on each nodes in exclusive mode when cpu consumable resource is enabled
--gres=list Specifies a comma delimited list of generic consumable resources
-k, --no-kill Do not automatically terminate a job of one of the nodes it has been allocated fails
-L, --licenses=license Specification of licenses
-M, --clusters=list Comma delimited list of clusters to issue commands to
--mail-type=type Notify user by email when certain event types occur. Valid type values are BEGIN, END, FAIL, REQUEUE, and ALL (any state change)
--mem=MB Minimum amount of real memory
--mem-per-cpu=MB Maximum amount of real memory per allocated cpu required by a job
--mincpus=n Minimum number of logical processors (threads) per node
-N, --nodes=minnodes[-maxnodes] Number of nodes on which to run
-n, --ntasks=n Number of tasks
--no-requeue Specifies that the batch job should not be requeued after node failure
--ntasks-per-node=n Number of tasks to invoke on each node
-p, --partition=partition Partition requested
--qos=qos Quality of Serice
--requeue If set, permit the job to be requeued
--reservation=name Allocate resources from named reservation
-s, --share Job allocation can share nodes with other running jobs
--tmp=size[units] Specify a minimum amount of temporary disk space
-w, --nodelist=hosts Request a specific list of hosts
-x, --exclude=nodelist Explicitly exclude certain nodes from the resources granted to the job

Additionally, the following parameters to drmaa_native_specification are supported, but their use is discouraged in favor of the corresponding DRMAA job attributes:

Native specification DRMAA job attribute Description
-e, --error=pattern drmaa_output_path Connect the batch script's standard error directly to the file name specified in the pattern
-J, --job-name=name drmaa_job_name Specify a name for the job allocation
-o, --output=pattern drmaa_error_path Connect the batch script's standard output directly to the file name specified in the pattern
-t, --time=hours:minutes drmaa_wct_hlimit Set a maximum job wallclock time

Descriptions of each parameter can be found in man sbatch.

Changelog

See CHANGELOG.md

Known bugs and limitations

The library covers all of the DRMAA 1.0 specification with exceptions listed below. It was successfully tested with Slurm 16.05 and 18.08. Known limitations:

  • drmaa_control options DRMAA_CONTROL_HOLD, DRMAA_CONTROL_RELEASE are only available for users being Slurm administrators (in version prior 2.2)
  • drmaa_control options DRMAA_CONTROL_SUSPEND, DRMAA_CONTROL_RESUME are only available for users being Slurm administrators
  • drmaa_wct_slimit not implemented
  • optional attributes drmaa_deadline_time, drmaa_duration_hlimit, drmaa_duration_slimit, drmaa_transfer_files not implemented
  • The SPANK client side (i.e. not remote) plugins chain is not invoked in DRMAA run job call. For this reason we advice you to use TASK BLOCKS in the UseEnv SPANK plugin.

Development and Pre-releases

note: This repository depends on FedStage DRMAA Utils, which is configured as a submodule. When cloning this repository, you should clone recursively, e.g.:

$ git clone --recursive https://github.com/natefoo/slurm-drmaa.git

The source repository does not contain Autotools-generated artifacts such as configure and Makefile. Please note the ./autogen.sh and ./autoclean.sh scripts which call the Autotools command chain in the appropriate order to generate these artifacts.

note: You need some developer tools to compile the source from git.

Developer tools

Although not needed to use the library or to compile from source distribution tarballs, user the following tools may be required if you intend to develop DRMAA for Slurm from git:

  • GNU autotools
    • autoconf (tested with version 2.67)
    • automake (tested with version 1.11)
    • libtool (tested with version 2.2.8)
    • m4 (tested with version 1.4.14)
  • Bison parser generator,
  • RAGEL State Machine Compiler,
  • gperf gperf - a perfect hash function generator.

Authors

The library was developed by:

  • Michal Matloka - first implementation
  • Mariusz Mamonski - maintainer since version 1.0.3
  • Piotr Kopta - maintainer since version 1.0.7

This library relies heavily on the Fedstage DRMAA utils code developed by:

  • Lukasz Ciesnik

The maintainer of this fork is:

  • Nate Coraor

with additional contributors.

Contact

You can submit issues and pull requests for DRMAA for Slurm in GitHub.

Links

Software using DRMAA for Slurm

  • QCG-Computing: remote multi-user job submission and control over Web Services
  • Galaxy: an open, web-based platform for accessible, reproducible, and transparent computational biomedical research

License

Copyright (C) 2011-2015 Poznan Supercomputing and Networking Center Copyright (C) 2014-2019 The Pennsylvania State University

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Some portions of this program are copied or derived from Slurm, which is licensed under the GNU General Public License Version 2. For details, including the list of Slurm copyright holders, see <https://slurm.schedmd.com/>.

slurm-drmaa's People

Contributors

atombaby avatar benmwebb avatar duffrohde avatar ericr86 avatar holtgrewe avatar kcgthb avatar mmatloka avatar natefoo avatar nsoranzo avatar pkopta avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

slurm-drmaa's Issues

Issues with Slurm 20.11.0

make[2]: Entering directory `/root/slurm-drmaa-1.1.1/slurm_drmaa'
/bin/sh ../libtool  --tag=CC   --mode=compile gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I..  -I/root/rpmbuild/BUILD/slurm-20.11.0 -I../drmaa_utils/ -Wno-long-long  -D_REENTRANT -D_THREAD_SAFE -DNDEBUG  -D_GNU_SOURCE -DCONFDIR=/etc  -Wall -W -Wno-unused-parameter -Wno-format-zero-length -pedantic -std=c99 -g -O2 -pthread -MT libdrmaa_la-drmaa.lo -MD -MP -MF .deps/libdrmaa_la-drmaa.Tpo -c -o libdrmaa_la-drmaa.lo `test -f 'drmaa.c' || echo './'`drmaa.c
libtool: compile:  gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -I/root/rpmbuild/BUILD/slurm-20.11.0 -I../drmaa_utils/ -Wno-long-long -D_REENTRANT -D_THREAD_SAFE -DNDEBUG -D_GNU_SOURCE -DCONFDIR=/etc -Wall -W -Wno-unused-parameter -Wno-format-zero-length -pedantic -std=c99 -g -O2 -pthread -MT libdrmaa_la-drmaa.lo -MD -MP -MF .deps/libdrmaa_la-drmaa.Tpo -c drmaa.c  -fPIC -DPIC -o .libs/libdrmaa_la-drmaa.o
drmaa.c: In function ‘slurmdrmaa_get_DRM_system’:
drmaa.c:65:3: error: unknown type name ‘slurm_ctl_conf_t’
   slurm_ctl_conf_t * conf_info_msg_ptr = NULL; 
   ^
drmaa.c:66:3: warning: passing argument 2 of ‘slurm_load_ctl_conf’ from incompatible pointer type [enabled by default]
   if ( slurm_load_ctl_conf ((time_t) NULL, &conf_info_msg_ptr ) == -1 ) 
   ^
In file included from drmaa.c:31:0:
/root/rpmbuild/BUILD/slurm-20.11.0/slurm/slurm.h:3714:12: note: expected ‘struct slurm_conf_t **’ but argument is of type ‘int **’
 extern int slurm_load_ctl_conf(time_t update_time,
            ^
drmaa.c:73:101: error: request for member ‘version’ in something not a structure or union
    fsd_snprintf(NULL, slurmdrmaa_version, sizeof(slurmdrmaa_version)-1,"SLURM %s", conf_info_msg_ptr->version);
                                                                                                     ^
drmaa.c:74:4: warning: passing argument 1 of ‘slurm_free_ctl_conf’ from incompatible pointer type [enabled by default]
    slurm_free_ctl_conf (conf_info_msg_ptr);
    ^
In file included from drmaa.c:31:0:
/root/rpmbuild/BUILD/slurm-20.11.0/slurm/slurm.h:3722:13: note: expected ‘struct slurm_conf_t *’ but argument is of type ‘int *’
 extern void slurm_free_ctl_conf(slurm_conf_t *slurm_ctl_conf_ptr);
             ^
make[2]: *** [libdrmaa_la-drmaa.lo] Error 1
make[2]: Leaving directory `/root/slurm-drmaa-1.1.1/slurm_drmaa'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/root/slurm-drmaa-1.1.1'
make: *** [all] Error 2

TRES slurm config settings cause errors

drmaa-run bash test.drmaa
drmaa-run: error: _parse_next_key: Parsing error at unrecognized key: PriorityWeightTRES
drmaa-run: error: Parse error in file /etc/slurm/slurm.conf line 131: "PriorityWeightTRES=CPU=1000,Mem=2000 "
drmaa-run: error: _parse_next_key: Parsing error at unrecognized key: AccountingStorageTRES
drmaa-run: error: Parse error in file /etc/slurm/slurm.conf line 186: "AccountingStorageTRES=gres/gpu,gres/mic,gres/nvme"
drmaa-run: error: Parsing error at unrecognized key: TresBillingWeights
drmaa-run: error: Parse error in file /etc/slurm/slurm.conf line 196: " TresBillingWeights="CPU=1.0,Mem=0.127G""
drmaa-run: error: Unable to establish controller machine
Segmentation fault

--mem is treated as --mem-per-cpu

When using --mem, the value passed to Slurm is multiplied by the number of CPUs, e.g., it's performed as --mem-per-cpu. I think this was just a merge error on my part. This commit seems to have been applied to the wrong lines:

47e7ba0

How to build RPMs

Is it possible to build RPMs from the source build?
Instructions?

Slurm DRMAA - Unable to generate targets

I followed the instructions given on github page. I am using slurm version 20.11.7 on AWS parallel cluster.

  1. export SLURM_INCLUDE_PATH=/opt/slurm/include/slurm
    export SLURM_LIBRARY_PATH=/opt/slurm/lib

  2. export LD_LIBRARY_PATH=/opt/slurm/lib

3 ./configure && make

Error I get is as following error

[ec2-user@ip-XX-XX-XX-XX slurm-drmaa-1.1.2]$ ./configure && make
checking for a BSD-compatible install... /bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking for style of include used by make... GNU
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking dependency style of gcc... gcc3
checking for ar... ar
checking the archiver (ar) interface... ar
checking for gcc... (cached) gcc
checking whether we are using the GNU C compiler... (cached) yes
checking whether gcc accepts -g... (cached) yes
checking for gcc option to accept ISO C89... (cached) none needed
checking dependency style of gcc... (cached) gcc3
checking for gcc option to accept ISO C99... none needed
checking how to run the C preprocessor... gcc -E
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
checking how to print strings... printf
checking for a sed that does not truncate output... /bin/sed
checking for grep that handles long lines and -e... /bin/grep
checking for egrep... /bin/grep -E
checking for fgrep... /bin/grep -F
checking for ld used by gcc... /bin/ld
checking if the linker (/bin/ld) is GNU ld... yes
checking for BSD- or MS-compatible name lister (nm)... /bin/nm -B
checking the name lister (/bin/nm -B) interface... BSD nm
checking whether ln -s works... yes
checking the maximum length of command line arguments... 1572864
checking whether the shell understands some XSI constructs... yes
checking whether the shell understands "+="... yes
checking how to convert x86_64-unknown-linux-gnu file names to x86_64-unknown-linux-gnu format... func_convert_file_noop
checking how to convert x86_64-unknown-linux-gnu file names to toolchain format... func_convert_file_noop
checking for /bin/ld option to reload object files... -r
checking for objdump... objdump
checking how to recognize dependent libraries... pass_all
checking for dlltool... no
checking how to associate runtime and link libraries... printf %s\n
checking for archiver @file support... @
checking for strip... strip
checking for ranlib... ranlib
checking command to parse /bin/nm -B output from gcc object... ok
checking for sysroot... no
checking for mt... no
checking if : is a manifest tool... no
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for dlfcn.h... yes
checking for objdir... .libs
checking if gcc supports -fno-rtti -fno-exceptions... no
checking for gcc option to produce PIC... -fPIC -DPIC
checking if gcc PIC flag -fPIC -DPIC works... yes
checking if gcc static flag -static works... yes
checking if gcc supports -c -o file.o... yes
checking if gcc supports -c -o file.o... (cached) yes
checking whether the gcc linker (/bin/ld -m elf_x86_64) supports shared libraries... yes
checking whether -lc should be explicitly linked in... no
checking dynamic linker characteristics... GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... yes
checking whether make sets $(MAKE)... (cached) yes
checking whether ln -s works... yes
checking whether gcc accepts -Wno-missing-field-initializers... yes
checking whether gcc accepts -Wno-format-zero-length... yes
checking whether gcc is Clang... no
checking whether pthreads work with -pthread... yes
checking for joinable pthread attribute... PTHREAD_CREATE_JOINABLE
checking whether more special flags are required for pthreads... no
checking for PTHREAD_PRIO_INHERIT... yes
configure: checking for SLURM
checking for SLURM compile flags... -I/opt/slurm/include
checking for SLURM library dir... /opt/slurm/lib
checking for slurmdb_users_get in -lslurm... yes
Using slurm libraries -lslurm
checking for usable SLURM libraries/headers... yes
checking for ANSI C header files... (cached) yes
checking whether time.h and sys/time.h may both be included... yes
checking for stdbool.h that conforms to C99... yes
checking for _Bool... yes
checking stddef.h usability... yes
checking stddef.h presence... yes
checking for stddef.h... yes
checking for stdlib.h... (cached) yes
checking for string.h... (cached) yes
checking for strings.h... (cached) yes
checking sys/time.h usability... yes
checking sys/time.h presence... yes
checking for sys/time.h... yes
checking for unistd.h... (cached) yes
checking for size_t... yes
checking whether struct tm is in sys/time.h or time.h... time.h
checking for an ANSI C-conforming const... yes
checking for inline... inline
checking for working volatile... yes
checking for stdlib.h... (cached) yes
checking for GNU libc compatible malloc... yes
checking for strftime... yes
checking for vprintf... yes
checking for _doprnt... no
checking for asprintf... yes
checking for fstat... yes
checking for getcwd... yes
checking for gettimeofday... yes
checking for localtime_r... yes
checking for memset... yes
checking for mkstemp... yes
checking for setenv... yes
checking for strcasecmp... yes
checking for strchr... yes
checking for strdup... yes
checking for strerror... yes
checking for strlcpy... no
checking for strndup... yes
checking for strstr... yes
checking for strtol... yes
checking for vasprintf... yes
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: creating Makefile
config.status: creating test/Makefile
config.status: creating slurm_drmaa/Makefile
config.status: creating config.h
config.status: config.h is unchanged
config.status: executing depfiles commands
config.status: executing libtool commands
=== configuring in drmaa_utils (/home/ec2-user/slurm-drmaa-1.1.2/drmaa_utils)
configure: WARNING: no configuration information is in drmaa_utils

Run 'make' now.
make all-recursive
make[1]: Entering directory /home/ec2-user/slurm-drmaa-1.1.2' Making all in drmaa_utils make[2]: Entering directory /home/ec2-user/slurm-drmaa-1.1.2/drmaa_utils'
make[2]: *** No rule to make target all'. Stop. make[2]: Leaving directory /home/ec2-user/slurm-drmaa-1.1.2/drmaa_utils'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/ec2-user/slurm-drmaa-1.1.2'
make: *** [all] Error 2

Submitting multi-threaded jobs using slurm-drmaa

I realize it's strange to ask a question like this on a repository, but I've spent the past hour trying to figure it out on my own to no avail. I thought that you might be able to answer it in 30 seconds. I would greatly appreciate any help!

In essence, how do you submit multi-threaded jobs using slurm-drmaa? To be clear, I want the job to run on one node (i.e. --ntasks=1). I use the --cpus-per-task option with srun or sbatch, but this option isn't available in the native specification for slurm-drmaa.

I've tried different combinations of --mincpus, --nodes, --ntasks-per-node and --ntasks, but they either allow jobs to be split across multiple nodes or they fail. I've looked through the code for galaxyproject/galaxy and galaxyproject/pulsar, but I couldn't find any hints.

RPM installs libdrmaa.so in /usr/lib64, but sets DRMAA_LIBRARY_PATH to /usr/lib/

Hi!

Using the provided SPEC file (CentOS 7.x, x86_64), it looks like the generated RPM installs libdrmaa.so in /usr/lib64:

$ rpm -ql slurm-drmaa
/usr/bin/drmaa-run
/usr/bin/drmaa-run-bulk
/usr/bin/hpc-bash
/usr/include/drmaa.h
/usr/lib64/libdrmaa.a
/usr/lib64/libdrmaa.la
/usr/lib64/libdrmaa.so
/usr/lib64/libdrmaa.so.1
/usr/lib64/libdrmaa.so.1.0.8
/usr/share/doc/slurm-drmaa-1.1.2
/usr/share/doc/slurm-drmaa-1.1.2/COPYING
/usr/share/doc/slurm-drmaa-1.1.2/NEWS
/usr/share/doc/slurm-drmaa-1.1.2/README.md
/usr/share/doc/slurm-drmaa-1.1.2/slurm_drmaa.conf.example

But DRMAA_LIBRARY_PATH defaults to /usr/lib/libdrmaa.so:

$ drmaa-run
F #1d7bd [     0.00]  * Could not load DRMAA library (DRMAA_LIBRARY_PATH=/usr/lib/libdrmaa.so)
F #1d7bd [     0.00]  * Error

and:

$ strace -e open drmaa-run
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libdrmaa.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/etc/localtime", O_RDONLY|O_CLOEXEC) = 3
F #1d7c5 [     0.00]  * Could not load DRMAA library (DRMAA_LIBRARY_PATH=/usr/lib/libdrmaa.so)
F #1d7c5 [     0.00]  * Error
+++ exited with 255 +++

Would there be a way to make the default DRMAA_LIBRARY_PATH consistent with the RPM installation?

JNI implementation

Hi

We have several web applications submitting jobs to SGE using the Java-based Opal Toolkit (https://sourceforge.net/projects/opaltoolkit/) via drmaa.jar

In the process of migrating from SGE to Slurm I realized that the JNI methods do not exist in the slurm-drmaa library. Do you know of any out-of-the-box solution for submitting jobs via drmaa.jar to Slurm?

Best regards,

Guilhem

Date format for maximum wall clock time

Hi,
I have a problem in passing the right format for setting the hard wall clock time. I've been successful in passing the hours and minutes, but I'm not able to set the seconds. Is it possible?
For now I'm passing the string formatted as follows: "HH:MM", but if I add the seconds "HH:MM:SS" it will just pass 0:00 to SLURM.

Thanks,
Alessio

segfault when waiting for bulk jobs > 20 mins

Hi,

When I run bulk jobs with one of the jobs lasting more than 20 minutes, the s.synchronize() function waits for a few minutes after the last job is finished and then triggers a segfault:

import drmaa
s = drmaa.Session()
s.initialize()
jt = s.createJobTemplate()
jt.remoteCommand = "eval"
jt.args = [ 'sleep $((12/${SLURM_ARRAY_TASK_ID}*100))' ]
joblist = s.runBulkJobs(jt, 1, 4, 1)
s.synchronize(joblist, drmaa.Session.TIMEOUT_WAIT_FOREVER, True)

Same happens if I loop through the job ids with the s.wait() function:

for jobid in joblist:
   s.wait(jobid, drmaa.Session.TIMEOUT_WAIT_FOREVER)

However it works perfectly fine if jobs finish in the same order as their SLURM_ARRAY_TASK_ID:

import drmaa
s = drmaa.Session()
s.initialize()
jt = s.createJobTemplate()
jt.remoteCommand = "eval"
jt.args = [ 'sleep $((${SLURM_ARRAY_TASK_ID}*300))' ]
joblist = s.runBulkJobs(jt, 1, 4, 1)
s.synchronize(joblist, drmaa.Session.TIMEOUT_WAIT_FOREVER, True)

No problem if jobs last only 10 minutes:

import drmaa
s = drmaa.Session()
s.initialize()
jt = s.createJobTemplate()
jt.remoteCommand = "eval"
jt.args = [ 'sleep $((12/${SLURM_ARRAY_TASK_ID}*10))' ]
joblist = s.runBulkJobs(jt, 1, 4, 1)
s.synchronize(joblist, drmaa.Session.TIMEOUT_WAIT_FOREVER, True)

I came up with this little piece of code to bypass the bug:

import drmaa
s = drmaa.Session()
s.initialize()
jt = s.createJobTemplate()
jt.remoteCommand = "eval"
jt.args = [ 'sleep $((12/${SLURM_ARRAY_TASK_ID}*100))' ]
joblist = s.runBulkJobs(jt, 1, 4, 1)
for jobid in joblist:
   while (s.jobStatus(jobid)=="running"):
      time.sleep(10)
   print "job %s done" % jobid

Yields:

job 135892105_1 done
job 135892105_2 done
job 135892105_3 done
job 135892105_4 done

LD_LIBRARY_PATH not exported.

Hi,

I am experiencing an issue similar to #19. It appears the slurm shared libraries specified by --with-slurm-lib cannot be found when loading conftest at runtime during the configure script.

I believe the issue is that while LD_LIBRARY_PATH is set in ax_slurm.m4, it is never exported. You can see how this was done in cURL: curl/curl@302d537.

I've tried this and it does fix the issue. Alternatively, while looking into this I've found it suggested that using rpath is better practice as it's more constrained. I also was able to run ./configure successfully by setting the rpath as shown here: reid-wagner@67c7f6e.

If you want to go that path I'd be glad to open a PR. I haven't been able to test compilation yet for a few reasons, one being that I'm encountering an unrelated compilation issue on master.

The above issue happens with slurm-drmaa 1.1.1 and gcc 4.8.5 on CentOS 7.8.2003.

Additionally it's worth mentioning that out of the box 1.1.1 configured and compiled on my Ubuntu machine with gcc 9.3.0. I actually grabbed the conftest.c source from config.log and compiled it on both machines. On the Ubuntu machine it appears that the dependency on libslurm was stripped from the ELF, I guess because it's optimized out. On the CentOS machine the dependency is there.. So on the Ubuntu machine it wasn't actually testing that the libraries could be found at runtime.

Thanks for taking a look.

Below is the error from config.log. I modified the paths:

configure:14098: checking for usable SLURM libraries/headers
configure:14119: gcc -std=gnu99 -o conftest -pedantic -std=c99 -g -O2 -pthread -D_REENTRANT -D_THREAD_SAFE -DNDEBUG  -D_GNU_SOURCE -I
/path/to/include/  -L/path/to/lib/ conftest.c -lslurm   -lslurm  >&5
configure:14119: $? = 0
configure:14119: ./conftest
./conftest: error while loading shared libraries: libslurm.so.35: cannot open shared object file: No such file or directory
configure:14119: $? = 127
configure: program exited with status 127
configure: failed program was:
| /* confdefs.h */
| #define PACKAGE_NAME "DRMAA for Slurm"
| #define PACKAGE_TARNAME "slurm-drmaa"
| #define PACKAGE_VERSION "1.1.1"
| #define PACKAGE_STRING "DRMAA for Slurm 1.1.1"
| #define PACKAGE_BUGREPORT "[email protected]"
| #define PACKAGE_URL ""
| #define PACKAGE "slurm-drmaa"
| #define VERSION "1.1.1"
| #define STDC_HEADERS 1
| #define HAVE_SYS_TYPES_H 1
| #define HAVE_SYS_STAT_H 1
| #define HAVE_STDLIB_H 1
| #define HAVE_STRING_H 1
| #define HAVE_MEMORY_H 1
| #define HAVE_STRINGS_H 1
| #define HAVE_INTTYPES_H 1
| #define HAVE_STDINT_H 1
| #define HAVE_UNISTD_H 1
| #define HAVE_DLFCN_H 1
| #define LT_OBJDIR ".libs/"
| #define HAVE_PTHREAD_PRIO_INHERIT 1
| #define HAVE_LIBSLURM 1
| /* end confdefs.h.  */
|  #include "slurm/slurm.h"
| int
| main ()
| {
|  job_desc_msg_t job_req; /*at least check for declared structs */
|                  return 0;
| 
|   ;
|   return 0;
| }
configure:14134: result: no
configure:14140: error: 
Slurm libraries/headers not found;
add --with-slurm-inc and --with-slurm-lib with appropriate locations.

Segmentation fault when providing --time

Got one more segmentation fault when specifying --time 1:00:00 for 1 hour.
Seems to be just an alias problem since -t 1:00:00 works.

...
d #e494 [     0.03]  * # Native specification: --cpus-per-task=2 --nodes=1 --mem-per-cpu=50 --partition=htc --time 1:00:00 --tmp=100
t #e494 [     0.03] -> slurmdrmaa_parse_native
t #e494 [     0.03] -> slurmdrmaa_parse_additional_attr
d #e494 [     0.03]  * # cpus_per_task = 2
t #e494 [     0.03] <- slurmdrmaa_parse_additional_attr
t #e494 [     0.03] -> slurmdrmaa_parse_additional_attr
d #e494 [     0.03]  * nodes: 1 ->
d #e494 [     0.03]  * # min_nodes = 1
t #e494 [     0.03] <- slurmdrmaa_parse_additional_attr
t #e494 [     0.03] -> slurmdrmaa_parse_additional_attr
d #e494 [     0.03]  * # pn_min_memory (MEM_PER_CPU) = 50
t #e494 [     0.03] <- slurmdrmaa_parse_additional_attr
t #e494 [     0.03] -> slurmdrmaa_parse_additional_attr
d #e494 [     0.03]  * # partition = htc
t #e494 [     0.03] <- slurmdrmaa_parse_additional_attr
t #e494 [     0.03] -> slurmdrmaa_parse_additional_attr
d #e494 [     0.03]  * # time_limit = (null)
t #e494 [     0.03] -> slurmdrmaa_datetime_parse((null))

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff6b20faf in __strlen_sse42 () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff6b20faf in __strlen_sse42 () from /lib64/libc.so.6
#1  0x00007fffed955284 in slurmdrmaa_datetime_parse (string=0x0) at util.c:53
#2  0x00007fffed956295 in slurmdrmaa_add_attribute (job_desc=0x7fffffff9e10, attr=19, value=0x0) at util.c:292
#3  0x00007fffed956c19 in slurmdrmaa_parse_additional_attr (job_desc=0x7fffffff9e10, add_attr=0x7ac3bf "time", clusters_opt=0x7fffffff8af0) at util.c:427
#4  0x00007fffed9570f8 in slurmdrmaa_parse_native (job_desc=0x7fffffff9e10, value=0x79f8b0 "--cpus-per-task=2 --nodes=1 --mem-per-cpu=50 --partition=htc --time 1:00:00 --tmp=100") at util.c:502
#5  0x00007fffed95462e in slurmdrmaa_job_create (session=0x641ad0, jt=0x7e3570, envp=0x7fffffffa0f8, expand=0x771280, job_desc=0x7fffffff9e10) at job.c:701
#6  0x00007fffed952d3b in slurmdrmaa_job_create_req (session=0x641ad0, jt=0x7e3570, envp=0x7fffffffa0f8, job_desc=0x7fffffff9e10) at job.c:302
#7  0x00007fffed954af4 in slurmdrmaa_session_run_bulk (self=0x641ad0, jt=0x7e3570, start=1, end=2, incr=1) at session.c:126
#8  0x00007fffed96facb in drmaa_run_bulk_jobs (job_ids=0x7fffeea84a28, jt=0x7e3570, start=1, end=2, incr=1, error_diagnosis=0x732960 "", error_diag_len=1024) at drmaa_base.c:427
#9  0x00007fffeffed550 in ffi_call_unix64 () at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/libffi/src/x86/unix64.S:76
#10 0x00007fffeffeccf5 in ffi_call (cif=<optimized out>, fn=0x7fffed96f8e3 <drmaa_run_bulk_jobs>, rvalue=<optimized out>, avalue=0x7fffffffa330) at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/libffi/src/x86/ffi64.c:525
#11 0x00007fffeffe483c in _call_function_pointer (argcount=7, resmem=0x7fffffffa380, restype=<optimized out>, atypes=<optimized out>, avalues=0x7fffffffa330, pProc=0x7fffed96f8e3 <drmaa_run_bulk_jobs>, flags=4353)
    at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/callproc.c:809
#12 _ctypes_callproc (pProc=0x7fffed96f8e3 <drmaa_run_bulk_jobs>, argtuple=0x7fffffffa4f0, flags=4353, argtypes=<optimized out>, restype=0x7ffff0236158, checker=0x0) at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/callproc.c:1147
#13 0x00007fffeffdcda3 in PyCFuncPtr_call (self=<optimized out>, inargs=<optimized out>, kwds=0x0) at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/_ctypes.c:3870
#14 0x00007ffff793fe96 in PyObject_Call (func=0x7fffeea66e58, args=<optimized out>, kwargs=<optimized out>) at Objects/abstract.c:2246
#15 0x00007ffff7a20236 in do_call_core (kwdict=0x0, callargs=<optimized out>, func=0x7fffeea66e58) at Python/ceval.c:5067
#16 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3366
#17 0x00007ffff7a1aa60 in _PyEval_EvalCodeWithName (_co=0x7ffff0220390, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=6, kwnames=0x0, kwargs=0x7e61c8, kwcount=0, kwstep=1, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0,
    name=0x7ffff7f66308, qualname=0x7ffff7f66308) at Python/ceval.c:4128
#18 0x00007ffff7a1c48a in fast_function (kwnames=<optimized out>, nargs=6, stack=<optimized out>, func=0x7fffeea7f840) at Python/ceval.c:4939
#19 call_function (pp_stack=0x7fffffffaa08, oparg=<optimized out>, kwnames=<optimized out>) at Python/ceval.c:4819
#20 0x00007ffff7a1f15d in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3284
#21 0x00007ffff7969e33 in gen_send_ex (gen=0x7fffefd92200, arg=<optimized out>, exc=<optimized out>, closing=<optimized out>) at Objects/genobject.c:189
#22 0x00007ffff7978f16 in listextend (self=0x7fffeea8ee88, b=<optimized out>) at Objects/listobject.c:857
#23 0x00007ffff7979398 in list_init (self=0x7fffeea8ee88, args=<optimized out>, kw=<optimized out>) at Objects/listobject.c:2316
#24 0x00007ffff79add4c in type_call (type=<optimized out>, args=0x7ffff7e8e908, kwds=0x0) at Objects/typeobject.c:915
#25 0x00007ffff793fade in _PyObject_FastCallDict (func=0x7ffff7d5bb40 <PyList_Type>, args=<optimized out>, nargs=<optimized out>, kwargs=0x0) at Objects/abstract.c:2316
#26 0x00007ffff7a1c2bb in call_function (pp_stack=0x7fffffffad48, oparg=<optimized out>, kwnames=0x0) at Python/ceval.c:4822
#27 0x00007ffff7a1f15d in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3284
#28 0x00007ffff7a1aa60 in _PyEval_EvalCodeWithName (_co=0x7ffff01ff420, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=1, kwnames=0x7ffff7e9cb58, kwargs=0x7ffff7f8fba8, kwcount=3, kwstep=1, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0,
    name=0x7ffff7ea3d70, qualname=0x7fffefd8f300) at Python/ceval.c:4128
#29 0x00007ffff7a1c48a in fast_function (kwnames=<optimized out>, nargs=1, stack=<optimized out>, func=0x7fffeea84400) at Python/ceval.c:4939
#30 call_function (pp_stack=0x7fffffffafe8, oparg=<optimized out>, kwnames=<optimized out>) at Python/ceval.c:4819
#31 0x00007ffff7a1e8dd in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3300
#32 0x00007ffff7a1aa60 in _PyEval_EvalCodeWithName (_co=0x7ffff7f1c930, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=0, kwnames=0x0, kwargs=0x8, kwcount=0, kwstep=2, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0)
    at Python/ceval.c:4128
#33 0x00007ffff7a1aee3 in PyEval_EvalCodeEx (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kws=<optimized out>, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:4149
#34 0x00007ffff7a1af2b in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>) at Python/ceval.c:695
#35 0x00007ffff7a4d6c0 in run_mod (arena=0x7ffff7f79180, flags=0x7fffffffb340, locals=0x7ffff7f5df30, globals=0x7ffff7f5df30, filename=0x7ffff7ea3970, mod=0x6857d8) at Python/pythonrun.c:980
#36 PyRun_FileExFlags (fp=0x6438d0, filename_str=<optimized out>, start=<optimized out>, globals=0x7ffff7f5df30, locals=0x7ffff7f5df30, closeit=<optimized out>, flags=0x7fffffffb340) at Python/pythonrun.c:933
#37 0x00007ffff7a4ec83 in PyRun_SimpleFileExFlags (fp=0x6438d0, filename=<optimized out>, closeit=1, flags=0x7fffffffb340) at Python/pythonrun.c:396
#38 0x00007ffff7a6a0b5 in run_file (p_cf=0x7fffffffb340, filename=0x603310 L"test_drmaa.py", fp=0x6438d0) at Modules/main.c:338
#39 Py_Main (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:810
#40 0x0000000000400c1d in main (argc=2, argv=<optimized out>) at ./Programs/python.c:69

Handle new Slurm job states

As of Slurm 17.11 there are new job states that slurm-drmaa is not handling. These include JOB_OOM (OUT_OF_MEMORY), among others. As a result, the DRMAA state will be UNDETERMINED until the job exceeds Slurm's MinJobAge.

Need to determine if JOB_OOM is only returned when the job is terminal, or at any time that the OOM killer has activated.

"unknown signal?!" reported from JobInfo terminatedSignal

Hello,

I'm not sure if this the origin of this particular bug, but I have not successfully reproduced this error on other DRMAA implementations.

I've submitted jobs to my SLURM 18.08 system where, occassionally, I get a reported "unknown signal?!". The exact same job, when resubmitted, may or may not have this issue. I cannot track down exactly what happens when this occurs or what causes this.

I have run strace on the job itself that was submitted on equivalent jobs, one which reports the "unknown signal" vs a regular exiting job and I cannot find any discernable difference and notably when tracing specifically for any signals.

sacct reports nothing unusual, and actually seems to indicate that the job exited without issue. The sysadmin for our cluster system seems to agree and cannot find any issue.

This could be a cluster-specific issue, DRMAA issue, or not. If I'm looking in the wrong place please kindly redirect me. I'm not sure where or how I could start tracking down this issue.

Thanks for your time.

Issue with Slurm 20.02.4 and slurm-drmaa-1.1.1

I'm running into an issue on Centos 7.9.2009 running slurm 20.02.4 and slurm-drmaa-1.1.1. I ran configure with the '--with-slurm-inc=/opt/slurm/include' and '--with-slurm-lib=/opt/slurm/lib' options, but in the galaxy log it can't find libslurm.so.35 despite the library being there:

Traceback (most recent call last):
  File "lib/galaxy/main.py", line 298, in <module>
    main()
  File "lib/galaxy/main.py", line 294, in main
    app_loop(args, log)
  File "lib/galaxy/main.py", line 141, in app_loop
    attach_to_pools=args.attach_to_pool,
  File "lib/galaxy/main.py", line 108, in load_galaxy_app
    **kwds
  File "lib/galaxy/app.py", line 221, in __init__
    self.job_manager = manager.JobManager(self)
  File "lib/galaxy/jobs/manager.py", line 26, in __init__
    self.job_handler = handler.JobHandler(app)
  File "lib/galaxy/jobs/handler.py", line 51, in __init__
    self.dispatcher = DefaultJobDispatcher(app)
  File "lib/galaxy/jobs/handler.py", line 972, in __init__
    self.job_runners = self.app.job_config.get_job_runner_plugins(self.app.config.server_name)
  File "lib/galaxy/jobs/__init__.py", line 801, in get_job_runner_plugins
    rval[id] = runner_class(self.app, runner.get('workers', JobConfiguration.DEFAULT_NWORKERS), **runner.get('kwds', {}))
  File "lib/galaxy/jobs/runners/drmaa.py", line 65, in __init__
    drmaa = __import__("drmaa")
  File "/app/galaxy/.venv/lib/python3.6/site-packages/drmaa/__init__.py", line 65, in <module>
    from .session import JobInfo, JobTemplate, Session
  File "/app/galaxy/.venv/lib/python3.6/site-packages/drmaa/session.py", line 39, in <module>
    from drmaa.helpers import (adapt_rusage, Attribute, attribute_names_iterator,
  File "/app/galaxy/.venv/lib/python3.6/site-packages/drmaa/helpers.py", line 36, in <module>
    from drmaa.wrappers import (drmaa_attr_names_t, drmaa_attr_values_t,
  File "/app/galaxy/.venv/lib/python3.6/site-packages/drmaa/wrappers.py", line 56, in <module>
    _lib = CDLL(libpath, mode=RTLD_GLOBAL)
  File "/usr/lib64/python3.6/ctypes/__init__.py", line 343, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libslurm.so.35: cannot open shared object file: No such file or directory
[galaxy@ip-xxxxxxxxx slurm-drmaa-1.1.1]$ ls -la /opt/slurm/lib/libslurm.*
-rw-r--r--. 1 root root 57281210 Nov 17 16:38 /opt/slurm/lib/libslurm.a
-rwxr-xr-x. 1 root root      976 Nov 17 16:38 /opt/slurm/lib/libslurm.la
lrwxrwxrwx. 1 root root       18 Nov 17 16:38 /opt/slurm/lib/libslurm.so -> libslurm.so.35.0.0
lrwxrwxrwx. 1 root root       18 Nov 17 16:38 /opt/slurm/lib/libslurm.so.35 -> libslurm.so.35.0.0
-rwxr-xr-x. 1 root root  8200504 Nov 17 16:38 /opt/slurm/lib/libslurm.so.35.0.0

After some digging I found the following in the config.log:

configure:4925: checking for gcc option to accept ISO C99
configure:5074: gcc  -c -g -O2  conftest.c >&5
conftest.c:61:29: error: expected ';', ',' or ')' before 'text'
 test_restrict (ccp restrict text)
                             ^
conftest.c: In function 'main':
conftest.c:115:18: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'newvar'
   char *restrict newvar = "Another string";
                  ^
conftest.c:115:18: error: 'newvar' undeclared (first use in this function)
conftest.c:115:18: note: each undeclared identifier is reported only once for each function it appears in
conftest.c:125:3: error: 'for' loop initial declarations are only allowed in C99 mode
   for (int i = 0; i < ia->datasize; ++i)
   ^
conftest.c:125:3: note: use option -std=c99 or -std=gnu99 to compile your code
configure:5074: $? = 1
configure: failed program was:
| /* confdefs.h */
| #define PACKAGE_NAME "DRMAA for Slurm"
| #define PACKAGE_TARNAME "slurm-drmaa"
| #define PACKAGE_VERSION "1.1.1"
| #define PACKAGE_STRING "DRMAA for Slurm 1.1.1"
| #define PACKAGE_BUGREPORT "[email protected]"
| #define PACKAGE_URL ""
| #define PACKAGE "slurm-drmaa"
| #define VERSION "1.1.1"
| /* end confdefs.h.  */

Any ideas?

New native specification in User Slurm

Hi, I have a question. If we create a new specification like -P for Project_ID which is defined as the id of the particular project like account. Can I update the DRMAA-Slurm library for this attribute by myself? Thanks.

segfault when jobs' requirements cannot be met

Using the same approach described in #5 but running the job with:

jt.nativeSpecification = "--cpus-per-task=2000 --nodes=1 --mem-per-cpu=5000 --partition=htc --tmp=100"

Another segfault is triggered:

Program received signal SIGSEGV, Segmentation fault.
drmaa_release_job_ids (values=0x0) at drmaa_base.c:297
297     iter_function(job_id, drmaa_job_ids_t)
(gdb) bt
#0  drmaa_release_job_ids (values=0x0) at drmaa_base.c:297
#1  0x00007fffeffed550 in ffi_call_unix64 () at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/libffi/src/x86/unix64.S:76
#2  0x00007fffeffeccf5 in ffi_call (cif=<optimized out>, fn=0x7fffed771770 <drmaa_release_job_ids>, rvalue=<optimized out>, avalue=0x7fffffffc710)
    at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/libffi/src/x86/ffi64.c:525
#3  0x00007fffeffe483c in _call_function_pointer (argcount=1, resmem=0x7fffffffc730, restype=<optimized out>, atypes=<optimized out>, avalues=0x7fffffffc710, pProc=0x7fffed771770 <drmaa_release_job_ids>, flags=4353)
    at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/callproc.c:809
#4  _ctypes_callproc (pProc=0x7fffed771770 <drmaa_release_job_ids>, argtuple=0x7fffffffc7e0, flags=4353, argtypes=<optimized out>, restype=0x7ffff7d61cd0 <_Py_NoneStruct>, checker=0x0)
    at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/callproc.c:1147
#5  0x00007fffeffdcda3 in PyCFuncPtr_call (self=<optimized out>, inargs=<optimized out>, kwds=0x0) at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/_ctypes.c:3870
#6  0x00007ffff793fade in _PyObject_FastCallDict (func=0x7fffeea66818, args=<optimized out>, nargs=<optimized out>, kwargs=0x0) at Objects/abstract.c:2316
#7  0x00007ffff7a1c2bb in call_function (pp_stack=0x7fffffffcb18, oparg=<optimized out>, kwnames=0x0) at Python/ceval.c:4822
#8  0x00007ffff7a1f15d in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3284
#9  0x00007ffff7969e33 in gen_send_ex (gen=0x7fffefd90200, arg=<optimized out>, exc=<optimized out>, closing=<optimized out>) at Objects/genobject.c:189
#10 0x00007ffff7978f16 in listextend (self=0x7fffeea7ad48, b=<optimized out>) at Objects/listobject.c:857
#11 0x00007ffff7979398 in list_init (self=0x7fffeea7ad48, args=<optimized out>, kw=<optimized out>) at Objects/listobject.c:2316
#12 0x00007ffff79add4c in type_call (type=<optimized out>, args=0x7ffff7e8d470, kwds=0x0) at Objects/typeobject.c:915
#13 0x00007ffff793fade in _PyObject_FastCallDict (func=0x7ffff7d5bb40 <PyList_Type>, args=<optimized out>, nargs=<optimized out>, kwargs=0x0) at Objects/abstract.c:2316
#14 0x00007ffff7a1c2bb in call_function (pp_stack=0x7fffffffce58, oparg=<optimized out>, kwnames=0x0) at Python/ceval.c:4822
#15 0x00007ffff7a1f15d in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3284
#16 0x00007ffff7a1aa60 in _PyEval_EvalCodeWithName (_co=0x7ffff01fc420, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=1, kwnames=0x7ffff7e9dba0, kwargs=0x7ffff7f8fba8, kwcount=3, kwstep=1, 
    defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x7ffff7ea3c30, qualname=0x7fffefd8d2b8) at Python/ceval.c:4128
#17 0x00007ffff7a1c48a in fast_function (kwnames=<optimized out>, nargs=1, stack=<optimized out>, func=0x7fffeea832f0) at Python/ceval.c:4939
#18 call_function (pp_stack=0x7fffffffd0f8, oparg=<optimized out>, kwnames=<optimized out>) at Python/ceval.c:4819
#19 0x00007ffff7a1e8dd in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3300
#20 0x00007ffff7a1aa60 in _PyEval_EvalCodeWithName (_co=0x7ffff7f1b930, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=0, kwnames=0x0, kwargs=0x8, kwcount=0, kwstep=2, defs=0x0, defcount=0, 
    kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:4128
#21 0x00007ffff7a1aee3 in PyEval_EvalCodeEx (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kws=<optimized out>, kwcount=0, defs=0x0, defcount=0, 
    kwdefs=0x0, closure=0x0) at Python/ceval.c:4149
#22 0x00007ffff7a1af2b in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>) at Python/ceval.c:695
#23 0x00007ffff7a4d6c0 in run_mod (arena=0x7ffff7f79180, flags=0x7fffffffd450, locals=0x7ffff7f5cf30, globals=0x7ffff7f5cf30, filename=0x7ffff7ea3830, mod=0x683f58) at Python/pythonrun.c:980
#24 PyRun_FileExFlags (fp=0x64cc30, filename_str=<optimized out>, start=<optimized out>, globals=0x7ffff7f5cf30, locals=0x7ffff7f5cf30, closeit=<optimized out>, flags=0x7fffffffd450) at Python/pythonrun.c:933
#25 0x00007ffff7a4ec83 in PyRun_SimpleFileExFlags (fp=0x64cc30, filename=<optimized out>, closeit=1, flags=0x7fffffffd450) at Python/pythonrun.c:396
#26 0x00007ffff7a6a0b5 in run_file (p_cf=0x7fffffffd450, filename=0x603310 L<error reading variable>, fp=0x64cc30) at Modules/main.c:338
#27 Py_Main (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:810
#28 0x0000000000400c1d in main (argc=2, argv=<optimized out>) at ./Programs/python.c:69

In this case, the error python: error: CPU count per node can not be satisfied is still shown but it segfaults afterwards anyway.

slurm 21.08.2 compatibility

Hi, just wanted to note this here. Following error during compilation:

libtool: compile: gcc -DHAVE_CONFIG_H -I. -I.. -I//include -I../drmaa_utils/ -Wno-long-long -D_REENTRANT -D_THREAD_SAFE -DNDEBUG -D_GNU_SOURCE -DCONFDIR=/etc -Wall -W -Wno-unused-parameter -Wno-format-zero-length -pedantic -std=c99 -g -O0 -pthread -MT libdrmaa_la-job.lo -MD -MP -MF .deps/libdrmaa_la-job.Tpo -c job.c -fPIC -DPIC -o .libs/libdrmaa_la-job.o
job.c: In function ‘slurmdrmaa_job_control’:
job.c:104:8: error: too few arguments to function ‘slurm_kill_job2’
if(slurm_kill_job2(self->job_id, SIGKILL, 0) == -1) {
^~~~~~~~~~~~~~~
In file included from ../slurm_drmaa/job.h:29,
from job.c:31:
/usr/include/slurm/slurm.h:3531:12: note: declared here
extern int slurm_kill_job2(const char *job_id, uint16_t signal, uint16_t flags,
^~~~~~~~~~~~~~~
job.c: In function ‘slurmdrmaa_job_update_status’:
job.c:364:24: warning: this statement may fall through [-Wimplicit-fallthrough=]
self->exit_status = -1;
~~~~~~~~~~~~~~~~~~^~~~
job.c:365:5: note: here
case JOB_FAILED:
^~~~

I added NULL argument to slurm_kill_job2 call and it compiled.

SLURM in Java

Hi,
I've compiled the last slurm-drmaa and tried to use the drmaa.jar from the SGE so I can access to the methods in Java pointing to the compiled version of libdrmaa.so but I'm getting an Exception in thread "main" java.lang.UnsatisfiedLinkError: com.sun.grid.drmaa.SessionImpl.nativeInit(Ljava/lang/String;)V
So I'm using:
export DRMAA_LIBRARY_PATH=/home/vargasfr/slurm_drmaa/slurm-drmaa-1.1.2/install/lib/libdrmaa.so
export CLASSPATH=/home/vargasfr/slurm_drmaa/lib/drmaa.jar:/home/vargasfr/slurm_drmaa/slurm-drmaa-1.1.2/install/lib/libdrmaa.so:/home/vargasfr/slurm_drmaa
export LD_LIBRARY_PATH=/home/vargasfr/slurm_drmaa/slurm-drmaa-1.1.2/install/lib

From my Java file I'm importing import org.ggf.drmaa.* (where this is coming from the drmaa.jar) and calling:
SessionFactory factory = SessionFactory.getFactory();

Any idea on how can I make this working using SLURM?

all errors reported as FSD_ERRNO_INTERNAL_ERROR

Hi,

Thanks for merging my previous fix. This one is in a similar vein.

On line 134 of slurm_drmaa/job.c, any problem when updating the job status is reported back as FSD_ERRNO_INTERNAL_ERROR. The specific issue here is that the caller would like to know if the error is intermittent (eg. a network time-out) and thus possibly the job status can be queried successfully in a few minutes, or if the problem is terminal and the job is dead. I've prepared a complementary patch to Snakemake to handle FSD_ERRNO_DRM_COMMUNICATION_FAILURE as an intermittent fault and to keep polling the job.

Really, the DRMAA library should make a better attempt to convert SLURM errors to meaningful DRMAA error codes, but this is a start.

Let me know if you'd prefer me to submit this stuff elsewhere. It's hard to see who is maintaining the definitive slurm-dmraa libs just now.

*** tim_testing_slurm//build/slurm-drmaa-1.2.0.2/slurm_drmaa/job.c.orig	2016-11-04 15:09:49.000000000 +0000
--- tim_testing_slurm//build/slurm-drmaa-1.2.0.2/slurm_drmaa/job.c	2017-06-09 15:05:38.000000000 +0100
***************
*** 131,138 ****
  
  			if (_slurm_errno == ESLURM_INVALID_JOB_ID) {
  				self->on_missing(self);
! 			} else {
! 				fsd_exc_raise_fmt(FSD_ERRNO_INTERNAL_ERROR,"slurm_load_jobs error: %s,job_id: %s", slurm_strerror(slurm_get_errno()), self->job_id);
  			}
  		}
  		if (job_info) {
--- 131,150 ----
  
  			if (_slurm_errno == ESLURM_INVALID_JOB_ID) {
  				self->on_missing(self);
! 			} else
!                 // We should detect the error corresponding to "Socket timed out" and report
!                 // it explicitly as FSD_ERRNO_TIMEOUT or maybe FSD_ERRNO_DRM_COMMUNICATION_FAILURE
!                 // ( I'm not sure if FSD_ERRNO_TIMEOUT is the same as DRMAA_ERRNO_EXIT_TIMEOUT,
!                 //   which simply indicates the job is still running?? Maybe we should try it and see. )
!                 // To see what _slurm_errno corresponds to which message let's look at
!                 // 'slurm_strerror' in the slurm source code...
!                 //   https://github.com/SchedMD/slurm/blob/master/src/common/slurm_errno.c
!             if ( _slurm_errno == SLURM_PROTOCOL_SOCKET_IMPL_TIMEOUT ||
!                  _slurm_errno == SLURMCTLD_COMMUNICATIONS_CONNECTION_ERROR
!                ) {
!                 fsd_exc_raise_fmt(FSD_ERRNO_DRM_COMMUNICATION_FAILURE,"slurm_load_jobs error: %s,job_id: %s", slurm_strerror(_slurm_errno), self->job_id);
!             } else {
! 				fsd_exc_raise_fmt(FSD_ERRNO_INTERNAL_ERROR,"slurm_load_jobs error: %s,job_id: %s", slurm_strerror(_slurm_errno), self->job_id);
  			}
  		}
  		if (job_info) {

Cheers,

TIM

Problems during compilation

Hi,
I'm trying to compile slurm-drmaa-1.2.0-dev.deca826 for SLURM 17-11.5 and I'm getting errors during "configure" step.
I'm executing "./configure --prefix=/tmp/test-slurm-drmaa --with-slurm-inc=/soft/slurm-17.11.5/include/slurm/ --with-slurm-lib=/soft/slurm-17.11.5/lib/" (my SLURM installation is installed in a NFS folder called /soft") and error is "SLURM libraries/headers not found; add --with-slurm-inc and --with-slurm-lib with appropriate locations."

Could you help me?

Thanks.

segfault with native specification of limits using human qualifiers

When specifying --tmp, --mem-per-cpu and other numerical parameters, if a size qualifier is used 1M, 1G, ... the code will segfault.

While this is supported on the command-line via srun/squeue the drmaa code assumes a number is passed and will segfault on non-numeric input.

Slurm-drmaa 1.1.2 binary (RPM) installation issue with slurm-19.05.8-1

I'm trying to install slurm-drmaa version 1.1.2 alongwith slurm version 19.05.8-1 on CentOS-7.9 machine. It throws the package dependency issue and requires.. libslurm.so.31()(64bit) and libslurmdb.so.31()(64bit) libraries.. Whereas, slurm-19.05.8-1 provides the library libslurm.so.34 .

[root@test-vm111 ~]# cat /etc/redhat-release
CentOS Linux release 7.9.2009 (Core)

Installed slurm packages..
[root@test-vm111 yum.repos.d]# rpm -qa | grep ^slurm | sort
slurm-19.05.8-1.el7.x86_64
slurm-contribs-19.05.8-1.el7.x86_64
slurm-devel-19.05.8-1.el7.x86_64
slurm-example-configs-19.05.8-1.el7.x86_64
slurm-libpmi-19.05.8-1.el7.x86_64
slurm-openlava-19.05.8-1.el7.x86_64
slurm-pam_slurm-19.05.8-1.el7.x86_64
slurm-perlapi-19.05.8-1.el7.x86_64
slurm-slurmctld-19.05.8-1.el7.x86_64
slurm-slurmd-19.05.8-1.el7.x86_64
slurm-slurmdbd-19.05.8-1.el7.x86_64
slurm-torque-19.05.8-1.el7.x86_64

Issue with slurm-drmaa 1.1.2 installation
[root@test-vm111 yum.repos.d]# yum install slurm-drmaa
Loaded plugins: langpacks, nvidia
Resolving Dependencies
--> Running transaction check
---> Package slurm-drmaa.x86_64 0:1.1.2-1.el7 will be installed
--> Processing Dependency: libslurmdb.so.31()(64bit) for package: slurm-drmaa-1.1.2-1.el7.x86_64
--> Processing Dependency: libslurm.so.31()(64bit) for package: slurm-drmaa-1.1.2-1.el7.x86_64
--> Finished Dependency Resolution
Error: Package: slurm-drmaa-1.1.2-1.el7.x86_64 (slurm-19.05)
Requires: libslurmdb.so.31()(64bit)
Error: Package: slurm-drmaa-1.1.2-1.el7.x86_64 (slurm-19.05)
Requires: libslurm.so.31()(64bit)
You could try using --skip-broken to work around the problem
You could try running: rpm -Va --nofiles --nodigest

libslurm.so library available in machine
[root@test-vm111 ~]# locate libslurm.so
/usr/lib64/libslurm.so
/usr/lib64/libslurm.so.34
/usr/lib64/libslurm.so.34.0.0

[root@test-vm111 yum.repos.d]# rpm -qf /usr/lib64/libslurm.so.34
slurm-19.05.8-1.el7.x86_64

Anyone can look into the issue?

Implement DRMAA v2.2

Currently only Univa implements 2.2, but it brings a lot of nice improvements, and it'd help adoption if it were implemented for Slurm as well. Specification docs here.

Implementing 2.2 would be a big undertaking and as slurm-drmaa is only a side-project for me (and I'm not a C programmer by trade) I'd say it's fairly unlikely that anything will get done on this, but it's a good goal.

Slurm 20.11.0 "Slurm libraries/headers not found"

./configure fails

# ./configure
[...]
checking for SLURM library dir... /usr/lib
checking for slurmdb_users_get in -lslurm... yes
Using slurm libraries -lslurm 
checking for usable SLURM libraries/headers... *** The SLURM test program failed to link or run. See the file config.log
*** for the exact error that occured.
no
configure: error: 
Slurm libraries/headers not found;
add --with-slurm-inc and --with-slurm-lib with appropriate locations.

The relevant section in config.log

configure:13429: ./conftest
conftest: error: resolve_ctls_from_dns_srv: res_nsearch error: No error
conftest: error: fetch_config: DNS SRV lookup failed
conftest: error: _establish_config_source: failed to fetch config
conftest: fatal: Could not establish a configuration source
configure:13429: $? = 1
configure: program exited with status 1
configure: failed program was:
| /* confdefs.h */
| #define PACKAGE_NAME "DRMAA for Slurm"
| #define PACKAGE_TARNAME "slurm-drmaa"
| #define PACKAGE_VERSION "1.2.0-dev.71cd5be"
| #define PACKAGE_STRING "DRMAA for Slurm 1.2.0-dev.71cd5be"
| #define PACKAGE_BUGREPORT "[email protected]"
| #define PACKAGE_URL ""
| #define PACKAGE "slurm-drmaa"
| #define VERSION "1.2.0-dev.71cd5be"
| #define STDC_HEADERS 1
| #define HAVE_SYS_TYPES_H 1
| #define HAVE_SYS_STAT_H 1
| #define HAVE_STDLIB_H 1
| #define HAVE_STRING_H 1
| #define HAVE_MEMORY_H 1
| #define HAVE_STRINGS_H 1
| #define HAVE_INTTYPES_H 1
| #define HAVE_STDINT_H 1
| #define HAVE_UNISTD_H 1
| #define HAVE_DLFCN_H 1
| #define LT_OBJDIR ".libs/"
| #define HAVE_PTHREAD_PRIO_INHERIT 1
| #define HAVE_LIBSLURM 1
| /* end confdefs.h.  */
|  #include "slurm/slurm.h"
| int
| main ()
| {
|  job_desc_msg_t job_req; /*at least check for declared structs */
|                  return 0;
| 
|   ;
|   return 0;
| }
configure:13444: result: no
configure:13450: error: 

It looks like this is the culprit:

conftest: error: resolve_ctls_from_dns_srv: res_nsearch error: No error
conftest: error: fetch_config: DNS SRV lookup failed
conftest: error: _establish_config_source: failed to fetch config
conftest: fatal: Could not establish a configuration source

It worked with the previous version that I used (20.02).

It looks like there now is a slurm_library_init and this subsequently requires a working slurm configuration and installation to run.

DRMAA on Galaxy Repo does not work

https://depot.galaxyproject.org/yum/package/slurm/18.08/7/x86_64/

The DRMAA rpm is breaking as it is not meant for the slurm version.
The slurm rpm in the repo provides .so.33 and drmaa in this repo requires .so.31

Error: Package: slurm-drmaa-1.1.0-1.el7.x86_64 (slurm-18.08)
Requires: libslurm.so.31()(64bit)
Error: Package: slurm-drmaa-1.1.0-1.el7.x86_64 (slurm-18.08)
Requires: libslurmdb.so.31()(64bit)
You could try using --skip-broken to work around the problem

Do you have access to this?

segfault when submitting bulk jobs

After:

export DRMAA_LIBRARY_PATH=~/test_drmaa/slurm-drmaa-1.2.0-dev.83fc288/slurm_drmaa/.libs/libdrmaa.so

When using libdrmaa via python

#!/usr/bin/env python
from __future__ import print_function
import os
import drmaa

LOGS = "logs/"
if not os.path.isdir(LOGS):
    os.mkdir(LOGS)

s = drmaa.Session()
s.initialize()
print("Supported contact strings:", s.contact)
print("Supported DRM systems:", s.drmsInfo)
print("Supported DRMAA implementations:", s.drmaaImplementation)
print("Version", s.version)

jt = s.createJobTemplate()
jt.remoteCommand = "/usr/bin/echo"
jt.args = ["Hello", "world"]
jt.jobName = "testdrmaa"
jt.jobEnvironment = os.environ.copy()
jt.workingDirectory = os.getcwd()

jt.outputPath = ":" + os.path.join(LOGS, "job-%A_%a.out")
jt.errorPath = ":" + os.path.join(LOGS, "job-%A_%a.err")
jt.nativeSpecification = "--cpus-per-task=2 --nodes=1 --mem-per-cpu=50 --partition=htc --tmp=100"

print("Submitting", jt.remoteCommand, "with", jt.args, "and logs to", jt.outputPath)
ids = s.runBulkJobs(jt, beginIndex=1, endIndex=2, step=1)
print("Job submitted with ids", ids)

s.deleteJobTemplate(jt)

The above code fails when calling runBulkJobs

Stack trace of the above script:

Program received signal SIGSEGV, Segmentation fault.
strlcpy (dest=dest@entry=0x7a9640 "9829091", src=0x0, size=size@entry=1024) at compat.c:50
50              while( *src  &&  --size > 0 )
(gdb) bt
#0  strlcpy (dest=dest@entry=0x7a9640 "9829091", src=0x0, size=size@entry=1024) at compat.c:50
#1  0x00007fffed772fac in drmaa_get_next_job_id (values=0x7ac5c0, value=0x7a9640 "9829091", value_len=1024) at drmaa_base.c:297
#2  0x00007fffeffed550 in ffi_call_unix64 () at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/libffi/src/x86/unix64.S:76
#3  0x00007fffeffeccf5 in ffi_call (cif=<optimized out>, fn=0x7fffed772e90 <drmaa_get_next_job_id>, rvalue=<optimized out>, avalue=0x7fffffffc6c0) at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/libffi/src/x86/ffi64.c:525
#4  0x00007fffeffe483c in _call_function_pointer (argcount=3, resmem=0x7fffffffc6f0, restype=<optimized out>, atypes=<optimized out>, avalues=0x7fffffffc6c0, pProc=0x7fffed772e90 <drmaa_get_next_job_id>, flags=4353) at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/callproc.c:809
#5  _ctypes_callproc (pProc=0x7fffed772e90 <drmaa_get_next_job_id>, argtuple=0x7fffffffc7e0, flags=4353, argtypes=<optimized out>, restype=0x7ffff0212f28, checker=0x0) at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/callproc.c:1147
#6  0x00007fffeffdcda3 in PyCFuncPtr_call (self=<optimized out>, inargs=<optimized out>, kwds=0x0) at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/_ctypes.c:3870
#7  0x00007ffff793fade in _PyObject_FastCallDict (func=0x7fffeea655c0, args=<optimized out>, nargs=<optimized out>, kwargs=0x0) at Objects/abstract.c:2316
#8  0x00007ffff7a1c2bb in call_function (pp_stack=0x7fffffffcb18, oparg=<optimized out>, kwnames=0x0) at Python/ceval.c:4822
#9  0x00007ffff7a1f15d in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3284
#10 0x00007ffff7969e33 in gen_send_ex (gen=0x7fffefd90200, arg=<optimized out>, exc=<optimized out>, closing=<optimized out>) at Objects/genobject.c:189
#11 0x00007ffff7978f3e in listextend (self=0x7fffeea79d48, b=<optimized out>) at Objects/listobject.c:857
#12 0x00007ffff7979398 in list_init (self=0x7fffeea79d48, args=<optimized out>, kw=<optimized out>) at Objects/listobject.c:2316
#13 0x00007ffff79add4c in type_call (type=<optimized out>, args=0x7ffff7e8d470, kwds=0x0) at Objects/typeobject.c:915
#14 0x00007ffff793fade in _PyObject_FastCallDict (func=0x7ffff7d5bb40 <PyList_Type>, args=<optimized out>, nargs=<optimized out>, kwargs=0x0) at Objects/abstract.c:2316
#15 0x00007ffff7a1c2bb in call_function (pp_stack=0x7fffffffce58, oparg=<optimized out>, kwnames=0x0) at Python/ceval.c:4822
#16 0x00007ffff7a1f15d in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3284
#17 0x00007ffff7a1aa60 in _PyEval_EvalCodeWithName (_co=0x7ffff01fc420, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=1, kwnames=0x7ffff7e9dba0, kwargs=0x7ffff7f8fba8, kwcount=3, kwstep=1, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x7ffff7ea3c30, qualname=0x7fffefd8d2b8) at Python/ceval.c:4128
#18 0x00007ffff7a1c48a in fast_function (kwnames=<optimized out>, nargs=1, stack=<optimized out>, func=0x7fffeea8c2f0) at Python/ceval.c:4939
#19 call_function (pp_stack=0x7fffffffd0f8, oparg=<optimized out>, kwnames=<optimized out>) at Python/ceval.c:4819
#20 0x00007ffff7a1e8dd in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3300
#21 0x00007ffff7a1aa60 in _PyEval_EvalCodeWithName (_co=0x7ffff7f1b930, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=0, kwnames=0x0, kwargs=0x8, kwcount=0, kwstep=2, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:4128
#22 0x00007ffff7a1aee3 in PyEval_EvalCodeEx (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kws=<optimized out>, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:4149
#23 0x00007ffff7a1af2b in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>) at Python/ceval.c:695
#24 0x00007ffff7a4d6c0 in run_mod (arena=0x7ffff7f79180, flags=0x7fffffffd450, locals=0x7ffff7f5cf30, globals=0x7ffff7f5cf30, filename=0x7ffff7ea3830, mod=0x683f58) at Python/pythonrun.c:980
#25 PyRun_FileExFlags (fp=0x64cc30, filename_str=<optimized out>, start=<optimized out>, globals=0x7ffff7f5cf30, locals=0x7ffff7f5cf30, closeit=<optimized out>, flags=0x7fffffffd450) at Python/pythonrun.c:933
#26 0x00007ffff7a4ec83 in PyRun_SimpleFileExFlags (fp=0x64cc30, filename=<optimized out>, closeit=1, flags=0x7fffffffd450) at Python/pythonrun.c:396
#27 0x00007ffff7a6a0b5 in run_file (p_cf=0x7fffffffd450, filename=0x603310 L<error reading variable>, fp=0x64cc30) at Modules/main.c:338
#28 Py_Main (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:810
#29 0x0000000000400c1d in main (argc=2, argv=<optimized out>) at ./Programs/python.c:69--

The above code runs fine with a libdrmaa built from https://github.com/ljyanesm/slurm-drmaa

Unable to compile slurm-drmaa on CentOS 7.5 with SLURM 18.0.8

Dear Developers,
We are unable to compile slurm-drmaa 1.0.7 on this version of slurm?
Is the code compatible with this version of slurm??
Please find the errors below:

util.c:175:19: error: 'job_desc_msg_t' has no member named 'gres'
  fsd_free(job_desc->gres);```

util.c:322:12: error: 'job_desc_msg_t' has no member named 'gres'
    job_desc->gres = fsd_strdup(value);

Please advise.
Thank you ,
Iten 

Single quotes in the expanded args should be escaped

Hello,
thank you for this library.
I noticed a small bug when using single quotes in arguments.
I believe that single quotes in the expanded args (

temp_script = fsd_asprintf("%s '%s'", temp_script_old, arg_expanded);
) should be escaped.

In my case I'm running

bash, "-c", "source /tmp/init.sh ; /bin/sort '-s' '-k' '3' '-t' '\t'"
(note I use , to separate the arguments, in total two arguments are passed to bash)

which will then be translated into

d #15300 [     0.06]  * # Script:
d #15300 [     0.06]  | #!/bin/bash
d #15300 [     0.06]  | bash '-c' 'source /tmp/init.sh ; /bin/sort '-s' '-k' '3' '-t' '     ''

However this fails with:

/bin/sort: option requires an argument -- 't'
Try '/bin/sort --help' for more information.

I believe what happens internally is that bash -c treats source /tmp/init.sh ; /bin/sort as the first string and as per documentation everything after becomes an argument:

If the -c option is present, then commands are read from string. If there are arguments after the string, they are assigned to the positional parameters, starting with $0. 

To solve this I believe that the single quotes within the expanded args should be escaped (e.g. via https://creativeandcritical.net/str-replace-c)

=======
consider this minimal example:

#!/usr/bin/env python
import drmaa
import os

def main():
	error_log = os.path.join(os.getcwd(), 'error.log')
	with drmaa.Session() as s:
		print('Creating job template')
		jt = s.createJobTemplate()
		jt.remoteCommand = "/bin/bash"
		jt.args = [ '-c', "/bin/sort '-s' '-k' '3' '-t' '\t'" ]
		
		jt.errorPath = ":" + error_log
		
		jobid = s.runJob(jt)
		print('Your job has been submitted with ID {0}'.format(jobid))
		
		retval = s.wait(jobid, drmaa.Session.TIMEOUT_WAIT_FOREVER)
		print('Job: {0} finished with status {1}'.format(retval.jobId, retval.exitStatus))
		
		with open(error_log, 'r') as fin:
			print(fin.read())
		
		print('Cleaning up')
		s.deleteJobTemplate(jt)
		os.remove(error_log)

if __name__=='__main__':
	main()

results in

$ python foo.py
Creating job template
Your job has been submitted with ID 7990
Job: 7990 finished with status 2
/bin/sort: option requires an argument -- 't'
Try '/bin/sort --help' for more information.

Cleaning up

slurm-20.11.1 compatibility.

Hi,

We are using slurm-drmaa for our galaxy instance in our HPC environment. We recently updated our slurm to slurm-20.11.1 however, even the latest version of slurm-drmaa is no longer compatible with our new slurm:

libtool: compile: gcc -DHAVE_CONFIG_H -I. -I.. -I/opt/software/slurm/include/ -I../drmaa_utils/ -Wno-long-long -D_REENTRANT -D_THREAD_SAFE -DNDEBUG -D_GNU_SOURCE -DCONFDIR=/opt/software/slurm-drmaa/etc -Wall -W -Wno-unused-parameter -Wno-format-zero-length -pedantic -std=c99 -g -O2 -pthread -MT libdrmaa_la-drmaa.lo -MD -MP -MF .deps/libdrmaa_la-drmaa.Tpo -c drmaa.c -fPIC -DPIC -o .libs/libdrmaa_la-drmaa.o
drmaa.c: In function ‘slurmdrmaa_get_DRM_system’:
drmaa.c:65:3: error: unknown type name ‘slurm_ctl_conf_t’; did you mean ‘slurm_conf_t’?
65 | slurm_ctl_conf_t * conf_info_msg_ptr = NULL;
| ^~~~~~~~~~~~~~~~
| slurm_conf_t
drmaa.c:66:44: warning: passing argument 2 of ‘slurm_load_ctl_conf’ from incompatible pointer type [-Wincompatible-pointer-types]
66 | if ( slurm_load_ctl_conf ((time_t) NULL, &conf_info_msg_ptr ) == -1 )
| ^~~~~~~~~~~~~~~~~~
| |
| int **
In file included from drmaa.c:31:
/opt/software/slurm/include/slurm/slurm.h:3715:47: note: expected ‘slurm_conf_t **’ {aka ‘struct **’} but argument is of type ‘int **’
3715 | slurm_conf_t **slurm_ctl_conf_ptr);
| ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~
drmaa.c:73:101: error: request for member ‘version’ in something not a structure or union
73 | fsd_snprintf(NULL, slurmdrmaa_version, sizeof(slurmdrmaa_version)-1,"SLURM %s", conf_info_msg_ptr->version);
| ^~
drmaa.c:74:25: warning: passing argument 1 of ‘slurm_free_ctl_conf’ from incompatible pointer type [-Wincompatible-pointer-types]
74 | slurm_free_ctl_conf (conf_info_msg_ptr);
| ^~~~~~~~~~~~~~~~~
| |
| int *
In file included from drmaa.c:31:
/opt/software/slurm/include/slurm/slurm.h:3722:47: note: expected ‘slurm_conf_t *’ {aka ‘struct *’} but argument is of type ‘int *’
3722 | extern void slurm_free_ctl_conf(slurm_conf_t *slurm_ctl_conf_ptr);
| ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~

Is there any plan to update slurm-drmaa?

Thanks,
Cheers,
Ata

Array jobs produce IDs considered invalid

Using 44cc67e:

t #6279 [     5.70] <- slurmdrmaa_parse_native
d #6279 [     5.70]  * job 16484204 submitted
t #6279 [     5.71] -> fsd_job_new(16484204_1)
t #6279 [     5.71] <- fsd_job_new=0xc163b0: ref_cnt=1 [lock 16484204_1]
t #6279 [     5.71] -> fsd_job_set_add(job=0xc163b0, job_id=16484204_1)
t #6279 [     5.71] <- fsd_job_set_add: job->ref_cnt=2
t #6279 [     5.71] -> fsd_job_release(0xc163b0={job_id=16484204_1, ref_cnt=2}) [unlock 16484204_1]
t #6279 [     5.71] <- fsd_job_release
t #6279 [     5.71] -> fsd_job_new(16484204_2)
t #6279 [     5.71] <- fsd_job_new=0xd6acf0: ref_cnt=1 [lock 16484204_2]
t #6279 [     5.71] -> fsd_job_set_add(job=0xd6acf0, job_id=16484204_2)
t #6279 [     5.71] <- fsd_job_set_add: job->ref_cnt=2
t #6279 [     5.71] -> fsd_job_release(0xd6acf0={job_id=16484204_2, ref_cnt=2}) [unlock 16484204_2]
t #6279 [     5.71] <- fsd_job_release
t #6279 [     5.71] -> fsd_job_new(16484204_3)
t #6279 [     5.71] <- fsd_job_new=0xdffed0: ref_cnt=1 [lock 16484204_3]
t #6279 [     5.71] -> fsd_job_set_add(job=0xdffed0, job_id=16484204_3)
t #6279 [     5.71] <- fsd_job_set_add: job->ref_cnt=2
t #6279 [     5.71] -> fsd_job_release(0xdffed0={job_id=16484204_3, ref_cnt=2}) [unlock 16484204_3]
t #6279 [     5.71] <- fsd_job_release
t #6279 [     5.71] -> fsd_job_new(16484204_4)
t #6279 [     5.71] <- fsd_job_new=0xe55a70: ref_cnt=1 [lock 16484204_4]
t #6279 [     5.71] -> fsd_job_set_add(job=0xe55a70, job_id=16484204_4)
t #6279 [     5.71] <- fsd_job_set_add: job->ref_cnt=2
t #6279 [     5.71] -> fsd_job_release(0xe55a70={job_id=16484204_4, ref_cnt=2}) [unlock 16484204_4]
t #6279 [     5.71] <- fsd_job_release
t #6279 [     5.71] -> slurmdrmaa_free_job_desc
t #6279 [     5.71] <- slurmdrmaa_free_job_desc
t #6279 [     5.71] <- drmaa_run_bulk_jobs =0
d #6279 [     5.71]  * fsd_exc_new(1006,Vector have no more elements.,0)
t #6279 [     5.71] <- drmaa_get_next_job_id=25: Vector have no more elements.
t #6279 [     5.71] -> drmaa_delete_job_template(0xe24e30)
t #6279 [     5.71] <- drmaa_delete_job_template =0
t #6279 [    70.34] -> drmaa_job_ps(job_id=16484204_2)
t #6279 [    70.34] -> fsd_job_set_get(job_id=16484204_2)
t #6279 [    70.34] <- fsd_job_set_get(job_id=16484204_2) =0xd6acf0: ref_cnt=2 [lock 16484204_2]
d #6279 [    70.34]  *  job->last_update_time = 0
d #6279 [    70.34]  * updating status of job: 16484204_2
t #6279 [    70.34] -> slurmdrmaa_job_update_status({job_id=16484204_2})
t #6279 [    70.34] -> slurmdrmaa_set_job_id({job_id=16484204_2})
t #6279 [    70.34] <- slurmdrmaa_set_job_id; job_id=16484204_2
E #6279 [    70.34]  * fsd_exc_new(1003,not an number: 16484204_2,1)
t #6279 [    70.34] -> slurmdrmaa_unset_job_id({job_id=(null)})
t #6279 [    70.34] <- slurmdrmaa_unset_job_id; job_id=16484204_2
t #6279 [    70.34] -> fsd_job_release(0xd6acf0={job_id=16484204_2, ref_cnt=2}) [unlock 16484204_2]
t #6279 [    70.34] <- fsd_job_release
t #6279 [    70.34] <- drmaa_job_ps=4: not an number: 16484204_2

Which causes DRMAA to drop these jobs.

The code seems to assume that job_ids have to be numeric.
SLURM uses ArrayJobIDs of the form <jobid>_<arrayid> which in this case isn't being handled properly.

Also noticed the line t #6279 [ 5.71] <- drmaa_run_bulk_jobs =0. Shouldn't it be =1 in this case?

Is native Ubuntu 20.04 package useful?

Hi,

I tried the native slurm-drmaa package, but this led to galaxy handlers failing to restart. The handler logs complained about slurm-drmaa.

sudo apt install slurm-drmaa1
Is yours (I'll use the Ubuntu launchpad repo, thanks) more likely to be compatible with Ubuntu 20.04 ?
Galaxy version 20.05.

Thanks

'Invalid Trackable RESource (TRES) specification' error

Hi,

I'm using slurm-drmaa to submit a job and I get the error below:

d #89f27 [     0.00]  * # Native specification:  --time=1:00:00 --ntasks=1 --gres=gpu:1 --cpus-per-task=2 --nodes=1 --account=xxx@yyy --partition=mypartition

t #89f27 [     0.00] -> slurmdrmaa_parse_native

t #89f27 [     0.00] -> slurmdrmaa_parse_additional_attr

d #89f27 [     0.00]  * # time_limit = 1:00:00

t #89f27 [     0.00] -> slurmdrmaa_datetime_parse(1:00:00)

d #89f27 [     0.00]  * parsed: 0000-00-00 01:00:00 +00:00:00 [---hms-]

t #89f27 [     0.00] <- slurmdrmaa_datetime_parse(1:00:00) =60 minutes
t #89f27 [     0.00] <- slurmdrmaa_parse_additional_attr

t #89f27 [     0.00] -> slurmdrmaa_parse_additional_attr
d #89f27 [     0.00]  * # ntasks = 1
t #89f27 [     0.00] <- slurmdrmaa_parse_additional_attr

t #89f27 [     0.00] -> slurmdrmaa_parse_additional_attr
d #89f27 [     0.00]  * # gres = gpu:1
t #89f27 [     0.00] <- slurmdrmaa_parse_additional_attr

t #89f27 [     0.00] -> slurmdrmaa_parse_additional_attr
d #89f27 [     0.00]  * # cpus_per_task = 2
t #89f27 [     0.00] <- slurmdrmaa_parse_additional_attr

t #89f27 [     0.00] -> slurmdrmaa_parse_additional_attr
d #89f27 [     0.00]  * nodes: 1 ->
d #89f27 [     0.00]  * # min_nodes = 1
t #89f27 [     0.00] <- slurmdrmaa_parse_additional_attr

t #89f27 [     0.00] -> slurmdrmaa_parse_additional_attr
d #89f27 [     0.00]  * # account = xxx@yyy
t #89f27 [     0.00] <- slurmdrmaa_parse_additional_attr

t #89f27 [     0.00] -> slurmdrmaa_parse_additional_attr
d #89f27 [     0.00]  * # partition = allgpus
t #89f27 [     0.00] <- slurmdrmaa_parse_additional_attr

d #89f27 [     0.00]  * finalizing job constraints
d #89f27 [     0.00]  * set min_cpus to ntasks*cpus_per_task: 2
t #89f27 [     0.00] <- slurmdrmaa_parse_native
E #89f27 [     4.24]  * fsd_exc_new(1016,slurm_submit_batch_job error: Invalid Trackable RESource (TRES) specification,1)

t #89f27 [     4.24] -> slurmdrmaa_free_job_desc
t #89f27 [     4.24] <- slurmdrmaa_free_job_desc

t #89f27 [     4.24] <- drmaa_run_job=17: slurm_submit_batch_job error: Invalid Trackable RESource (TRES) specification

Traceback (most recent call last):
  ..
  File "/.../python3.6/site-packages/drmaa/session.py", line 314, in runJob
    c(drmaa_run_job, jid, sizeof(jid), jobTemplate)
  File "/.../python3.6/site-packages/drmaa/helpers.py", line 302, in c
    return f(*(args + (error_buffer, sizeof(error_buffer))))
  File "/.../site-packages/drmaa/errors.py", line 151, in error_check
    raise _ERRORS[code - 1](error_string)
drmaa.errors.DeniedByDrmException: code 17: slurm_submit_batch_job error: Invalid Trackable RESource (TRES) specification

The same job without ```--gres=gpu:1`` works fine.
slurm-drmaa version is 1.1.3 and slurm version is 21.08.6. Os is RHEL 8.4.

Any hint would be greatly appreciated,
Kimchi

Running CLI commands without options segfaults.

Testing slurm drmaa in a container, but even when running outside of a container either building from source or installing via galaxy rpm every time I run binary its segfaults.

am I missing something?

Error is:
[root@f8ddc11bc51e /]# DRMAA_LIBRARY_PATH=/usr/lib64/libdrmaa.so /usr/bin/drmaa-run
Segmentation fault (core dumped)

Backtrace shows:

[root@f8ddc11bc51e /]# export DRMAA_LIBRARY_PATH=/usr/lib64/libdrmaa.so
[root@f8ddc11bc51e /]# gdb /usr/bin/drmaa-run
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/...
Reading symbols from /usr/bin/drmaa-run...Reading symbols from /usr/lib/debug/usr/bin/drmaa-run.debug...done.
done.
(gdb) run
Starting program: /usr/bin/drmaa-run
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x00000000004129b6 in parse_args (argc=0, argv=0x7fffffffe7a0) at drmaa_run.c:254
254 while (argc >= 0 && argv[0][0] == '-')
(gdb) backtrace
#0 0x00000000004129b6 in parse_args (argc=0, argv=0x7fffffffe7a0) at drmaa_run.c:254
#1 0x00000000004120df in main (argc=1, argv=0x7fffffffe798) at drmaa_run.c:122
(gdb)

My test setup is as follows:

Dockerfile:
$ cat Dockerfile
FROM centos:7

RUN (cd /lib/systemd/system/sysinit.target.wants/; for i in ; do [ $i == systemd-tmpfiles-setup.service ] || rm -f $i; done);
rm -f /lib/systemd/system/multi-user.target.wants/
;
rm -f /etc/systemd/system/.wants/;
rm -f /lib/systemd/system/local-fs.target.wants/;
rm -f /lib/systemd/system/sockets.target.wants/udev;
rm -f /lib/systemd/system/sockets.target.wants/initctl;
rm -f /lib/systemd/system/basic.target.wants/
;
rm -f /lib/systemd/system/anaconda.target.wants/*;

VOLUME [ "/sys/fs/cgroup"]

RUN yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo
RUN yum-config-manager --add-repo https://depot.galaxyproject.org/yum/galaxy.repo

RUN yum -y install which strace gdb
RUN debuginfo-install -y libgcc-4.8.5-44.el7.x86_64
RUN debuginfo-install -y glibc-2.17-324.el7_9.x86_64
RUN yum -y install slurm-slurmd-20.11.8 slurm-devel-20.11.8glibc-2.17-324.el7_9.x86_64

RUN yum clean all && yum -y update

RUN yum -y install slurm-drmaa slurm-drmaa-debuginfo

RUN yum clean all &&
rm -rf /var/cache/yum

VOLUME [ "/sys/fs/cgroup"]

ENTRYPOINT ['/usr/sbin/init']

Which results in a working container, and when I login to the container I'm running:

[root@f8ddc11bc51e /]# cat /etc/redhat-release
CentOS Linux release 7.9.2009 (Core)

[root@f8ddc11bc51e7 /]# rpm -qa slurm*
slurm-slurmd-20.11.8-1.el7.x86_64
slurm-drmaa-debuginfo-1.1.2-1.el7.x86_64
slurm-20.11.8-1.el7.x86_64
slurm-devel-20.11.8-1.el7.x86_64
slurm-drmaa-1.1.2-1.el7.x86_64

[root@f8ddc11bc51e /]# yum info slurm-drmaa-1.1.2-1.el7.x86_64
Loaded plugins: fastestmirror, ovl
Loading mirror speeds from cached hostfile

  • base: mirrors.vinters.com
  • extras: mirrors.coreix.net
  • updates: mirrors.coreix.net
    Installed Packages
    Name : slurm-drmaa
    Arch : x86_64
    Version : 1.1.2
    Release : 1.el7
    Size : 863 k
    Repo : installed
    From repo : galaxy
    Summary : DRMAA for Slurm
    URL : https://github.com/natefoo/slurm-drmaa
    License : GPLv3+
    Description : DRMAA for Slurm is an implementation of Open Grid Forum DRMAA 1.0 (Distributed
    : Resource Management Application API) specification for submission and control of
    : jobs to SLURM. Using DRMAA, grid applications builders, portal developers and
    : ISVs can use the same high-level API to link their software with different
    : cluster/resource management systems.

Support `maxnodes` in `--nodes` option

As per the sbatch documentation, it is possible to request both a minimum and a maximum number of nodes with --nodes:

-N, --nodes=<minnodes[-maxnodes]>
Request that a minimum of minnodes nodes be allocated to this job. A maximum node count may also be specified with maxnodes.

However, slurm-drmaa doesn't support this:

Traceback (most recent call last):
  File "/home/ndc/drmaa-venv/bin/sbatch-drmaa", line 20, in <module>
    jobid = s.runJob(jt)
  File "/home/ndc/drmaa-venv/lib/python2.7/site-packages/drmaa/session.py", line 314, in runJob
    c(drmaa_run_job, jid, sizeof(jid), jobTemplate)
  File "/home/ndc/drmaa-venv/lib/python2.7/site-packages/drmaa/helpers.py", line 303, in c
    return f(*(args + (error_buffer, sizeof(error_buffer))))
  File "/home/ndc/drmaa-venv/lib/python2.7/site-packages/drmaa/errors.py", line 151, in error_check
    raise _ERRORS[code - 1](error_string)
drmaa.errors.InvalidArgumentException: code 4: not an number: 1-1

segfault with unresponsive/timeout socket

If libdrmaa fails to contact SLURM due to system overload, a temporary network interruption or a timeout a "socket error" is sometimes seen and is immediately followed by a segfault.
This is likely to be due to improper handling of the error.

Doesn't handle cross-platform exit statuses gracefully

Looking into this it seems that the exit status returned from a child process is not handled gracefully in some instances.

Ideally the exit status should be using the macros defined in sys/wait.h but it is only used sparingly across drmaa.c. Notably the macro WIFEXITED is used but not WIFSIGNALED or WTERMSIG. For example, instead of WIFSIGNALED there is an operation that may or may not be the same as the macro necessary for that particular architecture. Is there a reason why these macros were not used?

This is related to Issue #26. After removing the hardcoded exit status manipulation with macros, I suddenly went from jobs reporting "unknown signal?!" to "wasTerminated" which was far more informative in terms of tracking down our issues (and ultimately solved our problem).

recipe for installing SLURM and friends on Debian 11

Hello and apologies if this question is in the wrong place. We are upgrading from Debian 8 to Debian 11. I am a developer with no particular background in system administration or configuration. Several weeks into a cycle of install/google-error-message/install-something-else, I have installed munge, slurm, slurm-drmaa, and bats(!). slurmctld and slurmd are now running, but calls to drmaa_run_job() result in seg faults. (The surrounding C++ code is copied from our Debian 8 host, where drmaa_run_job() runs successfully.) I'll print some debug output below, but what I'm really looking for is start-to-finish step-by-step instructions for configuring, installing, and running whatever it takes to make SLURM usable on Debian 11. Thanks in advance.

Last few steps of debug output from drmaa_run_job:

d #597f9 [ 40.42] * finalizing job constraints
d #597f9 [ 40.42] * set min_cpus to ntasks: 1
t #597f9 [ 40.42] <- slurmdrmaa_parse_native
ORA-24550: signal received: [si_signo=11] [si_errno=0] [si_code=1] [si_int=0] [si_ptr=(nil)] [si_addr=0x1656]
kpedbg_dmp_stack()+394<-kpeDbgCrash()+204<-kpeDbgSignalHandler()+113<-skgesig_sigactionHandler()+258<-__sighandler()<-0x00007F06CFEC9B71<-slurm_pack_selected_step()+1286<-slurm_send_node_msg()+505<-slurm_send_recv_msg()+66<-slurm_send_recv_controller_msg()+315<-slurm_submit_batch_job()+119<-slurmdrmaa_session_run_bulk()+518<-slurmdrmaa_session_run_job()+179<-drmaa_run_job()+374<-_ZN19custom_code::submit_jobERKN5boost10filesystem4pathES4_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESC_bb()+4407<-0x0000000000000009<-0x7453705F6D00626F

runscript.sh: line 62: 366577 Segmentation fault

Stack trace from gdb:

           Stack trace of thread 366585:
            #0  0x00007f06d1914fe1 raise (libpthread.so.0 + 0x13fe1)
            #1  0x00007f06c254893f skgesigOSCrash (libclntsh.so + 0x267293f)
            #2  0x00007f06c2c63cdd kpeDbgSignalHandler (libclntsh.so + 0x2d8dcdd)
            #3  0x00007f06c2548c12 skgesig_sigactionHandler (libclntsh.so + 0x2672c12)
            #4  0x00007f06d1915140 __restore_rt (libpthread.so.0 + 0x14140)
            #5  0x00007f06cfec9b71 __strlen_avx2 (libc.so.6 + 0x15fb71)
            #6  0x00007f06d0467cb3 n/a (libslurm.so.36 + 0xf8cb3)
            #7  0x00007f06d047c646 n/a (libslurm.so.36 + 0x10d646)
            #8  0x00007f06d0456cf9 slurm_send_node_msg (libslurm.so.36 + 0xe7cf9)
            #9  0x00007f06d0457f72 slurm_send_recv_msg (libslurm.so.36 + 0xe8f72)
            #10 0x00007f06d04580db slurm_send_recv_controller_msg (libslurm.so.36 + 0xe90db)
            #11 0x00007f06d03b76e7 slurm_submit_batch_job (libslurm.so.36 + 0x486e7)
            #12 0x00007f06d05414f1 slurmdrmaa_session_run_bulk (libdrmaa.so.1 + 0xb4f1)
            #13 0x00007f06d054123b slurmdrmaa_session_run_job (libdrmaa.so.1 + 0xb23b)
            #14 0x00007f06d055c133 drmaa_run_job (libdrmaa.so.1 + 0x26133)
            #15 0x000056442ad0bf37 n/a (XXX + 0xd1f37)
            #16 0x0000000000000009 n/a (n/a + 0x0)

Any advice would be greatly appreciated.

slurm-drmaa.spec missing configure options

Supplied slurm-drmaa.spec file in release bundle is missing functionality to support configure options outlined in:

https://github.com/natefoo/slurm-drmaa/blob/main/README.md

Namely:
``
Notable ./configure script options:

--with-slurm-inc SLURM_INCLUDE_PATH

    Path to Slurm header files (i.e. directory containing slurm/slurm.h ). By default the library tries to guess the SLURM_INCLUDE_PATH and SLURM_LIBRARY_PATH based on location of the srun executable.

--with-slurm-lib SLURM_LIBRARY_PATH

    Path to Slurm libraries (i.e. directory containing libslurm.a ).

--prefix INSTALLATION_DIRECTORY

    Root directory where PSNC DRMAA for Slurm shall be installed. When not given library is installed in /usr/local.

--enable-debug

    Compiles library with debugging enabled (with debugging symbols not stripped, without optimizations, and with many log messages enabled). Useful when you are to debug DRMAA enabled application or investigate problems with DRMAA library itself.

Following patch enables this functionality for rpmbuild using rpmmacros.


--- slurm-drmaa-1.1.3/slurm-drmaa.spec	2022-01-04 16:48:41.991693930 +0000
+++ slurm-drmaa.spec	2022-01-04 16:37:09.449491395 +0000
@@ -31,7 +31,10 @@
 RPM_OPT_FLAGS=`echo "$RPM_OPT_FLAGS" | sed -e 's/-O2 /-O0 /'`
 CFLAGS="$RPM_OPT_FLAGS"
 export CFLAGS
-%configure
+%configure \
+    %{?_with_slurm_lib:--with-slurm-lib=%{_with_slurm_lib}} \
+    %{?_with_slurm_inc:--with-slurm-inc=%{_with_slurm_inc}} \
+    %{?_enable_debug:--enable-debug}
 
 %install
 rm -rf "$RPM_BUILD_ROOT"

Compatibility with latest SLURM

Hi,

As per recently there is a CVE related to all previous SLURM versions, meaning that there are now only two supported SLURM versions (20.11.9 and 21.08.8). Is this package still viable for the latest version of each?

Thanks

Fails with Slurm 18.08.8

Testing with the drmaa-run utility, I find that slurm-drmaa fails with the 18.08.8 release of Slurm, but the exact same procedure works fine with 18.08.7. With 18.08.8 it fails at the job run step:

E #2af1 [     0.77]  * fsd_exc_new(1001,slurm_submit_batch_job error (-1): Unspecified error,1)
t #2af1 [     0.77] -> slurmdrmaa_free_job_desc
t #2af1 [     0.77] <- slurmdrmaa_free_job_desc
t #2af1 [     0.77] <- drmaa_run_job=1: slurm_submit_batch_job error (-1): Unspecified error
F #2af1 [     0.77]  * Failed to submit a job: slurm_submit_batch_job error (-1): Unspecified error

Corresponding to this part of the drmaa-run code:

        /* run */
        if (api.run_job(jobid, sizeof(jobid) - 1, jt, errbuf, sizeof(errbuf) - 1) != DRMAA_ERRNO_SUCCESS) {
                fsd_log_fatal(("Failed to submit a job: %s ", errbuf));
                exit(2); /* TODO exception */

Slurm 18.08.8 addresses a security vulnerability that exists in prior versions of Slurm.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.