natefoo / slurm-drmaa Goto Github PK

DRMAA for Slurm: Implementation of the DRMAA C bindings for Slurm

License: GNU General Public License v3.0

Makefile 2.79% Shell 8.51% M4 25.65% C 62.22% Dockerfile 0.83%

drmaa cluster clusters hpc resource-management distributed-computing slurm slurm-job-scheduler

slurm-drmaa's Issues

Single quotes in the expanded args should be escaped

Hello,
thank you for this library.
I noticed a small bug when using single quotes in arguments.
I believe that single quotes in the expanded args (

slurm-drmaa/slurm_drmaa/job.c

Line 615 in 665d5b5

temp_script = fsd_asprintf("%s '%s'", temp_script_old, arg_expanded);

) should be escaped.

In my case I'm running

bash, "-c", "source /tmp/init.sh ; /bin/sort '-s' '-k' '3' '-t' '\t'"
(note I use , to separate the arguments, in total two arguments are passed to bash)

which will then be translated into

d #15300 [     0.06]  * # Script:
d #15300 [     0.06]  | #!/bin/bash
d #15300 [     0.06]  | bash '-c' 'source /tmp/init.sh ; /bin/sort '-s' '-k' '3' '-t' '     ''

However this fails with:

/bin/sort: option requires an argument -- 't'
Try '/bin/sort --help' for more information.

I believe what happens internally is that bash -c treats source /tmp/init.sh ; /bin/sort as the first string and as per documentation everything after becomes an argument:

If the -c option is present, then commands are read from string. If there are arguments after the string, they are assigned to the positional parameters, starting with $0.

To solve this I believe that the single quotes within the expanded args should be escaped (e.g. via https://creativeandcritical.net/str-replace-c)

=======
consider this minimal example:

#!/usr/bin/env python
import drmaa
import os

def main():
	error_log = os.path.join(os.getcwd(), 'error.log')
	with drmaa.Session() as s:
		print('Creating job template')
		jt = s.createJobTemplate()
		jt.remoteCommand = "/bin/bash"
		jt.args = [ '-c', "/bin/sort '-s' '-k' '3' '-t' '\t'" ]
		
		jt.errorPath = ":" + error_log
		
		jobid = s.runJob(jt)
		print('Your job has been submitted with ID {0}'.format(jobid))
		
		retval = s.wait(jobid, drmaa.Session.TIMEOUT_WAIT_FOREVER)
		print('Job: {0} finished with status {1}'.format(retval.jobId, retval.exitStatus))
		
		with open(error_log, 'r') as fin:
			print(fin.read())
		
		print('Cleaning up')
		s.deleteJobTemplate(jt)
		os.remove(error_log)

if __name__=='__main__':
	main()

results in

$ python foo.py
Creating job template
Your job has been submitted with ID 7990
Job: 7990 finished with status 2
/bin/sort: option requires an argument -- 't'
Try '/bin/sort --help' for more information.

Cleaning up

"unknown signal?!" reported from JobInfo terminatedSignal

Hello,

I'm not sure if this the origin of this particular bug, but I have not successfully reproduced this error on other DRMAA implementations.

I've submitted jobs to my SLURM 18.08 system where, occassionally, I get a reported "unknown signal?!". The exact same job, when resubmitted, may or may not have this issue. I cannot track down exactly what happens when this occurs or what causes this.

I have run strace on the job itself that was submitted on equivalent jobs, one which reports the "unknown signal" vs a regular exiting job and I cannot find any discernable difference and notably when tracing specifically for any signals.

sacct reports nothing unusual, and actually seems to indicate that the job exited without issue. The sysadmin for our cluster system seems to agree and cannot find any issue.

This could be a cluster-specific issue, DRMAA issue, or not. If I'm looking in the wrong place please kindly redirect me. I'm not sure where or how I could start tracking down this issue.

Thanks for your time.

Compatibility with latest SLURM

Hi,

As per recently there is a CVE related to all previous SLURM versions, meaning that there are now only two supported SLURM versions (20.11.9 and 21.08.8). Is this package still viable for the latest version of each?

Thanks

Array jobs produce IDs considered invalid

Using 44cc67e:

t #6279 [     5.70] <- slurmdrmaa_parse_native
d #6279 [     5.70]  * job 16484204 submitted
t #6279 [     5.71] -> fsd_job_new(16484204_1)
t #6279 [     5.71] <- fsd_job_new=0xc163b0: ref_cnt=1 [lock 16484204_1]
t #6279 [     5.71] -> fsd_job_set_add(job=0xc163b0, job_id=16484204_1)
t #6279 [     5.71] <- fsd_job_set_add: job->ref_cnt=2
t #6279 [     5.71] -> fsd_job_release(0xc163b0={job_id=16484204_1, ref_cnt=2}) [unlock 16484204_1]
t #6279 [     5.71] <- fsd_job_release
t #6279 [     5.71] -> fsd_job_new(16484204_2)
t #6279 [     5.71] <- fsd_job_new=0xd6acf0: ref_cnt=1 [lock 16484204_2]
t #6279 [     5.71] -> fsd_job_set_add(job=0xd6acf0, job_id=16484204_2)
t #6279 [     5.71] <- fsd_job_set_add: job->ref_cnt=2
t #6279 [     5.71] -> fsd_job_release(0xd6acf0={job_id=16484204_2, ref_cnt=2}) [unlock 16484204_2]
t #6279 [     5.71] <- fsd_job_release
t #6279 [     5.71] -> fsd_job_new(16484204_3)
t #6279 [     5.71] <- fsd_job_new=0xdffed0: ref_cnt=1 [lock 16484204_3]
t #6279 [     5.71] -> fsd_job_set_add(job=0xdffed0, job_id=16484204_3)
t #6279 [     5.71] <- fsd_job_set_add: job->ref_cnt=2
t #6279 [     5.71] -> fsd_job_release(0xdffed0={job_id=16484204_3, ref_cnt=2}) [unlock 16484204_3]
t #6279 [     5.71] <- fsd_job_release
t #6279 [     5.71] -> fsd_job_new(16484204_4)
t #6279 [     5.71] <- fsd_job_new=0xe55a70: ref_cnt=1 [lock 16484204_4]
t #6279 [     5.71] -> fsd_job_set_add(job=0xe55a70, job_id=16484204_4)
t #6279 [     5.71] <- fsd_job_set_add: job->ref_cnt=2
t #6279 [     5.71] -> fsd_job_release(0xe55a70={job_id=16484204_4, ref_cnt=2}) [unlock 16484204_4]
t #6279 [     5.71] <- fsd_job_release
t #6279 [     5.71] -> slurmdrmaa_free_job_desc
t #6279 [     5.71] <- slurmdrmaa_free_job_desc
t #6279 [     5.71] <- drmaa_run_bulk_jobs =0
d #6279 [     5.71]  * fsd_exc_new(1006,Vector have no more elements.,0)
t #6279 [     5.71] <- drmaa_get_next_job_id=25: Vector have no more elements.
t #6279 [     5.71] -> drmaa_delete_job_template(0xe24e30)
t #6279 [     5.71] <- drmaa_delete_job_template =0
t #6279 [    70.34] -> drmaa_job_ps(job_id=16484204_2)
t #6279 [    70.34] -> fsd_job_set_get(job_id=16484204_2)
t #6279 [    70.34] <- fsd_job_set_get(job_id=16484204_2) =0xd6acf0: ref_cnt=2 [lock 16484204_2]
d #6279 [    70.34]  *  job->last_update_time = 0
d #6279 [    70.34]  * updating status of job: 16484204_2
t #6279 [    70.34] -> slurmdrmaa_job_update_status({job_id=16484204_2})
t #6279 [    70.34] -> slurmdrmaa_set_job_id({job_id=16484204_2})
t #6279 [    70.34] <- slurmdrmaa_set_job_id; job_id=16484204_2
E #6279 [    70.34]  * fsd_exc_new(1003,not an number: 16484204_2,1)
t #6279 [    70.34] -> slurmdrmaa_unset_job_id({job_id=(null)})
t #6279 [    70.34] <- slurmdrmaa_unset_job_id; job_id=16484204_2
t #6279 [    70.34] -> fsd_job_release(0xd6acf0={job_id=16484204_2, ref_cnt=2}) [unlock 16484204_2]
t #6279 [    70.34] <- fsd_job_release
t #6279 [    70.34] <- drmaa_job_ps=4: not an number: 16484204_2

Which causes DRMAA to drop these jobs.

The code seems to assume that job_ids have to be numeric.
SLURM uses ArrayJobIDs of the form <jobid>_<arrayid> which in this case isn't being handled properly.

Also noticed the line t #6279 [ 5.71] <- drmaa_run_bulk_jobs =0. Shouldn't it be =1 in this case?

segfault when jobs' requirements cannot be met

Using the same approach described in #5 but running the job with:

jt.nativeSpecification = "--cpus-per-task=2000 --nodes=1 --mem-per-cpu=5000 --partition=htc --tmp=100"

Another segfault is triggered:

Program received signal SIGSEGV, Segmentation fault.
drmaa_release_job_ids (values=0x0) at drmaa_base.c:297
297     iter_function(job_id, drmaa_job_ids_t)
(gdb) bt
#0  drmaa_release_job_ids (values=0x0) at drmaa_base.c:297
#1  0x00007fffeffed550 in ffi_call_unix64 () at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/libffi/src/x86/unix64.S:76
#2  0x00007fffeffeccf5 in ffi_call (cif=<optimized out>, fn=0x7fffed771770 <drmaa_release_job_ids>, rvalue=<optimized out>, avalue=0x7fffffffc710)
    at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/libffi/src/x86/ffi64.c:525
#3  0x00007fffeffe483c in _call_function_pointer (argcount=1, resmem=0x7fffffffc730, restype=<optimized out>, atypes=<optimized out>, avalues=0x7fffffffc710, pProc=0x7fffed771770 <drmaa_release_job_ids>, flags=4353)
    at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/callproc.c:809
#4  _ctypes_callproc (pProc=0x7fffed771770 <drmaa_release_job_ids>, argtuple=0x7fffffffc7e0, flags=4353, argtypes=<optimized out>, restype=0x7ffff7d61cd0 <_Py_NoneStruct>, checker=0x0)
    at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/callproc.c:1147
#5  0x00007fffeffdcda3 in PyCFuncPtr_call (self=<optimized out>, inargs=<optimized out>, kwds=0x0) at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/_ctypes.c:3870
#6  0x00007ffff793fade in _PyObject_FastCallDict (func=0x7fffeea66818, args=<optimized out>, nargs=<optimized out>, kwargs=0x0) at Objects/abstract.c:2316
#7  0x00007ffff7a1c2bb in call_function (pp_stack=0x7fffffffcb18, oparg=<optimized out>, kwnames=0x0) at Python/ceval.c:4822
#8  0x00007ffff7a1f15d in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3284
#9  0x00007ffff7969e33 in gen_send_ex (gen=0x7fffefd90200, arg=<optimized out>, exc=<optimized out>, closing=<optimized out>) at Objects/genobject.c:189
#10 0x00007ffff7978f16 in listextend (self=0x7fffeea7ad48, b=<optimized out>) at Objects/listobject.c:857
#11 0x00007ffff7979398 in list_init (self=0x7fffeea7ad48, args=<optimized out>, kw=<optimized out>) at Objects/listobject.c:2316
#12 0x00007ffff79add4c in type_call (type=<optimized out>, args=0x7ffff7e8d470, kwds=0x0) at Objects/typeobject.c:915
#13 0x00007ffff793fade in _PyObject_FastCallDict (func=0x7ffff7d5bb40 <PyList_Type>, args=<optimized out>, nargs=<optimized out>, kwargs=0x0) at Objects/abstract.c:2316
#14 0x00007ffff7a1c2bb in call_function (pp_stack=0x7fffffffce58, oparg=<optimized out>, kwnames=0x0) at Python/ceval.c:4822
#15 0x00007ffff7a1f15d in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3284
#16 0x00007ffff7a1aa60 in _PyEval_EvalCodeWithName (_co=0x7ffff01fc420, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=1, kwnames=0x7ffff7e9dba0, kwargs=0x7ffff7f8fba8, kwcount=3, kwstep=1, 
    defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x7ffff7ea3c30, qualname=0x7fffefd8d2b8) at Python/ceval.c:4128
#17 0x00007ffff7a1c48a in fast_function (kwnames=<optimized out>, nargs=1, stack=<optimized out>, func=0x7fffeea832f0) at Python/ceval.c:4939
#18 call_function (pp_stack=0x7fffffffd0f8, oparg=<optimized out>, kwnames=<optimized out>) at Python/ceval.c:4819
#19 0x00007ffff7a1e8dd in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3300
#20 0x00007ffff7a1aa60 in _PyEval_EvalCodeWithName (_co=0x7ffff7f1b930, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=0, kwnames=0x0, kwargs=0x8, kwcount=0, kwstep=2, defs=0x0, defcount=0, 
    kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:4128
#21 0x00007ffff7a1aee3 in PyEval_EvalCodeEx (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kws=<optimized out>, kwcount=0, defs=0x0, defcount=0, 
    kwdefs=0x0, closure=0x0) at Python/ceval.c:4149
#22 0x00007ffff7a1af2b in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>) at Python/ceval.c:695
#23 0x00007ffff7a4d6c0 in run_mod (arena=0x7ffff7f79180, flags=0x7fffffffd450, locals=0x7ffff7f5cf30, globals=0x7ffff7f5cf30, filename=0x7ffff7ea3830, mod=0x683f58) at Python/pythonrun.c:980
#24 PyRun_FileExFlags (fp=0x64cc30, filename_str=<optimized out>, start=<optimized out>, globals=0x7ffff7f5cf30, locals=0x7ffff7f5cf30, closeit=<optimized out>, flags=0x7fffffffd450) at Python/pythonrun.c:933
#25 0x00007ffff7a4ec83 in PyRun_SimpleFileExFlags (fp=0x64cc30, filename=<optimized out>, closeit=1, flags=0x7fffffffd450) at Python/pythonrun.c:396
#26 0x00007ffff7a6a0b5 in run_file (p_cf=0x7fffffffd450, filename=0x603310 L<error reading variable>, fp=0x64cc30) at Modules/main.c:338
#27 Py_Main (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:810
#28 0x0000000000400c1d in main (argc=2, argv=<optimized out>) at ./Programs/python.c:69

In this case, the error python: error: CPU count per node can not be satisfied is still shown but it segfaults afterwards anyway.

Issues with Slurm 20.11.0

make[2]: Entering directory `/root/slurm-drmaa-1.1.1/slurm_drmaa'
/bin/sh ../libtool  --tag=CC   --mode=compile gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I..  -I/root/rpmbuild/BUILD/slurm-20.11.0 -I../drmaa_utils/ -Wno-long-long  -D_REENTRANT -D_THREAD_SAFE -DNDEBUG  -D_GNU_SOURCE -DCONFDIR=/etc  -Wall -W -Wno-unused-parameter -Wno-format-zero-length -pedantic -std=c99 -g -O2 -pthread -MT libdrmaa_la-drmaa.lo -MD -MP -MF .deps/libdrmaa_la-drmaa.Tpo -c -o libdrmaa_la-drmaa.lo `test -f 'drmaa.c' || echo './'`drmaa.c
libtool: compile:  gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -I/root/rpmbuild/BUILD/slurm-20.11.0 -I../drmaa_utils/ -Wno-long-long -D_REENTRANT -D_THREAD_SAFE -DNDEBUG -D_GNU_SOURCE -DCONFDIR=/etc -Wall -W -Wno-unused-parameter -Wno-format-zero-length -pedantic -std=c99 -g -O2 -pthread -MT libdrmaa_la-drmaa.lo -MD -MP -MF .deps/libdrmaa_la-drmaa.Tpo -c drmaa.c  -fPIC -DPIC -o .libs/libdrmaa_la-drmaa.o
drmaa.c: In function ‘slurmdrmaa_get_DRM_system’:
drmaa.c:65:3: error: unknown type name ‘slurm_ctl_conf_t’
   slurm_ctl_conf_t * conf_info_msg_ptr = NULL; 
   ^
drmaa.c:66:3: warning: passing argument 2 of ‘slurm_load_ctl_conf’ from incompatible pointer type [enabled by default]
   if ( slurm_load_ctl_conf ((time_t) NULL, &conf_info_msg_ptr ) == -1 ) 
   ^
In file included from drmaa.c:31:0:
/root/rpmbuild/BUILD/slurm-20.11.0/slurm/slurm.h:3714:12: note: expected ‘struct slurm_conf_t **’ but argument is of type ‘int **’
 extern int slurm_load_ctl_conf(time_t update_time,
            ^
drmaa.c:73:101: error: request for member ‘version’ in something not a structure or union
    fsd_snprintf(NULL, slurmdrmaa_version, sizeof(slurmdrmaa_version)-1,"SLURM %s", conf_info_msg_ptr->version);
                                                                                                     ^
drmaa.c:74:4: warning: passing argument 1 of ‘slurm_free_ctl_conf’ from incompatible pointer type [enabled by default]
    slurm_free_ctl_conf (conf_info_msg_ptr);
    ^
In file included from drmaa.c:31:0:
/root/rpmbuild/BUILD/slurm-20.11.0/slurm/slurm.h:3722:13: note: expected ‘struct slurm_conf_t *’ but argument is of type ‘int *’
 extern void slurm_free_ctl_conf(slurm_conf_t *slurm_ctl_conf_ptr);
             ^
make[2]: *** [libdrmaa_la-drmaa.lo] Error 1
make[2]: Leaving directory `/root/slurm-drmaa-1.1.1/slurm_drmaa'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/root/slurm-drmaa-1.1.1'
make: *** [all] Error 2

Unable to compile slurm-drmaa on CentOS 7.5 with SLURM 18.0.8

Dear Developers,
We are unable to compile slurm-drmaa 1.0.7 on this version of slurm?
Is the code compatible with this version of slurm??
Please find the errors below:

util.c:175:19: error: 'job_desc_msg_t' has no member named 'gres'
  fsd_free(job_desc->gres);```

util.c:322:12: error: 'job_desc_msg_t' has no member named 'gres'
    job_desc->gres = fsd_strdup(value);

Please advise.
Thank you ,
Iten

Slurm 20.11.0 "Slurm libraries/headers not found"

./configure fails

# ./configure
[...]
checking for SLURM library dir... /usr/lib
checking for slurmdb_users_get in -lslurm... yes
Using slurm libraries -lslurm 
checking for usable SLURM libraries/headers... *** The SLURM test program failed to link or run. See the file config.log
*** for the exact error that occured.
no
configure: error: 
Slurm libraries/headers not found;
add --with-slurm-inc and --with-slurm-lib with appropriate locations.

The relevant section in config.log

configure:13429: ./conftest
conftest: error: resolve_ctls_from_dns_srv: res_nsearch error: No error
conftest: error: fetch_config: DNS SRV lookup failed
conftest: error: _establish_config_source: failed to fetch config
conftest: fatal: Could not establish a configuration source
configure:13429: $? = 1
configure: program exited with status 1
configure: failed program was:
| /* confdefs.h */
| #define PACKAGE_NAME "DRMAA for Slurm"
| #define PACKAGE_TARNAME "slurm-drmaa"
| #define PACKAGE_VERSION "1.2.0-dev.71cd5be"
| #define PACKAGE_STRING "DRMAA for Slurm 1.2.0-dev.71cd5be"
| #define PACKAGE_BUGREPORT "[email protected]"
| #define PACKAGE_URL ""
| #define PACKAGE "slurm-drmaa"
| #define VERSION "1.2.0-dev.71cd5be"
| #define STDC_HEADERS 1
| #define HAVE_SYS_TYPES_H 1
| #define HAVE_SYS_STAT_H 1
| #define HAVE_STDLIB_H 1
| #define HAVE_STRING_H 1
| #define HAVE_MEMORY_H 1
| #define HAVE_STRINGS_H 1
| #define HAVE_INTTYPES_H 1
| #define HAVE_STDINT_H 1
| #define HAVE_UNISTD_H 1
| #define HAVE_DLFCN_H 1
| #define LT_OBJDIR ".libs/"
| #define HAVE_PTHREAD_PRIO_INHERIT 1
| #define HAVE_LIBSLURM 1
| /* end confdefs.h.  */
|  #include "slurm/slurm.h"
| int
| main ()
| {
|  job_desc_msg_t job_req; /*at least check for declared structs */
|                  return 0;
| 
|   ;
|   return 0;
| }
configure:13444: result: no
configure:13450: error:

It looks like this is the culprit:

conftest: error: resolve_ctls_from_dns_srv: res_nsearch error: No error
conftest: error: fetch_config: DNS SRV lookup failed
conftest: error: _establish_config_source: failed to fetch config
conftest: fatal: Could not establish a configuration source

It worked with the previous version that I used (20.02).

It looks like there now is a slurm_library_init and this subsequently requires a working slurm configuration and installation to run.

slurm-drmaa.spec missing configure options

Supplied slurm-drmaa.spec file in release bundle is missing functionality to support configure options outlined in:

https://github.com/natefoo/slurm-drmaa/blob/main/README.md

Namely:
``
Notable ./configure script options:

--with-slurm-inc SLURM_INCLUDE_PATH

    Path to Slurm header files (i.e. directory containing slurm/slurm.h ). By default the library tries to guess the SLURM_INCLUDE_PATH and SLURM_LIBRARY_PATH based on location of the srun executable.

--with-slurm-lib SLURM_LIBRARY_PATH

    Path to Slurm libraries (i.e. directory containing libslurm.a ).

--prefix INSTALLATION_DIRECTORY

    Root directory where PSNC DRMAA for Slurm shall be installed. When not given library is installed in /usr/local.

--enable-debug

    Compiles library with debugging enabled (with debugging symbols not stripped, without optimizations, and with many log messages enabled). Useful when you are to debug DRMAA enabled application or investigate problems with DRMAA library itself.


Following patch enables this functionality for rpmbuild using rpmmacros.


--- slurm-drmaa-1.1.3/slurm-drmaa.spec	2022-01-04 16:48:41.991693930 +0000
+++ slurm-drmaa.spec	2022-01-04 16:37:09.449491395 +0000
@@ -31,7 +31,10 @@
 RPM_OPT_FLAGS=`echo "$RPM_OPT_FLAGS" | sed -e 's/-O2 /-O0 /'`
 CFLAGS="$RPM_OPT_FLAGS"
 export CFLAGS
-%configure
+%configure \
+    %{?_with_slurm_lib:--with-slurm-lib=%{_with_slurm_lib}} \
+    %{?_with_slurm_inc:--with-slurm-inc=%{_with_slurm_inc}} \
+    %{?_enable_debug:--enable-debug}
 
 %install
 rm -rf "$RPM_BUILD_ROOT"

Slurm DRMAA - Unable to generate targets

I followed the instructions given on github page. I am using slurm version 20.11.7 on AWS parallel cluster.

export SLURM_INCLUDE_PATH=/opt/slurm/include/slurm
export SLURM_LIBRARY_PATH=/opt/slurm/lib
export LD_LIBRARY_PATH=/opt/slurm/lib

3 ./configure && make

Error I get is as following error

[ec2-user@ip-XX-XX-XX-XX slurm-drmaa-1.1.2]$ ./configure && make
checking for a BSD-compatible install... /bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking for style of include used by make... GNU
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking dependency style of gcc... gcc3
checking for ar... ar
checking the archiver (ar) interface... ar
checking for gcc... (cached) gcc
checking whether we are using the GNU C compiler... (cached) yes
checking whether gcc accepts -g... (cached) yes
checking for gcc option to accept ISO C89... (cached) none needed
checking dependency style of gcc... (cached) gcc3
checking for gcc option to accept ISO C99... none needed
checking how to run the C preprocessor... gcc -E
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
checking how to print strings... printf
checking for a sed that does not truncate output... /bin/sed
checking for grep that handles long lines and -e... /bin/grep
checking for egrep... /bin/grep -E
checking for fgrep... /bin/grep -F
checking for ld used by gcc... /bin/ld
checking if the linker (/bin/ld) is GNU ld... yes
checking for BSD- or MS-compatible name lister (nm)... /bin/nm -B
checking the name lister (/bin/nm -B) interface... BSD nm
checking whether ln -s works... yes
checking the maximum length of command line arguments... 1572864
checking whether the shell understands some XSI constructs... yes
checking whether the shell understands "+="... yes
checking how to convert x86_64-unknown-linux-gnu file names to x86_64-unknown-linux-gnu format... func_convert_file_noop
checking how to convert x86_64-unknown-linux-gnu file names to toolchain format... func_convert_file_noop
checking for /bin/ld option to reload object files... -r
checking for objdump... objdump
checking how to recognize dependent libraries... pass_all
checking for dlltool... no
checking how to associate runtime and link libraries... printf %s\n
checking for archiver @file support... @
checking for strip... strip
checking for ranlib... ranlib
checking command to parse /bin/nm -B output from gcc object... ok
checking for sysroot... no
checking for mt... no
checking if : is a manifest tool... no
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for dlfcn.h... yes
checking for objdir... .libs
checking if gcc supports -fno-rtti -fno-exceptions... no
checking for gcc option to produce PIC... -fPIC -DPIC
checking if gcc PIC flag -fPIC -DPIC works... yes
checking if gcc static flag -static works... yes
checking if gcc supports -c -o file.o... yes
checking if gcc supports -c -o file.o... (cached) yes
checking whether the gcc linker (/bin/ld -m elf_x86_64) supports shared libraries... yes
checking whether -lc should be explicitly linked in... no
checking dynamic linker characteristics... GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... yes
checking whether make sets $(MAKE)... (cached) yes
checking whether ln -s works... yes
checking whether gcc accepts -Wno-missing-field-initializers... yes
checking whether gcc accepts -Wno-format-zero-length... yes
checking whether gcc is Clang... no
checking whether pthreads work with -pthread... yes
checking for joinable pthread attribute... PTHREAD_CREATE_JOINABLE
checking whether more special flags are required for pthreads... no
checking for PTHREAD_PRIO_INHERIT... yes
configure: checking for SLURM
checking for SLURM compile flags... -I/opt/slurm/include
checking for SLURM library dir... /opt/slurm/lib
checking for slurmdb_users_get in -lslurm... yes
Using slurm libraries -lslurm
checking for usable SLURM libraries/headers... yes
checking for ANSI C header files... (cached) yes
checking whether time.h and sys/time.h may both be included... yes
checking for stdbool.h that conforms to C99... yes
checking for _Bool... yes
checking stddef.h usability... yes
checking stddef.h presence... yes
checking for stddef.h... yes
checking for stdlib.h... (cached) yes
checking for string.h... (cached) yes
checking for strings.h... (cached) yes
checking sys/time.h usability... yes
checking sys/time.h presence... yes
checking for sys/time.h... yes
checking for unistd.h... (cached) yes
checking for size_t... yes
checking whether struct tm is in sys/time.h or time.h... time.h
checking for an ANSI C-conforming const... yes
checking for inline... inline
checking for working volatile... yes
checking for stdlib.h... (cached) yes
checking for GNU libc compatible malloc... yes
checking for strftime... yes
checking for vprintf... yes
checking for _doprnt... no
checking for asprintf... yes
checking for fstat... yes
checking for getcwd... yes
checking for gettimeofday... yes
checking for localtime_r... yes
checking for memset... yes
checking for mkstemp... yes
checking for setenv... yes
checking for strcasecmp... yes
checking for strchr... yes
checking for strdup... yes
checking for strerror... yes
checking for strlcpy... no
checking for strndup... yes
checking for strstr... yes
checking for strtol... yes
checking for vasprintf... yes
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: creating Makefile
config.status: creating test/Makefile
config.status: creating slurm_drmaa/Makefile
config.status: creating config.h
config.status: config.h is unchanged
config.status: executing depfiles commands
config.status: executing libtool commands
=== configuring in drmaa_utils (/home/ec2-user/slurm-drmaa-1.1.2/drmaa_utils)
configure: WARNING: no configuration information is in drmaa_utils

Run 'make' now.
make all-recursive
make[1]: Entering directory /home/ec2-user/slurm-drmaa-1.1.2' Making all in drmaa_utils make[2]: Entering directory /home/ec2-user/slurm-drmaa-1.1.2/drmaa_utils'
make[2]: *** No rule to make target all'. Stop. make[2]: Leaving directory /home/ec2-user/slurm-drmaa-1.1.2/drmaa_utils'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/ec2-user/slurm-drmaa-1.1.2'
make: *** [all] Error 2

JNI implementation

We have several web applications submitting jobs to SGE using the Java-based Opal Toolkit (https://sourceforge.net/projects/opaltoolkit/) via drmaa.jar

In the process of migrating from SGE to Slurm I realized that the JNI methods do not exist in the slurm-drmaa library. Do you know of any out-of-the-box solution for submitting jobs via drmaa.jar to Slurm?

Best regards,

Guilhem

How to build RPMs

Is it possible to build RPMs from the source build?
Instructions?

Galaxy Project Depot certificate expired

The Galaxy Project Depot certificate has expired today:

/etc/apt/trusted.gpg

pub rsa4096 2014-09-09 [SC] [expired: 2021-07-29]
4AAF 9274 AC74 7FFB 1627 6F99 5262 6447 751B 835F
uid [ expired] Nathan Coraor [email protected]
uid [ expired] Nathan Coraor [email protected]

Handle new Slurm job states

As of Slurm 17.11 there are new job states that slurm-drmaa is not handling. These include JOB_OOM (OUT_OF_MEMORY), among others. As a result, the DRMAA state will be UNDETERMINED until the job exceeds Slurm's MinJobAge.

Need to determine if JOB_OOM is only returned when the job is terminal, or at any time that the OOM killer has activated.

Is native Ubuntu 20.04 package useful?

Hi,

I tried the native slurm-drmaa package, but this led to galaxy handlers failing to restart. The handler logs complained about slurm-drmaa.

sudo apt install slurm-drmaa1
Is yours (I'll use the Ubuntu launchpad repo, thanks) more likely to be compatible with Ubuntu 20.04 ?
Galaxy version 20.05.

Thanks

Slurm-drmaa 1.1.2 binary (RPM) installation issue with slurm-19.05.8-1

I'm trying to install slurm-drmaa version 1.1.2 alongwith slurm version 19.05.8-1 on CentOS-7.9 machine. It throws the package dependency issue and requires.. libslurm.so.31()(64bit) and libslurmdb.so.31()(64bit) libraries.. Whereas, slurm-19.05.8-1 provides the library libslurm.so.34 .

[root@test-vm111 ~]# cat /etc/redhat-release
CentOS Linux release 7.9.2009 (Core)

Installed slurm packages..
[root@test-vm111 yum.repos.d]# rpm -qa | grep ^slurm | sort
slurm-19.05.8-1.el7.x86_64
slurm-contribs-19.05.8-1.el7.x86_64
slurm-devel-19.05.8-1.el7.x86_64
slurm-example-configs-19.05.8-1.el7.x86_64
slurm-libpmi-19.05.8-1.el7.x86_64
slurm-openlava-19.05.8-1.el7.x86_64
slurm-pam_slurm-19.05.8-1.el7.x86_64
slurm-perlapi-19.05.8-1.el7.x86_64
slurm-slurmctld-19.05.8-1.el7.x86_64
slurm-slurmd-19.05.8-1.el7.x86_64
slurm-slurmdbd-19.05.8-1.el7.x86_64
slurm-torque-19.05.8-1.el7.x86_64

Issue with slurm-drmaa 1.1.2 installation
[root@test-vm111 yum.repos.d]# yum install slurm-drmaa
Loaded plugins: langpacks, nvidia
Resolving Dependencies
--> Running transaction check
---> Package slurm-drmaa.x86_64 0:1.1.2-1.el7 will be installed
--> Processing Dependency: libslurmdb.so.31()(64bit) for package: slurm-drmaa-1.1.2-1.el7.x86_64
--> Processing Dependency: libslurm.so.31()(64bit) for package: slurm-drmaa-1.1.2-1.el7.x86_64
--> Finished Dependency Resolution
Error: Package: slurm-drmaa-1.1.2-1.el7.x86_64 (slurm-19.05)
Requires: libslurmdb.so.31()(64bit)
Error: Package: slurm-drmaa-1.1.2-1.el7.x86_64 (slurm-19.05)
Requires: libslurm.so.31()(64bit)
You could try using --skip-broken to work around the problem
You could try running: rpm -Va --nofiles --nodigest

libslurm.so library available in machine
[root@test-vm111 ~]# locate libslurm.so
/usr/lib64/libslurm.so
/usr/lib64/libslurm.so.34
/usr/lib64/libslurm.so.34.0.0

[root@test-vm111 yum.repos.d]# rpm -qf /usr/lib64/libslurm.so.34
slurm-19.05.8-1.el7.x86_64

Anyone can look into the issue?

all errors reported as FSD_ERRNO_INTERNAL_ERROR

Hi,

Thanks for merging my previous fix. This one is in a similar vein.

On line 134 of slurm_drmaa/job.c, any problem when updating the job status is reported back as FSD_ERRNO_INTERNAL_ERROR. The specific issue here is that the caller would like to know if the error is intermittent (eg. a network time-out) and thus possibly the job status can be queried successfully in a few minutes, or if the problem is terminal and the job is dead. I've prepared a complementary patch to Snakemake to handle FSD_ERRNO_DRM_COMMUNICATION_FAILURE as an intermittent fault and to keep polling the job.

Really, the DRMAA library should make a better attempt to convert SLURM errors to meaningful DRMAA error codes, but this is a start.

Let me know if you'd prefer me to submit this stuff elsewhere. It's hard to see who is maintaining the definitive slurm-dmraa libs just now.

*** tim_testing_slurm//build/slurm-drmaa-1.2.0.2/slurm_drmaa/job.c.orig	2016-11-04 15:09:49.000000000 +0000
--- tim_testing_slurm//build/slurm-drmaa-1.2.0.2/slurm_drmaa/job.c	2017-06-09 15:05:38.000000000 +0100
***************
*** 131,138 ****
  
  			if (_slurm_errno == ESLURM_INVALID_JOB_ID) {
  				self->on_missing(self);
! 			} else {
! 				fsd_exc_raise_fmt(FSD_ERRNO_INTERNAL_ERROR,"slurm_load_jobs error: %s,job_id: %s", slurm_strerror(slurm_get_errno()), self->job_id);
  			}
  		}
  		if (job_info) {
--- 131,150 ----
  
  			if (_slurm_errno == ESLURM_INVALID_JOB_ID) {
  				self->on_missing(self);
! 			} else
!                 // We should detect the error corresponding to "Socket timed out" and report
!                 // it explicitly as FSD_ERRNO_TIMEOUT or maybe FSD_ERRNO_DRM_COMMUNICATION_FAILURE
!                 // ( I'm not sure if FSD_ERRNO_TIMEOUT is the same as DRMAA_ERRNO_EXIT_TIMEOUT,
!                 //   which simply indicates the job is still running?? Maybe we should try it and see. )
!                 // To see what _slurm_errno corresponds to which message let's look at
!                 // 'slurm_strerror' in the slurm source code...
!                 //   https://github.com/SchedMD/slurm/blob/master/src/common/slurm_errno.c
!             if ( _slurm_errno == SLURM_PROTOCOL_SOCKET_IMPL_TIMEOUT ||
!                  _slurm_errno == SLURMCTLD_COMMUNICATIONS_CONNECTION_ERROR
!                ) {
!                 fsd_exc_raise_fmt(FSD_ERRNO_DRM_COMMUNICATION_FAILURE,"slurm_load_jobs error: %s,job_id: %s", slurm_strerror(_slurm_errno), self->job_id);
!             } else {
! 				fsd_exc_raise_fmt(FSD_ERRNO_INTERNAL_ERROR,"slurm_load_jobs error: %s,job_id: %s", slurm_strerror(_slurm_errno), self->job_id);
  			}
  		}
  		if (job_info) {

Cheers,

TIM

Running CLI commands without options segfaults.

Testing slurm drmaa in a container, but even when running outside of a container either building from source or installing via galaxy rpm every time I run binary its segfaults.

am I missing something?

Error is:
[root@f8ddc11bc51e /]# DRMAA_LIBRARY_PATH=/usr/lib64/libdrmaa.so /usr/bin/drmaa-run
Segmentation fault (core dumped)

Backtrace shows:

[root@f8ddc11bc51e /]# export DRMAA_LIBRARY_PATH=/usr/lib64/libdrmaa.so
[root@f8ddc11bc51e /]# gdb /usr/bin/drmaa-run
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/...
Reading symbols from /usr/bin/drmaa-run...Reading symbols from /usr/lib/debug/usr/bin/drmaa-run.debug...done.
done.
(gdb) run
Starting program: /usr/bin/drmaa-run
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x00000000004129b6 in parse_args (argc=0, argv=0x7fffffffe7a0) at drmaa_run.c:254
254 while (argc >= 0 && argv[0][0] == '-')
(gdb) backtrace
#0 0x00000000004129b6 in parse_args (argc=0, argv=0x7fffffffe7a0) at drmaa_run.c:254
#1 0x00000000004120df in main (argc=1, argv=0x7fffffffe798) at drmaa_run.c:122
(gdb)

My test setup is as follows:

Dockerfile:
$ cat Dockerfile
FROM centos:7

RUN (cd /lib/systemd/system/sysinit.target.wants/; for i in ; do [ $i == systemd-tmpfiles-setup.service ] || rm -f $i; done);
rm -f /lib/systemd/system/multi-user.target.wants/;
rm -f /etc/systemd/system/.wants/;
rm -f /lib/systemd/system/local-fs.target.wants/;
rm -f /lib/systemd/system/sockets.target.wants/udev;
rm -f /lib/systemd/system/sockets.target.wants/initctl;
rm -f /lib/systemd/system/basic.target.wants/;
rm -f /lib/systemd/system/anaconda.target.wants/*;

VOLUME [ "/sys/fs/cgroup"]

RUN yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo
RUN yum-config-manager --add-repo https://depot.galaxyproject.org/yum/galaxy.repo

RUN yum -y install which strace gdb
RUN debuginfo-install -y libgcc-4.8.5-44.el7.x86_64
RUN debuginfo-install -y glibc-2.17-324.el7_9.x86_64
RUN yum -y install slurm-slurmd-20.11.8 slurm-devel-20.11.8glibc-2.17-324.el7_9.x86_64

RUN yum clean all && yum -y update

RUN yum -y install slurm-drmaa slurm-drmaa-debuginfo

RUN yum clean all &&
rm -rf /var/cache/yum

VOLUME [ "/sys/fs/cgroup"]

ENTRYPOINT ['/usr/sbin/init']

Which results in a working container, and when I login to the container I'm running:

[root@f8ddc11bc51e /]# cat /etc/redhat-release
CentOS Linux release 7.9.2009 (Core)

[root@f8ddc11bc51e7 /]# rpm -qa slurm*
slurm-slurmd-20.11.8-1.el7.x86_64
slurm-drmaa-debuginfo-1.1.2-1.el7.x86_64
slurm-20.11.8-1.el7.x86_64
slurm-devel-20.11.8-1.el7.x86_64
slurm-drmaa-1.1.2-1.el7.x86_64

[root@f8ddc11bc51e /]# yum info slurm-drmaa-1.1.2-1.el7.x86_64
Loaded plugins: fastestmirror, ovl
Loading mirror speeds from cached hostfile

base: mirrors.vinters.com
extras: mirrors.coreix.net
updates: mirrors.coreix.net
Installed Packages
Name : slurm-drmaa
Arch : x86_64
Version : 1.1.2
Release : 1.el7
Size : 863 k
Repo : installed
From repo : galaxy
Summary : DRMAA for Slurm
URL : https://github.com/natefoo/slurm-drmaa
License : GPLv3+
Description : DRMAA for Slurm is an implementation of Open Grid Forum DRMAA 1.0 (Distributed
: Resource Management Application API) specification for submission and control of
: jobs to SLURM. Using DRMAA, grid applications builders, portal developers and
: ISVs can use the same high-level API to link their software with different
: cluster/resource management systems.

SLURM in Java

Hi,
I've compiled the last slurm-drmaa and tried to use the drmaa.jar from the SGE so I can access to the methods in Java pointing to the compiled version of libdrmaa.so but I'm getting an Exception in thread "main" java.lang.UnsatisfiedLinkError: com.sun.grid.drmaa.SessionImpl.nativeInit(Ljava/lang/String;)V
So I'm using:
export DRMAA_LIBRARY_PATH=/home/vargasfr/slurm_drmaa/slurm-drmaa-1.1.2/install/lib/libdrmaa.so
export CLASSPATH=/home/vargasfr/slurm_drmaa/lib/drmaa.jar:/home/vargasfr/slurm_drmaa/slurm-drmaa-1.1.2/install/lib/libdrmaa.so:/home/vargasfr/slurm_drmaa
export LD_LIBRARY_PATH=/home/vargasfr/slurm_drmaa/slurm-drmaa-1.1.2/install/lib

From my Java file I'm importing import org.ggf.drmaa.* (where this is coming from the drmaa.jar) and calling:
SessionFactory factory = SessionFactory.getFactory();

Any idea on how can I make this working using SLURM?

Support for --chdir argument

Would it be possible to document the -D, --chdir=<directory> argument as well?

--mem is treated as --mem-per-cpu

When using --mem, the value passed to Slurm is multiplied by the number of CPUs, e.g., it's performed as --mem-per-cpu. I think this was just a merge error on my part. This commit seems to have been applied to the wrong lines:

47e7ba0

Date format for maximum wall clock time

Hi,
I have a problem in passing the right format for setting the hard wall clock time. I've been successful in passing the hours and minutes, but I'm not able to set the seconds. Is it possible?
For now I'm passing the string formatted as follows: "HH:MM", but if I add the seconds "HH:MM:SS" it will just pass 0:00 to SLURM.

Thanks,
Alessio

Issue with Slurm 20.02.4 and slurm-drmaa-1.1.1

I'm running into an issue on Centos 7.9.2009 running slurm 20.02.4 and slurm-drmaa-1.1.1. I ran configure with the '--with-slurm-inc=/opt/slurm/include' and '--with-slurm-lib=/opt/slurm/lib' options, but in the galaxy log it can't find libslurm.so.35 despite the library being there:

Traceback (most recent call last):
  File "lib/galaxy/main.py", line 298, in <module>
    main()
  File "lib/galaxy/main.py", line 294, in main
    app_loop(args, log)
  File "lib/galaxy/main.py", line 141, in app_loop
    attach_to_pools=args.attach_to_pool,
  File "lib/galaxy/main.py", line 108, in load_galaxy_app
    **kwds
  File "lib/galaxy/app.py", line 221, in __init__
    self.job_manager = manager.JobManager(self)
  File "lib/galaxy/jobs/manager.py", line 26, in __init__
    self.job_handler = handler.JobHandler(app)
  File "lib/galaxy/jobs/handler.py", line 51, in __init__
    self.dispatcher = DefaultJobDispatcher(app)
  File "lib/galaxy/jobs/handler.py", line 972, in __init__
    self.job_runners = self.app.job_config.get_job_runner_plugins(self.app.config.server_name)
  File "lib/galaxy/jobs/__init__.py", line 801, in get_job_runner_plugins
    rval[id] = runner_class(self.app, runner.get('workers', JobConfiguration.DEFAULT_NWORKERS), **runner.get('kwds', {}))
  File "lib/galaxy/jobs/runners/drmaa.py", line 65, in __init__
    drmaa = __import__("drmaa")
  File "/app/galaxy/.venv/lib/python3.6/site-packages/drmaa/__init__.py", line 65, in <module>
    from .session import JobInfo, JobTemplate, Session
  File "/app/galaxy/.venv/lib/python3.6/site-packages/drmaa/session.py", line 39, in <module>
    from drmaa.helpers import (adapt_rusage, Attribute, attribute_names_iterator,
  File "/app/galaxy/.venv/lib/python3.6/site-packages/drmaa/helpers.py", line 36, in <module>
    from drmaa.wrappers import (drmaa_attr_names_t, drmaa_attr_values_t,
  File "/app/galaxy/.venv/lib/python3.6/site-packages/drmaa/wrappers.py", line 56, in <module>
    _lib = CDLL(libpath, mode=RTLD_GLOBAL)
  File "/usr/lib64/python3.6/ctypes/__init__.py", line 343, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libslurm.so.35: cannot open shared object file: No such file or directory

[galaxy@ip-xxxxxxxxx slurm-drmaa-1.1.1]$ ls -la /opt/slurm/lib/libslurm.*
-rw-r--r--. 1 root root 57281210 Nov 17 16:38 /opt/slurm/lib/libslurm.a
-rwxr-xr-x. 1 root root      976 Nov 17 16:38 /opt/slurm/lib/libslurm.la
lrwxrwxrwx. 1 root root       18 Nov 17 16:38 /opt/slurm/lib/libslurm.so -> libslurm.so.35.0.0
lrwxrwxrwx. 1 root root       18 Nov 17 16:38 /opt/slurm/lib/libslurm.so.35 -> libslurm.so.35.0.0
-rwxr-xr-x. 1 root root  8200504 Nov 17 16:38 /opt/slurm/lib/libslurm.so.35.0.0

After some digging I found the following in the config.log:

configure:4925: checking for gcc option to accept ISO C99
configure:5074: gcc  -c -g -O2  conftest.c >&5
conftest.c:61:29: error: expected ';', ',' or ')' before 'text'
 test_restrict (ccp restrict text)
                             ^
conftest.c: In function 'main':
conftest.c:115:18: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'newvar'
   char *restrict newvar = "Another string";
                  ^
conftest.c:115:18: error: 'newvar' undeclared (first use in this function)
conftest.c:115:18: note: each undeclared identifier is reported only once for each function it appears in
conftest.c:125:3: error: 'for' loop initial declarations are only allowed in C99 mode
   for (int i = 0; i < ia->datasize; ++i)
   ^
conftest.c:125:3: note: use option -std=c99 or -std=gnu99 to compile your code
configure:5074: $? = 1
configure: failed program was:
| /* confdefs.h */
| #define PACKAGE_NAME "DRMAA for Slurm"
| #define PACKAGE_TARNAME "slurm-drmaa"
| #define PACKAGE_VERSION "1.1.1"
| #define PACKAGE_STRING "DRMAA for Slurm 1.1.1"
| #define PACKAGE_BUGREPORT "[email protected]"
| #define PACKAGE_URL ""
| #define PACKAGE "slurm-drmaa"
| #define VERSION "1.1.1"
| /* end confdefs.h.  */

Any ideas?

slurm 21.08.2 compatibility

Hi, just wanted to note this here. Following error during compilation:

libtool: compile: gcc -DHAVE_CONFIG_H -I. -I.. -I//include -I../drmaa_utils/ -Wno-long-long -D_REENTRANT -D_THREAD_SAFE -DNDEBUG -D_GNU_SOURCE -DCONFDIR=/etc -Wall -W -Wno-unused-parameter -Wno-format-zero-length -pedantic -std=c99 -g -O0 -pthread -MT libdrmaa_la-job.lo -MD -MP -MF .deps/libdrmaa_la-job.Tpo -c job.c -fPIC -DPIC -o .libs/libdrmaa_la-job.o
job.c: In function ‘slurmdrmaa_job_control’:
job.c:104:8: error: too few arguments to function ‘slurm_kill_job2’
if(slurm_kill_job2(self->job_id, SIGKILL, 0) == -1) {
^~~~~~~~~~~~~~~
In file included from ../slurm_drmaa/job.h:29,
from job.c:31:
/usr/include/slurm/slurm.h:3531:12: note: declared here
extern int slurm_kill_job2(const char *job_id, uint16_t signal, uint16_t flags,
^~~~~~~~~~~~~~~
job.c: In function ‘slurmdrmaa_job_update_status’:
job.c:364:24: warning: this statement may fall through [-Wimplicit-fallthrough=]
self->exit_status = -1;
~~~~~~~~~~~~~~~~~~^~~~
job.c:365:5: note: here
case JOB_FAILED:
^~~~

I added NULL argument to slurm_kill_job2 call and it compiled.

segfault when waiting for bulk jobs > 20 mins

Hi,

When I run bulk jobs with one of the jobs lasting more than 20 minutes, the s.synchronize() function waits for a few minutes after the last job is finished and then triggers a segfault:

import drmaa
s = drmaa.Session()
s.initialize()
jt = s.createJobTemplate()
jt.remoteCommand = "eval"
jt.args = [ 'sleep $((12/${SLURM_ARRAY_TASK_ID}*100))' ]
joblist = s.runBulkJobs(jt, 1, 4, 1)
s.synchronize(joblist, drmaa.Session.TIMEOUT_WAIT_FOREVER, True)

Same happens if I loop through the job ids with the s.wait() function:

for jobid in joblist:
   s.wait(jobid, drmaa.Session.TIMEOUT_WAIT_FOREVER)

However it works perfectly fine if jobs finish in the same order as their SLURM_ARRAY_TASK_ID:

import drmaa
s = drmaa.Session()
s.initialize()
jt = s.createJobTemplate()
jt.remoteCommand = "eval"
jt.args = [ 'sleep $((${SLURM_ARRAY_TASK_ID}*300))' ]
joblist = s.runBulkJobs(jt, 1, 4, 1)
s.synchronize(joblist, drmaa.Session.TIMEOUT_WAIT_FOREVER, True)

No problem if jobs last only 10 minutes:

import drmaa
s = drmaa.Session()
s.initialize()
jt = s.createJobTemplate()
jt.remoteCommand = "eval"
jt.args = [ 'sleep $((12/${SLURM_ARRAY_TASK_ID}*10))' ]
joblist = s.runBulkJobs(jt, 1, 4, 1)
s.synchronize(joblist, drmaa.Session.TIMEOUT_WAIT_FOREVER, True)

I came up with this little piece of code to bypass the bug:

import drmaa
s = drmaa.Session()
s.initialize()
jt = s.createJobTemplate()
jt.remoteCommand = "eval"
jt.args = [ 'sleep $((12/${SLURM_ARRAY_TASK_ID}*100))' ]
joblist = s.runBulkJobs(jt, 1, 4, 1)
for jobid in joblist:
   while (s.jobStatus(jobid)=="running"):
      time.sleep(10)
   print "job %s done" % jobid

Yields:

job 135892105_1 done
job 135892105_2 done
job 135892105_3 done
job 135892105_4 done

New native specification in User Slurm

Hi, I have a question. If we create a new specification like -P for Project_ID which is defined as the id of the particular project like account. Can I update the DRMAA-Slurm library for this attribute by myself? Thanks.

segfault with unresponsive/timeout socket

If libdrmaa fails to contact SLURM due to system overload, a temporary network interruption or a timeout a "socket error" is sometimes seen and is immediately followed by a segfault.
This is likely to be due to improper handling of the error.

RPM installs libdrmaa.so in /usr/lib64, but sets DRMAA_LIBRARY_PATH to /usr/lib/

Hi!

Using the provided SPEC file (CentOS 7.x, x86_64), it looks like the generated RPM installs libdrmaa.so in /usr/lib64:

$ rpm -ql slurm-drmaa
/usr/bin/drmaa-run
/usr/bin/drmaa-run-bulk
/usr/bin/hpc-bash
/usr/include/drmaa.h
/usr/lib64/libdrmaa.a
/usr/lib64/libdrmaa.la
/usr/lib64/libdrmaa.so
/usr/lib64/libdrmaa.so.1
/usr/lib64/libdrmaa.so.1.0.8
/usr/share/doc/slurm-drmaa-1.1.2
/usr/share/doc/slurm-drmaa-1.1.2/COPYING
/usr/share/doc/slurm-drmaa-1.1.2/NEWS
/usr/share/doc/slurm-drmaa-1.1.2/README.md
/usr/share/doc/slurm-drmaa-1.1.2/slurm_drmaa.conf.example

But DRMAA_LIBRARY_PATH defaults to /usr/lib/libdrmaa.so:

$ drmaa-run
F #1d7bd [     0.00]  * Could not load DRMAA library (DRMAA_LIBRARY_PATH=/usr/lib/libdrmaa.so)
F #1d7bd [     0.00]  * Error

and:

$ strace -e open drmaa-run
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libdrmaa.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/etc/localtime", O_RDONLY|O_CLOEXEC) = 3
F #1d7c5 [     0.00]  * Could not load DRMAA library (DRMAA_LIBRARY_PATH=/usr/lib/libdrmaa.so)
F #1d7c5 [     0.00]  * Error
+++ exited with 255 +++

Would there be a way to make the default DRMAA_LIBRARY_PATH consistent with the RPM installation?

segfault when submitting bulk jobs

After:

export DRMAA_LIBRARY_PATH=~/test_drmaa/slurm-drmaa-1.2.0-dev.83fc288/slurm_drmaa/.libs/libdrmaa.so

When using libdrmaa via python

#!/usr/bin/env python
from __future__ import print_function
import os
import drmaa

LOGS = "logs/"
if not os.path.isdir(LOGS):
    os.mkdir(LOGS)

s = drmaa.Session()
s.initialize()
print("Supported contact strings:", s.contact)
print("Supported DRM systems:", s.drmsInfo)
print("Supported DRMAA implementations:", s.drmaaImplementation)
print("Version", s.version)

jt = s.createJobTemplate()
jt.remoteCommand = "/usr/bin/echo"
jt.args = ["Hello", "world"]
jt.jobName = "testdrmaa"
jt.jobEnvironment = os.environ.copy()
jt.workingDirectory = os.getcwd()

jt.outputPath = ":" + os.path.join(LOGS, "job-%A_%a.out")
jt.errorPath = ":" + os.path.join(LOGS, "job-%A_%a.err")
jt.nativeSpecification = "--cpus-per-task=2 --nodes=1 --mem-per-cpu=50 --partition=htc --tmp=100"

print("Submitting", jt.remoteCommand, "with", jt.args, "and logs to", jt.outputPath)
ids = s.runBulkJobs(jt, beginIndex=1, endIndex=2, step=1)
print("Job submitted with ids", ids)

s.deleteJobTemplate(jt)

The above code fails when calling runBulkJobs

Stack trace of the above script:

Program received signal SIGSEGV, Segmentation fault.
strlcpy (dest=dest@entry=0x7a9640 "9829091", src=0x0, size=size@entry=1024) at compat.c:50
50              while( *src  &&  --size > 0 )
(gdb) bt
#0  strlcpy (dest=dest@entry=0x7a9640 "9829091", src=0x0, size=size@entry=1024) at compat.c:50
#1  0x00007fffed772fac in drmaa_get_next_job_id (values=0x7ac5c0, value=0x7a9640 "9829091", value_len=1024) at drmaa_base.c:297
#2  0x00007fffeffed550 in ffi_call_unix64 () at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/libffi/src/x86/unix64.S:76
#3  0x00007fffeffeccf5 in ffi_call (cif=<optimized out>, fn=0x7fffed772e90 <drmaa_get_next_job_id>, rvalue=<optimized out>, avalue=0x7fffffffc6c0) at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/libffi/src/x86/ffi64.c:525
#4  0x00007fffeffe483c in _call_function_pointer (argcount=3, resmem=0x7fffffffc6f0, restype=<optimized out>, atypes=<optimized out>, avalues=0x7fffffffc6c0, pProc=0x7fffed772e90 <drmaa_get_next_job_id>, flags=4353) at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/callproc.c:809
#5  _ctypes_callproc (pProc=0x7fffed772e90 <drmaa_get_next_job_id>, argtuple=0x7fffffffc7e0, flags=4353, argtypes=<optimized out>, restype=0x7ffff0212f28, checker=0x0) at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/callproc.c:1147
#6  0x00007fffeffdcda3 in PyCFuncPtr_call (self=<optimized out>, inargs=<optimized out>, kwds=0x0) at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/_ctypes.c:3870
#7  0x00007ffff793fade in _PyObject_FastCallDict (func=0x7fffeea655c0, args=<optimized out>, nargs=<optimized out>, kwargs=0x0) at Objects/abstract.c:2316
#8  0x00007ffff7a1c2bb in call_function (pp_stack=0x7fffffffcb18, oparg=<optimized out>, kwnames=0x0) at Python/ceval.c:4822
#9  0x00007ffff7a1f15d in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3284
#10 0x00007ffff7969e33 in gen_send_ex (gen=0x7fffefd90200, arg=<optimized out>, exc=<optimized out>, closing=<optimized out>) at Objects/genobject.c:189
#11 0x00007ffff7978f3e in listextend (self=0x7fffeea79d48, b=<optimized out>) at Objects/listobject.c:857
#12 0x00007ffff7979398 in list_init (self=0x7fffeea79d48, args=<optimized out>, kw=<optimized out>) at Objects/listobject.c:2316
#13 0x00007ffff79add4c in type_call (type=<optimized out>, args=0x7ffff7e8d470, kwds=0x0) at Objects/typeobject.c:915
#14 0x00007ffff793fade in _PyObject_FastCallDict (func=0x7ffff7d5bb40 <PyList_Type>, args=<optimized out>, nargs=<optimized out>, kwargs=0x0) at Objects/abstract.c:2316
#15 0x00007ffff7a1c2bb in call_function (pp_stack=0x7fffffffce58, oparg=<optimized out>, kwnames=0x0) at Python/ceval.c:4822
#16 0x00007ffff7a1f15d in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3284
#17 0x00007ffff7a1aa60 in _PyEval_EvalCodeWithName (_co=0x7ffff01fc420, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=1, kwnames=0x7ffff7e9dba0, kwargs=0x7ffff7f8fba8, kwcount=3, kwstep=1, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x7ffff7ea3c30, qualname=0x7fffefd8d2b8) at Python/ceval.c:4128
#18 0x00007ffff7a1c48a in fast_function (kwnames=<optimized out>, nargs=1, stack=<optimized out>, func=0x7fffeea8c2f0) at Python/ceval.c:4939
#19 call_function (pp_stack=0x7fffffffd0f8, oparg=<optimized out>, kwnames=<optimized out>) at Python/ceval.c:4819
#20 0x00007ffff7a1e8dd in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3300
#21 0x00007ffff7a1aa60 in _PyEval_EvalCodeWithName (_co=0x7ffff7f1b930, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=0, kwnames=0x0, kwargs=0x8, kwcount=0, kwstep=2, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:4128
#22 0x00007ffff7a1aee3 in PyEval_EvalCodeEx (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kws=<optimized out>, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:4149
#23 0x00007ffff7a1af2b in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>) at Python/ceval.c:695
#24 0x00007ffff7a4d6c0 in run_mod (arena=0x7ffff7f79180, flags=0x7fffffffd450, locals=0x7ffff7f5cf30, globals=0x7ffff7f5cf30, filename=0x7ffff7ea3830, mod=0x683f58) at Python/pythonrun.c:980
#25 PyRun_FileExFlags (fp=0x64cc30, filename_str=<optimized out>, start=<optimized out>, globals=0x7ffff7f5cf30, locals=0x7ffff7f5cf30, closeit=<optimized out>, flags=0x7fffffffd450) at Python/pythonrun.c:933
#26 0x00007ffff7a4ec83 in PyRun_SimpleFileExFlags (fp=0x64cc30, filename=<optimized out>, closeit=1, flags=0x7fffffffd450) at Python/pythonrun.c:396
#27 0x00007ffff7a6a0b5 in run_file (p_cf=0x7fffffffd450, filename=0x603310 L<error reading variable>, fp=0x64cc30) at Modules/main.c:338
#28 Py_Main (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:810
#29 0x0000000000400c1d in main (argc=2, argv=<optimized out>) at ./Programs/python.c:69--

The above code runs fine with a libdrmaa built from https://github.com/ljyanesm/slurm-drmaa

segfault with native specification of limits using human qualifiers

When specifying --tmp, --mem-per-cpu and other numerical parameters, if a size qualifier is used 1M, 1G, ... the code will segfault.

While this is supported on the command-line via srun/squeue the drmaa code assumes a number is passed and will segfault on non-numeric input.

Doesn't handle cross-platform exit statuses gracefully

Looking into this it seems that the exit status returned from a child process is not handled gracefully in some instances.

Ideally the exit status should be using the macros defined in sys/wait.h but it is only used sparingly across drmaa.c. Notably the macro WIFEXITED is used but not WIFSIGNALED or WTERMSIG. For example, instead of WIFSIGNALED there is an operation that may or may not be the same as the macro necessary for that particular architecture. Is there a reason why these macros were not used?

This is related to Issue #26. After removing the hardcoded exit status manipulation with macros, I suddenly went from jobs reporting "unknown signal?!" to "wasTerminated" which was far more informative in terms of tracking down our issues (and ultimately solved our problem).

Submitting multi-threaded jobs using slurm-drmaa

I realize it's strange to ask a question like this on a repository, but I've spent the past hour trying to figure it out on my own to no avail. I thought that you might be able to answer it in 30 seconds. I would greatly appreciate any help!

In essence, how do you submit multi-threaded jobs using slurm-drmaa? To be clear, I want the job to run on one node (i.e. --ntasks=1). I use the --cpus-per-task option with srun or sbatch, but this option isn't available in the native specification for slurm-drmaa.

I've tried different combinations of --mincpus, --nodes, --ntasks-per-node and --ntasks, but they either allow jobs to be split across multiple nodes or they fail. I've looked through the code for galaxyproject/galaxy and galaxyproject/pulsar, but I couldn't find any hints.

TRES slurm config settings cause errors

drmaa-run bash test.drmaa
drmaa-run: error: _parse_next_key: Parsing error at unrecognized key: PriorityWeightTRES
drmaa-run: error: Parse error in file /etc/slurm/slurm.conf line 131: "PriorityWeightTRES=CPU=1000,Mem=2000 "
drmaa-run: error: _parse_next_key: Parsing error at unrecognized key: AccountingStorageTRES
drmaa-run: error: Parse error in file /etc/slurm/slurm.conf line 186: "AccountingStorageTRES=gres/gpu,gres/mic,gres/nvme"
drmaa-run: error: Parsing error at unrecognized key: TresBillingWeights
drmaa-run: error: Parse error in file /etc/slurm/slurm.conf line 196: " TresBillingWeights="CPU=1.0,Mem=0.127G""
drmaa-run: error: Unable to establish controller machine
Segmentation fault

support maximum number of simultaneously running tasks

For Slurm we can set the maximum number of simultaneous running tasks for an array job as follows:
--array={start}-{end}:{step-size}%{maximum number of simultaneously running tasks}

But it seems slurm-drmaa doesn't support this:

slurm-drmaa/slurm_drmaa/session.c

Line 123 in 665d5b5

job_desc.array_inx = fsd_asprintf( "%d-%d:%d", start, end, incr );

Full support for memory units and time formats?

It looks to me that the DRMAA interface does not support memory units and not all time formats are supported.

What appears to be supported:

http://apps.man.poznan.pl/trac/slurm-drmaa#Nativespecification

sbatch (at least in Slurm 20.02) can handle memory specifications such as 10G and time specifications as days-hours.

Support for --nice parameter

It would be great to have support for --nice as well.

Compatibility with slurm 19.05.01

At least when I built it just now, slurm 19.05.01 doesn't ship a libslurmdb.so, so the linking test run by configure fails. Removing the linking to it here solves the issue https://github.com/natefoo/slurm-drmaa/blob/master/m4/ax_slurm.m4#L75 but it'd be good if someone else confirmed before this is implemented.

slurm-20.11.1 compatibility.

Hi,

We are using slurm-drmaa for our galaxy instance in our HPC environment. We recently updated our slurm to slurm-20.11.1 however, even the latest version of slurm-drmaa is no longer compatible with our new slurm:

libtool: compile: gcc -DHAVE_CONFIG_H -I. -I.. -I/opt/software/slurm/include/ -I../drmaa_utils/ -Wno-long-long -D_REENTRANT -D_THREAD_SAFE -DNDEBUG -D_GNU_SOURCE -DCONFDIR=/opt/software/slurm-drmaa/etc -Wall -W -Wno-unused-parameter -Wno-format-zero-length -pedantic -std=c99 -g -O2 -pthread -MT libdrmaa_la-drmaa.lo -MD -MP -MF .deps/libdrmaa_la-drmaa.Tpo -c drmaa.c -fPIC -DPIC -o .libs/libdrmaa_la-drmaa.o
drmaa.c: In function ‘slurmdrmaa_get_DRM_system’:
drmaa.c:65:3: error: unknown type name ‘slurm_ctl_conf_t’; did you mean ‘slurm_conf_t’?
65 | slurm_ctl_conf_t * conf_info_msg_ptr = NULL;
| ^~~~~~~~~~~~~~~~
| slurm_conf_t
drmaa.c:66:44: warning: passing argument 2 of ‘slurm_load_ctl_conf’ from incompatible pointer type [-Wincompatible-pointer-types]
66 | if ( slurm_load_ctl_conf ((time_t) NULL, &conf_info_msg_ptr ) == -1 )
| ^~~~~~~~~~~~~~~~~~
| |
| int **
In file included from drmaa.c:31:
/opt/software/slurm/include/slurm/slurm.h:3715:47: note: expected ‘slurm_conf_t **’ {aka ‘struct **’} but argument is of type ‘int **’
3715 | slurm_conf_t **slurm_ctl_conf_ptr);
| ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~
drmaa.c:73:101: error: request for member ‘version’ in something not a structure or union
73 | fsd_snprintf(NULL, slurmdrmaa_version, sizeof(slurmdrmaa_version)-1,"SLURM %s", conf_info_msg_ptr->version);
| ^~
drmaa.c:74:25: warning: passing argument 1 of ‘slurm_free_ctl_conf’ from incompatible pointer type [-Wincompatible-pointer-types]
74 | slurm_free_ctl_conf (conf_info_msg_ptr);
| ^~~~~~~~~~~~~~~~~
| |
| int *
In file included from drmaa.c:31:
/opt/software/slurm/include/slurm/slurm.h:3722:47: note: expected ‘slurm_conf_t *’ {aka ‘struct *’} but argument is of type ‘int *’
3722 | extern void slurm_free_ctl_conf(slurm_conf_t *slurm_ctl_conf_ptr);
| ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~

Is there any plan to update slurm-drmaa?

Thanks,
Cheers,
Ata

DRMAA on Galaxy Repo does not work

https://depot.galaxyproject.org/yum/package/slurm/18.08/7/x86_64/

The DRMAA rpm is breaking as it is not meant for the slurm version.
The slurm rpm in the repo provides .so.33 and drmaa in this repo requires .so.31

Error: Package: slurm-drmaa-1.1.0-1.el7.x86_64 (slurm-18.08)
Requires: libslurm.so.31()(64bit)
Error: Package: slurm-drmaa-1.1.0-1.el7.x86_64 (slurm-18.08)
Requires: libslurmdb.so.31()(64bit)
You could try using --skip-broken to work around the problem

Do you have access to this?

Fails with Slurm 18.08.8

Testing with the drmaa-run utility, I find that slurm-drmaa fails with the 18.08.8 release of Slurm, but the exact same procedure works fine with 18.08.7. With 18.08.8 it fails at the job run step:

E #2af1 [     0.77]  * fsd_exc_new(1001,slurm_submit_batch_job error (-1): Unspecified error,1)
t #2af1 [     0.77] -> slurmdrmaa_free_job_desc
t #2af1 [     0.77] <- slurmdrmaa_free_job_desc
t #2af1 [     0.77] <- drmaa_run_job=1: slurm_submit_batch_job error (-1): Unspecified error
F #2af1 [     0.77]  * Failed to submit a job: slurm_submit_batch_job error (-1): Unspecified error

Corresponding to this part of the drmaa-run code:

        /* run */
        if (api.run_job(jobid, sizeof(jobid) - 1, jt, errbuf, sizeof(errbuf) - 1) != DRMAA_ERRNO_SUCCESS) {
                fsd_log_fatal(("Failed to submit a job: %s ", errbuf));
                exit(2); /* TODO exception */

Slurm 18.08.8 addresses a security vulnerability that exists in prior versions of Slurm.

Missing sources in 1.2.0-dev.deca826 release

In both source archives for the 1.2.0-dev.deca826 release the drmaa_utils directory is empty.

recipe for installing SLURM and friends on Debian 11

Hello and apologies if this question is in the wrong place. We are upgrading from Debian 8 to Debian 11. I am a developer with no particular background in system administration or configuration. Several weeks into a cycle of install/google-error-message/install-something-else, I have installed munge, slurm, slurm-drmaa, and bats(!). slurmctld and slurmd are now running, but calls to drmaa_run_job() result in seg faults. (The surrounding C++ code is copied from our Debian 8 host, where drmaa_run_job() runs successfully.) I'll print some debug output below, but what I'm really looking for is start-to-finish step-by-step instructions for configuring, installing, and running whatever it takes to make SLURM usable on Debian 11. Thanks in advance.

Last few steps of debug output from drmaa_run_job:

d #597f9 [ 40.42] * finalizing job constraints
d #597f9 [ 40.42] * set min_cpus to ntasks: 1
t #597f9 [ 40.42] <- slurmdrmaa_parse_native
ORA-24550: signal received: [si_signo=11] [si_errno=0] [si_code=1] [si_int=0] [si_ptr=(nil)] [si_addr=0x1656]
kpedbg_dmp_stack()+394<-kpeDbgCrash()+204<-kpeDbgSignalHandler()+113<-skgesig_sigactionHandler()+258<-__sighandler()<-0x00007F06CFEC9B71<-slurm_pack_selected_step()+1286<-slurm_send_node_msg()+505<-slurm_send_recv_msg()+66<-slurm_send_recv_controller_msg()+315<-slurm_submit_batch_job()+119<-slurmdrmaa_session_run_bulk()+518<-slurmdrmaa_session_run_job()+179<-drmaa_run_job()+374<-_ZN19custom_code::submit_jobERKN5boost10filesystem4pathES4_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESC_bb()+4407<-0x0000000000000009<-0x7453705F6D00626F

runscript.sh: line 62: 366577 Segmentation fault

Stack trace from gdb:

           Stack trace of thread 366585:
            #0  0x00007f06d1914fe1 raise (libpthread.so.0 + 0x13fe1)
            #1  0x00007f06c254893f skgesigOSCrash (libclntsh.so + 0x267293f)
            #2  0x00007f06c2c63cdd kpeDbgSignalHandler (libclntsh.so + 0x2d8dcdd)
            #3  0x00007f06c2548c12 skgesig_sigactionHandler (libclntsh.so + 0x2672c12)
            #4  0x00007f06d1915140 __restore_rt (libpthread.so.0 + 0x14140)
            #5  0x00007f06cfec9b71 __strlen_avx2 (libc.so.6 + 0x15fb71)
            #6  0x00007f06d0467cb3 n/a (libslurm.so.36 + 0xf8cb3)
            #7  0x00007f06d047c646 n/a (libslurm.so.36 + 0x10d646)
            #8  0x00007f06d0456cf9 slurm_send_node_msg (libslurm.so.36 + 0xe7cf9)
            #9  0x00007f06d0457f72 slurm_send_recv_msg (libslurm.so.36 + 0xe8f72)
            #10 0x00007f06d04580db slurm_send_recv_controller_msg (libslurm.so.36 + 0xe90db)
            #11 0x00007f06d03b76e7 slurm_submit_batch_job (libslurm.so.36 + 0x486e7)
            #12 0x00007f06d05414f1 slurmdrmaa_session_run_bulk (libdrmaa.so.1 + 0xb4f1)
            #13 0x00007f06d054123b slurmdrmaa_session_run_job (libdrmaa.so.1 + 0xb23b)
            #14 0x00007f06d055c133 drmaa_run_job (libdrmaa.so.1 + 0x26133)
            #15 0x000056442ad0bf37 n/a (XXX + 0xd1f37)
            #16 0x0000000000000009 n/a (n/a + 0x0)

Any advice would be greatly appreciated.

PropagatePrioProcess slurm config setting causes error with SLURM_PRIO_PROCESS

If the Slurm config setting PropagatePrioProcess is set, jobs submitted via slurm-drmaa emit the error:

slurmstepd: error: Couldn't find SLURM_PRIO_PROCESS in environment

Slurm manager checks for this and sbatch sets this environment variable, and thus slurm-drmaa also needs to set it.

segfault when max number of jobs reached

This still requires confirmation and a proper backtrace but when submitting over 10.000 jobs on a cluster with a 10.000 limit a segfault would be seen.

Segmentation fault when providing --time

Got one more segmentation fault when specifying --time 1:00:00 for 1 hour.
Seems to be just an alias problem since -t 1:00:00 works.

...
d #e494 [     0.03]  * # Native specification: --cpus-per-task=2 --nodes=1 --mem-per-cpu=50 --partition=htc --time 1:00:00 --tmp=100
t #e494 [     0.03] -> slurmdrmaa_parse_native
t #e494 [     0.03] -> slurmdrmaa_parse_additional_attr
d #e494 [     0.03]  * # cpus_per_task = 2
t #e494 [     0.03] <- slurmdrmaa_parse_additional_attr
t #e494 [     0.03] -> slurmdrmaa_parse_additional_attr
d #e494 [     0.03]  * nodes: 1 ->
d #e494 [     0.03]  * # min_nodes = 1
t #e494 [     0.03] <- slurmdrmaa_parse_additional_attr
t #e494 [     0.03] -> slurmdrmaa_parse_additional_attr
d #e494 [     0.03]  * # pn_min_memory (MEM_PER_CPU) = 50
t #e494 [     0.03] <- slurmdrmaa_parse_additional_attr
t #e494 [     0.03] -> slurmdrmaa_parse_additional_attr
d #e494 [     0.03]  * # partition = htc
t #e494 [     0.03] <- slurmdrmaa_parse_additional_attr
t #e494 [     0.03] -> slurmdrmaa_parse_additional_attr
d #e494 [     0.03]  * # time_limit = (null)
t #e494 [     0.03] -> slurmdrmaa_datetime_parse((null))

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff6b20faf in __strlen_sse42 () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff6b20faf in __strlen_sse42 () from /lib64/libc.so.6
#1  0x00007fffed955284 in slurmdrmaa_datetime_parse (string=0x0) at util.c:53
#2  0x00007fffed956295 in slurmdrmaa_add_attribute (job_desc=0x7fffffff9e10, attr=19, value=0x0) at util.c:292
#3  0x00007fffed956c19 in slurmdrmaa_parse_additional_attr (job_desc=0x7fffffff9e10, add_attr=0x7ac3bf "time", clusters_opt=0x7fffffff8af0) at util.c:427
#4  0x00007fffed9570f8 in slurmdrmaa_parse_native (job_desc=0x7fffffff9e10, value=0x79f8b0 "--cpus-per-task=2 --nodes=1 --mem-per-cpu=50 --partition=htc --time 1:00:00 --tmp=100") at util.c:502
#5  0x00007fffed95462e in slurmdrmaa_job_create (session=0x641ad0, jt=0x7e3570, envp=0x7fffffffa0f8, expand=0x771280, job_desc=0x7fffffff9e10) at job.c:701
#6  0x00007fffed952d3b in slurmdrmaa_job_create_req (session=0x641ad0, jt=0x7e3570, envp=0x7fffffffa0f8, job_desc=0x7fffffff9e10) at job.c:302
#7  0x00007fffed954af4 in slurmdrmaa_session_run_bulk (self=0x641ad0, jt=0x7e3570, start=1, end=2, incr=1) at session.c:126
#8  0x00007fffed96facb in drmaa_run_bulk_jobs (job_ids=0x7fffeea84a28, jt=0x7e3570, start=1, end=2, incr=1, error_diagnosis=0x732960 "", error_diag_len=1024) at drmaa_base.c:427
#9  0x00007fffeffed550 in ffi_call_unix64 () at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/libffi/src/x86/unix64.S:76
#10 0x00007fffeffeccf5 in ffi_call (cif=<optimized out>, fn=0x7fffed96f8e3 <drmaa_run_bulk_jobs>, rvalue=<optimized out>, avalue=0x7fffffffa330) at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/libffi/src/x86/ffi64.c:525
#11 0x00007fffeffe483c in _call_function_pointer (argcount=7, resmem=0x7fffffffa380, restype=<optimized out>, atypes=<optimized out>, avalues=0x7fffffffa330, pProc=0x7fffed96f8e3 <drmaa_run_bulk_jobs>, flags=4353)
    at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/callproc.c:809
#12 _ctypes_callproc (pProc=0x7fffed96f8e3 <drmaa_run_bulk_jobs>, argtuple=0x7fffffffa4f0, flags=4353, argtypes=<optimized out>, restype=0x7ffff0236158, checker=0x0) at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/callproc.c:1147
#13 0x00007fffeffdcda3 in PyCFuncPtr_call (self=<optimized out>, inargs=<optimized out>, kwds=0x0) at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/_ctypes.c:3870
#14 0x00007ffff793fe96 in PyObject_Call (func=0x7fffeea66e58, args=<optimized out>, kwargs=<optimized out>) at Objects/abstract.c:2246
#15 0x00007ffff7a20236 in do_call_core (kwdict=0x0, callargs=<optimized out>, func=0x7fffeea66e58) at Python/ceval.c:5067
#16 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3366
#17 0x00007ffff7a1aa60 in _PyEval_EvalCodeWithName (_co=0x7ffff0220390, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=6, kwnames=0x0, kwargs=0x7e61c8, kwcount=0, kwstep=1, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0,
    name=0x7ffff7f66308, qualname=0x7ffff7f66308) at Python/ceval.c:4128
#18 0x00007ffff7a1c48a in fast_function (kwnames=<optimized out>, nargs=6, stack=<optimized out>, func=0x7fffeea7f840) at Python/ceval.c:4939
#19 call_function (pp_stack=0x7fffffffaa08, oparg=<optimized out>, kwnames=<optimized out>) at Python/ceval.c:4819
#20 0x00007ffff7a1f15d in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3284
#21 0x00007ffff7969e33 in gen_send_ex (gen=0x7fffefd92200, arg=<optimized out>, exc=<optimized out>, closing=<optimized out>) at Objects/genobject.c:189
#22 0x00007ffff7978f16 in listextend (self=0x7fffeea8ee88, b=<optimized out>) at Objects/listobject.c:857
#23 0x00007ffff7979398 in list_init (self=0x7fffeea8ee88, args=<optimized out>, kw=<optimized out>) at Objects/listobject.c:2316
#24 0x00007ffff79add4c in type_call (type=<optimized out>, args=0x7ffff7e8e908, kwds=0x0) at Objects/typeobject.c:915
#25 0x00007ffff793fade in _PyObject_FastCallDict (func=0x7ffff7d5bb40 <PyList_Type>, args=<optimized out>, nargs=<optimized out>, kwargs=0x0) at Objects/abstract.c:2316
#26 0x00007ffff7a1c2bb in call_function (pp_stack=0x7fffffffad48, oparg=<optimized out>, kwnames=0x0) at Python/ceval.c:4822
#27 0x00007ffff7a1f15d in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3284
#28 0x00007ffff7a1aa60 in _PyEval_EvalCodeWithName (_co=0x7ffff01ff420, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=1, kwnames=0x7ffff7e9cb58, kwargs=0x7ffff7f8fba8, kwcount=3, kwstep=1, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0,
    name=0x7ffff7ea3d70, qualname=0x7fffefd8f300) at Python/ceval.c:4128
#29 0x00007ffff7a1c48a in fast_function (kwnames=<optimized out>, nargs=1, stack=<optimized out>, func=0x7fffeea84400) at Python/ceval.c:4939
#30 call_function (pp_stack=0x7fffffffafe8, oparg=<optimized out>, kwnames=<optimized out>) at Python/ceval.c:4819
#31 0x00007ffff7a1e8dd in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3300
#32 0x00007ffff7a1aa60 in _PyEval_EvalCodeWithName (_co=0x7ffff7f1c930, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=0, kwnames=0x0, kwargs=0x8, kwcount=0, kwstep=2, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0)
    at Python/ceval.c:4128
#33 0x00007ffff7a1aee3 in PyEval_EvalCodeEx (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kws=<optimized out>, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:4149
#34 0x00007ffff7a1af2b in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>) at Python/ceval.c:695
#35 0x00007ffff7a4d6c0 in run_mod (arena=0x7ffff7f79180, flags=0x7fffffffb340, locals=0x7ffff7f5df30, globals=0x7ffff7f5df30, filename=0x7ffff7ea3970, mod=0x6857d8) at Python/pythonrun.c:980
#36 PyRun_FileExFlags (fp=0x6438d0, filename_str=<optimized out>, start=<optimized out>, globals=0x7ffff7f5df30, locals=0x7ffff7f5df30, closeit=<optimized out>, flags=0x7fffffffb340) at Python/pythonrun.c:933
#37 0x00007ffff7a4ec83 in PyRun_SimpleFileExFlags (fp=0x6438d0, filename=<optimized out>, closeit=1, flags=0x7fffffffb340) at Python/pythonrun.c:396
#38 0x00007ffff7a6a0b5 in run_file (p_cf=0x7fffffffb340, filename=0x603310 L"test_drmaa.py", fp=0x6438d0) at Modules/main.c:338
#39 Py_Main (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:810
#40 0x0000000000400c1d in main (argc=2, argv=<optimized out>) at ./Programs/python.c:69

'Invalid Trackable RESource (TRES) specification' error

Hi,

I'm using slurm-drmaa to submit a job and I get the error below:

d #89f27 [     0.00]  * # Native specification:  --time=1:00:00 --ntasks=1 --gres=gpu:1 --cpus-per-task=2 --nodes=1 --account=xxx@yyy --partition=mypartition

t #89f27 [     0.00] -> slurmdrmaa_parse_native

t #89f27 [     0.00] -> slurmdrmaa_parse_additional_attr

d #89f27 [     0.00]  * # time_limit = 1:00:00

t #89f27 [     0.00] -> slurmdrmaa_datetime_parse(1:00:00)

d #89f27 [     0.00]  * parsed: 0000-00-00 01:00:00 +00:00:00 [---hms-]

t #89f27 [     0.00] <- slurmdrmaa_datetime_parse(1:00:00) =60 minutes
t #89f27 [     0.00] <- slurmdrmaa_parse_additional_attr

t #89f27 [     0.00] -> slurmdrmaa_parse_additional_attr
d #89f27 [     0.00]  * # ntasks = 1
t #89f27 [     0.00] <- slurmdrmaa_parse_additional_attr

t #89f27 [     0.00] -> slurmdrmaa_parse_additional_attr
d #89f27 [     0.00]  * # gres = gpu:1
t #89f27 [     0.00] <- slurmdrmaa_parse_additional_attr

t #89f27 [     0.00] -> slurmdrmaa_parse_additional_attr
d #89f27 [     0.00]  * # cpus_per_task = 2
t #89f27 [     0.00] <- slurmdrmaa_parse_additional_attr

t #89f27 [     0.00] -> slurmdrmaa_parse_additional_attr
d #89f27 [     0.00]  * nodes: 1 ->
d #89f27 [     0.00]  * # min_nodes = 1
t #89f27 [     0.00] <- slurmdrmaa_parse_additional_attr

t #89f27 [     0.00] -> slurmdrmaa_parse_additional_attr
d #89f27 [     0.00]  * # account = xxx@yyy
t #89f27 [     0.00] <- slurmdrmaa_parse_additional_attr

t #89f27 [     0.00] -> slurmdrmaa_parse_additional_attr
d #89f27 [     0.00]  * # partition = allgpus
t #89f27 [     0.00] <- slurmdrmaa_parse_additional_attr

d #89f27 [     0.00]  * finalizing job constraints
d #89f27 [     0.00]  * set min_cpus to ntasks*cpus_per_task: 2
t #89f27 [     0.00] <- slurmdrmaa_parse_native
E #89f27 [     4.24]  * fsd_exc_new(1016,slurm_submit_batch_job error: Invalid Trackable RESource (TRES) specification,1)

t #89f27 [     4.24] -> slurmdrmaa_free_job_desc
t #89f27 [     4.24] <- slurmdrmaa_free_job_desc

t #89f27 [     4.24] <- drmaa_run_job=17: slurm_submit_batch_job error: Invalid Trackable RESource (TRES) specification

Traceback (most recent call last):
  ..
  File "/.../python3.6/site-packages/drmaa/session.py", line 314, in runJob
    c(drmaa_run_job, jid, sizeof(jid), jobTemplate)
  File "/.../python3.6/site-packages/drmaa/helpers.py", line 302, in c
    return f(*(args + (error_buffer, sizeof(error_buffer))))
  File "/.../site-packages/drmaa/errors.py", line 151, in error_check
    raise _ERRORS[code - 1](error_string)
drmaa.errors.DeniedByDrmException: code 17: slurm_submit_batch_job error: Invalid Trackable RESource (TRES) specification

The same job without ```--gres=gpu:1`` works fine.
slurm-drmaa version is 1.1.3 and slurm version is 21.08.6. Os is RHEL 8.4.

Any hint would be greatly appreciated,
Kimchi

LD_LIBRARY_PATH not exported.

Hi,

I am experiencing an issue similar to #19. It appears the slurm shared libraries specified by --with-slurm-lib cannot be found when loading conftest at runtime during the configure script.

I believe the issue is that while LD_LIBRARY_PATH is set in ax_slurm.m4, it is never exported. You can see how this was done in cURL: curl/curl@302d537.

I've tried this and it does fix the issue. Alternatively, while looking into this I've found it suggested that using rpath is better practice as it's more constrained. I also was able to run ./configure successfully by setting the rpath as shown here: reid-wagner@67c7f6e.

If you want to go that path I'd be glad to open a PR. I haven't been able to test compilation yet for a few reasons, one being that I'm encountering an unrelated compilation issue on master.

The above issue happens with slurm-drmaa 1.1.1 and gcc 4.8.5 on CentOS 7.8.2003.

Additionally it's worth mentioning that out of the box 1.1.1 configured and compiled on my Ubuntu machine with gcc 9.3.0. I actually grabbed the conftest.c source from config.log and compiled it on both machines. On the Ubuntu machine it appears that the dependency on libslurm was stripped from the ELF, I guess because it's optimized out. On the CentOS machine the dependency is there.. So on the Ubuntu machine it wasn't actually testing that the libraries could be found at runtime.

Thanks for taking a look.

Below is the error from config.log. I modified the paths:

configure:14098: checking for usable SLURM libraries/headers
configure:14119: gcc -std=gnu99 -o conftest -pedantic -std=c99 -g -O2 -pthread -D_REENTRANT -D_THREAD_SAFE -DNDEBUG  -D_GNU_SOURCE -I
/path/to/include/  -L/path/to/lib/ conftest.c -lslurm   -lslurm  >&5
configure:14119: $? = 0
configure:14119: ./conftest
./conftest: error while loading shared libraries: libslurm.so.35: cannot open shared object file: No such file or directory
configure:14119: $? = 127
configure: program exited with status 127
configure: failed program was:
| /* confdefs.h */
| #define PACKAGE_NAME "DRMAA for Slurm"
| #define PACKAGE_TARNAME "slurm-drmaa"
| #define PACKAGE_VERSION "1.1.1"
| #define PACKAGE_STRING "DRMAA for Slurm 1.1.1"
| #define PACKAGE_BUGREPORT "[email protected]"
| #define PACKAGE_URL ""
| #define PACKAGE "slurm-drmaa"
| #define VERSION "1.1.1"
| #define STDC_HEADERS 1
| #define HAVE_SYS_TYPES_H 1
| #define HAVE_SYS_STAT_H 1
| #define HAVE_STDLIB_H 1
| #define HAVE_STRING_H 1
| #define HAVE_MEMORY_H 1
| #define HAVE_STRINGS_H 1
| #define HAVE_INTTYPES_H 1
| #define HAVE_STDINT_H 1
| #define HAVE_UNISTD_H 1
| #define HAVE_DLFCN_H 1
| #define LT_OBJDIR ".libs/"
| #define HAVE_PTHREAD_PRIO_INHERIT 1
| #define HAVE_LIBSLURM 1
| /* end confdefs.h.  */
|  #include "slurm/slurm.h"
| int
| main ()
| {
|  job_desc_msg_t job_req; /*at least check for declared structs */
|                  return 0;
| 
|   ;
|   return 0;
| }
configure:14134: result: no
configure:14140: error: 
Slurm libraries/headers not found;
add --with-slurm-inc and --with-slurm-lib with appropriate locations.

Implement DRMAA v2.2

Currently only Univa implements 2.2, but it brings a lot of nice improvements, and it'd help adoption if it were implemented for Slurm as well. Specification docs here.

Implementing 2.2 would be a big undertaking and as slurm-drmaa is only a side-project for me (and I'm not a C programmer by trade) I'd say it's fairly unlikely that anything will get done on this, but it's a good goal.

Problems during compilation

Hi,
I'm trying to compile slurm-drmaa-1.2.0-dev.deca826 for SLURM 17-11.5 and I'm getting errors during "configure" step.
I'm executing "./configure --prefix=/tmp/test-slurm-drmaa --with-slurm-inc=/soft/slurm-17.11.5/include/slurm/ --with-slurm-lib=/soft/slurm-17.11.5/lib/" (my SLURM installation is installed in a NFS folder called /soft") and error is "SLURM libraries/headers not found; add --with-slurm-inc and --with-slurm-lib with appropriate locations."

Could you help me?

Thanks.

Support `maxnodes` in `--nodes` option

As per the sbatch documentation, it is possible to request both a minimum and a maximum number of nodes with --nodes:

-N, --nodes=<minnodes[-maxnodes]>
Request that a minimum of minnodes nodes be allocated to this job. A maximum node count may also be specified with maxnodes.

However, slurm-drmaa doesn't support this:

Traceback (most recent call last):
  File "/home/ndc/drmaa-venv/bin/sbatch-drmaa", line 20, in <module>
    jobid = s.runJob(jt)
  File "/home/ndc/drmaa-venv/lib/python2.7/site-packages/drmaa/session.py", line 314, in runJob
    c(drmaa_run_job, jid, sizeof(jid), jobTemplate)
  File "/home/ndc/drmaa-venv/lib/python2.7/site-packages/drmaa/helpers.py", line 303, in c
    return f(*(args + (error_buffer, sizeof(error_buffer))))
  File "/home/ndc/drmaa-venv/lib/python2.7/site-packages/drmaa/errors.py", line 151, in error_check
    raise _ERRORS[code - 1](error_string)
drmaa.errors.InvalidArgumentException: code 4: not an number: 1-1

natefoo / slurm-drmaa Goto Github PK

slurm-drmaa's Issues

/etc/apt/trusted.gpg

Recommend Projects

Recommend Topics

Recommend Org