natefoo / slurm-drmaa Goto Github PK
View Code? Open in Web Editor NEWDRMAA for Slurm: Implementation of the DRMAA C bindings for Slurm
License: GNU General Public License v3.0
DRMAA for Slurm: Implementation of the DRMAA C bindings for Slurm
License: GNU General Public License v3.0
Hello,
thank you for this library.
I noticed a small bug when using single quotes in arguments.
I believe that single quotes in the expanded args (
Line 615 in 665d5b5
In my case I'm running
bash, "-c", "source /tmp/init.sh ; /bin/sort '-s' '-k' '3' '-t' '\t'"
(note I use , to separate the arguments, in total two arguments are passed to bash)
which will then be translated into
d #15300 [ 0.06] * # Script:
d #15300 [ 0.06] | #!/bin/bash
d #15300 [ 0.06] | bash '-c' 'source /tmp/init.sh ; /bin/sort '-s' '-k' '3' '-t' ' ''
However this fails with:
/bin/sort: option requires an argument -- 't'
Try '/bin/sort --help' for more information.
I believe what happens internally is that bash -c
treats source /tmp/init.sh ; /bin/sort
as the first string and as per documentation everything after becomes an argument:
If the -c option is present, then commands are read from string. If there are arguments after the string, they are assigned to the positional parameters, starting with $0.
To solve this I believe that the single quotes within the expanded args should be escaped (e.g. via https://creativeandcritical.net/str-replace-c)
=======
consider this minimal example:
#!/usr/bin/env python
import drmaa
import os
def main():
error_log = os.path.join(os.getcwd(), 'error.log')
with drmaa.Session() as s:
print('Creating job template')
jt = s.createJobTemplate()
jt.remoteCommand = "/bin/bash"
jt.args = [ '-c', "/bin/sort '-s' '-k' '3' '-t' '\t'" ]
jt.errorPath = ":" + error_log
jobid = s.runJob(jt)
print('Your job has been submitted with ID {0}'.format(jobid))
retval = s.wait(jobid, drmaa.Session.TIMEOUT_WAIT_FOREVER)
print('Job: {0} finished with status {1}'.format(retval.jobId, retval.exitStatus))
with open(error_log, 'r') as fin:
print(fin.read())
print('Cleaning up')
s.deleteJobTemplate(jt)
os.remove(error_log)
if __name__=='__main__':
main()
results in
$ python foo.py
Creating job template
Your job has been submitted with ID 7990
Job: 7990 finished with status 2
/bin/sort: option requires an argument -- 't'
Try '/bin/sort --help' for more information.
Cleaning up
Hello,
I'm not sure if this the origin of this particular bug, but I have not successfully reproduced this error on other DRMAA implementations.
I've submitted jobs to my SLURM 18.08 system where, occassionally, I get a reported "unknown signal?!". The exact same job, when resubmitted, may or may not have this issue. I cannot track down exactly what happens when this occurs or what causes this.
I have run strace on the job itself that was submitted on equivalent jobs, one which reports the "unknown signal" vs a regular exiting job and I cannot find any discernable difference and notably when tracing specifically for any signals.
sacct
reports nothing unusual, and actually seems to indicate that the job exited without issue. The sysadmin for our cluster system seems to agree and cannot find any issue.
This could be a cluster-specific issue, DRMAA issue, or not. If I'm looking in the wrong place please kindly redirect me. I'm not sure where or how I could start tracking down this issue.
Thanks for your time.
Hi,
As per recently there is a CVE related to all previous SLURM versions, meaning that there are now only two supported SLURM versions (20.11.9 and 21.08.8). Is this package still viable for the latest version of each?
Thanks
Using 44cc67e:
t #6279 [ 5.70] <- slurmdrmaa_parse_native
d #6279 [ 5.70] * job 16484204 submitted
t #6279 [ 5.71] -> fsd_job_new(16484204_1)
t #6279 [ 5.71] <- fsd_job_new=0xc163b0: ref_cnt=1 [lock 16484204_1]
t #6279 [ 5.71] -> fsd_job_set_add(job=0xc163b0, job_id=16484204_1)
t #6279 [ 5.71] <- fsd_job_set_add: job->ref_cnt=2
t #6279 [ 5.71] -> fsd_job_release(0xc163b0={job_id=16484204_1, ref_cnt=2}) [unlock 16484204_1]
t #6279 [ 5.71] <- fsd_job_release
t #6279 [ 5.71] -> fsd_job_new(16484204_2)
t #6279 [ 5.71] <- fsd_job_new=0xd6acf0: ref_cnt=1 [lock 16484204_2]
t #6279 [ 5.71] -> fsd_job_set_add(job=0xd6acf0, job_id=16484204_2)
t #6279 [ 5.71] <- fsd_job_set_add: job->ref_cnt=2
t #6279 [ 5.71] -> fsd_job_release(0xd6acf0={job_id=16484204_2, ref_cnt=2}) [unlock 16484204_2]
t #6279 [ 5.71] <- fsd_job_release
t #6279 [ 5.71] -> fsd_job_new(16484204_3)
t #6279 [ 5.71] <- fsd_job_new=0xdffed0: ref_cnt=1 [lock 16484204_3]
t #6279 [ 5.71] -> fsd_job_set_add(job=0xdffed0, job_id=16484204_3)
t #6279 [ 5.71] <- fsd_job_set_add: job->ref_cnt=2
t #6279 [ 5.71] -> fsd_job_release(0xdffed0={job_id=16484204_3, ref_cnt=2}) [unlock 16484204_3]
t #6279 [ 5.71] <- fsd_job_release
t #6279 [ 5.71] -> fsd_job_new(16484204_4)
t #6279 [ 5.71] <- fsd_job_new=0xe55a70: ref_cnt=1 [lock 16484204_4]
t #6279 [ 5.71] -> fsd_job_set_add(job=0xe55a70, job_id=16484204_4)
t #6279 [ 5.71] <- fsd_job_set_add: job->ref_cnt=2
t #6279 [ 5.71] -> fsd_job_release(0xe55a70={job_id=16484204_4, ref_cnt=2}) [unlock 16484204_4]
t #6279 [ 5.71] <- fsd_job_release
t #6279 [ 5.71] -> slurmdrmaa_free_job_desc
t #6279 [ 5.71] <- slurmdrmaa_free_job_desc
t #6279 [ 5.71] <- drmaa_run_bulk_jobs =0
d #6279 [ 5.71] * fsd_exc_new(1006,Vector have no more elements.,0)
t #6279 [ 5.71] <- drmaa_get_next_job_id=25: Vector have no more elements.
t #6279 [ 5.71] -> drmaa_delete_job_template(0xe24e30)
t #6279 [ 5.71] <- drmaa_delete_job_template =0
t #6279 [ 70.34] -> drmaa_job_ps(job_id=16484204_2)
t #6279 [ 70.34] -> fsd_job_set_get(job_id=16484204_2)
t #6279 [ 70.34] <- fsd_job_set_get(job_id=16484204_2) =0xd6acf0: ref_cnt=2 [lock 16484204_2]
d #6279 [ 70.34] * job->last_update_time = 0
d #6279 [ 70.34] * updating status of job: 16484204_2
t #6279 [ 70.34] -> slurmdrmaa_job_update_status({job_id=16484204_2})
t #6279 [ 70.34] -> slurmdrmaa_set_job_id({job_id=16484204_2})
t #6279 [ 70.34] <- slurmdrmaa_set_job_id; job_id=16484204_2
E #6279 [ 70.34] * fsd_exc_new(1003,not an number: 16484204_2,1)
t #6279 [ 70.34] -> slurmdrmaa_unset_job_id({job_id=(null)})
t #6279 [ 70.34] <- slurmdrmaa_unset_job_id; job_id=16484204_2
t #6279 [ 70.34] -> fsd_job_release(0xd6acf0={job_id=16484204_2, ref_cnt=2}) [unlock 16484204_2]
t #6279 [ 70.34] <- fsd_job_release
t #6279 [ 70.34] <- drmaa_job_ps=4: not an number: 16484204_2
Which causes DRMAA to drop these jobs.
The code seems to assume that job_ids have to be numeric.
SLURM uses ArrayJobIDs of the form <jobid>_<arrayid>
which in this case isn't being handled properly.
Also noticed the line t #6279 [ 5.71] <- drmaa_run_bulk_jobs =0
. Shouldn't it be =1
in this case?
Using the same approach described in #5 but running the job with:
jt.nativeSpecification = "--cpus-per-task=2000 --nodes=1 --mem-per-cpu=5000 --partition=htc --tmp=100"
Another segfault is triggered:
Program received signal SIGSEGV, Segmentation fault.
drmaa_release_job_ids (values=0x0) at drmaa_base.c:297
297 iter_function(job_id, drmaa_job_ids_t)
(gdb) bt
#0 drmaa_release_job_ids (values=0x0) at drmaa_base.c:297
#1 0x00007fffeffed550 in ffi_call_unix64 () at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/libffi/src/x86/unix64.S:76
#2 0x00007fffeffeccf5 in ffi_call (cif=<optimized out>, fn=0x7fffed771770 <drmaa_release_job_ids>, rvalue=<optimized out>, avalue=0x7fffffffc710)
at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/libffi/src/x86/ffi64.c:525
#3 0x00007fffeffe483c in _call_function_pointer (argcount=1, resmem=0x7fffffffc730, restype=<optimized out>, atypes=<optimized out>, avalues=0x7fffffffc710, pProc=0x7fffed771770 <drmaa_release_job_ids>, flags=4353)
at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/callproc.c:809
#4 _ctypes_callproc (pProc=0x7fffed771770 <drmaa_release_job_ids>, argtuple=0x7fffffffc7e0, flags=4353, argtypes=<optimized out>, restype=0x7ffff7d61cd0 <_Py_NoneStruct>, checker=0x0)
at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/callproc.c:1147
#5 0x00007fffeffdcda3 in PyCFuncPtr_call (self=<optimized out>, inargs=<optimized out>, kwds=0x0) at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/_ctypes.c:3870
#6 0x00007ffff793fade in _PyObject_FastCallDict (func=0x7fffeea66818, args=<optimized out>, nargs=<optimized out>, kwargs=0x0) at Objects/abstract.c:2316
#7 0x00007ffff7a1c2bb in call_function (pp_stack=0x7fffffffcb18, oparg=<optimized out>, kwnames=0x0) at Python/ceval.c:4822
#8 0x00007ffff7a1f15d in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3284
#9 0x00007ffff7969e33 in gen_send_ex (gen=0x7fffefd90200, arg=<optimized out>, exc=<optimized out>, closing=<optimized out>) at Objects/genobject.c:189
#10 0x00007ffff7978f16 in listextend (self=0x7fffeea7ad48, b=<optimized out>) at Objects/listobject.c:857
#11 0x00007ffff7979398 in list_init (self=0x7fffeea7ad48, args=<optimized out>, kw=<optimized out>) at Objects/listobject.c:2316
#12 0x00007ffff79add4c in type_call (type=<optimized out>, args=0x7ffff7e8d470, kwds=0x0) at Objects/typeobject.c:915
#13 0x00007ffff793fade in _PyObject_FastCallDict (func=0x7ffff7d5bb40 <PyList_Type>, args=<optimized out>, nargs=<optimized out>, kwargs=0x0) at Objects/abstract.c:2316
#14 0x00007ffff7a1c2bb in call_function (pp_stack=0x7fffffffce58, oparg=<optimized out>, kwnames=0x0) at Python/ceval.c:4822
#15 0x00007ffff7a1f15d in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3284
#16 0x00007ffff7a1aa60 in _PyEval_EvalCodeWithName (_co=0x7ffff01fc420, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=1, kwnames=0x7ffff7e9dba0, kwargs=0x7ffff7f8fba8, kwcount=3, kwstep=1,
defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x7ffff7ea3c30, qualname=0x7fffefd8d2b8) at Python/ceval.c:4128
#17 0x00007ffff7a1c48a in fast_function (kwnames=<optimized out>, nargs=1, stack=<optimized out>, func=0x7fffeea832f0) at Python/ceval.c:4939
#18 call_function (pp_stack=0x7fffffffd0f8, oparg=<optimized out>, kwnames=<optimized out>) at Python/ceval.c:4819
#19 0x00007ffff7a1e8dd in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3300
#20 0x00007ffff7a1aa60 in _PyEval_EvalCodeWithName (_co=0x7ffff7f1b930, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=0, kwnames=0x0, kwargs=0x8, kwcount=0, kwstep=2, defs=0x0, defcount=0,
kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:4128
#21 0x00007ffff7a1aee3 in PyEval_EvalCodeEx (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kws=<optimized out>, kwcount=0, defs=0x0, defcount=0,
kwdefs=0x0, closure=0x0) at Python/ceval.c:4149
#22 0x00007ffff7a1af2b in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>) at Python/ceval.c:695
#23 0x00007ffff7a4d6c0 in run_mod (arena=0x7ffff7f79180, flags=0x7fffffffd450, locals=0x7ffff7f5cf30, globals=0x7ffff7f5cf30, filename=0x7ffff7ea3830, mod=0x683f58) at Python/pythonrun.c:980
#24 PyRun_FileExFlags (fp=0x64cc30, filename_str=<optimized out>, start=<optimized out>, globals=0x7ffff7f5cf30, locals=0x7ffff7f5cf30, closeit=<optimized out>, flags=0x7fffffffd450) at Python/pythonrun.c:933
#25 0x00007ffff7a4ec83 in PyRun_SimpleFileExFlags (fp=0x64cc30, filename=<optimized out>, closeit=1, flags=0x7fffffffd450) at Python/pythonrun.c:396
#26 0x00007ffff7a6a0b5 in run_file (p_cf=0x7fffffffd450, filename=0x603310 L<error reading variable>, fp=0x64cc30) at Modules/main.c:338
#27 Py_Main (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:810
#28 0x0000000000400c1d in main (argc=2, argv=<optimized out>) at ./Programs/python.c:69
In this case, the error python: error: CPU count per node can not be satisfied
is still shown but it segfaults afterwards anyway.
make[2]: Entering directory `/root/slurm-drmaa-1.1.1/slurm_drmaa'
/bin/sh ../libtool --tag=CC --mode=compile gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -I/root/rpmbuild/BUILD/slurm-20.11.0 -I../drmaa_utils/ -Wno-long-long -D_REENTRANT -D_THREAD_SAFE -DNDEBUG -D_GNU_SOURCE -DCONFDIR=/etc -Wall -W -Wno-unused-parameter -Wno-format-zero-length -pedantic -std=c99 -g -O2 -pthread -MT libdrmaa_la-drmaa.lo -MD -MP -MF .deps/libdrmaa_la-drmaa.Tpo -c -o libdrmaa_la-drmaa.lo `test -f 'drmaa.c' || echo './'`drmaa.c
libtool: compile: gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -I/root/rpmbuild/BUILD/slurm-20.11.0 -I../drmaa_utils/ -Wno-long-long -D_REENTRANT -D_THREAD_SAFE -DNDEBUG -D_GNU_SOURCE -DCONFDIR=/etc -Wall -W -Wno-unused-parameter -Wno-format-zero-length -pedantic -std=c99 -g -O2 -pthread -MT libdrmaa_la-drmaa.lo -MD -MP -MF .deps/libdrmaa_la-drmaa.Tpo -c drmaa.c -fPIC -DPIC -o .libs/libdrmaa_la-drmaa.o
drmaa.c: In function ‘slurmdrmaa_get_DRM_system’:
drmaa.c:65:3: error: unknown type name ‘slurm_ctl_conf_t’
slurm_ctl_conf_t * conf_info_msg_ptr = NULL;
^
drmaa.c:66:3: warning: passing argument 2 of ‘slurm_load_ctl_conf’ from incompatible pointer type [enabled by default]
if ( slurm_load_ctl_conf ((time_t) NULL, &conf_info_msg_ptr ) == -1 )
^
In file included from drmaa.c:31:0:
/root/rpmbuild/BUILD/slurm-20.11.0/slurm/slurm.h:3714:12: note: expected ‘struct slurm_conf_t **’ but argument is of type ‘int **’
extern int slurm_load_ctl_conf(time_t update_time,
^
drmaa.c:73:101: error: request for member ‘version’ in something not a structure or union
fsd_snprintf(NULL, slurmdrmaa_version, sizeof(slurmdrmaa_version)-1,"SLURM %s", conf_info_msg_ptr->version);
^
drmaa.c:74:4: warning: passing argument 1 of ‘slurm_free_ctl_conf’ from incompatible pointer type [enabled by default]
slurm_free_ctl_conf (conf_info_msg_ptr);
^
In file included from drmaa.c:31:0:
/root/rpmbuild/BUILD/slurm-20.11.0/slurm/slurm.h:3722:13: note: expected ‘struct slurm_conf_t *’ but argument is of type ‘int *’
extern void slurm_free_ctl_conf(slurm_conf_t *slurm_ctl_conf_ptr);
^
make[2]: *** [libdrmaa_la-drmaa.lo] Error 1
make[2]: Leaving directory `/root/slurm-drmaa-1.1.1/slurm_drmaa'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/root/slurm-drmaa-1.1.1'
make: *** [all] Error 2
Dear Developers,
We are unable to compile slurm-drmaa 1.0.7 on this version of slurm?
Is the code compatible with this version of slurm??
Please find the errors below:
util.c:175:19: error: 'job_desc_msg_t' has no member named 'gres'
fsd_free(job_desc->gres);```
util.c:322:12: error: 'job_desc_msg_t' has no member named 'gres'
job_desc->gres = fsd_strdup(value);
Please advise.
Thank you ,
Iten
./configure
fails
# ./configure
[...]
checking for SLURM library dir... /usr/lib
checking for slurmdb_users_get in -lslurm... yes
Using slurm libraries -lslurm
checking for usable SLURM libraries/headers... *** The SLURM test program failed to link or run. See the file config.log
*** for the exact error that occured.
no
configure: error:
Slurm libraries/headers not found;
add --with-slurm-inc and --with-slurm-lib with appropriate locations.
The relevant section in config.log
configure:13429: ./conftest
conftest: error: resolve_ctls_from_dns_srv: res_nsearch error: No error
conftest: error: fetch_config: DNS SRV lookup failed
conftest: error: _establish_config_source: failed to fetch config
conftest: fatal: Could not establish a configuration source
configure:13429: $? = 1
configure: program exited with status 1
configure: failed program was:
| /* confdefs.h */
| #define PACKAGE_NAME "DRMAA for Slurm"
| #define PACKAGE_TARNAME "slurm-drmaa"
| #define PACKAGE_VERSION "1.2.0-dev.71cd5be"
| #define PACKAGE_STRING "DRMAA for Slurm 1.2.0-dev.71cd5be"
| #define PACKAGE_BUGREPORT "[email protected]"
| #define PACKAGE_URL ""
| #define PACKAGE "slurm-drmaa"
| #define VERSION "1.2.0-dev.71cd5be"
| #define STDC_HEADERS 1
| #define HAVE_SYS_TYPES_H 1
| #define HAVE_SYS_STAT_H 1
| #define HAVE_STDLIB_H 1
| #define HAVE_STRING_H 1
| #define HAVE_MEMORY_H 1
| #define HAVE_STRINGS_H 1
| #define HAVE_INTTYPES_H 1
| #define HAVE_STDINT_H 1
| #define HAVE_UNISTD_H 1
| #define HAVE_DLFCN_H 1
| #define LT_OBJDIR ".libs/"
| #define HAVE_PTHREAD_PRIO_INHERIT 1
| #define HAVE_LIBSLURM 1
| /* end confdefs.h. */
| #include "slurm/slurm.h"
| int
| main ()
| {
| job_desc_msg_t job_req; /*at least check for declared structs */
| return 0;
|
| ;
| return 0;
| }
configure:13444: result: no
configure:13450: error:
It looks like this is the culprit:
conftest: error: resolve_ctls_from_dns_srv: res_nsearch error: No error
conftest: error: fetch_config: DNS SRV lookup failed
conftest: error: _establish_config_source: failed to fetch config
conftest: fatal: Could not establish a configuration source
It worked with the previous version that I used (20.02).
It looks like there now is a slurm_library_init
and this subsequently requires a working slurm configuration and installation to run.
Supplied slurm-drmaa.spec file in release bundle is missing functionality to support configure options outlined in:
https://github.com/natefoo/slurm-drmaa/blob/main/README.md
Namely:
``
Notable ./configure script options:
--with-slurm-inc SLURM_INCLUDE_PATH
Path to Slurm header files (i.e. directory containing slurm/slurm.h ). By default the library tries to guess the SLURM_INCLUDE_PATH and SLURM_LIBRARY_PATH based on location of the srun executable.
--with-slurm-lib SLURM_LIBRARY_PATH
Path to Slurm libraries (i.e. directory containing libslurm.a ).
--prefix INSTALLATION_DIRECTORY
Root directory where PSNC DRMAA for Slurm shall be installed. When not given library is installed in /usr/local.
--enable-debug
Compiles library with debugging enabled (with debugging symbols not stripped, without optimizations, and with many log messages enabled). Useful when you are to debug DRMAA enabled application or investigate problems with DRMAA library itself.
Following patch enables this functionality for rpmbuild using rpmmacros.
--- slurm-drmaa-1.1.3/slurm-drmaa.spec 2022-01-04 16:48:41.991693930 +0000
+++ slurm-drmaa.spec 2022-01-04 16:37:09.449491395 +0000
@@ -31,7 +31,10 @@
RPM_OPT_FLAGS=`echo "$RPM_OPT_FLAGS" | sed -e 's/-O2 /-O0 /'`
CFLAGS="$RPM_OPT_FLAGS"
export CFLAGS
-%configure
+%configure \
+ %{?_with_slurm_lib:--with-slurm-lib=%{_with_slurm_lib}} \
+ %{?_with_slurm_inc:--with-slurm-inc=%{_with_slurm_inc}} \
+ %{?_enable_debug:--enable-debug}
%install
rm -rf "$RPM_BUILD_ROOT"
I followed the instructions given on github page. I am using slurm version 20.11.7 on AWS parallel cluster.
export SLURM_INCLUDE_PATH=/opt/slurm/include/slurm
export SLURM_LIBRARY_PATH=/opt/slurm/lib
export LD_LIBRARY_PATH=/opt/slurm/lib
3 ./configure && make
Error I get is as following error
[ec2-user@ip-XX-XX-XX-XX slurm-drmaa-1.1.2]$ ./configure && make
checking for a BSD-compatible install... /bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking for style of include used by make... GNU
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking dependency style of gcc... gcc3
checking for ar... ar
checking the archiver (ar) interface... ar
checking for gcc... (cached) gcc
checking whether we are using the GNU C compiler... (cached) yes
checking whether gcc accepts -g... (cached) yes
checking for gcc option to accept ISO C89... (cached) none needed
checking dependency style of gcc... (cached) gcc3
checking for gcc option to accept ISO C99... none needed
checking how to run the C preprocessor... gcc -E
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
checking how to print strings... printf
checking for a sed that does not truncate output... /bin/sed
checking for grep that handles long lines and -e... /bin/grep
checking for egrep... /bin/grep -E
checking for fgrep... /bin/grep -F
checking for ld used by gcc... /bin/ld
checking if the linker (/bin/ld) is GNU ld... yes
checking for BSD- or MS-compatible name lister (nm)... /bin/nm -B
checking the name lister (/bin/nm -B) interface... BSD nm
checking whether ln -s works... yes
checking the maximum length of command line arguments... 1572864
checking whether the shell understands some XSI constructs... yes
checking whether the shell understands "+="... yes
checking how to convert x86_64-unknown-linux-gnu file names to x86_64-unknown-linux-gnu format... func_convert_file_noop
checking how to convert x86_64-unknown-linux-gnu file names to toolchain format... func_convert_file_noop
checking for /bin/ld option to reload object files... -r
checking for objdump... objdump
checking how to recognize dependent libraries... pass_all
checking for dlltool... no
checking how to associate runtime and link libraries... printf %s\n
checking for archiver @file support... @
checking for strip... strip
checking for ranlib... ranlib
checking command to parse /bin/nm -B output from gcc object... ok
checking for sysroot... no
checking for mt... no
checking if : is a manifest tool... no
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for dlfcn.h... yes
checking for objdir... .libs
checking if gcc supports -fno-rtti -fno-exceptions... no
checking for gcc option to produce PIC... -fPIC -DPIC
checking if gcc PIC flag -fPIC -DPIC works... yes
checking if gcc static flag -static works... yes
checking if gcc supports -c -o file.o... yes
checking if gcc supports -c -o file.o... (cached) yes
checking whether the gcc linker (/bin/ld -m elf_x86_64) supports shared libraries... yes
checking whether -lc should be explicitly linked in... no
checking dynamic linker characteristics... GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... yes
checking whether make sets $(MAKE)... (cached) yes
checking whether ln -s works... yes
checking whether gcc accepts -Wno-missing-field-initializers... yes
checking whether gcc accepts -Wno-format-zero-length... yes
checking whether gcc is Clang... no
checking whether pthreads work with -pthread... yes
checking for joinable pthread attribute... PTHREAD_CREATE_JOINABLE
checking whether more special flags are required for pthreads... no
checking for PTHREAD_PRIO_INHERIT... yes
configure: checking for SLURM
checking for SLURM compile flags... -I/opt/slurm/include
checking for SLURM library dir... /opt/slurm/lib
checking for slurmdb_users_get in -lslurm... yes
Using slurm libraries -lslurm
checking for usable SLURM libraries/headers... yes
checking for ANSI C header files... (cached) yes
checking whether time.h and sys/time.h may both be included... yes
checking for stdbool.h that conforms to C99... yes
checking for _Bool... yes
checking stddef.h usability... yes
checking stddef.h presence... yes
checking for stddef.h... yes
checking for stdlib.h... (cached) yes
checking for string.h... (cached) yes
checking for strings.h... (cached) yes
checking sys/time.h usability... yes
checking sys/time.h presence... yes
checking for sys/time.h... yes
checking for unistd.h... (cached) yes
checking for size_t... yes
checking whether struct tm is in sys/time.h or time.h... time.h
checking for an ANSI C-conforming const... yes
checking for inline... inline
checking for working volatile... yes
checking for stdlib.h... (cached) yes
checking for GNU libc compatible malloc... yes
checking for strftime... yes
checking for vprintf... yes
checking for _doprnt... no
checking for asprintf... yes
checking for fstat... yes
checking for getcwd... yes
checking for gettimeofday... yes
checking for localtime_r... yes
checking for memset... yes
checking for mkstemp... yes
checking for setenv... yes
checking for strcasecmp... yes
checking for strchr... yes
checking for strdup... yes
checking for strerror... yes
checking for strlcpy... no
checking for strndup... yes
checking for strstr... yes
checking for strtol... yes
checking for vasprintf... yes
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: creating Makefile
config.status: creating test/Makefile
config.status: creating slurm_drmaa/Makefile
config.status: creating config.h
config.status: config.h is unchanged
config.status: executing depfiles commands
config.status: executing libtool commands
=== configuring in drmaa_utils (/home/ec2-user/slurm-drmaa-1.1.2/drmaa_utils)
configure: WARNING: no configuration information is in drmaa_utils
Run 'make' now.
make all-recursive
make[1]: Entering directory /home/ec2-user/slurm-drmaa-1.1.2' Making all in drmaa_utils make[2]: Entering directory
/home/ec2-user/slurm-drmaa-1.1.2/drmaa_utils'
make[2]: *** No rule to make target all'. Stop. make[2]: Leaving directory
/home/ec2-user/slurm-drmaa-1.1.2/drmaa_utils'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/ec2-user/slurm-drmaa-1.1.2'
make: *** [all] Error 2
Hi
We have several web applications submitting jobs to SGE using the Java-based Opal Toolkit (https://sourceforge.net/projects/opaltoolkit/) via drmaa.jar
In the process of migrating from SGE to Slurm I realized that the JNI methods do not exist in the slurm-drmaa library. Do you know of any out-of-the-box solution for submitting jobs via drmaa.jar to Slurm?
Best regards,
Guilhem
Is it possible to build RPMs from the source build?
Instructions?
The Galaxy Project Depot certificate has expired today:
pub rsa4096 2014-09-09 [SC] [expired: 2021-07-29]
4AAF 9274 AC74 7FFB 1627 6F99 5262 6447 751B 835F
uid [ expired] Nathan Coraor [email protected]
uid [ expired] Nathan Coraor [email protected]
As of Slurm 17.11 there are new job states that slurm-drmaa is not handling. These include JOB_OOM
(OUT_OF_MEMORY
), among others. As a result, the DRMAA state will be UNDETERMINED
until the job exceeds Slurm's MinJobAge
.
Need to determine if JOB_OOM
is only returned when the job is terminal, or at any time that the OOM killer has activated.
Hi,
I tried the native slurm-drmaa package, but this led to galaxy handlers failing to restart. The handler logs complained about slurm-drmaa.
sudo apt install slurm-drmaa1
Is yours (I'll use the Ubuntu launchpad repo, thanks) more likely to be compatible with Ubuntu 20.04 ?
Galaxy version 20.05.
Thanks
I'm trying to install slurm-drmaa version 1.1.2 alongwith slurm version 19.05.8-1 on CentOS-7.9 machine. It throws the package dependency issue and requires.. libslurm.so.31()(64bit) and libslurmdb.so.31()(64bit) libraries.. Whereas, slurm-19.05.8-1 provides the library libslurm.so.34 .
[root@test-vm111 ~]# cat /etc/redhat-release
CentOS Linux release 7.9.2009 (Core)
Installed slurm packages..
[root@test-vm111 yum.repos.d]# rpm -qa | grep ^slurm | sort
slurm-19.05.8-1.el7.x86_64
slurm-contribs-19.05.8-1.el7.x86_64
slurm-devel-19.05.8-1.el7.x86_64
slurm-example-configs-19.05.8-1.el7.x86_64
slurm-libpmi-19.05.8-1.el7.x86_64
slurm-openlava-19.05.8-1.el7.x86_64
slurm-pam_slurm-19.05.8-1.el7.x86_64
slurm-perlapi-19.05.8-1.el7.x86_64
slurm-slurmctld-19.05.8-1.el7.x86_64
slurm-slurmd-19.05.8-1.el7.x86_64
slurm-slurmdbd-19.05.8-1.el7.x86_64
slurm-torque-19.05.8-1.el7.x86_64
Issue with slurm-drmaa 1.1.2 installation
[root@test-vm111 yum.repos.d]# yum install slurm-drmaa
Loaded plugins: langpacks, nvidia
Resolving Dependencies
--> Running transaction check
---> Package slurm-drmaa.x86_64 0:1.1.2-1.el7 will be installed
--> Processing Dependency: libslurmdb.so.31()(64bit) for package: slurm-drmaa-1.1.2-1.el7.x86_64
--> Processing Dependency: libslurm.so.31()(64bit) for package: slurm-drmaa-1.1.2-1.el7.x86_64
--> Finished Dependency Resolution
Error: Package: slurm-drmaa-1.1.2-1.el7.x86_64 (slurm-19.05)
Requires: libslurmdb.so.31()(64bit)
Error: Package: slurm-drmaa-1.1.2-1.el7.x86_64 (slurm-19.05)
Requires: libslurm.so.31()(64bit)
You could try using --skip-broken to work around the problem
You could try running: rpm -Va --nofiles --nodigest
libslurm.so library available in machine
[root@test-vm111 ~]# locate libslurm.so
/usr/lib64/libslurm.so
/usr/lib64/libslurm.so.34
/usr/lib64/libslurm.so.34.0.0
[root@test-vm111 yum.repos.d]# rpm -qf /usr/lib64/libslurm.so.34
slurm-19.05.8-1.el7.x86_64
Anyone can look into the issue?
Hi,
Thanks for merging my previous fix. This one is in a similar vein.
On line 134 of slurm_drmaa/job.c, any problem when updating the job status is reported back as FSD_ERRNO_INTERNAL_ERROR. The specific issue here is that the caller would like to know if the error is intermittent (eg. a network time-out) and thus possibly the job status can be queried successfully in a few minutes, or if the problem is terminal and the job is dead. I've prepared a complementary patch to Snakemake to handle FSD_ERRNO_DRM_COMMUNICATION_FAILURE as an intermittent fault and to keep polling the job.
Really, the DRMAA library should make a better attempt to convert SLURM errors to meaningful DRMAA error codes, but this is a start.
Let me know if you'd prefer me to submit this stuff elsewhere. It's hard to see who is maintaining the definitive slurm-dmraa libs just now.
*** tim_testing_slurm//build/slurm-drmaa-1.2.0.2/slurm_drmaa/job.c.orig 2016-11-04 15:09:49.000000000 +0000
--- tim_testing_slurm//build/slurm-drmaa-1.2.0.2/slurm_drmaa/job.c 2017-06-09 15:05:38.000000000 +0100
***************
*** 131,138 ****
if (_slurm_errno == ESLURM_INVALID_JOB_ID) {
self->on_missing(self);
! } else {
! fsd_exc_raise_fmt(FSD_ERRNO_INTERNAL_ERROR,"slurm_load_jobs error: %s,job_id: %s", slurm_strerror(slurm_get_errno()), self->job_id);
}
}
if (job_info) {
--- 131,150 ----
if (_slurm_errno == ESLURM_INVALID_JOB_ID) {
self->on_missing(self);
! } else
! // We should detect the error corresponding to "Socket timed out" and report
! // it explicitly as FSD_ERRNO_TIMEOUT or maybe FSD_ERRNO_DRM_COMMUNICATION_FAILURE
! // ( I'm not sure if FSD_ERRNO_TIMEOUT is the same as DRMAA_ERRNO_EXIT_TIMEOUT,
! // which simply indicates the job is still running?? Maybe we should try it and see. )
! // To see what _slurm_errno corresponds to which message let's look at
! // 'slurm_strerror' in the slurm source code...
! // https://github.com/SchedMD/slurm/blob/master/src/common/slurm_errno.c
! if ( _slurm_errno == SLURM_PROTOCOL_SOCKET_IMPL_TIMEOUT ||
! _slurm_errno == SLURMCTLD_COMMUNICATIONS_CONNECTION_ERROR
! ) {
! fsd_exc_raise_fmt(FSD_ERRNO_DRM_COMMUNICATION_FAILURE,"slurm_load_jobs error: %s,job_id: %s", slurm_strerror(_slurm_errno), self->job_id);
! } else {
! fsd_exc_raise_fmt(FSD_ERRNO_INTERNAL_ERROR,"slurm_load_jobs error: %s,job_id: %s", slurm_strerror(_slurm_errno), self->job_id);
}
}
if (job_info) {
Cheers,
TIM
Testing slurm drmaa in a container, but even when running outside of a container either building from source or installing via galaxy rpm every time I run binary its segfaults.
am I missing something?
Error is:
[root@f8ddc11bc51e /]# DRMAA_LIBRARY_PATH=/usr/lib64/libdrmaa.so /usr/bin/drmaa-run
Segmentation fault (core dumped)
Backtrace shows:
[root@f8ddc11bc51e /]# export DRMAA_LIBRARY_PATH=/usr/lib64/libdrmaa.so
[root@f8ddc11bc51e /]# gdb /usr/bin/drmaa-run
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/...
Reading symbols from /usr/bin/drmaa-run...Reading symbols from /usr/lib/debug/usr/bin/drmaa-run.debug...done.
done.
(gdb) run
Starting program: /usr/bin/drmaa-run
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Program received signal SIGSEGV, Segmentation fault.
0x00000000004129b6 in parse_args (argc=0, argv=0x7fffffffe7a0) at drmaa_run.c:254
254 while (argc >= 0 && argv[0][0] == '-')
(gdb) backtrace
#0 0x00000000004129b6 in parse_args (argc=0, argv=0x7fffffffe7a0) at drmaa_run.c:254
#1 0x00000000004120df in main (argc=1, argv=0x7fffffffe798) at drmaa_run.c:122
(gdb)
My test setup is as follows:
Dockerfile:
$ cat Dockerfile
FROM centos:7
RUN (cd /lib/systemd/system/sysinit.target.wants/; for i in ; do [ $i == systemd-tmpfiles-setup.service ] || rm -f $i; done);
rm -f /lib/systemd/system/multi-user.target.wants/;
rm -f /etc/systemd/system/.wants/;
rm -f /lib/systemd/system/local-fs.target.wants/;
rm -f /lib/systemd/system/sockets.target.wants/udev;
rm -f /lib/systemd/system/sockets.target.wants/initctl;
rm -f /lib/systemd/system/basic.target.wants/;
rm -f /lib/systemd/system/anaconda.target.wants/*;
VOLUME [ "/sys/fs/cgroup"]
RUN yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo
RUN yum-config-manager --add-repo https://depot.galaxyproject.org/yum/galaxy.repo
RUN yum -y install which strace gdb
RUN debuginfo-install -y libgcc-4.8.5-44.el7.x86_64
RUN debuginfo-install -y glibc-2.17-324.el7_9.x86_64
RUN yum -y install slurm-slurmd-20.11.8 slurm-devel-20.11.8glibc-2.17-324.el7_9.x86_64
RUN yum clean all && yum -y update
RUN yum -y install slurm-drmaa slurm-drmaa-debuginfo
RUN yum clean all &&
rm -rf /var/cache/yum
VOLUME [ "/sys/fs/cgroup"]
ENTRYPOINT ['/usr/sbin/init']
Which results in a working container, and when I login to the container I'm running:
[root@f8ddc11bc51e /]# cat /etc/redhat-release
CentOS Linux release 7.9.2009 (Core)
[root@f8ddc11bc51e7 /]# rpm -qa slurm*
slurm-slurmd-20.11.8-1.el7.x86_64
slurm-drmaa-debuginfo-1.1.2-1.el7.x86_64
slurm-20.11.8-1.el7.x86_64
slurm-devel-20.11.8-1.el7.x86_64
slurm-drmaa-1.1.2-1.el7.x86_64
[root@f8ddc11bc51e /]# yum info slurm-drmaa-1.1.2-1.el7.x86_64
Loaded plugins: fastestmirror, ovl
Loading mirror speeds from cached hostfile
Hi,
I've compiled the last slurm-drmaa and tried to use the drmaa.jar from the SGE so I can access to the methods in Java pointing to the compiled version of libdrmaa.so but I'm getting an Exception in thread "main" java.lang.UnsatisfiedLinkError: com.sun.grid.drmaa.SessionImpl.nativeInit(Ljava/lang/String;)V
So I'm using:
export DRMAA_LIBRARY_PATH=/home/vargasfr/slurm_drmaa/slurm-drmaa-1.1.2/install/lib/libdrmaa.so
export CLASSPATH=/home/vargasfr/slurm_drmaa/lib/drmaa.jar:/home/vargasfr/slurm_drmaa/slurm-drmaa-1.1.2/install/lib/libdrmaa.so:/home/vargasfr/slurm_drmaa
export LD_LIBRARY_PATH=/home/vargasfr/slurm_drmaa/slurm-drmaa-1.1.2/install/lib
From my Java file I'm importing import org.ggf.drmaa.* (where this is coming from the drmaa.jar) and calling:
SessionFactory factory = SessionFactory.getFactory();
Any idea on how can I make this working using SLURM?
Would it be possible to document the -D, --chdir=<directory>
argument as well?
When using --mem
, the value passed to Slurm is multiplied by the number of CPUs, e.g., it's performed as --mem-per-cpu
. I think this was just a merge error on my part. This commit seems to have been applied to the wrong lines:
Hi,
I have a problem in passing the right format for setting the hard wall clock time. I've been successful in passing the hours and minutes, but I'm not able to set the seconds. Is it possible?
For now I'm passing the string formatted as follows: "HH:MM", but if I add the seconds "HH:MM:SS" it will just pass 0:00 to SLURM.
Thanks,
Alessio
I'm running into an issue on Centos 7.9.2009 running slurm 20.02.4 and slurm-drmaa-1.1.1. I ran configure with the '--with-slurm-inc=/opt/slurm/include'
and '--with-slurm-lib=/opt/slurm/lib'
options, but in the galaxy log it can't find libslurm.so.35 despite the library being there:
Traceback (most recent call last):
File "lib/galaxy/main.py", line 298, in <module>
main()
File "lib/galaxy/main.py", line 294, in main
app_loop(args, log)
File "lib/galaxy/main.py", line 141, in app_loop
attach_to_pools=args.attach_to_pool,
File "lib/galaxy/main.py", line 108, in load_galaxy_app
**kwds
File "lib/galaxy/app.py", line 221, in __init__
self.job_manager = manager.JobManager(self)
File "lib/galaxy/jobs/manager.py", line 26, in __init__
self.job_handler = handler.JobHandler(app)
File "lib/galaxy/jobs/handler.py", line 51, in __init__
self.dispatcher = DefaultJobDispatcher(app)
File "lib/galaxy/jobs/handler.py", line 972, in __init__
self.job_runners = self.app.job_config.get_job_runner_plugins(self.app.config.server_name)
File "lib/galaxy/jobs/__init__.py", line 801, in get_job_runner_plugins
rval[id] = runner_class(self.app, runner.get('workers', JobConfiguration.DEFAULT_NWORKERS), **runner.get('kwds', {}))
File "lib/galaxy/jobs/runners/drmaa.py", line 65, in __init__
drmaa = __import__("drmaa")
File "/app/galaxy/.venv/lib/python3.6/site-packages/drmaa/__init__.py", line 65, in <module>
from .session import JobInfo, JobTemplate, Session
File "/app/galaxy/.venv/lib/python3.6/site-packages/drmaa/session.py", line 39, in <module>
from drmaa.helpers import (adapt_rusage, Attribute, attribute_names_iterator,
File "/app/galaxy/.venv/lib/python3.6/site-packages/drmaa/helpers.py", line 36, in <module>
from drmaa.wrappers import (drmaa_attr_names_t, drmaa_attr_values_t,
File "/app/galaxy/.venv/lib/python3.6/site-packages/drmaa/wrappers.py", line 56, in <module>
_lib = CDLL(libpath, mode=RTLD_GLOBAL)
File "/usr/lib64/python3.6/ctypes/__init__.py", line 343, in __init__
self._handle = _dlopen(self._name, mode)
OSError: libslurm.so.35: cannot open shared object file: No such file or directory
[galaxy@ip-xxxxxxxxx slurm-drmaa-1.1.1]$ ls -la /opt/slurm/lib/libslurm.*
-rw-r--r--. 1 root root 57281210 Nov 17 16:38 /opt/slurm/lib/libslurm.a
-rwxr-xr-x. 1 root root 976 Nov 17 16:38 /opt/slurm/lib/libslurm.la
lrwxrwxrwx. 1 root root 18 Nov 17 16:38 /opt/slurm/lib/libslurm.so -> libslurm.so.35.0.0
lrwxrwxrwx. 1 root root 18 Nov 17 16:38 /opt/slurm/lib/libslurm.so.35 -> libslurm.so.35.0.0
-rwxr-xr-x. 1 root root 8200504 Nov 17 16:38 /opt/slurm/lib/libslurm.so.35.0.0
After some digging I found the following in the config.log:
configure:4925: checking for gcc option to accept ISO C99
configure:5074: gcc -c -g -O2 conftest.c >&5
conftest.c:61:29: error: expected ';', ',' or ')' before 'text'
test_restrict (ccp restrict text)
^
conftest.c: In function 'main':
conftest.c:115:18: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'newvar'
char *restrict newvar = "Another string";
^
conftest.c:115:18: error: 'newvar' undeclared (first use in this function)
conftest.c:115:18: note: each undeclared identifier is reported only once for each function it appears in
conftest.c:125:3: error: 'for' loop initial declarations are only allowed in C99 mode
for (int i = 0; i < ia->datasize; ++i)
^
conftest.c:125:3: note: use option -std=c99 or -std=gnu99 to compile your code
configure:5074: $? = 1
configure: failed program was:
| /* confdefs.h */
| #define PACKAGE_NAME "DRMAA for Slurm"
| #define PACKAGE_TARNAME "slurm-drmaa"
| #define PACKAGE_VERSION "1.1.1"
| #define PACKAGE_STRING "DRMAA for Slurm 1.1.1"
| #define PACKAGE_BUGREPORT "[email protected]"
| #define PACKAGE_URL ""
| #define PACKAGE "slurm-drmaa"
| #define VERSION "1.1.1"
| /* end confdefs.h. */
Any ideas?
Hi, just wanted to note this here. Following error during compilation:
libtool: compile: gcc -DHAVE_CONFIG_H -I. -I.. -I//include -I../drmaa_utils/ -Wno-long-long -D_REENTRANT -D_THREAD_SAFE -DNDEBUG -D_GNU_SOURCE -DCONFDIR=/etc -Wall -W -Wno-unused-parameter -Wno-format-zero-length -pedantic -std=c99 -g -O0 -pthread -MT libdrmaa_la-job.lo -MD -MP -MF .deps/libdrmaa_la-job.Tpo -c job.c -fPIC -DPIC -o .libs/libdrmaa_la-job.o
job.c: In function ‘slurmdrmaa_job_control’:
job.c:104:8: error: too few arguments to function ‘slurm_kill_job2’
if(slurm_kill_job2(self->job_id, SIGKILL, 0) == -1) {
^~~~~~~~~~~~~~~
In file included from ../slurm_drmaa/job.h:29,
from job.c:31:
/usr/include/slurm/slurm.h:3531:12: note: declared here
extern int slurm_kill_job2(const char *job_id, uint16_t signal, uint16_t flags,
^~~~~~~~~~~~~~~
job.c: In function ‘slurmdrmaa_job_update_status’:
job.c:364:24: warning: this statement may fall through [-Wimplicit-fallthrough=]
self->exit_status = -1;
~~~~~~~~~~~~~~~~~~^~~~
job.c:365:5: note: here
case JOB_FAILED:
^~~~
I added NULL argument to slurm_kill_job2 call and it compiled.
Hi,
When I run bulk jobs with one of the jobs lasting more than 20 minutes, the s.synchronize() function waits for a few minutes after the last job is finished and then triggers a segfault:
import drmaa
s = drmaa.Session()
s.initialize()
jt = s.createJobTemplate()
jt.remoteCommand = "eval"
jt.args = [ 'sleep $((12/${SLURM_ARRAY_TASK_ID}*100))' ]
joblist = s.runBulkJobs(jt, 1, 4, 1)
s.synchronize(joblist, drmaa.Session.TIMEOUT_WAIT_FOREVER, True)
Same happens if I loop through the job ids with the s.wait() function:
for jobid in joblist:
s.wait(jobid, drmaa.Session.TIMEOUT_WAIT_FOREVER)
However it works perfectly fine if jobs finish in the same order as their SLURM_ARRAY_TASK_ID:
import drmaa
s = drmaa.Session()
s.initialize()
jt = s.createJobTemplate()
jt.remoteCommand = "eval"
jt.args = [ 'sleep $((${SLURM_ARRAY_TASK_ID}*300))' ]
joblist = s.runBulkJobs(jt, 1, 4, 1)
s.synchronize(joblist, drmaa.Session.TIMEOUT_WAIT_FOREVER, True)
No problem if jobs last only 10 minutes:
import drmaa
s = drmaa.Session()
s.initialize()
jt = s.createJobTemplate()
jt.remoteCommand = "eval"
jt.args = [ 'sleep $((12/${SLURM_ARRAY_TASK_ID}*10))' ]
joblist = s.runBulkJobs(jt, 1, 4, 1)
s.synchronize(joblist, drmaa.Session.TIMEOUT_WAIT_FOREVER, True)
I came up with this little piece of code to bypass the bug:
import drmaa
s = drmaa.Session()
s.initialize()
jt = s.createJobTemplate()
jt.remoteCommand = "eval"
jt.args = [ 'sleep $((12/${SLURM_ARRAY_TASK_ID}*100))' ]
joblist = s.runBulkJobs(jt, 1, 4, 1)
for jobid in joblist:
while (s.jobStatus(jobid)=="running"):
time.sleep(10)
print "job %s done" % jobid
Yields:
job 135892105_1 done
job 135892105_2 done
job 135892105_3 done
job 135892105_4 done
Hi, I have a question. If we create a new specification like -P for Project_ID which is defined as the id of the particular project like account. Can I update the DRMAA-Slurm library for this attribute by myself? Thanks.
If libdrmaa fails to contact SLURM due to system overload, a temporary network interruption or a timeout a "socket error" is sometimes seen and is immediately followed by a segfault.
This is likely to be due to improper handling of the error.
Hi!
Using the provided SPEC file (CentOS 7.x, x86_64), it looks like the generated RPM installs libdrmaa.so
in /usr/lib64
:
$ rpm -ql slurm-drmaa
/usr/bin/drmaa-run
/usr/bin/drmaa-run-bulk
/usr/bin/hpc-bash
/usr/include/drmaa.h
/usr/lib64/libdrmaa.a
/usr/lib64/libdrmaa.la
/usr/lib64/libdrmaa.so
/usr/lib64/libdrmaa.so.1
/usr/lib64/libdrmaa.so.1.0.8
/usr/share/doc/slurm-drmaa-1.1.2
/usr/share/doc/slurm-drmaa-1.1.2/COPYING
/usr/share/doc/slurm-drmaa-1.1.2/NEWS
/usr/share/doc/slurm-drmaa-1.1.2/README.md
/usr/share/doc/slurm-drmaa-1.1.2/slurm_drmaa.conf.example
But DRMAA_LIBRARY_PATH
defaults to /usr/lib/libdrmaa.so
:
$ drmaa-run
F #1d7bd [ 0.00] * Could not load DRMAA library (DRMAA_LIBRARY_PATH=/usr/lib/libdrmaa.so)
F #1d7bd [ 0.00] * Error
and:
$ strace -e open drmaa-run
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/libdrmaa.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/etc/localtime", O_RDONLY|O_CLOEXEC) = 3
F #1d7c5 [ 0.00] * Could not load DRMAA library (DRMAA_LIBRARY_PATH=/usr/lib/libdrmaa.so)
F #1d7c5 [ 0.00] * Error
+++ exited with 255 +++
Would there be a way to make the default DRMAA_LIBRARY_PATH
consistent with the RPM installation?
After:
export DRMAA_LIBRARY_PATH=~/test_drmaa/slurm-drmaa-1.2.0-dev.83fc288/slurm_drmaa/.libs/libdrmaa.so
When using libdrmaa via python
#!/usr/bin/env python
from __future__ import print_function
import os
import drmaa
LOGS = "logs/"
if not os.path.isdir(LOGS):
os.mkdir(LOGS)
s = drmaa.Session()
s.initialize()
print("Supported contact strings:", s.contact)
print("Supported DRM systems:", s.drmsInfo)
print("Supported DRMAA implementations:", s.drmaaImplementation)
print("Version", s.version)
jt = s.createJobTemplate()
jt.remoteCommand = "/usr/bin/echo"
jt.args = ["Hello", "world"]
jt.jobName = "testdrmaa"
jt.jobEnvironment = os.environ.copy()
jt.workingDirectory = os.getcwd()
jt.outputPath = ":" + os.path.join(LOGS, "job-%A_%a.out")
jt.errorPath = ":" + os.path.join(LOGS, "job-%A_%a.err")
jt.nativeSpecification = "--cpus-per-task=2 --nodes=1 --mem-per-cpu=50 --partition=htc --tmp=100"
print("Submitting", jt.remoteCommand, "with", jt.args, "and logs to", jt.outputPath)
ids = s.runBulkJobs(jt, beginIndex=1, endIndex=2, step=1)
print("Job submitted with ids", ids)
s.deleteJobTemplate(jt)
The above code fails when calling runBulkJobs
Stack trace of the above script:
Program received signal SIGSEGV, Segmentation fault.
strlcpy (dest=dest@entry=0x7a9640 "9829091", src=0x0, size=size@entry=1024) at compat.c:50
50 while( *src && --size > 0 )
(gdb) bt
#0 strlcpy (dest=dest@entry=0x7a9640 "9829091", src=0x0, size=size@entry=1024) at compat.c:50
#1 0x00007fffed772fac in drmaa_get_next_job_id (values=0x7ac5c0, value=0x7a9640 "9829091", value_len=1024) at drmaa_base.c:297
#2 0x00007fffeffed550 in ffi_call_unix64 () at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/libffi/src/x86/unix64.S:76
#3 0x00007fffeffeccf5 in ffi_call (cif=<optimized out>, fn=0x7fffed772e90 <drmaa_get_next_job_id>, rvalue=<optimized out>, avalue=0x7fffffffc6c0) at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/libffi/src/x86/ffi64.c:525
#4 0x00007fffeffe483c in _call_function_pointer (argcount=3, resmem=0x7fffffffc6f0, restype=<optimized out>, atypes=<optimized out>, avalues=0x7fffffffc6c0, pProc=0x7fffed772e90 <drmaa_get_next_job_id>, flags=4353) at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/callproc.c:809
#5 _ctypes_callproc (pProc=0x7fffed772e90 <drmaa_get_next_job_id>, argtuple=0x7fffffffc7e0, flags=4353, argtypes=<optimized out>, restype=0x7ffff0212f28, checker=0x0) at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/callproc.c:1147
#6 0x00007fffeffdcda3 in PyCFuncPtr_call (self=<optimized out>, inargs=<optimized out>, kwds=0x0) at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/_ctypes.c:3870
#7 0x00007ffff793fade in _PyObject_FastCallDict (func=0x7fffeea655c0, args=<optimized out>, nargs=<optimized out>, kwargs=0x0) at Objects/abstract.c:2316
#8 0x00007ffff7a1c2bb in call_function (pp_stack=0x7fffffffcb18, oparg=<optimized out>, kwnames=0x0) at Python/ceval.c:4822
#9 0x00007ffff7a1f15d in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3284
#10 0x00007ffff7969e33 in gen_send_ex (gen=0x7fffefd90200, arg=<optimized out>, exc=<optimized out>, closing=<optimized out>) at Objects/genobject.c:189
#11 0x00007ffff7978f3e in listextend (self=0x7fffeea79d48, b=<optimized out>) at Objects/listobject.c:857
#12 0x00007ffff7979398 in list_init (self=0x7fffeea79d48, args=<optimized out>, kw=<optimized out>) at Objects/listobject.c:2316
#13 0x00007ffff79add4c in type_call (type=<optimized out>, args=0x7ffff7e8d470, kwds=0x0) at Objects/typeobject.c:915
#14 0x00007ffff793fade in _PyObject_FastCallDict (func=0x7ffff7d5bb40 <PyList_Type>, args=<optimized out>, nargs=<optimized out>, kwargs=0x0) at Objects/abstract.c:2316
#15 0x00007ffff7a1c2bb in call_function (pp_stack=0x7fffffffce58, oparg=<optimized out>, kwnames=0x0) at Python/ceval.c:4822
#16 0x00007ffff7a1f15d in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3284
#17 0x00007ffff7a1aa60 in _PyEval_EvalCodeWithName (_co=0x7ffff01fc420, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=1, kwnames=0x7ffff7e9dba0, kwargs=0x7ffff7f8fba8, kwcount=3, kwstep=1, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x7ffff7ea3c30, qualname=0x7fffefd8d2b8) at Python/ceval.c:4128
#18 0x00007ffff7a1c48a in fast_function (kwnames=<optimized out>, nargs=1, stack=<optimized out>, func=0x7fffeea8c2f0) at Python/ceval.c:4939
#19 call_function (pp_stack=0x7fffffffd0f8, oparg=<optimized out>, kwnames=<optimized out>) at Python/ceval.c:4819
#20 0x00007ffff7a1e8dd in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3300
#21 0x00007ffff7a1aa60 in _PyEval_EvalCodeWithName (_co=0x7ffff7f1b930, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=0, kwnames=0x0, kwargs=0x8, kwcount=0, kwstep=2, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:4128
#22 0x00007ffff7a1aee3 in PyEval_EvalCodeEx (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kws=<optimized out>, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:4149
#23 0x00007ffff7a1af2b in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>) at Python/ceval.c:695
#24 0x00007ffff7a4d6c0 in run_mod (arena=0x7ffff7f79180, flags=0x7fffffffd450, locals=0x7ffff7f5cf30, globals=0x7ffff7f5cf30, filename=0x7ffff7ea3830, mod=0x683f58) at Python/pythonrun.c:980
#25 PyRun_FileExFlags (fp=0x64cc30, filename_str=<optimized out>, start=<optimized out>, globals=0x7ffff7f5cf30, locals=0x7ffff7f5cf30, closeit=<optimized out>, flags=0x7fffffffd450) at Python/pythonrun.c:933
#26 0x00007ffff7a4ec83 in PyRun_SimpleFileExFlags (fp=0x64cc30, filename=<optimized out>, closeit=1, flags=0x7fffffffd450) at Python/pythonrun.c:396
#27 0x00007ffff7a6a0b5 in run_file (p_cf=0x7fffffffd450, filename=0x603310 L<error reading variable>, fp=0x64cc30) at Modules/main.c:338
#28 Py_Main (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:810
#29 0x0000000000400c1d in main (argc=2, argv=<optimized out>) at ./Programs/python.c:69--
The above code runs fine with a libdrmaa built from https://github.com/ljyanesm/slurm-drmaa
When specifying --tmp
, --mem-per-cpu
and other numerical parameters, if a size qualifier is used 1M
, 1G
, ... the code will segfault.
While this is supported on the command-line via srun/squeue the drmaa code assumes a number is passed and will segfault on non-numeric input.
Looking into this it seems that the exit status returned from a child process is not handled gracefully in some instances.
Ideally the exit status should be using the macros defined in sys/wait.h
but it is only used sparingly across drmaa.c
. Notably the macro WIFEXITED
is used but not WIFSIGNALED
or WTERMSIG
. For example, instead of WIFSIGNALED
there is an operation that may or may not be the same as the macro necessary for that particular architecture. Is there a reason why these macros were not used?
This is related to Issue #26. After removing the hardcoded exit status manipulation with macros, I suddenly went from jobs reporting "unknown signal?!" to "wasTerminated" which was far more informative in terms of tracking down our issues (and ultimately solved our problem).
I realize it's strange to ask a question like this on a repository, but I've spent the past hour trying to figure it out on my own to no avail. I thought that you might be able to answer it in 30 seconds. I would greatly appreciate any help!
In essence, how do you submit multi-threaded jobs using slurm-drmaa? To be clear, I want the job to run on one node (i.e. --ntasks=1
). I use the --cpus-per-task
option with srun
or sbatch
, but this option isn't available in the native specification for slurm-drmaa.
I've tried different combinations of --mincpus
, --nodes
, --ntasks-per-node
and --ntasks
, but they either allow jobs to be split across multiple nodes or they fail. I've looked through the code for galaxyproject/galaxy and galaxyproject/pulsar, but I couldn't find any hints.
drmaa-run bash test.drmaa
drmaa-run: error: _parse_next_key: Parsing error at unrecognized key: PriorityWeightTRES
drmaa-run: error: Parse error in file /etc/slurm/slurm.conf line 131: "PriorityWeightTRES=CPU=1000,Mem=2000 "
drmaa-run: error: _parse_next_key: Parsing error at unrecognized key: AccountingStorageTRES
drmaa-run: error: Parse error in file /etc/slurm/slurm.conf line 186: "AccountingStorageTRES=gres/gpu,gres/mic,gres/nvme"
drmaa-run: error: Parsing error at unrecognized key: TresBillingWeights
drmaa-run: error: Parse error in file /etc/slurm/slurm.conf line 196: " TresBillingWeights="CPU=1.0,Mem=0.127G""
drmaa-run: error: Unable to establish controller machine
Segmentation fault
For Slurm we can set the maximum number of simultaneous running tasks for an array job as follows:
--array={start}-{end}:{step-size}%{maximum number of simultaneously running tasks}
But it seems slurm-drmaa doesn't support this:
slurm-drmaa/slurm_drmaa/session.c
Line 123 in 665d5b5
It looks to me that the DRMAA interface does not support memory units and not all time formats are supported.
What appears to be supported:
sbatch
(at least in Slurm 20.02) can handle memory specifications such as 10G
and time specifications as days-hours
.
It would be great to have support for --nice
as well.
At least when I built it just now, slurm 19.05.01 doesn't ship a libslurmdb.so, so the linking test run by configure fails. Removing the linking to it here solves the issue https://github.com/natefoo/slurm-drmaa/blob/master/m4/ax_slurm.m4#L75 but it'd be good if someone else confirmed before this is implemented.
Hi,
We are using slurm-drmaa for our galaxy instance in our HPC environment. We recently updated our slurm to slurm-20.11.1 however, even the latest version of slurm-drmaa is no longer compatible with our new slurm:
libtool: compile: gcc -DHAVE_CONFIG_H -I. -I.. -I/opt/software/slurm/include/ -I../drmaa_utils/ -Wno-long-long -D_REENTRANT -D_THREAD_SAFE -DNDEBUG -D_GNU_SOURCE -DCONFDIR=/opt/software/slurm-drmaa/etc -Wall -W -Wno-unused-parameter -Wno-format-zero-length -pedantic -std=c99 -g -O2 -pthread -MT libdrmaa_la-drmaa.lo -MD -MP -MF .deps/libdrmaa_la-drmaa.Tpo -c drmaa.c -fPIC -DPIC -o .libs/libdrmaa_la-drmaa.o
drmaa.c: In function ‘slurmdrmaa_get_DRM_system’:
drmaa.c:65:3: error: unknown type name ‘slurm_ctl_conf_t’; did you mean ‘slurm_conf_t’?
65 | slurm_ctl_conf_t * conf_info_msg_ptr = NULL;
| ^~~~~~~~~~~~~~~~
| slurm_conf_t
drmaa.c:66:44: warning: passing argument 2 of ‘slurm_load_ctl_conf’ from incompatible pointer type [-Wincompatible-pointer-types]
66 | if ( slurm_load_ctl_conf ((time_t) NULL, &conf_info_msg_ptr ) == -1 )
| ^~~~~~~~~~~~~~~~~~
| |
| int **
In file included from drmaa.c:31:
/opt/software/slurm/include/slurm/slurm.h:3715:47: note: expected ‘slurm_conf_t **’ {aka ‘struct **’} but argument is of type ‘int **’
3715 | slurm_conf_t **slurm_ctl_conf_ptr);
| ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~
drmaa.c:73:101: error: request for member ‘version’ in something not a structure or union
73 | fsd_snprintf(NULL, slurmdrmaa_version, sizeof(slurmdrmaa_version)-1,"SLURM %s", conf_info_msg_ptr->version);
| ^~
drmaa.c:74:25: warning: passing argument 1 of ‘slurm_free_ctl_conf’ from incompatible pointer type [-Wincompatible-pointer-types]
74 | slurm_free_ctl_conf (conf_info_msg_ptr);
| ^~~~~~~~~~~~~~~~~
| |
| int *
In file included from drmaa.c:31:
/opt/software/slurm/include/slurm/slurm.h:3722:47: note: expected ‘slurm_conf_t *’ {aka ‘struct *’} but argument is of type ‘int *’
3722 | extern void slurm_free_ctl_conf(slurm_conf_t *slurm_ctl_conf_ptr);
| ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~
Is there any plan to update slurm-drmaa?
Thanks,
Cheers,
Ata
https://depot.galaxyproject.org/yum/package/slurm/18.08/7/x86_64/
The DRMAA rpm is breaking as it is not meant for the slurm version.
The slurm rpm in the repo provides .so.33 and drmaa in this repo requires .so.31
Error: Package: slurm-drmaa-1.1.0-1.el7.x86_64 (slurm-18.08)
Requires: libslurm.so.31()(64bit)
Error: Package: slurm-drmaa-1.1.0-1.el7.x86_64 (slurm-18.08)
Requires: libslurmdb.so.31()(64bit)
You could try using --skip-broken to work around the problem
Do you have access to this?
Testing with the drmaa-run utility, I find that slurm-drmaa fails with the 18.08.8 release of Slurm, but the exact same procedure works fine with 18.08.7. With 18.08.8 it fails at the job run step:
E #2af1 [ 0.77] * fsd_exc_new(1001,slurm_submit_batch_job error (-1): Unspecified error,1)
t #2af1 [ 0.77] -> slurmdrmaa_free_job_desc
t #2af1 [ 0.77] <- slurmdrmaa_free_job_desc
t #2af1 [ 0.77] <- drmaa_run_job=1: slurm_submit_batch_job error (-1): Unspecified error
F #2af1 [ 0.77] * Failed to submit a job: slurm_submit_batch_job error (-1): Unspecified error
Corresponding to this part of the drmaa-run code:
/* run */
if (api.run_job(jobid, sizeof(jobid) - 1, jt, errbuf, sizeof(errbuf) - 1) != DRMAA_ERRNO_SUCCESS) {
fsd_log_fatal(("Failed to submit a job: %s ", errbuf));
exit(2); /* TODO exception */
Slurm 18.08.8 addresses a security vulnerability that exists in prior versions of Slurm.
In both source archives for the 1.2.0-dev.deca826
release the drmaa_utils
directory is empty.
Hello and apologies if this question is in the wrong place. We are upgrading from Debian 8 to Debian 11. I am a developer with no particular background in system administration or configuration. Several weeks into a cycle of install/google-error-message/install-something-else, I have installed munge, slurm, slurm-drmaa, and bats(!). slurmctld and slurmd are now running, but calls to drmaa_run_job() result in seg faults. (The surrounding C++ code is copied from our Debian 8 host, where drmaa_run_job() runs successfully.) I'll print some debug output below, but what I'm really looking for is start-to-finish step-by-step instructions for configuring, installing, and running whatever it takes to make SLURM usable on Debian 11. Thanks in advance.
Last few steps of debug output from drmaa_run_job:
d #597f9 [ 40.42] * finalizing job constraints
d #597f9 [ 40.42] * set min_cpus to ntasks: 1
t #597f9 [ 40.42] <- slurmdrmaa_parse_native
ORA-24550: signal received: [si_signo=11] [si_errno=0] [si_code=1] [si_int=0] [si_ptr=(nil)] [si_addr=0x1656]
kpedbg_dmp_stack()+394<-kpeDbgCrash()+204<-kpeDbgSignalHandler()+113<-skgesig_sigactionHandler()+258<-__sighandler()<-0x00007F06CFEC9B71<-slurm_pack_selected_step()+1286<-slurm_send_node_msg()+505<-slurm_send_recv_msg()+66<-slurm_send_recv_controller_msg()+315<-slurm_submit_batch_job()+119<-slurmdrmaa_session_run_bulk()+518<-slurmdrmaa_session_run_job()+179<-drmaa_run_job()+374<-_ZN19custom_code::submit_jobERKN5boost10filesystem4pathES4_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESC_bb()+4407<-0x0000000000000009<-0x7453705F6D00626F
runscript.sh: line 62: 366577 Segmentation fault
Stack trace from gdb:
Stack trace of thread 366585:
#0 0x00007f06d1914fe1 raise (libpthread.so.0 + 0x13fe1)
#1 0x00007f06c254893f skgesigOSCrash (libclntsh.so + 0x267293f)
#2 0x00007f06c2c63cdd kpeDbgSignalHandler (libclntsh.so + 0x2d8dcdd)
#3 0x00007f06c2548c12 skgesig_sigactionHandler (libclntsh.so + 0x2672c12)
#4 0x00007f06d1915140 __restore_rt (libpthread.so.0 + 0x14140)
#5 0x00007f06cfec9b71 __strlen_avx2 (libc.so.6 + 0x15fb71)
#6 0x00007f06d0467cb3 n/a (libslurm.so.36 + 0xf8cb3)
#7 0x00007f06d047c646 n/a (libslurm.so.36 + 0x10d646)
#8 0x00007f06d0456cf9 slurm_send_node_msg (libslurm.so.36 + 0xe7cf9)
#9 0x00007f06d0457f72 slurm_send_recv_msg (libslurm.so.36 + 0xe8f72)
#10 0x00007f06d04580db slurm_send_recv_controller_msg (libslurm.so.36 + 0xe90db)
#11 0x00007f06d03b76e7 slurm_submit_batch_job (libslurm.so.36 + 0x486e7)
#12 0x00007f06d05414f1 slurmdrmaa_session_run_bulk (libdrmaa.so.1 + 0xb4f1)
#13 0x00007f06d054123b slurmdrmaa_session_run_job (libdrmaa.so.1 + 0xb23b)
#14 0x00007f06d055c133 drmaa_run_job (libdrmaa.so.1 + 0x26133)
#15 0x000056442ad0bf37 n/a (XXX + 0xd1f37)
#16 0x0000000000000009 n/a (n/a + 0x0)
Any advice would be greatly appreciated.
If the Slurm config setting PropagatePrioProcess
is set, jobs submitted via slurm-drmaa emit the error:
slurmstepd: error: Couldn't find SLURM_PRIO_PROCESS in environment
Slurm manager checks for this and sbatch sets this environment variable, and thus slurm-drmaa also needs to set it.
This still requires confirmation and a proper backtrace but when submitting over 10.000 jobs on a cluster with a 10.000 limit a segfault would be seen.
Got one more segmentation fault when specifying --time 1:00:00
for 1 hour.
Seems to be just an alias problem since -t 1:00:00
works.
...
d #e494 [ 0.03] * # Native specification: --cpus-per-task=2 --nodes=1 --mem-per-cpu=50 --partition=htc --time 1:00:00 --tmp=100
t #e494 [ 0.03] -> slurmdrmaa_parse_native
t #e494 [ 0.03] -> slurmdrmaa_parse_additional_attr
d #e494 [ 0.03] * # cpus_per_task = 2
t #e494 [ 0.03] <- slurmdrmaa_parse_additional_attr
t #e494 [ 0.03] -> slurmdrmaa_parse_additional_attr
d #e494 [ 0.03] * nodes: 1 ->
d #e494 [ 0.03] * # min_nodes = 1
t #e494 [ 0.03] <- slurmdrmaa_parse_additional_attr
t #e494 [ 0.03] -> slurmdrmaa_parse_additional_attr
d #e494 [ 0.03] * # pn_min_memory (MEM_PER_CPU) = 50
t #e494 [ 0.03] <- slurmdrmaa_parse_additional_attr
t #e494 [ 0.03] -> slurmdrmaa_parse_additional_attr
d #e494 [ 0.03] * # partition = htc
t #e494 [ 0.03] <- slurmdrmaa_parse_additional_attr
t #e494 [ 0.03] -> slurmdrmaa_parse_additional_attr
d #e494 [ 0.03] * # time_limit = (null)
t #e494 [ 0.03] -> slurmdrmaa_datetime_parse((null))
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff6b20faf in __strlen_sse42 () from /lib64/libc.so.6
(gdb) bt
#0 0x00007ffff6b20faf in __strlen_sse42 () from /lib64/libc.so.6
#1 0x00007fffed955284 in slurmdrmaa_datetime_parse (string=0x0) at util.c:53
#2 0x00007fffed956295 in slurmdrmaa_add_attribute (job_desc=0x7fffffff9e10, attr=19, value=0x0) at util.c:292
#3 0x00007fffed956c19 in slurmdrmaa_parse_additional_attr (job_desc=0x7fffffff9e10, add_attr=0x7ac3bf "time", clusters_opt=0x7fffffff8af0) at util.c:427
#4 0x00007fffed9570f8 in slurmdrmaa_parse_native (job_desc=0x7fffffff9e10, value=0x79f8b0 "--cpus-per-task=2 --nodes=1 --mem-per-cpu=50 --partition=htc --time 1:00:00 --tmp=100") at util.c:502
#5 0x00007fffed95462e in slurmdrmaa_job_create (session=0x641ad0, jt=0x7e3570, envp=0x7fffffffa0f8, expand=0x771280, job_desc=0x7fffffff9e10) at job.c:701
#6 0x00007fffed952d3b in slurmdrmaa_job_create_req (session=0x641ad0, jt=0x7e3570, envp=0x7fffffffa0f8, job_desc=0x7fffffff9e10) at job.c:302
#7 0x00007fffed954af4 in slurmdrmaa_session_run_bulk (self=0x641ad0, jt=0x7e3570, start=1, end=2, incr=1) at session.c:126
#8 0x00007fffed96facb in drmaa_run_bulk_jobs (job_ids=0x7fffeea84a28, jt=0x7e3570, start=1, end=2, incr=1, error_diagnosis=0x732960 "", error_diag_len=1024) at drmaa_base.c:427
#9 0x00007fffeffed550 in ffi_call_unix64 () at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/libffi/src/x86/unix64.S:76
#10 0x00007fffeffeccf5 in ffi_call (cif=<optimized out>, fn=0x7fffed96f8e3 <drmaa_run_bulk_jobs>, rvalue=<optimized out>, avalue=0x7fffffffa330) at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/libffi/src/x86/ffi64.c:525
#11 0x00007fffeffe483c in _call_function_pointer (argcount=7, resmem=0x7fffffffa380, restype=<optimized out>, atypes=<optimized out>, avalues=0x7fffffffa330, pProc=0x7fffed96f8e3 <drmaa_run_bulk_jobs>, flags=4353)
at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/callproc.c:809
#12 _ctypes_callproc (pProc=0x7fffed96f8e3 <drmaa_run_bulk_jobs>, argtuple=0x7fffffffa4f0, flags=4353, argtypes=<optimized out>, restype=0x7ffff0236158, checker=0x0) at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/callproc.c:1147
#13 0x00007fffeffdcda3 in PyCFuncPtr_call (self=<optimized out>, inargs=<optimized out>, kwds=0x0) at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/_ctypes.c:3870
#14 0x00007ffff793fe96 in PyObject_Call (func=0x7fffeea66e58, args=<optimized out>, kwargs=<optimized out>) at Objects/abstract.c:2246
#15 0x00007ffff7a20236 in do_call_core (kwdict=0x0, callargs=<optimized out>, func=0x7fffeea66e58) at Python/ceval.c:5067
#16 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3366
#17 0x00007ffff7a1aa60 in _PyEval_EvalCodeWithName (_co=0x7ffff0220390, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=6, kwnames=0x0, kwargs=0x7e61c8, kwcount=0, kwstep=1, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0,
name=0x7ffff7f66308, qualname=0x7ffff7f66308) at Python/ceval.c:4128
#18 0x00007ffff7a1c48a in fast_function (kwnames=<optimized out>, nargs=6, stack=<optimized out>, func=0x7fffeea7f840) at Python/ceval.c:4939
#19 call_function (pp_stack=0x7fffffffaa08, oparg=<optimized out>, kwnames=<optimized out>) at Python/ceval.c:4819
#20 0x00007ffff7a1f15d in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3284
#21 0x00007ffff7969e33 in gen_send_ex (gen=0x7fffefd92200, arg=<optimized out>, exc=<optimized out>, closing=<optimized out>) at Objects/genobject.c:189
#22 0x00007ffff7978f16 in listextend (self=0x7fffeea8ee88, b=<optimized out>) at Objects/listobject.c:857
#23 0x00007ffff7979398 in list_init (self=0x7fffeea8ee88, args=<optimized out>, kw=<optimized out>) at Objects/listobject.c:2316
#24 0x00007ffff79add4c in type_call (type=<optimized out>, args=0x7ffff7e8e908, kwds=0x0) at Objects/typeobject.c:915
#25 0x00007ffff793fade in _PyObject_FastCallDict (func=0x7ffff7d5bb40 <PyList_Type>, args=<optimized out>, nargs=<optimized out>, kwargs=0x0) at Objects/abstract.c:2316
#26 0x00007ffff7a1c2bb in call_function (pp_stack=0x7fffffffad48, oparg=<optimized out>, kwnames=0x0) at Python/ceval.c:4822
#27 0x00007ffff7a1f15d in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3284
#28 0x00007ffff7a1aa60 in _PyEval_EvalCodeWithName (_co=0x7ffff01ff420, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=1, kwnames=0x7ffff7e9cb58, kwargs=0x7ffff7f8fba8, kwcount=3, kwstep=1, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0,
name=0x7ffff7ea3d70, qualname=0x7fffefd8f300) at Python/ceval.c:4128
#29 0x00007ffff7a1c48a in fast_function (kwnames=<optimized out>, nargs=1, stack=<optimized out>, func=0x7fffeea84400) at Python/ceval.c:4939
#30 call_function (pp_stack=0x7fffffffafe8, oparg=<optimized out>, kwnames=<optimized out>) at Python/ceval.c:4819
#31 0x00007ffff7a1e8dd in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3300
#32 0x00007ffff7a1aa60 in _PyEval_EvalCodeWithName (_co=0x7ffff7f1c930, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=0, kwnames=0x0, kwargs=0x8, kwcount=0, kwstep=2, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0)
at Python/ceval.c:4128
#33 0x00007ffff7a1aee3 in PyEval_EvalCodeEx (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kws=<optimized out>, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:4149
#34 0x00007ffff7a1af2b in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>) at Python/ceval.c:695
#35 0x00007ffff7a4d6c0 in run_mod (arena=0x7ffff7f79180, flags=0x7fffffffb340, locals=0x7ffff7f5df30, globals=0x7ffff7f5df30, filename=0x7ffff7ea3970, mod=0x6857d8) at Python/pythonrun.c:980
#36 PyRun_FileExFlags (fp=0x6438d0, filename_str=<optimized out>, start=<optimized out>, globals=0x7ffff7f5df30, locals=0x7ffff7f5df30, closeit=<optimized out>, flags=0x7fffffffb340) at Python/pythonrun.c:933
#37 0x00007ffff7a4ec83 in PyRun_SimpleFileExFlags (fp=0x6438d0, filename=<optimized out>, closeit=1, flags=0x7fffffffb340) at Python/pythonrun.c:396
#38 0x00007ffff7a6a0b5 in run_file (p_cf=0x7fffffffb340, filename=0x603310 L"test_drmaa.py", fp=0x6438d0) at Modules/main.c:338
#39 Py_Main (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:810
#40 0x0000000000400c1d in main (argc=2, argv=<optimized out>) at ./Programs/python.c:69
Hi,
I'm using slurm-drmaa to submit a job and I get the error below:
d #89f27 [ 0.00] * # Native specification: --time=1:00:00 --ntasks=1 --gres=gpu:1 --cpus-per-task=2 --nodes=1 --account=xxx@yyy --partition=mypartition
t #89f27 [ 0.00] -> slurmdrmaa_parse_native
t #89f27 [ 0.00] -> slurmdrmaa_parse_additional_attr
d #89f27 [ 0.00] * # time_limit = 1:00:00
t #89f27 [ 0.00] -> slurmdrmaa_datetime_parse(1:00:00)
d #89f27 [ 0.00] * parsed: 0000-00-00 01:00:00 +00:00:00 [---hms-]
t #89f27 [ 0.00] <- slurmdrmaa_datetime_parse(1:00:00) =60 minutes
t #89f27 [ 0.00] <- slurmdrmaa_parse_additional_attr
t #89f27 [ 0.00] -> slurmdrmaa_parse_additional_attr
d #89f27 [ 0.00] * # ntasks = 1
t #89f27 [ 0.00] <- slurmdrmaa_parse_additional_attr
t #89f27 [ 0.00] -> slurmdrmaa_parse_additional_attr
d #89f27 [ 0.00] * # gres = gpu:1
t #89f27 [ 0.00] <- slurmdrmaa_parse_additional_attr
t #89f27 [ 0.00] -> slurmdrmaa_parse_additional_attr
d #89f27 [ 0.00] * # cpus_per_task = 2
t #89f27 [ 0.00] <- slurmdrmaa_parse_additional_attr
t #89f27 [ 0.00] -> slurmdrmaa_parse_additional_attr
d #89f27 [ 0.00] * nodes: 1 ->
d #89f27 [ 0.00] * # min_nodes = 1
t #89f27 [ 0.00] <- slurmdrmaa_parse_additional_attr
t #89f27 [ 0.00] -> slurmdrmaa_parse_additional_attr
d #89f27 [ 0.00] * # account = xxx@yyy
t #89f27 [ 0.00] <- slurmdrmaa_parse_additional_attr
t #89f27 [ 0.00] -> slurmdrmaa_parse_additional_attr
d #89f27 [ 0.00] * # partition = allgpus
t #89f27 [ 0.00] <- slurmdrmaa_parse_additional_attr
d #89f27 [ 0.00] * finalizing job constraints
d #89f27 [ 0.00] * set min_cpus to ntasks*cpus_per_task: 2
t #89f27 [ 0.00] <- slurmdrmaa_parse_native
E #89f27 [ 4.24] * fsd_exc_new(1016,slurm_submit_batch_job error: Invalid Trackable RESource (TRES) specification,1)
t #89f27 [ 4.24] -> slurmdrmaa_free_job_desc
t #89f27 [ 4.24] <- slurmdrmaa_free_job_desc
t #89f27 [ 4.24] <- drmaa_run_job=17: slurm_submit_batch_job error: Invalid Trackable RESource (TRES) specification
Traceback (most recent call last):
..
File "/.../python3.6/site-packages/drmaa/session.py", line 314, in runJob
c(drmaa_run_job, jid, sizeof(jid), jobTemplate)
File "/.../python3.6/site-packages/drmaa/helpers.py", line 302, in c
return f(*(args + (error_buffer, sizeof(error_buffer))))
File "/.../site-packages/drmaa/errors.py", line 151, in error_check
raise _ERRORS[code - 1](error_string)
drmaa.errors.DeniedByDrmException: code 17: slurm_submit_batch_job error: Invalid Trackable RESource (TRES) specification
The same job without ```--gres=gpu:1`` works fine.
slurm-drmaa version is 1.1.3 and slurm version is 21.08.6. Os is RHEL 8.4.
Any hint would be greatly appreciated,
Kimchi
Hi,
I am experiencing an issue similar to #19. It appears the slurm shared libraries specified by --with-slurm-lib
cannot be found when loading conftest
at runtime during the configure
script.
I believe the issue is that while LD_LIBRARY_PATH
is set in ax_slurm.m4
, it is never exported. You can see how this was done in cURL: curl/curl@302d537.
I've tried this and it does fix the issue. Alternatively, while looking into this I've found it suggested that using rpath is better practice as it's more constrained. I also was able to run ./configure
successfully by setting the rpath as shown here: reid-wagner@67c7f6e.
If you want to go that path I'd be glad to open a PR. I haven't been able to test compilation yet for a few reasons, one being that I'm encountering an unrelated compilation issue on master.
The above issue happens with slurm-drmaa 1.1.1 and gcc 4.8.5 on CentOS 7.8.2003.
Additionally it's worth mentioning that out of the box 1.1.1 configured and compiled on my Ubuntu machine with gcc 9.3.0. I actually grabbed the conftest.c source from config.log and compiled it on both machines. On the Ubuntu machine it appears that the dependency on libslurm was stripped from the ELF, I guess because it's optimized out. On the CentOS machine the dependency is there.. So on the Ubuntu machine it wasn't actually testing that the libraries could be found at runtime.
Thanks for taking a look.
Below is the error from config.log
. I modified the paths:
configure:14098: checking for usable SLURM libraries/headers
configure:14119: gcc -std=gnu99 -o conftest -pedantic -std=c99 -g -O2 -pthread -D_REENTRANT -D_THREAD_SAFE -DNDEBUG -D_GNU_SOURCE -I
/path/to/include/ -L/path/to/lib/ conftest.c -lslurm -lslurm >&5
configure:14119: $? = 0
configure:14119: ./conftest
./conftest: error while loading shared libraries: libslurm.so.35: cannot open shared object file: No such file or directory
configure:14119: $? = 127
configure: program exited with status 127
configure: failed program was:
| /* confdefs.h */
| #define PACKAGE_NAME "DRMAA for Slurm"
| #define PACKAGE_TARNAME "slurm-drmaa"
| #define PACKAGE_VERSION "1.1.1"
| #define PACKAGE_STRING "DRMAA for Slurm 1.1.1"
| #define PACKAGE_BUGREPORT "[email protected]"
| #define PACKAGE_URL ""
| #define PACKAGE "slurm-drmaa"
| #define VERSION "1.1.1"
| #define STDC_HEADERS 1
| #define HAVE_SYS_TYPES_H 1
| #define HAVE_SYS_STAT_H 1
| #define HAVE_STDLIB_H 1
| #define HAVE_STRING_H 1
| #define HAVE_MEMORY_H 1
| #define HAVE_STRINGS_H 1
| #define HAVE_INTTYPES_H 1
| #define HAVE_STDINT_H 1
| #define HAVE_UNISTD_H 1
| #define HAVE_DLFCN_H 1
| #define LT_OBJDIR ".libs/"
| #define HAVE_PTHREAD_PRIO_INHERIT 1
| #define HAVE_LIBSLURM 1
| /* end confdefs.h. */
| #include "slurm/slurm.h"
| int
| main ()
| {
| job_desc_msg_t job_req; /*at least check for declared structs */
| return 0;
|
| ;
| return 0;
| }
configure:14134: result: no
configure:14140: error:
Slurm libraries/headers not found;
add --with-slurm-inc and --with-slurm-lib with appropriate locations.
Currently only Univa implements 2.2, but it brings a lot of nice improvements, and it'd help adoption if it were implemented for Slurm as well. Specification docs here.
Implementing 2.2 would be a big undertaking and as slurm-drmaa is only a side-project for me (and I'm not a C programmer by trade) I'd say it's fairly unlikely that anything will get done on this, but it's a good goal.
Hi,
I'm trying to compile slurm-drmaa-1.2.0-dev.deca826 for SLURM 17-11.5 and I'm getting errors during "configure" step.
I'm executing "./configure --prefix=/tmp/test-slurm-drmaa --with-slurm-inc=/soft/slurm-17.11.5/include/slurm/ --with-slurm-lib=/soft/slurm-17.11.5/lib/" (my SLURM installation is installed in a NFS folder called /soft") and error is "SLURM libraries/headers not found; add --with-slurm-inc and --with-slurm-lib with appropriate locations."
Could you help me?
Thanks.
As per the sbatch documentation, it is possible to request both a minimum and a maximum number of nodes with --nodes
:
-N, --nodes=<minnodes[-maxnodes]>
Request that a minimum of minnodes nodes be allocated to this job. A maximum node count may also be specified with maxnodes.
However, slurm-drmaa doesn't support this:
Traceback (most recent call last):
File "/home/ndc/drmaa-venv/bin/sbatch-drmaa", line 20, in <module>
jobid = s.runJob(jt)
File "/home/ndc/drmaa-venv/lib/python2.7/site-packages/drmaa/session.py", line 314, in runJob
c(drmaa_run_job, jid, sizeof(jid), jobTemplate)
File "/home/ndc/drmaa-venv/lib/python2.7/site-packages/drmaa/helpers.py", line 303, in c
return f(*(args + (error_buffer, sizeof(error_buffer))))
File "/home/ndc/drmaa-venv/lib/python2.7/site-packages/drmaa/errors.py", line 151, in error_check
raise _ERRORS[code - 1](error_string)
drmaa.errors.InvalidArgumentException: code 4: not an number: 1-1
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.