pygridtools / gridmap Goto Github PK

View Code? Open in Web Editor NEW

82.0 82.0 34.0 849 KB

Easily map Python functions onto a cluster using a DRMAA-compatible grid engine like Sun Grid Engine (SGE).

License: GNU General Public License v3.0

Python 98.32% Shell 1.68%

grid grid-engine hacktoberfest linux

gridmap's People

Contributors

Stargazers

Watchers

gridmap's Issues

Make default settings overridable by environment variables so people don't have to alter the code.

Currently if you want to change something like the DEFAULT_QUEUE you have to modify the constants in the code. That's dumb, so I should change it so it will automatically use environment variables instead if they're set.

Crash if stalled job comes back from the dead

If a job is stalled and we ask SGE to kill it, but for some reason the process lives on, it will cause a crash because the REP socket will end up in a bad state.

Potential extension to the library

Hello...

Edit: I just realized that there is a related issue already opened: Add support for limiting the number of concurrently executing jobs #27. I believe this is also what I am trying to address with my approach below.

I am trying to modify gridmap behavior. I have a bunch of jobs(~10,000 or so) that I have to run to completion. Essentially, I don't care about combining results from each individual job, I just need to run each self-contained job to completion. I would like to use your infrastructure to essentially build on top of to achieve this. Also, I don't want to submit all ~10,000 jobs at once to the GridEngine. Instead I want to batch these jobs up in chunks and submit them...essentially, build a pool of processes. A queue would probably be a better term? When one process finishes, either due to completion or exception, remove this from the job queue and add one or more new processes to the pool based on the number of free spots left in the pool.

According to me, minimally, this would involve two changes to the JobMonitor script to achieve this:

Prevent JobMonitor from exiting when it encounters an error in check_alive(). This can be simply achieved by overriding the check_if_alive() method, calling super.check_if_alive() in the override and catching the exception it may throw because of a compute node subprocess throwing an exception...

    class JobMonitorExt(job.JobMonitor):
        def check_if_alive():
            try:
                super.check_if_alive()
            except:
                pass

Batching jobs: This can be done by overriding all_jobs_done() and instead of checking if all jobs are done, just check if at least one process has finished(either exception or completed). If there is a process like this, remove it from the process queue and add a new function process to the current session by calling _append_job_to_session(). all_jobs_done() returns True only when all the ~10,000 jobs have been processed.

    class JobMonitorExt(job.JobMonitor):
        def all_jobs_done():
            # check if some job is done(exception or completed). If some job is done, remove that job from queue and add new job.
            # Return true if no more jobs to process

Do you think this will work? I'd appreciate your thoughts on this.

Add option to send success mail upon grid_map completion

Add a feature to grid_map that sends a short success mail upon completion of all grid_map jobs.

ETS-specific path cleaning function should be removed

The ETS-specific clean_path function in gridmap/gridmap/data.py should be removed. This appears to no longer be necessary.

jobs are resubmitted even when they are killed purposely using qdel

Hi Dan, while I was trying to kill (qdel) the running jobs on the grid, they would seem to be killed but later resubmitted again. I have to repeat this for 5 times. However, those which are not currently executing (on the wait list) would be killed for good with just one qdel. I think this may be the case that the automatic resubmit feature is being applied to jobs which die prematurely and also jobs which are terminated purposely. Can you please check ?

runpy.py:125: RuntimeWarning: 'gridmap.runner' found in sys.modules

python3.6/runpy.py:125: RuntimeWarning: 'gridmap.runner' found in sys.modules after import of package 'gridmap', but prior to execution of 'gridmap.runner'; this may result in unpredictable behaviour warn(RuntimeWarning(msg))

I don't know how to remove this warning，but the program is looks like running normally .

An issue, possibly relating to the use of a custom whitelist of machines

Howdy! While evaluating all the texts yesterday afternoon/evening, I received the following error:

Traceback (most recent call last):
  File "SR5_Batch.py", line 77, in <module>
    results = results + gridmap.process_jobs(jobs_queue, temp_dir=TEMP_DIR, white_list=WHITE_LIST, quiet=False)
  File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/job.py", line 773, in process_jobs
    monitor.check(sid, jobs)
  File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/job.py", line 372, in check
    self.check_if_alive()
  File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/job.py", line 429, in check_if_alive
    handle_resubmit(self.session_id, job, temp_dir=self.temp_dir)
  File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/job.py", line 584, in handle_resubmit
    job.white_list.remove(node_name)
ValueError: list.remove(x): x not in list

Nothing actually died or anything, so I just kinda left it to see what would happen, and discovered hours later that nothing had actually happened. My gridmapping just kinda hung there.

The interesting thing is that this doesn't seem to happen consistently. I was able to evaluate ~60 texts just fine, but when I went for 1180, in batches of 100, I received this at around the 9th batch. When I can, I'll test this again to see how duplicable it is. 😕

Better documentation and a tutorial

Our RTD site is pretty bare-bones at the moment, and we need a nice tutorial, like we have for SKLL.

Should check for DRMAA library not found exception instead of crashing

Currently, if DRMAA Python is installed and you don't have DRMAA_LIBRARY_PATH set (or you don't actually have the library installed because you don't have a DRM), python will crash when you import gridmap. We should catch that exception and just switch to local mode.

Delay in retrieving results from large job lists

Currently, we iterate through the jobs that have completed one at a time and query the grid engine about their status to help with debugging errors. The problem is, this requesting the status for each job individually takes forever.

Also, if multiple jobs crash we have to wait for the 30 second retry timeout for each job in a row. It seems like a better solution would be trying to retrieve all the results at once in parallel (using multiprocessing).

Possible gridmap problem OR possible grid hiccup

Hello. Today, this happened. 😕 Let me know if I can help with trying to reproduce this or anything.

Traceback (most recent call last):
  File "SR5_Batch.py", line 77, in <module>
    results = results + gridmap.process_jobs(jobs_queue, temp_dir=TEMP_DIR, white_list=WHITE_LIST, quiet=False)
  File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/job.py", line 524, in process_jobs
    temp_dir=temp_dir, quiet=quiet)
  File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/job.py", line 239, in _submit_jobs
    temp_dir=temp_dir, quiet=quiet)
  File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/job.py", line 299, in _append_job_to_session
    jobid = session.runJob(jt)
  File "/opt/python/2.7/lib/python2.7/site-packages/drmaa/__init__.py", line 331, in runJob
    _h.c(_w.drmaa_run_job, jid, _ct.sizeof(jid), jobTemplate)
  File "/opt/python/2.7/lib/python2.7/site-packages/drmaa/helpers.py", line 213, in c
    return f(*(args + (error_buffer, sizeof(error_buffer))))
  File "/opt/python/2.7/lib/python2.7/site-packages/drmaa/errors.py", line 90, in error_check
    raise _ERRORS[code-1]("code %s: %s" % (code, error_buffer.value))
drmaa.errors.DrmCommunicationException: code 2: failed receiving gdi request response for mid=130 (got syncron message receive timeout error).

updating documentation

SGE_CELL is not necessarily default. Users usually have to the SGE_ROOT to find out what SGE_CELL is. Thanks

Request: Update PyPi package

Hello,

First of all thank you for a fantastic package, it is extremely useful for our research where we have to process large amounts of data. We are about to release a Python package for one of our tools and gridmap is a dependency for parallelising the most computationally intensive calculation.

Currently, the latest available gridmap package on PyPi is version 0.13.0, but the code is not up to date with the code in the GitHub repository. Hence we were wondering if it would be possible to give gridmap a version bump to 0.14.0 and upload an updated package to PyPi, so our users can make use of the latest fixes and improvements (without having to install manually from GitHub)?

Thank you very much in advance!

Kai

Add an option to specify interpreting shell

Implement the -S option in the native job specification

Intermittent errors (issue migrated from ETS Gitlab)

Occasionally, we get errors like this:

Error while unpickling output for pythongrid job 0 from stored with key output_71c951d7-6ba0-41a0-8267-f31ef7130c64_0
This could caused by a problem with the cluster environment, imports or environment variables.
Try running `pythongrid.py 71c951d7-6ba0-41a0-8267-f31ef7130c64 0 /home/nlp-text/dynamic/mheilman/sklearn_wrapper /scratch/ loki.research.ets.org` to see if your job crashed before writing its output.
Check log files for more information:
stdout: /scratch/Sidewalks_1_Rider_Item_4a_train.cv.o1718937
stderr: /scratch/Sidewalks_1_Rider_Item_4a_train.cv.e1718937
Exception: must be string or buffer, not None

The strange thing is that if you run the command, you can see the job did not crash before writing its output, and if you connect to the underlying Redis server, you can see that the data is there. It's just that sometimes, for whatever reason, the reading the data from the server fails (even though we have multiple retries built-in to deal with synchronization problems).

Jobs indicated as running never actually start.

Hello!

I really like your project but I'm having trouble running your example code in examples\manual.py.
When I run it I get the promising output:

=====================================
========   Submit and Wait   ========
=====================================

sending function jobs to cluster.
2015-04-03 16:19:05,742 - gridmap.job - INFO - Setting up JobMonitor on tcp://10.194.168.53:52713

The output of qstat also looks fine:

$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
 423383 0.56000 gridmap_jo <my_user_name>     r     04/03/2015 16:19:10 [email protected]     1
 423384 0.56000 gridmap_jo <my_user_name>     r     04/03/2015 16:19:10 [email protected]     1
 423385 0.56000 gridmap_jo <my_user_name>     r     04/03/2015 16:19:10 [email protected]     1
 423386 0.56000 gridmap_jo <my_user_name>     r     04/03/2015 16:19:10 [email protected]     1

As you can see, the jobs are indicated as (r)unning.

The problem however is that the jobs never actually seem to finish. Which is odd since the calculation should when done locally
takes about 10 seconds. As expected since the function sleep_walk(10) is being called.

I then modified your example to skip the sleep function and write out a file called test.txt. But nothing ever happens.

Which brings me to my second question. How do I use the JobMonitor feature? I didnt gather much information from your
documentation I'm afraid.

Any help is much appreciated. Also if there is any way I can contribute please let me know.

Kai

JobMonitor Port and Job Native Specifications.

For systems with that can access a limited range of ports, I think it would be nice to let the user specify the port for the JobMonitor

Also exposing the native specifications should allow users to add advance options on their jobs.

I can make a PR for it as I currently implemented those on the gripmap fork in my repository.

Python cluster manager

I lack a bit understanding of clusters. The problem am looking to solve currently is, I have a cluster of containers with A running on it. A always uses one of the configuration file B(1,2,3,...n) which changes based up on the request specification. In a cluster if all the nodes are using B1 and suddenly a new request comes in which needs B2, If B2 does not exist in the list of configuration files contained in A-Manager then it should ask an external component for the particular configuration file B2. The next step is to apply the configuration to one of the nodes in the cluster.
Can i construct a cluster manager(A-manager) using python module??
Can i kill one of the nodes(A node) in the cluster and relaunch it with a new configuration??

Update anaconda python support

Dear maintainers,
I hope to reach the correct persons here. I'm trying to integrate gridmap via the anaconda package manager for python 3.8. However, conda does only list packages for gridmap up to python 3.6 (https://anaconda.org/anaconda/gridmap/files), which forces me to take the workaround using pip. This is okay for now, but I want to clean this up in the future.
Many thanks to anyone who can fix this!

Setting up JobMonitor Hanging

=====================================
========   Submit and Wait   ========
=====================================

sending function jobs to cluster

2015-07-21 18:15:13,077 - gridmap.job - INFO - Setting up JobMonitor on tcp://X.X.X.X:37905

Do I need to make sure to open certain ports?

psutil>=3.0.0 removes deprecated API

c.f. giampaolo/psutil#594 and https://github.com/giampaolo/psutil/blob/master/HISTORY.rst#200---2014-03-10 (API Changes)

Causes AttributeError in e.g.

gridmap/gridmap/runner.py

Line 117 in c291881

mem_total = float(process.get_memory_info()[0])

Should prepend module path to sys.path instead of append

Currently you can run into crazy problems if you two python modules with the same name (that aren't in packages) that are both in your sys.path. Obviously not something that people should experience often, but I had slightly different versions of the same quick temporary script in two directories and was baffled by the errors I was getting.

sourcing .bashrc

My default shell is zsh. If I create a .bashrc file with an incorrect, out-of-date DRMAA_LIBRARY_PATH variable, then jobs won't run. This error shows up in the .e* log file:

Traceback (most recent call last):
  File "/opt/python/2.7/lib/python2.7/runpy.py", line 151, in _run_module_as_main
    mod_name, loader, code, fname = _get_module_details(mod_name)
  File "/opt/python/2.7/lib/python2.7/runpy.py", line 101, in _get_module_details
    loader = get_loader(mod_name)
  File "/opt/python/2.7/lib/python2.7/pkgutil.py", line 464, in get_loader
    return find_loader(fullname)
  File "/opt/python/2.7/lib/python2.7/pkgutil.py", line 474, in find_loader
    for importer in iter_importers(fullname):
  File "/opt/python/2.7/lib/python2.7/pkgutil.py", line 430, in iter_importers
    __import__(pkg)
  File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/__init__.py", line 69, in <module>
    from gridmap.conf import (CHECK_FREQUENCY, CREATE_PLOTS, DEFAULT_QUEUE,
  File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/conf.py", line 76, in <module>
    import drmaa
  File "build/bdist.linux-x86_64/egg/drmaa/__init__.py", line 63, in <module>
  File "build/bdist.linux-x86_64/egg/drmaa/session.py", line 39, in <module>
  File "build/bdist.linux-x86_64/egg/drmaa/helpers.py", line 36, in <module>
  File "build/bdist.linux-x86_64/egg/drmaa/wrappers.py", line 56, in <module>
  File "/opt/python/2.7/lib/python2.7/ctypes/__init__.py", line 365, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /local/research/linux/sge6_2u6/lib/lx24-amd64/libdrmaa.so: cannot open shared object file: No such file or directory

So, it looks like the grid engine is sourcing the system default shell's rc file, even though it shouldn't be sourcing anything.

A temporary solution was to just delete the .bashrc file (or update it to use the correct path as well as the .zshrc file).

Refactor `_send_zmq_msg` to reuse sockets

Currently _send_zmq_msg creates a new connection to the JobMonitor every time, which is really wasteful. Doesn't affect performance much since we only check things every ~15 seconds, but it would still be nice to fix.

Start making conda packages

Relates to #84

We should start making the Conda packages ourselves and putting them up on the ets channel.

Add an option to return job exception instead of aborting all jobs

GridMap kills all jobs if one of them fails.
In some settings, like in my gridjug add-on, exceptions are OK as other jobs would take over.

Support arbitrary DRMAA native specification

Our cluster is configured to use consumable resources with different names for hard memory limits, GPUs, etc.

Requesting these resources requires writing to the native specification property of the Job class, but the current implementation is read-only.

Depending on /etc/hosts config, retrieving IP may not work

I recently received an email from a user about how with their cluster GridMap wasn't working because the Runner processes were failing to connect to the JobMonitor. It turns out that if your /etc/hosts is configured just so (and I believe it is this way on Ubuntu by default), socket.gethostbyname(socket.getfqdn()) will return 127.0.0.1, which is obviously not what we want.

I need to replace

self.ip_address = gethostbyname(self.host_name)

with something like

for _, _, _, _, (ip, _) in getaddrinfo(getfqdn(), 0):
    if ip != '127.0.0.1':
        self.ip_address = ip
        break
else:
    self.logger.warning('IP address for JobMonitor server is 127.0.0.1. '
                        'Runners on other machines will be unable to connect.')
    self.ip_address = '127.0.0.1'

Move the wiki info to the main documentation

Hi @dan-blanchard , I just came across the wiki page for the first time and I would like to suggest to move it to the main docs, as it is an excellent introductory note on what GridMap actually does.

Switch to psutil for getting process information

Current approach for getting memory and CPU information is very brittle.

Gridmap with large output objects

When large output objects are generated by the jobs that are run on a cluster node, JobMonitor is unable to keep up with the flow of incoming information. The consequence is that JobMonitor is unable to keep all nodes running at maximum capacity (since the nodes are waiting to get their data sent back to the node where JobMonitor is running), leading to an aggregated load pattern of the cluster as shown in the figure below:

In a synthetic benchmark, where each job returns a large NumPy array I found that running zdumps (as implemented in data.py) on my result takes approximately 1.78 seconds. If I disable bz2 compression this time is reduced to 140ms. This would be a bandaid for my specific situation, but does not solve the problem an sich.

While looking at the GridMap code I was wondering if it would be a lot of work to have a parallel version of JobMonitor. The reason for this is that a "slow result" would not immediately block retrieval of other incoming results.

Apparently it is possible to directly send large NumPy arrays over ZMQ without going through the effort of serializing it through pickle. This might help as well, although I do not directly see how to modify job.py such that it will know when to expect a "special" NumPy result packet.

Job names can't have spaces

Currently, job names cannot have spaces in them because of how they're passed to the grid engine. We should replace them with underscores on the fly to work around this.

Make JobMonitor more resilient to SMTP connection problems

Currently, since SEND_ERROR_MAIL defaults to True, if there's a problem with a job, we try to email the user about it. However, if sending that email raises an exception, the whole JobMonitor dies, thereby killing the rest of the jobs. We should make it so that doesn't happen.

Travis build not working

I am trying to get the travis plan to pass but I am not having much luck. Here's what I have fixed so fa in the fix-travis-ci branch:

Updated ubuntu package names as some of the old ones had become obsolete.
Updated DRMAA library path since the new package installs it in a different place now.
Add travis_wait to the nosetests command since I am not sure whether the command takes a long time or whether things are just hanging.

Any help would be greatly appreciated.

Error: too many values to unpack (expected 2)

Python 3.7.6, running example:

sending function jobs to cluster

Traceback (most recent call last):
  File "testb.py", line 113, in <module>
    main()
  File "testb.py", line 105, in main
    job_outputs = process_jobs(functionJobs, max_processes=4)
  File "/idiap/temp/rbraun/programs/anaconda3/lib/python3.7/site-packages/gridmap/job.py", line 887, in process_jobs
    with JobMonitor(temp_dir=temp_dir) as monitor:
  File "/idiap/temp/rbraun/programs/anaconda3/lib/python3.7/site-packages/gridmap/job.py", line 299, in __init__
    for _, _, _, _, (ip, _) in getaddrinfo(getfqdn(), 0):
ValueError: too many values to unpack (expected 2)

I fixed it by changing the code to

for _, _, _, _, tpl in getaddrinfo(getfqdn(), 0):
    if len(tpl) > 2:
        continue
    ip = tpl[0]

grid_map submits jobs twice

When using grid_map in my script, I found that after all the jobs have been finished they will be automatically submitted again, as if grid_map is called twice subsequently. I looked at the debug log and nothing seemed relevant. Can you help me get an idea of what could be wrong?

Classes in main modules will not work as arguments.

I'm not sure if this is a bug or just a limitation at this point, but if you have an executable script that defines a class in it, and then you try to pass an object of that type to grid_map, you will run into a pickling error, because it can't find the __main__ module when it tries to unpickle it.

Add option to (not) copy the environment

The package copies all environment variables to the SGE node when executing the job. The proposed feature allows the user to switch this off (in that case, environment variables are solely set by the node shell environment).

Add support for limiting the number of concurrently executing jobs

Currently there's no way to tell process_jobs (or consequently grid_map) that you only want a certain number of jobs to execute at a time. I think this should be fairly straightforward to add to the JobMonitor class by making it keep track of how many jobs are actively running and place jobs on hold or suspend them—I'm not sure what the difference in DRMAA is yet—when they request their inputs and function if the concurrent job limit is exceeded.

Documentation suggestion

import gridmap

def dummyfunc(whatever):
    return whatever + 1

results = gridmap.grid_map(dummyfunc, range(10))
print results

While it's obviously best practice to have code inside main() or some other function, it might be worth noting in the documentation that since job.py imports the parent module of dummyfunc, failure to do so will result in runaway job submission. I submitted 10k of these bad boys today!

Handle case where temp_dir exists but is not writable

See my post in #42

_append_job_to_session in job.py checks for temp_dir existence, but it does not check if the directory is writable.

Crash if CherryPy port is taken

Handle port already taken issue more gracefully, instead of dying.

Also, don't output so much CherryPy logging to stdout.

Get timing/performance info

The JobMonitor gets heartbeats every couple seconds with CPU and memory usage info from all of the jobs, so it would be nice if there was an easy way to get some of that info out of it.

possibility of having delay between jobs

It would be great if you can add an optional waiting time between job submit. It is easy to do it manually (what I do) but perhaps it can be an option to the gridmap. This is important in I/O intensive jobs because if I submit 2000 jobs and all get to run simultaneously, they can kill the NFS server.

pygridtools / gridmap Goto Github PK

gridmap's People

Contributors

Stargazers

Watchers

Forkers

gridmap's Issues

Recommend Projects

Recommend Topics

Recommend Org