Coder Social home page Coder Social logo

pygridtools / gridmap Goto Github PK

View Code? Open in Web Editor NEW
80.0 14.0 34.0 849 KB

Easily map Python functions onto a cluster using a DRMAA-compatible grid engine like Sun Grid Engine (SGE).

License: GNU General Public License v3.0

Python 98.32% Shell 1.68%
grid-engine grid linux hacktoberfest

gridmap's Introduction

GridMap

Build status

image

PyPI downloads

Latest version on PyPI

License

A package to allow you to easily create jobs on the cluster directly from Python. You can directly map Python functions onto the cluster without needing to write any wrapper code yourself.

This is the ETS fork of an older project called Python Grid. Unlike the older version, it is Python 2/3 compatible. Another major difference is that you can change the configuration via environment variables instead of having to modify a Python file in your site-packages directory. We've also fixed some bugs.

For some examples of how to use it, check out map_reduce.py (for a simple example of how you can map a function onto the cluster) and manual.py (for an example of how you can create list of jobs yourself) in the examples folder.

For complete documentation read the docs.

NOTE: You cannot use GridMap on machines that are not allowed to submit jobs (e.g., worker or edge nodes).

Requirements

Acknowledgments

Thank you to Max-Planck-Society and Educational Testing Service for funding the development of GridMap.

Changelog

See GitHub releases.

gridmap's People

Contributors

andsor avatar bitdeli-chef avatar cjw85 avatar cwidmer avatar dan-blanchard avatar desilinguist avatar jackkamm avatar mulhod avatar philippdre avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gridmap's Issues

An issue, possibly relating to the use of a custom whitelist of machines

Howdy! While evaluating all the texts yesterday afternoon/evening, I received the following error:

Traceback (most recent call last):
  File "SR5_Batch.py", line 77, in <module>
    results = results + gridmap.process_jobs(jobs_queue, temp_dir=TEMP_DIR, white_list=WHITE_LIST, quiet=False)
  File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/job.py", line 773, in process_jobs
    monitor.check(sid, jobs)
  File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/job.py", line 372, in check
    self.check_if_alive()
  File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/job.py", line 429, in check_if_alive
    handle_resubmit(self.session_id, job, temp_dir=self.temp_dir)
  File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/job.py", line 584, in handle_resubmit
    job.white_list.remove(node_name)
ValueError: list.remove(x): x not in list

Nothing actually died or anything, so I just kinda left it to see what would happen, and discovered hours later that nothing had actually happened. My gridmapping just kinda hung there.

The interesting thing is that this doesn't seem to happen consistently. I was able to evaluate ~60 texts just fine, but when I went for 1180, in batches of 100, I received this at around the 9th batch. When I can, I'll test this again to see how duplicable it is. 😕

Depending on /etc/hosts config, retrieving IP may not work

I recently received an email from a user about how with their cluster GridMap wasn't working because the Runner processes were failing to connect to the JobMonitor. It turns out that if your /etc/hosts is configured just so (and I believe it is this way on Ubuntu by default), socket.gethostbyname(socket.getfqdn()) will return 127.0.0.1, which is obviously not what we want.

I need to replace

self.ip_address = gethostbyname(self.host_name)

with something like

for _, _, _, _, (ip, _) in getaddrinfo(getfqdn(), 0):
    if ip != '127.0.0.1':
        self.ip_address = ip
        break
else:
    self.logger.warning('IP address for JobMonitor server is 127.0.0.1. '
                        'Runners on other machines will be unable to connect.')
    self.ip_address = '127.0.0.1'

Should prepend module path to sys.path instead of append

Currently you can run into crazy problems if you two python modules with the same name (that aren't in packages) that are both in your sys.path. Obviously not something that people should experience often, but I had slightly different versions of the same quick temporary script in two directories and was baffled by the errors I was getting.

Add support for limiting the number of concurrently executing jobs

Currently there's no way to tell process_jobs (or consequently grid_map) that you only want a certain number of jobs to execute at a time. I think this should be fairly straightforward to add to the JobMonitor class by making it keep track of how many jobs are actively running and place jobs on hold or suspend them—I'm not sure what the difference in DRMAA is yet—when they request their inputs and function if the concurrent job limit is exceeded.

Support virtualenvs

Currently the code is setup such that we use that the job runner uses the same python interpreter as process_jobs, but we don't force the PYTHONPATH to stay the same. It may just be as simple as doing that, but I could be wrong.

JobMonitor Port and Job Native Specifications.

For systems with that can access a limited range of ports, I think it would be nice to let the user specify the port for the JobMonitor

Also exposing the native specifications should allow users to add advance options on their jobs.

I can make a PR for it as I currently implemented those on the gripmap fork in my repository.

Make JobMonitor more resilient to SMTP connection problems

Currently, since SEND_ERROR_MAIL defaults to True, if there's a problem with a job, we try to email the user about it. However, if sending that email raises an exception, the whole JobMonitor dies, thereby killing the rest of the jobs. We should make it so that doesn't happen.

updating documentation

SGE_CELL is not necessarily default. Users usually have to the SGE_ROOT to find out what SGE_CELL is. Thanks

Replace "cleanup" argument with "quiet" that points output to /dev/null

The cleanup argument to grid_map and the Job class does not actually work when it's set to true. This is because we are trying to delete the output files from the same process that is writing to them.

We could fix this by adding an extra job that executes at the end that cleans things up. Or we could specify the output path to DRMAA as /dev/null.

Request: Update PyPi package

Hello,

First of all thank you for a fantastic package, it is extremely useful for our research where we have to process large amounts of data. We are about to release a Python package for one of our tools and gridmap is a dependency for parallelising the most computationally intensive calculation.

Currently, the latest available gridmap package on PyPi is version 0.13.0, but the code is not up to date with the code in the GitHub repository. Hence we were wondering if it would be possible to give gridmap a version bump to 0.14.0 and upload an updated package to PyPi, so our users can make use of the latest fixes and improvements (without having to install manually from GitHub)?

Thank you very much in advance!

Kai

grid_map submits jobs twice

When using grid_map in my script, I found that after all the jobs have been finished they will be automatically submitted again, as if grid_map is called twice subsequently. I looked at the debug log and nothing seemed relevant. Can you help me get an idea of what could be wrong?

Error: too many values to unpack (expected 2)

Python 3.7.6, running example:

sending function jobs to cluster

Traceback (most recent call last):
  File "testb.py", line 113, in <module>
    main()
  File "testb.py", line 105, in main
    job_outputs = process_jobs(functionJobs, max_processes=4)
  File "/idiap/temp/rbraun/programs/anaconda3/lib/python3.7/site-packages/gridmap/job.py", line 887, in process_jobs
    with JobMonitor(temp_dir=temp_dir) as monitor:
  File "/idiap/temp/rbraun/programs/anaconda3/lib/python3.7/site-packages/gridmap/job.py", line 299, in __init__
    for _, _, _, _, (ip, _) in getaddrinfo(getfqdn(), 0):
ValueError: too many values to unpack (expected 2)

I fixed it by changing the code to

for _, _, _, _, tpl in getaddrinfo(getfqdn(), 0):
    if len(tpl) > 2:
        continue
    ip = tpl[0]

sourcing .bashrc

My default shell is zsh. If I create a .bashrc file with an incorrect, out-of-date DRMAA_LIBRARY_PATH variable, then jobs won't run. This error shows up in the .e* log file:

Traceback (most recent call last):
  File "/opt/python/2.7/lib/python2.7/runpy.py", line 151, in _run_module_as_main
    mod_name, loader, code, fname = _get_module_details(mod_name)
  File "/opt/python/2.7/lib/python2.7/runpy.py", line 101, in _get_module_details
    loader = get_loader(mod_name)
  File "/opt/python/2.7/lib/python2.7/pkgutil.py", line 464, in get_loader
    return find_loader(fullname)
  File "/opt/python/2.7/lib/python2.7/pkgutil.py", line 474, in find_loader
    for importer in iter_importers(fullname):
  File "/opt/python/2.7/lib/python2.7/pkgutil.py", line 430, in iter_importers
    __import__(pkg)
  File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/__init__.py", line 69, in <module>
    from gridmap.conf import (CHECK_FREQUENCY, CREATE_PLOTS, DEFAULT_QUEUE,
  File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/conf.py", line 76, in <module>
    import drmaa
  File "build/bdist.linux-x86_64/egg/drmaa/__init__.py", line 63, in <module>
  File "build/bdist.linux-x86_64/egg/drmaa/session.py", line 39, in <module>
  File "build/bdist.linux-x86_64/egg/drmaa/helpers.py", line 36, in <module>
  File "build/bdist.linux-x86_64/egg/drmaa/wrappers.py", line 56, in <module>
  File "/opt/python/2.7/lib/python2.7/ctypes/__init__.py", line 365, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /local/research/linux/sge6_2u6/lib/lx24-amd64/libdrmaa.so: cannot open shared object file: No such file or directory

So, it looks like the grid engine is sourcing the system default shell's rc file, even though it shouldn't be sourcing anything.

A temporary solution was to just delete the .bashrc file (or update it to use the correct path as well as the .zshrc file).

Support arbitrary DRMAA native specification

Our cluster is configured to use consumable resources with different names for hard memory limits, GPUs, etc.

Requesting these resources requires writing to the native specification property of the Job class, but the current implementation is read-only.

Job names can't have spaces

Currently, job names cannot have spaces in them because of how they're passed to the grid engine. We should replace them with underscores on the fly to work around this.

Possible gridmap problem OR possible grid hiccup

Hello. Today, this happened. 😕 Let me know if I can help with trying to reproduce this or anything.

Traceback (most recent call last):
  File "SR5_Batch.py", line 77, in <module>
    results = results + gridmap.process_jobs(jobs_queue, temp_dir=TEMP_DIR, white_list=WHITE_LIST, quiet=False)
  File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/job.py", line 524, in process_jobs
    temp_dir=temp_dir, quiet=quiet)
  File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/job.py", line 239, in _submit_jobs
    temp_dir=temp_dir, quiet=quiet)
  File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/job.py", line 299, in _append_job_to_session
    jobid = session.runJob(jt)
  File "/opt/python/2.7/lib/python2.7/site-packages/drmaa/__init__.py", line 331, in runJob
    _h.c(_w.drmaa_run_job, jid, _ct.sizeof(jid), jobTemplate)
  File "/opt/python/2.7/lib/python2.7/site-packages/drmaa/helpers.py", line 213, in c
    return f(*(args + (error_buffer, sizeof(error_buffer))))
  File "/opt/python/2.7/lib/python2.7/site-packages/drmaa/errors.py", line 90, in error_check
    raise _ERRORS[code-1]("code %s: %s" % (code, error_buffer.value))
drmaa.errors.DrmCommunicationException: code 2: failed receiving gdi request response for mid=130 (got syncron message receive timeout error).

Intermittent errors (issue migrated from ETS Gitlab)

Occasionally, we get errors like this:

Error while unpickling output for pythongrid job 0 from stored with key output_71c951d7-6ba0-41a0-8267-f31ef7130c64_0
This could caused by a problem with the cluster environment, imports or environment variables.
Try running `pythongrid.py 71c951d7-6ba0-41a0-8267-f31ef7130c64 0 /home/nlp-text/dynamic/mheilman/sklearn_wrapper /scratch/ loki.research.ets.org` to see if your job crashed before writing its output.
Check log files for more information:
stdout: /scratch/Sidewalks_1_Rider_Item_4a_train.cv.o1718937
stderr: /scratch/Sidewalks_1_Rider_Item_4a_train.cv.e1718937
Exception: must be string or buffer, not None

The strange thing is that if you run the command, you can see the job did not crash before writing its output, and if you connect to the underlying Redis server, you can see that the data is there. It's just that sometimes, for whatever reason, the reading the data from the server fails (even though we have multiple retries built-in to deal with synchronization problems).

Classes in __main__ modules will not work as arguments.

I'm not sure if this is a bug or just a limitation at this point, but if you have an executable script that defines a class in it, and then you try to pass an object of that type to grid_map, you will run into a pickling error, because it can't find the __main__ module when it tries to unpickle it.

Crash if CherryPy port is taken

Handle port already taken issue more gracefully, instead of dying.

Also, don't output so much CherryPy logging to stdout.

possibility of having delay between jobs

It would be great if you can add an optional waiting time between job submit. It is easy to do it manually (what I do) but perhaps it can be an option to the gridmap. This is important in I/O intensive jobs because if I submit 2000 jobs and all get to run simultaneously, they can kill the NFS server.

Refactor `_send_zmq_msg` to reuse sockets

Currently _send_zmq_msg creates a new connection to the JobMonitor every time, which is really wasteful. Doesn't affect performance much since we only check things every ~15 seconds, but it would still be nice to fix.

Jobs indicated as running never actually start.

Hello!

I really like your project but I'm having trouble running your example code in examples\manual.py.
When I run it I get the promising output:

=====================================
========   Submit and Wait   ========
=====================================

sending function jobs to cluster.
2015-04-03 16:19:05,742 - gridmap.job - INFO - Setting up JobMonitor on tcp://10.194.168.53:52713

The output of qstat also looks fine:

$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
 423383 0.56000 gridmap_jo <my_user_name>     r     04/03/2015 16:19:10 [email protected]     1
 423384 0.56000 gridmap_jo <my_user_name>     r     04/03/2015 16:19:10 [email protected]     1
 423385 0.56000 gridmap_jo <my_user_name>     r     04/03/2015 16:19:10 [email protected]     1
 423386 0.56000 gridmap_jo <my_user_name>     r     04/03/2015 16:19:10 [email protected]     1

As you can see, the jobs are indicated as (r)unning.

The problem however is that the jobs never actually seem to finish. Which is odd since the calculation should when done locally
takes about 10 seconds. As expected since the function sleep_walk(10) is being called.

I then modified your example to skip the sleep function and write out a file called test.txt. But nothing ever happens.

Which brings me to my second question. How do I use the JobMonitor feature? I didnt gather much information from your
documentation I'm afraid.

Any help is much appreciated. Also if there is any way I can contribute please let me know.

Kai

Get timing/performance info

The JobMonitor gets heartbeats every couple seconds with CPU and memory usage info from all of the jobs, so it would be nice if there was an easy way to get some of that info out of it.

Gridmap with large output objects

When large output objects are generated by the jobs that are run on a cluster node, JobMonitor is unable to keep up with the flow of incoming information. The consequence is that JobMonitor is unable to keep all nodes running at maximum capacity (since the nodes are waiting to get their data sent back to the node where JobMonitor is running), leading to an aggregated load pattern of the cluster as shown in the figure below:
Aggregated cluster load with large output data

In a synthetic benchmark, where each job returns a large NumPy array I found that running zdumps (as implemented in data.py) on my result takes approximately 1.78 seconds. If I disable bz2 compression this time is reduced to 140ms. This would be a bandaid for my specific situation, but does not solve the problem an sich.

While looking at the GridMap code I was wondering if it would be a lot of work to have a parallel version of JobMonitor. The reason for this is that a "slow result" would not immediately block retrieval of other incoming results.

Apparently it is possible to directly send large NumPy arrays over ZMQ without going through the effort of serializing it through pickle. This might help as well, although I do not directly see how to modify job.py such that it will know when to expect a "special" NumPy result packet.

Add option to (not) copy the environment

The package copies all environment variables to the SGE node when executing the job. The proposed feature allows the user to switch this off (in that case, environment variables are solely set by the node shell environment).

Travis build not working

I am trying to get the travis plan to pass but I am not having much luck. Here's what I have fixed so fa in the fix-travis-ci branch:

  • Updated ubuntu package names as some of the old ones had become obsolete.
  • Updated DRMAA library path since the new package installs it in a different place now.
  • Add travis_wait to the nosetests command since I am not sure whether the command takes a long time or whether things are just hanging.

Any help would be greatly appreciated.

Setting up JobMonitor Hanging

=====================================
========   Submit and Wait   ========
=====================================

sending function jobs to cluster

2015-07-21 18:15:13,077 - gridmap.job - INFO - Setting up JobMonitor on tcp://X.X.X.X:37905

Do I need to make sure to open certain ports?

Potential extension to the library

Hello...

Edit: I just realized that there is a related issue already opened: Add support for limiting the number of concurrently executing jobs #27. I believe this is also what I am trying to address with my approach below.

I am trying to modify gridmap behavior. I have a bunch of jobs(~10,000 or so) that I have to run to completion. Essentially, I don't care about combining results from each individual job, I just need to run each self-contained job to completion. I would like to use your infrastructure to essentially build on top of to achieve this. Also, I don't want to submit all ~10,000 jobs at once to the GridEngine. Instead I want to batch these jobs up in chunks and submit them...essentially, build a pool of processes. A queue would probably be a better term? When one process finishes, either due to completion or exception, remove this from the job queue and add one or more new processes to the pool based on the number of free spots left in the pool.

According to me, minimally, this would involve two changes to the JobMonitor script to achieve this:

  1. Prevent JobMonitor from exiting when it encounters an error in check_alive(). This can be simply achieved by overriding the check_if_alive() method, calling super.check_if_alive() in the override and catching the exception it may throw because of a compute node subprocess throwing an exception...
    class JobMonitorExt(job.JobMonitor):
        def check_if_alive():
            try:
                super.check_if_alive()
            except:
                pass
  1. Batching jobs: This can be done by overriding all_jobs_done() and instead of checking if all jobs are done, just check if at least one process has finished(either exception or completed). If there is a process like this, remove it from the process queue and add a new function process to the current session by calling _append_job_to_session(). all_jobs_done() returns True only when all the ~10,000 jobs have been processed.
    class JobMonitorExt(job.JobMonitor):
        def all_jobs_done():
            # check if some job is done(exception or completed). If some job is done, remove that job from queue and add new job.
            # Return true if no more jobs to process

Do you think this will work? I'd appreciate your thoughts on this.

Update anaconda python support

Dear maintainers,
I hope to reach the correct persons here. I'm trying to integrate gridmap via the anaconda package manager for python 3.8. However, conda does only list packages for gridmap up to python 3.6 (https://anaconda.org/anaconda/gridmap/files), which forces me to take the workaround using pip. This is okay for now, but I want to clean this up in the future.
Many thanks to anyone who can fix this!

Delay in retrieving results from large job lists

Currently, we iterate through the jobs that have completed one at a time and query the grid engine about their status to help with debugging errors. The problem is, this requesting the status for each job individually takes forever.

Also, if multiple jobs crash we have to wait for the 30 second retry timeout for each job in a row. It seems like a better solution would be trying to retrieve all the results at once in parallel (using multiprocessing).

jobs are resubmitted even when they are killed purposely using qdel

Hi Dan, while I was trying to kill (qdel) the running jobs on the grid, they would seem to be killed but later resubmitted again. I have to repeat this for 5 times. However, those which are not currently executing (on the wait list) would be killed for good with just one qdel. I think this may be the case that the automatic resubmit feature is being applied to jobs which die prematurely and also jobs which are terminated purposely. Can you please check ?

Python cluster manager

I lack a bit understanding of clusters. The problem am looking to solve currently is, I have a cluster of containers with A running on it. A always uses one of the configuration file B(1,2,3,...n) which changes based up on the request specification. In a cluster if all the nodes are using B1 and suddenly a new request comes in which needs B2, If B2 does not exist in the list of configuration files contained in A-Manager then it should ask an external component for the particular configuration file B2. The next step is to apply the configuration to one of the nodes in the cluster.
Can i construct a cluster manager(A-manager) using python module??
Can i kill one of the nodes(A node) in the cluster and relaunch it with a new configuration??

Documentation suggestion

import gridmap

def dummyfunc(whatever):
    return whatever + 1

results = gridmap.grid_map(dummyfunc, range(10))
print results

While it's obviously best practice to have code inside main() or some other function, it might be worth noting in the documentation that since job.py imports the parent module of dummyfunc, failure to do so will result in runaway job submission. I submitted 10k of these bad boys today!

runpy.py:125: RuntimeWarning: 'gridmap.runner' found in sys.modules

python3.6/runpy.py:125: RuntimeWarning: 'gridmap.runner' found in sys.modules after import of package 'gridmap', but prior to execution of 'gridmap.runner'; this may result in unpredictable behaviour warn(RuntimeWarning(msg))

I don't know how to remove this warning,but the program is looks like running normally .

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.