pygridtools / gridmap Goto Github PK
View Code? Open in Web Editor NEWEasily map Python functions onto a cluster using a DRMAA-compatible grid engine like Sun Grid Engine (SGE).
License: GNU General Public License v3.0
Easily map Python functions onto a cluster using a DRMAA-compatible grid engine like Sun Grid Engine (SGE).
License: GNU General Public License v3.0
Currently if you want to change something like the DEFAULT_QUEUE you have to modify the constants in the code. That's dumb, so I should change it so it will automatically use environment variables instead if they're set.
If a job is stalled and we ask SGE to kill it, but for some reason the process lives on, it will cause a crash because the REP socket will end up in a bad state.
Hello...
Edit: I just realized that there is a related issue already opened: Add support for limiting the number of concurrently executing jobs #27. I believe this is also what I am trying to address with my approach below.
I am trying to modify gridmap behavior. I have a bunch of jobs(~10,000 or so) that I have to run to completion. Essentially, I don't care about combining results from each individual job, I just need to run each self-contained job to completion. I would like to use your infrastructure to essentially build on top of to achieve this. Also, I don't want to submit all ~10,000 jobs at once to the GridEngine. Instead I want to batch these jobs up in chunks and submit them...essentially, build a pool of processes. A queue would probably be a better term? When one process finishes, either due to completion or exception, remove this from the job queue and add one or more new processes to the pool based on the number of free spots left in the pool.
According to me, minimally, this would involve two changes to the JobMonitor script to achieve this:
class JobMonitorExt(job.JobMonitor):
def check_if_alive():
try:
super.check_if_alive()
except:
pass
class JobMonitorExt(job.JobMonitor):
def all_jobs_done():
# check if some job is done(exception or completed). If some job is done, remove that job from queue and add new job.
# Return true if no more jobs to process
Do you think this will work? I'd appreciate your thoughts on this.
Add a feature to grid_map that sends a short success mail upon completion of all grid_map jobs.
The ETS-specific clean_path
function in gridmap/gridmap/data.py
should be removed. This appears to no longer be necessary.
Hi Dan, while I was trying to kill (qdel) the running jobs on the grid, they would seem to be killed but later resubmitted again. I have to repeat this for 5 times. However, those which are not currently executing (on the wait list) would be killed for good with just one qdel. I think this may be the case that the automatic resubmit feature is being applied to jobs which die prematurely and also jobs which are terminated purposely. Can you please check ?
python3.6/runpy.py:125: RuntimeWarning: 'gridmap.runner' found in sys.modules after import of package 'gridmap', but prior to execution of 'gridmap.runner'; this may result in unpredictable behaviour warn(RuntimeWarning(msg))
I don't know how to remove this warning,but the program is looks like running normally .
Howdy! While evaluating all the texts yesterday afternoon/evening, I received the following error:
Traceback (most recent call last):
File "SR5_Batch.py", line 77, in <module>
results = results + gridmap.process_jobs(jobs_queue, temp_dir=TEMP_DIR, white_list=WHITE_LIST, quiet=False)
File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/job.py", line 773, in process_jobs
monitor.check(sid, jobs)
File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/job.py", line 372, in check
self.check_if_alive()
File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/job.py", line 429, in check_if_alive
handle_resubmit(self.session_id, job, temp_dir=self.temp_dir)
File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/job.py", line 584, in handle_resubmit
job.white_list.remove(node_name)
ValueError: list.remove(x): x not in list
Nothing actually died or anything, so I just kinda left it to see what would happen, and discovered hours later that nothing had actually happened. My gridmapping just kinda hung there.
The interesting thing is that this doesn't seem to happen consistently. I was able to evaluate ~60 texts just fine, but when I went for 1180, in batches of 100, I received this at around the 9th batch. When I can, I'll test this again to see how duplicable it is. 😕
Our RTD site is pretty bare-bones at the moment, and we need a nice tutorial, like we have for SKLL.
Currently, if DRMAA Python is installed and you don't have DRMAA_LIBRARY_PATH
set (or you don't actually have the library installed because you don't have a DRM), python will crash when you import gridmap
. We should catch that exception and just switch to local mode.
Currently, we iterate through the jobs that have completed one at a time and query the grid engine about their status to help with debugging errors. The problem is, this requesting the status for each job individually takes forever.
Also, if multiple jobs crash we have to wait for the 30 second retry timeout for each job in a row. It seems like a better solution would be trying to retrieve all the results at once in parallel (using multiprocessing).
Hello. Today, this happened. 😕 Let me know if I can help with trying to reproduce this or anything.
Traceback (most recent call last):
File "SR5_Batch.py", line 77, in <module>
results = results + gridmap.process_jobs(jobs_queue, temp_dir=TEMP_DIR, white_list=WHITE_LIST, quiet=False)
File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/job.py", line 524, in process_jobs
temp_dir=temp_dir, quiet=quiet)
File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/job.py", line 239, in _submit_jobs
temp_dir=temp_dir, quiet=quiet)
File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/job.py", line 299, in _append_job_to_session
jobid = session.runJob(jt)
File "/opt/python/2.7/lib/python2.7/site-packages/drmaa/__init__.py", line 331, in runJob
_h.c(_w.drmaa_run_job, jid, _ct.sizeof(jid), jobTemplate)
File "/opt/python/2.7/lib/python2.7/site-packages/drmaa/helpers.py", line 213, in c
return f(*(args + (error_buffer, sizeof(error_buffer))))
File "/opt/python/2.7/lib/python2.7/site-packages/drmaa/errors.py", line 90, in error_check
raise _ERRORS[code-1]("code %s: %s" % (code, error_buffer.value))
drmaa.errors.DrmCommunicationException: code 2: failed receiving gdi request response for mid=130 (got syncron message receive timeout error).
SGE_CELL is not necessarily default. Users usually have to the SGE_ROOT to find out what SGE_CELL is. Thanks
Hello,
First of all thank you for a fantastic package, it is extremely useful for our research where we have to process large amounts of data. We are about to release a Python package for one of our tools and gridmap is a dependency for parallelising the most computationally intensive calculation.
Currently, the latest available gridmap package on PyPi is version 0.13.0, but the code is not up to date with the code in the GitHub repository. Hence we were wondering if it would be possible to give gridmap a version bump to 0.14.0 and upload an updated package to PyPi, so our users can make use of the latest fixes and improvements (without having to install manually from GitHub)?
Thank you very much in advance!
Kai
Implement the -S option in the native job specification
Occasionally, we get errors like this:
Error while unpickling output for pythongrid job 0 from stored with key output_71c951d7-6ba0-41a0-8267-f31ef7130c64_0
This could caused by a problem with the cluster environment, imports or environment variables.
Try running `pythongrid.py 71c951d7-6ba0-41a0-8267-f31ef7130c64 0 /home/nlp-text/dynamic/mheilman/sklearn_wrapper /scratch/ loki.research.ets.org` to see if your job crashed before writing its output.
Check log files for more information:
stdout: /scratch/Sidewalks_1_Rider_Item_4a_train.cv.o1718937
stderr: /scratch/Sidewalks_1_Rider_Item_4a_train.cv.e1718937
Exception: must be string or buffer, not None
The strange thing is that if you run the command, you can see the job did not crash before writing its output, and if you connect to the underlying Redis server, you can see that the data is there. It's just that sometimes, for whatever reason, the reading the data from the server fails (even though we have multiple retries built-in to deal with synchronization problems).
Hello!
I really like your project but I'm having trouble running your example code in examples\manual.py
.
When I run it I get the promising output:
=====================================
======== Submit and Wait ========
=====================================
sending function jobs to cluster.
2015-04-03 16:19:05,742 - gridmap.job - INFO - Setting up JobMonitor on tcp://10.194.168.53:52713
The output of qstat also looks fine:
$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
423383 0.56000 gridmap_jo <my_user_name> r 04/03/2015 16:19:10 [email protected] 1
423384 0.56000 gridmap_jo <my_user_name> r 04/03/2015 16:19:10 [email protected] 1
423385 0.56000 gridmap_jo <my_user_name> r 04/03/2015 16:19:10 [email protected] 1
423386 0.56000 gridmap_jo <my_user_name> r 04/03/2015 16:19:10 [email protected] 1
As you can see, the jobs are indicated as (r)unning.
The problem however is that the jobs never actually seem to finish. Which is odd since the calculation should when done locally
takes about 10 seconds. As expected since the function sleep_walk(10)
is being called.
I then modified your example to skip the sleep function and write out a file called test.txt
. But nothing ever happens.
Which brings me to my second question. How do I use the JobMonitor feature? I didnt gather much information from your
documentation I'm afraid.
Any help is much appreciated. Also if there is any way I can contribute please let me know.
Kai
For systems with that can access a limited range of ports, I think it would be nice to let the user specify the port for the JobMonitor
Also exposing the native specifications should allow users to add advance options on their jobs.
I can make a PR for it as I currently implemented those on the gripmap fork in my repository.
I lack a bit understanding of clusters. The problem am looking to solve currently is, I have a cluster of containers with A running on it. A always uses one of the configuration file B(1,2,3,...n) which changes based up on the request specification. In a cluster if all the nodes are using B1 and suddenly a new request comes in which needs B2, If B2 does not exist in the list of configuration files contained in A-Manager then it should ask an external component for the particular configuration file B2. The next step is to apply the configuration to one of the nodes in the cluster.
Can i construct a cluster manager(A-manager) using python module??
Can i kill one of the nodes(A node) in the cluster and relaunch it with a new configuration??
Dear maintainers,
I hope to reach the correct persons here. I'm trying to integrate gridmap via the anaconda package manager for python 3.8. However, conda does only list packages for gridmap up to python 3.6 (https://anaconda.org/anaconda/gridmap/files), which forces me to take the workaround using pip. This is okay for now, but I want to clean this up in the future.
Many thanks to anyone who can fix this!
=====================================
======== Submit and Wait ========
=====================================
sending function jobs to cluster
2015-07-21 18:15:13,077 - gridmap.job - INFO - Setting up JobMonitor on tcp://X.X.X.X:37905
Do I need to make sure to open certain ports?
c.f. giampaolo/psutil#594 and https://github.com/giampaolo/psutil/blob/master/HISTORY.rst#200---2014-03-10 (API Changes)
Causes AttributeError in e.g.
Line 117 in c291881
Currently you can run into crazy problems if you two python modules with the same name (that aren't in packages) that are both in your sys.path
. Obviously not something that people should experience often, but I had slightly different versions of the same quick temporary script in two directories and was baffled by the errors I was getting.
My default shell is zsh. If I create a .bashrc file with an incorrect, out-of-date DRMAA_LIBRARY_PATH variable, then jobs won't run. This error shows up in the .e* log file:
Traceback (most recent call last):
File "/opt/python/2.7/lib/python2.7/runpy.py", line 151, in _run_module_as_main
mod_name, loader, code, fname = _get_module_details(mod_name)
File "/opt/python/2.7/lib/python2.7/runpy.py", line 101, in _get_module_details
loader = get_loader(mod_name)
File "/opt/python/2.7/lib/python2.7/pkgutil.py", line 464, in get_loader
return find_loader(fullname)
File "/opt/python/2.7/lib/python2.7/pkgutil.py", line 474, in find_loader
for importer in iter_importers(fullname):
File "/opt/python/2.7/lib/python2.7/pkgutil.py", line 430, in iter_importers
__import__(pkg)
File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/__init__.py", line 69, in <module>
from gridmap.conf import (CHECK_FREQUENCY, CREATE_PLOTS, DEFAULT_QUEUE,
File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/conf.py", line 76, in <module>
import drmaa
File "build/bdist.linux-x86_64/egg/drmaa/__init__.py", line 63, in <module>
File "build/bdist.linux-x86_64/egg/drmaa/session.py", line 39, in <module>
File "build/bdist.linux-x86_64/egg/drmaa/helpers.py", line 36, in <module>
File "build/bdist.linux-x86_64/egg/drmaa/wrappers.py", line 56, in <module>
File "/opt/python/2.7/lib/python2.7/ctypes/__init__.py", line 365, in __init__
self._handle = _dlopen(self._name, mode)
OSError: /local/research/linux/sge6_2u6/lib/lx24-amd64/libdrmaa.so: cannot open shared object file: No such file or directory
So, it looks like the grid engine is sourcing the system default shell's rc file, even though it shouldn't be sourcing anything.
A temporary solution was to just delete the .bashrc file (or update it to use the correct path as well as the .zshrc file).
Currently _send_zmq_msg
creates a new connection to the JobMonitor
every time, which is really wasteful. Doesn't affect performance much since we only check things every ~15 seconds, but it would still be nice to fix.
Relates to #84
We should start making the Conda packages ourselves and putting them up on the ets
channel.
GridMap kills all jobs if one of them fails.
In some settings, like in my gridjug add-on, exceptions are OK as other jobs would take over.
Our cluster is configured to use consumable resources with different names for hard memory limits, GPUs, etc.
Requesting these resources requires writing to the native specification property of the Job class, but the current implementation is read-only.
I recently received an email from a user about how with their cluster GridMap wasn't working because the Runner
processes were failing to connect to the JobMonitor
. It turns out that if your /etc/hosts
is configured just so (and I believe it is this way on Ubuntu by default), socket.gethostbyname(socket.getfqdn())
will return 127.0.0.1, which is obviously not what we want.
I need to replace
self.ip_address = gethostbyname(self.host_name)
with something like
for _, _, _, _, (ip, _) in getaddrinfo(getfqdn(), 0):
if ip != '127.0.0.1':
self.ip_address = ip
break
else:
self.logger.warning('IP address for JobMonitor server is 127.0.0.1. '
'Runners on other machines will be unable to connect.')
self.ip_address = '127.0.0.1'
Hi @dan-blanchard , I just came across the wiki page for the first time and I would like to suggest to move it to the main docs, as it is an excellent introductory note on what GridMap actually does.
Current approach for getting memory and CPU information is very brittle.
When large output objects are generated by the jobs that are run on a cluster node, JobMonitor
is unable to keep up with the flow of incoming information. The consequence is that JobMonitor
is unable to keep all nodes running at maximum capacity (since the nodes are waiting to get their data sent back to the node where JobMonitor
is running), leading to an aggregated load pattern of the cluster as shown in the figure below:
In a synthetic benchmark, where each job returns a large NumPy array I found that running zdumps
(as implemented in data.py
) on my result takes approximately 1.78 seconds. If I disable bz2 compression this time is reduced to 140ms. This would be a bandaid for my specific situation, but does not solve the problem an sich.
While looking at the GridMap code I was wondering if it would be a lot of work to have a parallel version of JobMonitor. The reason for this is that a "slow result" would not immediately block retrieval of other incoming results.
Apparently it is possible to directly send large NumPy arrays over ZMQ without going through the effort of serializing it through pickle. This might help as well, although I do not directly see how to modify job.py
such that it will know when to expect a "special" NumPy result packet.
Currently, job names cannot have spaces in them because of how they're passed to the grid engine. We should replace them with underscores on the fly to work around this.
Currently, since SEND_ERROR_MAIL
defaults to True
, if there's a problem with a job, we try to email the user about it. However, if sending that email raises an exception, the whole JobMonitor
dies, thereby killing the rest of the jobs. We should make it so that doesn't happen.
I am trying to get the travis plan to pass but I am not having much luck. Here's what I have fixed so fa in the fix-travis-ci
branch:
travis_wait
to the nosetests command since I am not sure whether the command takes a long time or whether things are just hanging.Any help would be greatly appreciated.
Python 3.7.6, running example:
sending function jobs to cluster
Traceback (most recent call last):
File "testb.py", line 113, in <module>
main()
File "testb.py", line 105, in main
job_outputs = process_jobs(functionJobs, max_processes=4)
File "/idiap/temp/rbraun/programs/anaconda3/lib/python3.7/site-packages/gridmap/job.py", line 887, in process_jobs
with JobMonitor(temp_dir=temp_dir) as monitor:
File "/idiap/temp/rbraun/programs/anaconda3/lib/python3.7/site-packages/gridmap/job.py", line 299, in __init__
for _, _, _, _, (ip, _) in getaddrinfo(getfqdn(), 0):
ValueError: too many values to unpack (expected 2)
I fixed it by changing the code to
for _, _, _, _, tpl in getaddrinfo(getfqdn(), 0):
if len(tpl) > 2:
continue
ip = tpl[0]
When using grid_map in my script, I found that after all the jobs have been finished they will be automatically submitted again, as if grid_map is called twice subsequently. I looked at the debug log and nothing seemed relevant. Can you help me get an idea of what could be wrong?
I'm not sure if this is a bug or just a limitation at this point, but if you have an executable script that defines a class in it, and then you try to pass an object of that type to grid_map
, you will run into a pickling error, because it can't find the __main__
module when it tries to unpickle it.
The package copies all environment variables to the SGE node when executing the job. The proposed feature allows the user to switch this off (in that case, environment variables are solely set by the node shell environment).
Currently there's no way to tell process_jobs
(or consequently grid_map
) that you only want a certain number of jobs to execute at a time. I think this should be fairly straightforward to add to the JobMonitor
class by making it keep track of how many jobs are actively running and place jobs on hold or suspend them—I'm not sure what the difference in DRMAA is yet—when they request their inputs and function if the concurrent job limit is exceeded.
import gridmap
def dummyfunc(whatever):
return whatever + 1
results = gridmap.grid_map(dummyfunc, range(10))
print results
While it's obviously best practice to have code inside main() or some other function, it might be worth noting in the documentation that since job.py imports the parent module of dummyfunc, failure to do so will result in runaway job submission. I submitted 10k of these bad boys today!
See my post in #42
_append_job_to_session in job.py checks for temp_dir existence, but it does not check if the directory is writable.
Handle port already taken issue more gracefully, instead of dying.
Also, don't output so much CherryPy logging to stdout.
The JobMonitor
gets heartbeats every couple seconds with CPU and memory usage info from all of the jobs, so it would be nice if there was an easy way to get some of that info out of it.
It would be great if you can add an optional waiting time between job submit. It is easy to do it manually (what I do) but perhaps it can be an option to the gridmap. This is important in I/O intensive jobs because if I submit 2000 jobs and all get to run simultaneously, they can kill the NFS server.
Dill is a more powerful serialization library with the ability to serialize more types of functions:
If DRMAA is not available, GridMap silently falls back to local processing.
I think this would make for a nice enhancement to ensure that GridMap is submitting the jobs to a cluster, and raising an exception if not, rather than falling back to local processing.
Currently the code is setup such that we use that the job runner uses the same python interpreter as process_jobs, but we don't force the PYTHONPATH to stay the same. It may just be as simple as doing that, but I could be wrong.
We should add an argument to process_jobs
and grid_map
called debug
that tells the Runner
processes that they should use debug-level logging instead of info-level.
The cleanup
argument to grid_map and the Job class does not actually work when it's set to true. This is because we are trying to delete the output files from the same process that is writing to them.
We could fix this by adding an extra job that executes at the end that cleans things up. Or we could specify the output path to DRMAA as /dev/null
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.