earonesty / grun Goto Github PK
View Code? Open in Web Editor NEWGrid/cluster job management system
Grid/cluster job management system
* Description: A lightweight replacement for job queueing systems like LSF, Torque, condor, SGE, for private clusters. * Installation: You need perl and ZMQ::LibZMQ3 apt-get install libzmq3 cpanm ZMQ::LibZMQ3 Put it in /usr/bin. Type grun -h for help with the configuration/setup. Basically you stick it on all the machines, and you put a "master" in the main config. It is, far and away, the easiest grid job queuing system out there. Example /etc/grun.conf on the master node: master: foo.mydomain.local services: queue log_file: /var/log/grun.log Example /etc/grun.conf on a many compute nodes: master: foo.mydomain.local services: exec log_file: /var/log/grun.log Once you have those two daemons running (grun -d), then you can, on either node, do this: grun hostname And it will show you which node the command was run on, confirming that your installtion works. * What you do with it: Submit jobs, grun puts them in a queue, the queue is on disk so machines can lose connection any time, and jobs keep going. Specify resouces requirements, or not - and have it figure stuff out. grun will assign jobs randomly, and will tend to fill up one machine at a time, in case big jobs come along. It should figure NFS mounts, and the right thing to do with them autoamtically. It has a fast i/o subsystem suitable for terabyte files. If all the machines are busy, jobs will queue up. Resources are soft-locked by jobs until they are finished, and if a non-grun job is using a machine, it's accounted for. NOTE: Jobs that require changing sets of resources should call grun themselves - forcing later commands to requeue: IE: > grun -m 3000 myjob.sh --- myjob.sh: --- perl usesalotofmemory.pl grun -c 4 -m 500 perl uses4cpusbutlessram.pl The memory and cpu usage of the parent program will go to zero as the second script is launched. In this way very complex jobs can be scheduled, without a special workflow system. Just use bash, perl or ruby. * What it does now: It does the queueing, i/o, node-matching to requirements, suspend/resume, priorities, and has a great config system. The "auto conf" allows sysadmins to create policies based on the parameters or script contens of a job. Grun has run millions of jobs in a production cluster with about 52 nodes of 24-48 cores each. Grun times every command and records memory/cpu/io. Ulimits are set on jobs, and can be scaled to a multiple of requested (elbowing). Jobs are editable, killable & suspend/resumable. You can specify "make like" semantics on jobs and it will check inputs/outputs and run jobs only if needed. Grun comes with a perl module that allows you to execute perl functions on remote hosts by copying the memory and code over to the remote machine. For example: use Grun; my $results = grun(\&mycrazyfunction, "param"); It decompiles the function locally, and recompiles remotely. * What it doesn't do yet (TODO): - needs a "plugin" architecture. Should be easy with perl. - graceful restart... with socket handoff (it's always safe to restart... but it can cause waiters to pause) - harrass users when queue drive is close to full, or cpu/ram was too high to ever run, or other (configurable) issues that users have * Goals: Small Keep the code under a few k lines, should be readable. Lots of comments. Simple Easy configuration, guesses default values Smart Keeps track of stats. Learns what to do with things. Puts the right jobs on the right machines. Fast Hundreds of thousands of jobs per hour on tens of thousands of machines should be easy. Configurable Make up lots of new things you want it to keep track of and allocate to jobs (like disk i/o, user limits, etc.) * Features to avoid: No security Grun has no security. It has to be behind a firewall on dedicated machines. This limits it, and keeps it simple. It's not hard to put up an ssh tunnel, and make it work at EC2. But I'm not building in kerberos or stuff like that. One platform Grun only works on unix-like machines that have perl installed. Nothing fancy Grun doesn't support MPI and other fancy grid things (although you can layer map-reduce on it) * Advanced usage: Example of a config on an ssh tunneled EC2 node (tested), with autostaging via gwrap: services: exec master: 127.0.0.1:5184 bind: 127.0.0.2:5184 wrap: /usr/bin/gwrap fetch_path: /opt:/mnt/ufs gwrap (source only) is a useful tool. Used right, it will copy all the dependent files to the remote host, only as needed, and then, when the remote program is finished, it will copy all the output files back to the local host. Of course, this *could* be built in to grun, via the -M option, but it isn't yet. gwrap is not yet heavily used in production, and if you do decide to use it heavly, please let me know... I'll help add things like logging, failure recovery as needed.
Need more stats on jobs... how much CPU/RAM/wall-clock-time was used, etc. at least. All derived from the /proc filesystem ...thats fine.
the directory structure requires a folder for each job .. to store the opts/status/i-o/etc
switching to sqlite is probably a better idea... especially since concurrency isn't really an issue... only the queue manager needs to see that DB
I have an embedded system and have build grun plus perl modules required. Most of this seems to work except that jobs do not complete ie I run grun hostname and it never returns .. i have to kill the job
$ grun -q jobs 29 pas (I) /home/pas hostname 28 pas (I) /home/pas hostname
job 29 was started on the remote exec system, job 28 was started on the local queue system. Both systems have same user name, user id and group id, just in case that makes a difference
Fri Oct 21 16:12:50 2016 trace Received command (00b6b854d6) 'run' : ["192.169.0.118",null,"run",{"umask":2,"env":{"USER":"pas","PATH":"/home/pas/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin"},"frompid":22264,"guid":"D517E09497A011E69D75A4AA68EA6766","syncdirs":["/home/pas"],"io":1,"hard_factor":null,"user":"pas","param":{},"group":"1000 4 20 24 46 104 111 119 122 124 137 1000","cwd":"/home/pas","wait":1,"memory":"1000000","cmd":["hostname"],"priority":"20","cpus":1}]
wireshark shows ping packets working ok.
Perhaps some security issue stops the job being run on the remote system?? Or something like that? I cant see any good logs from tracing on the exec system.. all it seems to show is start and stop of the daemon.. I find that odd .. perhaps also I have some problems with building all the necessary perl modules on my embedded system.. anyway even with no logs on the embedded (exec) side both sides seem to get the idea that the job is started as shown above..
This must be finger trouble but as I am new to this I cant work out why it is not working. Hope you can help
Built the libraries on a laptop and a desktop and it works! Still no luck with the embedded processor.. still having problems but at least I can now see how it should work!
Peter.
stdio and/or stderr are not reliably copied back from the execution host... this needs to be traced & solved
a better way to handle i/o would be to bind the output of a tail -f (or similar) to 2 separate sockets. the waitio call in the client should create 2 connecting sockets and use select on them to stream the results
also, if the connection is dropped, the client should be able to request data from a recovery position - or from the top of the stream if available
also "error/job status" should be it's own separate client call... not lumped in with stderr and stdout as "3 files"... that's silly
Working on changing the protocol so that the queue manager returns immediately after assigning a job id... not after finding a matching machine... otherwise control-c (abort) can't work for a job in a busy queue.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.