Coder Social home page Coder Social logo

rofl0r / jobflow Goto Github PK

View Code? Open in Web Editor NEW
88.0 9.0 6.0 116 KB

runs stuff in parallel (like GNU parallel, but much faster and memory-efficient)

License: MIT License

C 84.13% Makefile 1.97% Shell 13.90%
parallel process unix pipes gnu-parallel lightweight fast c

jobflow's Introduction

jobflow by rofl0r

this program is inspired by the functionality of GNU parallel, but tries to keep low overhead and follow the UNIX philosophy of doing one thing well.

how it works

basically, it works by processing stdin, launching one process per line. the actual line can be passed to the started program as an argv. this allows for easy parallelization of standard unix tasks.

it is possible to save the current processed line, so when the task is killed it can be continued later.

example usage

you have a list of things, and a tool that processes a single thing.

cat things.list | jobflow -threads=8 -exec ./mytask {}

seq 100 | jobflow -threads=100 -exec echo {}

cat urls.txt | jobflow -threads=32 -exec wget {}

find . -name '*.bmp' | jobflow -threads=8 -exec bmp2jpeg {.}.bmp {.}.jpg

run jobflow without arguments to see a list of possible command line options, and argument permutations.

starting from version 1.3.1, jobflow can also be used to extract a range of lines, e.g.:

seq 100 | jobflow -skip 10 -count 10  # print lines 11 to 20

Comparison with GNU parallel

GNU parallel is written in perl, which has the following disadvantages:

  • requires a perl installation even though most people already have perl installed anyway, installing it just for this purpose requires up to 50 MB storage (and potentially up to several hours of time to compile it from source on slow devices)
  • requires a lot of time on startup (parsing sources, etc)
  • requires a lot of memory (typically between 5-60 MB)

jobflow OTOH is written in C, which has numerous advantages.

  • once compiled to a tiny static binary, can be used without 3rd party stuff
  • very little and constant memory usage (typically a few KB)
  • no startup overhead
  • much higher execution speed

apart from the chosen language and related performance differences, the following other differences exist between GNU parallel and jobflow:

  • supports rlimits passed to started processes
  • doesn't support ssh (usage of remote cpus)
  • doesn't support all kinds of argument permutations: while GNU parallel has a rich set of options to permute the input, this doesn't adhere to the UNIX philosophy. jobflow can achieve the same result by passing the unmodified input to a user-created script that does the required permutations with other standard tools.

available command line options

-skip N -threads N -resume -statefile=/tmp/state -delayedflush
-delayedspinup N -buffered -joinoutput -limits mem=16M,cpu=10
-eof=XXX
-exec ./mycommand {}

-skip N

N=number of entries to skip

-count N

N=only process count lines (after skipping)

-threads N (alternative: -j N)

N=number of parallel processes to spawn

-resume

resume from last jobnumber stored in statefile

-eof XXX

use XXX as the EOF marker on stdin
if the marker is encountered, behave as if stdin was closed
not compatible with pipe/bulk mode

-statefile XXX

XXX=filename
saves last launched jobnumber into a file

-delayedflush

only write to statefile whenever all processes are busy,
and at program end

-delayedspinup N

N=maximum amount of milliseconds
...to wait when spinning up a fresh set of processes
a random value between 0 and the chosen amount is used to delay initial
spinup.
this can be handy to circumvent an I/O lockdown because of a burst of
activity on program startup

-buffered

store the stdout and stderr of launched processes into a temporary file
which will be printed after a process has finished.
this prevents mixing up of output of different processes.

-joinoutput

if -buffered, write both stdout and stderr into the same file.
this saves the chronological order of the output, and the combined output
will only be printed to stdout.

-bulk N

do bulk copies with a buffer of N bytes. only usable in pipe mode.
this passes (almost) the entire buffer to the next scheduled job.
the passed buffer will be truncated to the last line break boundary,
so jobs always get entire lines to work with.
this option is useful when you have huge input files and relatively short
task runtimes. by using it, syscall overhead can be reduced to a minimum.
N must be a multiple of 4KB. the suffixes G/M/K are detected.
actual memory allocation will be twice the amount passed.
note that pipe buffer size is limited to 64K on linux, so anything higher
than that probably doesn't make sense.

-limits [mem=N,cpu=N,stack=N,fsize=N,nofiles=N]

sets the rlimit of the new created processes.
see "man setrlimit" for an explanation. the suffixes G/M/K are detected.

-exec command with args

everything past -exec is treated as the command to execute on each line of
stdin received. the line can be passed as an argument using {}.
{.} passes everything before the last dot in a line as an argument.
it is possible to use multiple substitutions inside a single argument,
but currently only of one type.
if -exec is omitted, input will merely be dumped to stdout (like cat).

BUILD

just run make.

you may override variables used in the Makefile and set optimization CFLAGS and similar thing using a file called config.mak, e.g.:

echo "CFLAGS=-O2 -g" > config.mak
make -j2

jobflow's People

Contributors

rofl0r avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jobflow's Issues

Does GNU Parallel leak memory?

Hi

In your description https://github.com/rofl0r/jobflow/blob/master/README.md you write:

this program is inspired by GNU parallel, but has the following differences [...] does not leak memory

I have been unable to find an example of GNU Parallel leaking memory. If you have, could you be kind and let me know? If you do not have an example, consider removing that bullet point as it can be misinterpreted as if GNU Parallel leaks memory.

typo in command leads to bogus execution

note missing space between echo and {}

$ seq 1000 | jobflow -threads=100 -exec echo{}
posix_spawn: No such file or directory
posix_spawn: No such file or directory
posix_spawn: No such file or directory
posix_spawn: No such file or directory
...

output can get lost if the > operator is used

when using more than 1 process, and not the -buffered option, the output of some processes gets lost.
piping the output into cat > file instead, everything arrives.
linux bug ?

test:
fail:
seq 100 | ./jobflow.out -threads=100 -exec echo {} > test.tmp ; wc -l test.tmp
should return 100, but does not always

success
seq 100 | ./jobflow.out -threads=100 -exec echo {} | cat > test.tmp ; wc -l test.tmp
always returns 100

failing to compile on ubuntu 13.10 64bit

$ git clone https://github.com/rofl0r/jobflow.git
$ cd jobflow/
$ CFLAGS=" -Wall -Wextra -std=c99 -D_GNU_SOURCE -g -O0 " rcb --force jobflow.c ""
/bin/sh: 1: rcb: not found
make: *** [debug] Error 127

Fixed rcb issue.

$ git clone https://github.com/rofl0r/rcb.git
$ cd rcb
$ cp rcb2make.pl ~/bin/rcb2make
$ cp rcb.pl ~/bin/rcb

$ cd ~/jobflow
$ make
CFLAGS=" -Wall -Wextra -std=c99 -D_GNU_SOURCE -g -O0 " rcb --force jobflow.c ""
[RcB] $CC not set, defaulting to cc
[RcB] $NM not set, defaulting to nm
[RcB] scanning deps...failed to find dependency /home/iqbala/jobflow/../lib/include/optparser.h referenced from /home/iqbala/jobflow/jobflow.c!
failed to find dependency /home/iqbala/jobflow/../lib/include/stringptr.h referenced from /home/iqbala/jobflow/jobflow.c!
failed to find dependency /home/iqbala/jobflow/../lib/include/stringptrlist.h referenced from /home/iqbala/jobflow/jobflow.c!
failed to find dependency /home/iqbala/jobflow/../lib/include/sblist.h referenced from /home/iqbala/jobflow/jobflow.c!
failed to find dependency /home/iqbala/jobflow/../lib/include/strlib.h referenced from /home/iqbala/jobflow/jobflow.c!
failed to find dependency /home/iqbala/jobflow/../lib/include/timelib.h referenced from /home/iqbala/jobflow/jobflow.c!
failed to find dependency /home/iqbala/jobflow/../lib/include/filelib.h referenced from /home/iqbala/jobflow/jobflow.c!
done
[RcB] compiling main file...
[CC] cc -Wall -Wextra -std=c99 -D_GNU_SOURCE -g -O0 -c jobflow.c -o jobflow.o
jobflow.c:25:38: fatal error: ../lib/include/optparser.h: No such file or directory
#include "../lib/include/optparser.h"
^
compilation terminated.
make: *** [debug] Error 1

"some versions of perl's garbage collector are buggy and leak memory"

Thanks for the update.

Most of your claims about GNU Parallel are easy to verify. (E.g. run time parallel echo ::: 1 to see the startup time and seq 1000 | time -v parallel true to see memory usage).

But claiming that Perl's buggy garbage collector is a disadvantage to GNU Parallel is not easy to verify. So if you want to use correct claims, I will encourage you to link to an example that shows the current version of GNU Parallel being affected by Perl's buggy garbage collector.

Having run GNU Parallel on many different architectures and many different versions of Perl (version 5.8.4..5.26.1 on armv7l, sparc, armv6hl, alpha, mips) I have never seen the current version of GNU Parallel being affected by Perl's buggy garbage collector, so I am quite interested in seeing the exact situation in which that happens.

So may I suggest you include a link to a reproducible example showing this actually happening?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.