Coder Social home page Coder Social logo

justanhduc / task-spooler Goto Github PK

View Code? Open in Web Editor NEW
230.0 5.0 23.0 548 KB

A scheduler for GPU/CPU tasks

Home Page: https://justanhduc.github.io/2021/02/03/Task-Spooler.html

License: GNU General Public License v2.0

Makefile 0.67% C 88.23% Shell 1.21% HTML 9.19% CMake 0.71%
slurm slurm-job slurm-job-scheduler task-spooler job-scheduler cpp linux gpu-support debian c

task-spooler's Introduction

GPU Task Spooler

Contents

About

GPU Task Spooler, or ts for short, is a spooling system that helps manage CPU/GPU tasks easily. You can think of SLURM but for small individual servers rather than high-performance clusters.

Features

ts can offer you the following features

  • Queue and execute jobs, be it on CPUs or GPUs.
  • Automatically allocate free GPUs for your jobs: just forget CUDA_VISIBLE_DEVICES.
  • Control number of jobs running in parallel.
  • View your job outputs in terminal and/or txt.
  • Very minimalistic: easy setup and almost no configuration.
  • Terminal agnostic: queue in one terminal and view in another.
  • Jobs can be set to run in foreground or background.
  • Simple CLI, but there is a GUI addon.

Setup

See the installation steps in INSTALL.md.

Changelog

See CHANGELOG.

Tutorial

A tutorial with colab is available here.

Tricks

See here for some cool tricks to extend ts.

A note for DL/ML researchers

If the codes are modified after a job is queued, the modified version will be executed rather than the version at the time the job is queued. To ensure the right version of the codes is executed, it is necessary to use a versioning mechanism. Personally, I simply clone the whole code base excluding binary files to a temporary location with rsync and execute the job there. Another way is to use git to check out the right version before running, but it requires committing every small changes.

Working with remote servers

Like above, one can use rsync to copy the code base to a temporary location on the remote server, and use ssh to launch the job using ts. This can be done either with a script or using a small plug-in here.

Manual

See below/man ts/ts -h for more details.

usage: ts [action] [-ngfmdE] [-L <lab>] [-D <id>] [cmd...]
Env vars:
  TS_VISIBLE_DEVICES     the GPU IDs that are visible to ts. Jobs will be run on these GPUs only.
  TS_SOCKET              the path to the unix socket used by the ts command.
  TS_MAILTO              where to mail the result (on -m). Local user by default.
  TS_MAXFINISHED         maximum finished jobs in the queue.
  TS_MAXCONN             maximum number of ts connections at once.
  TS_ONFINISH            binary called on job end (passes jobid, error, outfile, command).
  TS_ENV                 command called on enqueue. Its output determines the job information.
  TS_SAVELIST            filename which will store the list, if the server dies.
  TS_SLOTS               amount of jobs which can run at once, read on server start.
  TMPDIR                 directory where to place the output files and the default socket.
Long option actions:
  --getenv               [var]        get the value of the specified variable in server environment.
  --setenv               [var]        set the specified flag to server environment.
  --unsetenv             [var]        remove the specified flag from server environment.
  --set_gpu_free_perc    [num]        set the value of GPU memory threshold above which GPUs are considered available (90 by default).
  --get_gpu_free_perc                 get the value of GPU memory threshold above which GPUs are considered available.
  --get_label          || -a [id]     show the job label. Of the last added, if not specified.
  --full_cmd           || -F [id]     show full command. Of the last added, if not specified.
  --count_running      || -R          return the number of running jobs
  --last_queue_id      || -q          show the job ID of the last added.
  --get_logdir                        get the path containing log files.
  --set_logdir           [path]       set the path containing log files. 
  --serialize [format] || -M [format] serialize the job list to the specified format. Choices: {default, json, tab}.
Long option adding jobs:
  --gpus               || -G [num]    number of GPUs required by the job (1 default).
  --gpu_indices        || -g [id,...] the job will be on these GPU indices without checking whether they are free.
Actions (can be performed only one at a time):
  -K           kill the task spooler server
  -C           clear the list of finished jobs
  -l           show the job list (default action)
  -g           list all jobs running on GPUs and the corresponding GPU IDs
  -S [num]     get/set the number of max simultaneous jobs of the server.
  -t [id]      \"tail -n 10 -f\" the output of the job. Last run if not specified.
  -c [id]      like -t, but shows all the lines. Last run if not specified.
  -p [id]      show the pid of the job. Last run if not specified.
  -o [id]      show the output file. Of last job run, if not specified.
  -i [id]      show job information. Of last job run, if not specified.
  -s [id]      show the job state. Of the last added, if not specified.
  -r [id]      remove a job. The last added, if not specified.
  -w [id]      wait for a job. The last added, if not specified.
  -k [id]      send SIGTERM to the job process group. The last run, if not specified.
  -T           send SIGTERM to all running job groups.
  -u [id]      put that job first. The last added, if not specified.
  -U [id-id]   swap two jobs in the queue.
  -B           in case of full queue on the server, quit (2) instead of waiting.
  -h           show this help
  -V           show the program version
Options adding jobs:
  -n           don't store the output of the command.
  -E           Keep stderr apart, in a name like the output file, but adding '.e'.
  -O           Set name of the log file (without any path).
  -z           gzip the stored output (if not -n).
  -f           don't fork into background.
  -m           send the output by e-mail (uses sendmail).
  -d           the job will be run after the last job ends.
  -D [id,...]  the job will be run after the job of given IDs ends.
  -W [id,...]  the job will be run after the job of given IDs ends well (exit code 0).
  -L [lab]     name this task with a label, to be distinguished on listing.
  -N [num]     number of slots required by the job (1 default).

People

Acknowledgement

  • To Lluís Batlle i Rossell, the author of the original Task Spooler
  • To Raúl Salinas, for his inspiring ideas
  • To Alessandro Öhler, the first non-acquaintance user, who proposed and created the mailing list
  • To Андрею Пантюхину, who created the BSD port
  • To the useful, although sometimes uncomfortable, UNIX interface
  • To Alexander V. Inyukhin, for the debian packages
  • To Pascal Bleser, for the SuSE packages
  • To Sergio Ballestrero, who sent code and motivated the development of a multislot version of ts
  • To GNU, an ugly but working and helpful ol' UNIX implementation

Others

Many memory bugs are identified thanks to Valgrind.

Related projects

Messenger

task-spooler's People

Contributors

bstee615 avatar jasha10 avatar jerinphilip avatar jevandezande avatar justanhduc avatar mshockwave avatar mw75 avatar technojoe avatar valllle avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

task-spooler's Issues

Way to remove task from queue that is in "allocating" state

Sorry if bringing this up as an issue here is incorrect, but I was wondering if there is a way to remove a job from the queue that has not yet started but is in the "allocating" state. I can't seem to -k it because it says the job is not finished or is not running, and I can't -r it either as it says the job cannot be removed. Thanks!

Structured output

Thank you for this awesome tool!

I made a web app to manage multiple task queues since in my own workflow I usually use 2-3 queues and would like notifications when the items are done https://github.com/bstee615/task-spooler-gui. I managed to parse the output of the 'tsp' command, but it is not very machine-readable. I need a structured output in order to parse the current status easily and precisely. I think adding an option -J to output JSON would be helpful.

I will have some time soon and will try to implement it in the coming weeks. Will you be open to a PR?

By the way, I just got through a busy period of 2 back-to-back deep learning paper submissions and would not have finished the experiments as easily without this friendly tool. Thank you.

Getting error of `Wrong server version.`

Hi, I follow the installation instruction at Ubuntu18.04 and simply ran ts. But, I got the error message: Wrong server version. Received 1048576, expecting 730.
Could you help me to resolve this issue?

"munmap_chunk(): invalid pointer" error using "-L <label>" option

Hi!
First of all, thank you for a great project: it very helps us to run some jobs in a batch manner.
As we don't need a GPU, we use currently a cpu-only version 1.2.1 of Task Spooler on a Linux 64-bit platform (SLES 15 sp3).

However, we encountered an error that looks like a bug with a memory allocation/freeing.
When we are trying to use "ts" with an "-L " option and then look this label in the output of job list (using "-a" option), we sometimes got a crash (both client and server parts of Task Spooler). The only available error in the ".error" file is the following:
munmap_chunk(): invalid pointer
Google tells that similar messages appear when a program is trying to free the memory that has not been allocated (via "malloc()" or similar).
Looking at the code, I can see the file "main.c" contains the following code near the end of file (line 610):
free(command_line.label);
However, the command_line.label variable has the value either zero (during the program initialization, line 42) or value of parameter "-L" (line 153). Neither case cause an malloc() call. Probably, it can cause to client crash; but anyway I could not understand why the server part is crashing also.

Enhancement request: bigger queue size

Currently task spooler uses a constant max queue size:

MAXCONN = 1000

(in server.c file)

Such limit was perfectly enough 20-30 years ago, but with recent kernels, running on recent hardware, this limit greatly narrows the scope of possible task spooler usage.

I already looked at the code and I think that increasing this limit to 10000 would be safe.

For compatibility with older/special systems, I suggest:

  • increasing this limit in the default compilation to 10000 (or maybe even more, at someone's risk)
  • add default limit=1000 in the code (to avoid change in ts command default behavior)
  • add new command line switch to enable limit over 1000 - eg. -X 8000

Please edit the README

First of all, thank you so much for your work and commitment to this old, so very usefull and so very underrated tool.

I ended up here accidentaly, by googling "task-spooler". I already know about the original repo by Luis and everything written there. I was looking for more updated information and what people are doing, and plan to to, nowadays, 2022 with this tool.

I then stumbled on your github repo. Very nice to know somebody has interest and is commited to working on this tool.

But, it took me a lot of unneccessayr wasted time and a looooot of trouble understanding what your repo is about.

There is an explanation for that.

You have an horrible README, greeting every one that lands the first time in your page.

You seem to have copied the original README and started doing "little edits", crossing out somethings, addding little notes here and there.

You just put the "homepage" of the project as a link to a blog post from 2021. That's not very professional.

Please dont take this personally. But the end result is an horrible Frankenstein.
It makes it sooooo difficult to quickly understand,

  • what is this Repo ? Is it a fork ? Is it a mirror of the original one ?
  • is it some guy taking over the old one and maintaining it
  • besides being a fork, what are the main points ? is it bug correction ? or is it new features planned ?
  • what are your future intentions ? is this for personal use ? do you plan to keep on with it in a year or so ?

All of the above would be solved, if you just had a smaller, simple, to the point README. Not copied or adapted from the original repo.

My personal opinion would be that you should actually have choosen another name, for example "task-spooler-ng". The original repo is dead. And is not gonna update with your work here. You have every right to fork. And deserve every credit for you work here. Since you are independent, are not pulling any more updates from the original repo, you, are effectevely the "new" task-spooler.

And eventually, what if the Luis old repo wakes up from the dead ? And starts updating it again ? A big mess. Again, my personal opinnion, change the name of your repo. It makes the work of other people eventually packgaging it for a Linux distribution like Debian or Archlinux unnecessary complex.

Start with a simple sentence, "This is a fork of XYZ, link". Enough.

Then add a couple more sentences about what "you" are doing here. Not what the old repo "task-spooler" did. That is history. Freshmeat and Luis is history. Put it in a HISTORY file.

Keep the README small. Put other stuff in the right place. Use the good old ones CHANGELOG file, a TODO , or a NEWS file. Simple plain text files.

Get rid of the old files written by Luis "OBJECTIVE" , "PLANS" etc. They dont make sense here. Put them in an "HISTORY" or "ARCHIVE" folder. You dont have to in chains like in a prison following the structure of the old repo.

Dont put that advanced or niche stuff with the cpu/only or planned features before the important stuff. Important stuff, for first time users, is, "what is the point of task-spoooler", and "how to quick start".

What is this (justanhduc) repo doing different from Luis repo is advanced or historic stuff.

Again, hope you take this as a constructive critic. The reason and my interest is actually that I use Archlinux. We have a package for task-spooler using the old repo. With some patches. We saw your repo, and are considering a possible change, if the future looks good and stable. Just like the Debian package.

Thanks in advance.

Advice on how to cancel (kill or remove) task

First off, thank you so much for forking/maintaining this project!

I want to know the best way to NOT run a task in the case that I do not know whether it is running or queued.

I see that -r throws an error if it is running and -k throws an error if it is not (e.g. queued).

Based on this, I came up with the command: ts -r ${taskid} || ts -k ${taskid}. Does that seem like the best approach?

Change log file directory

Hi, I really appreciate your fantastic program.

I think it would be really nice if this program supports some functions:

  1. Changing a log file directory
  2. Changing a log file name

I wish i can do the above things with command arguments :)

non-sudo installation

Thanks again for this wonderful tool. I wonder if the tool can be installed without sudo.

ts -F stochastically crashes the server

Occasionally (once every 50-100 commands), the task spooler's server will crash following a ts -F <id> command. I'm not fully certain if it's only that command, but maybe because I'm using it quite often so it's usually after calling ts -F that I suddenly see the server has crashed. It happens silently, so I don't have much more to add. Looking at /tmp/socket.ts.error, I see the following warnings:

-------------------Warning
 Msg: Sending a message to 3, sent -1 bytes, should send 48.
 errno 32, "Broken pipe"
msgdump:
 ENDJOB
date Thu Feb  2 13:31:23 2023
pid 12532
type CLIENT
-------------------Warning
 Msg: Sending a message to 3, sent -1 bytes, should send 48.
 errno 32, "Broken pipe"
msgdump:
 ENDJOB
date Thu Feb  2 13:31:27 2023
pid 12528
type CLIENT
-------------------Warning
 Msg: Sending a message to 3, sent -1 bytes, should send 48.
 errno 32, "Broken pipe"
msgdump:
 ENDJOB
date Thu Feb  2 13:31:29 2023
pid 12534
type CLIENT
-------------------Warning
 Msg: Sending a message to 3, sent -1 bytes, should send 48.
 errno 32, "Broken pipe"
msgdump:
 ENDJOB
date Thu Feb  2 13:31:33 2023
pid 12530
type CLIENT

I was previously using a version installed from source based on a November commit, but the behavior is the same even on the latest commit on the main branch. The only thing that changed on my environment is a recent system upgrade (on an NVIDIA DGX machine), so perhaps this is due to a newer NVIDIA drivers or some other updated package this program relies on?

BTW, this is a wonderful program that is integral to my daily work as a researcher. It's the just right level of task management that I needed.

Can I limit the distributed GPUs?

I'm using your script on a machine with 16 GPUs. For my tasks, I want specific GPUs to not be used or rather select which GPUs are used.

For example, I want GPUs 0-8 to be available to ts but 9-15 be left alone. Is this something that can be done?

Enhancement requests

Hi!
First of all, I'd like to thank for this great project - it is really useful!

At the same time, I'd like to share my troubles also and ask some enhancement requests.
At the moment we use a cpu-only version 1.2.1 of Task Spooler.

  1. I'd like to monitor the current size of queue. It's possible to use the "-R" option to receive the number of jobs that are currently running; however, there is no a simple similar option to obtain the number of jobs that are currently waiting in the queue (or totally: sum of running and queued jobs, i.e. current queue size).

  2. I'd like to have possibility to add job to the queue only if this job still has not been added.

Use case 1
Some jobs are scheduled to be placed into a queue via CRON (for example, every 10 minutes). However, sometimes it could be occurred that this job is running longer then usual (for example, 15-20 minutes). In this situation the CRON daemon will place this job to the queue twice or thrice that is not needed.
In this case I need to place job to the queue only if it's absent in this queue in either the "running" or "queued" states.

Use case 2
Job is placed to a queue by some external signal. Job should process some new data, and a signal has meaning: "some new data has arrived and ready for processing". However, sometimes new data could arrive just during the job is running. In this case this job should be re-started again when it completes; so it should be placed to the queue again. At the same time, if this job is queued already, it's not necessary to place it to this queue more that once.
In this case I need to place job to the queue only if it's absent in this queue in "queued" state only (even if it's present in "running" state).

  1. Probably, the usage of labels could be extended. At least, it could be useful to check a state of some specific jobs identifying them by label instead of job ID.

  2. All these tasks could be solved using some scripts obtaining the list of jobs with all their attributes (state, label, command line, etc.). However, the current version shortens some long fields (at least, label and command lines) according to the screen size; even if the command "ts -l" is used in scripts and STDOUT is not a terminal but redirected to the file or pipe. It causes to problem: script should initially obtain the list of jobs, and then to loop by this list using a job ID for every job and use some options like "-i", "-s", "-p" or "-a" to obtain a details for a specific job. Besides an additional work, it's impossible to obtain a consistent information, as during this loop the state is changing. Is it possible, please, to have possibility to obtain the full queue list (independently of a label and command sizes) for processing in scripts? It could be acceptable if this list has some well known format (for example, CSV or TAB-separated fields).

make cpu giving error: implicitly declaring library function 'snprintf' with type 'int

I just git clone'd the current master bbfd33e and ran make cpu. It outputs an error in list.c:

list.c:22:9: error: implicitly declaring library function 'snprintf' with type 'int (char *, unsigned long, const char *, ...)' [-Werror,-Wimplicit-function-declaration]
        snprintf(newline, len - 4, "%s", line);
        ^

Can you fix this error message? It blocks the install for me.


I am using GNU Make 3.81 on Mac M1 with macOS Monterey version 12.6.

Enhancement request: ability to postpone jobs

Use case is simple:

  • add the new job to the queue
  • but postpone execution by eg. 7 minutes

How is that different from using sleep() within jobs?

If a particular job sleeps, then it affects also all waiting jobs. Job postpone time should be counted by task spooler, separately for each job.

How it should be implemented, to avoid breaking task spooler architecture?

  1. ts should accept new command line parameter, eg. number of seconds to postpone the currently added job.
  2. Each ts instance should internally count the minimal Unix epoch time, at which the job should start.
  3. Job server should be contacted, when this time comes - earlier ts instances should just sleep and wait.

Make cpu_only a make/cmake toggle and drop the cpu_only branch ?

Hello!

To simplify packaging (I want to write an ebuild for Gentoo's GURU) so the code given in the releases tarballs can be used directly, it would be awesome if you could merge the cpu_only into master and make it into a cmake toggle for example, and/or maybe make the GPU queuing a toggle. Is it very difficult for you ? Another option is to add the cpu_only source code to the releases.

Thanks!

Adel

Enhancement request: support for priorities

Task spooler would be a great replacement for "real" queue servers, but it misses a few capabilities. One of them is support for priorities.

Minimal form: normal and high priority (or normal and low).

Ideally, 3 priorities (low, normal, high), or even more.

Unclear queue configuration

Hey,

Awesome repo, hugely helpful.

We are unsure if this is a usage problem, or a bug in task spooler.

When configuring the queue to work on multiple GPUs we need to run,
ts -S 4
And to set the log directory we need to run,
ts --set_logdir logs/
but when running them together we notice that -S is ignored
ts -S 4 --set_logdir logs/

Can task spooler options only be set one at a time? If so, we might benefit from some clarity in the readme/guides.

Compilation errors in macos bigsur

I have error messages when compiling on a recent mac, with the system compiler. Which is clang based

Apple clang version 12.0.5 (clang-1205.0.22.11)

list.c:22:5: error: implicitly declaring library function 'snprintf' with type 'int (char *, unsigned long, const char *, ...)'

I solved this with a change:
remove
-D_XOPEN_SOURCE=500 -D__STRICT_ANSI__
in Makefile
which seems to block the use of snprintf

json format for listing jobs

Hey,

I would like to automate a bit ts, and therefore I was thinking of adding a json option for output of listing jobs. Is that feature wanted? If so, I could code it up.

Bug: cannot add a very long command to queue

When trying to add a command with many parameters, task-spooler will crash at:

error("Reading the size of the name");

I'm currently using a workaround by having a script that stores the command in a variable, and then pass task-spooler a script that reads the command from the variable and execute it. But that's just a hack of course.

Here are my workaround scripts:

ts_helper with usage ts_helper OPTIONS @ LONG_COMMAND:

#!/bin/bash
IFS='@' read -ra ARGS <<< "$@";
if [ ${#ARGS[@]} -eq 2 ];
then
  SCRIPTRAW="${ARGS[1]}"
  SCRIPT="$SCRIPTRAW" ts "${ARGS[0]}" ts_helper_runner;
else
  echo "Wrong number of arguments";
fi;

which calls the script ts_helper_runner:

#!/bin/bash
exec $SCRIPT

Using `-n` `-f` flags: pass through SIGINT (and other signals?)

Hi,

I want to use ts -n -f my_app to run apps in the foreground.
The use case is:

  • schedule a sequence of long-running jobs (each in its own tty)
  • be able to interact with the jobs while they are running.

For example, I might run ts -n -f python my_app.py, which runs for a long time and might sometimes want to interact with the app (e.g. if a python debugger breakpoint gets hit).

There's a problem with the -nf flags: if I press Ctrl-C to send a SIGINT signal, it causes ts to malfunction.
For example:

$ ts -nf python -q
>>> 123
123
>>>
KeyboardInterrupt
>>> $  # Sending input to the python process no longer works as expected

After sending SIGINT (pressing Ctrl-C), I it seems that standard input to the python process has been messed up.

Do you have any tips for how to work around this issue? Would it be easy to modify ts so that SIGINT (and maybe other signals) get passed through to the process that's running in the foreground?

Unable to redirect output from command line

Hello.
I would like to redirect the output of my command launched by task-spooler.
I've seen comments using stdout redirection of the job but I haven't been able to use it.
I use ts "./test.sh > test.log but the output is still stored in a temporary file in /tmp.
I've tried with a absolute path and with the tee command and still nothing.
The -o file option still output with a random suffix in /tmp.
The only way I've found is to create a temporary script and redirect the output from inside the script.
Is there something I am missing ?
Cheers

Prompt to uninstall the apt installation of tsp before running ts in README

An error message 'Error calling recv_msg in c_check_version' occurs immediately when runing ts, ts [action], etc when tsp server installed by apt is runing.
In README, It will be nice to prompt beginer uninstalling tsp or runing tsp -K before using ts.
And thank you for this tool, it is very useful.

does not work in 3 gpus

It works well in 2 gpus. However, if I am using 3 gpus, it seems the 3rd job in the queue will not be executed. It shows the 3rd one is running but the text under Output is (...)

BTW, should I use the -S option before any other command?

[cpu-only] Fail to build: multiple definition of `logdir'

Hello,

I tried to build it but apparently I am hitting against an issue

➜  task-spooler git:(cpu-only) ./install_make 
GIT_VERSION=$(echo $(git describe --dirty --always --tags) | tr - +); \
cc -pedantic -ansi -Wall -g -O0 -std=c11 -D_XOPEN_SOURCE=500 -D__STRICT_ANSI__ -DTS_VERSION=${GIT_VERSION} -c main.c -o main.o
cc -pedantic -ansi -Wall -g -O0 -std=c11 -D_XOPEN_SOURCE=500 -D__STRICT_ANSI__ -c server.c -o server.o
cc -pedantic -ansi -Wall -g -O0 -std=c11 -D_XOPEN_SOURCE=500 -D__STRICT_ANSI__ -c server_start.c -o server_start.o
cc -pedantic -ansi -Wall -g -O0 -std=c11 -D_XOPEN_SOURCE=500 -D__STRICT_ANSI__ -c client.c -o client.o
cc -pedantic -ansi -Wall -g -O0 -std=c11 -D_XOPEN_SOURCE=500 -D__STRICT_ANSI__ -c msgdump.c -o msgdump.o
cc -pedantic -ansi -Wall -g -O0 -std=c11 -D_XOPEN_SOURCE=500 -D__STRICT_ANSI__ -c jobs.c -o jobs.o
cc -pedantic -ansi -Wall -g -O0 -std=c11 -D_XOPEN_SOURCE=500 -D__STRICT_ANSI__ -c execute.c -o execute.o
cc -pedantic -ansi -Wall -g -O0 -std=c11 -D_XOPEN_SOURCE=500 -D__STRICT_ANSI__ -c msg.c -o msg.o
cc -pedantic -ansi -Wall -g -O0 -std=c11 -D_XOPEN_SOURCE=500 -D__STRICT_ANSI__ -c mail.c -o mail.o
cc -pedantic -ansi -Wall -g -O0 -std=c11 -D_XOPEN_SOURCE=500 -D__STRICT_ANSI__ -c error.c -o error.o
cc -pedantic -ansi -Wall -g -O0 -std=c11 -D_XOPEN_SOURCE=500 -D__STRICT_ANSI__ -c signals.c -o signals.o
cc -pedantic -ansi -Wall -g -O0 -std=c11 -D_XOPEN_SOURCE=500 -D__STRICT_ANSI__ -c list.c -o list.o
cc -pedantic -ansi -Wall -g -O0 -std=c11 -D_XOPEN_SOURCE=500 -D__STRICT_ANSI__ -c print.c -o print.o
cc -pedantic -ansi -Wall -g -O0 -std=c11 -D_XOPEN_SOURCE=500 -D__STRICT_ANSI__ -c info.c -o info.o
cc -pedantic -ansi -Wall -g -O0 -std=c11 -D_XOPEN_SOURCE=500 -D__STRICT_ANSI__ -c env.c -o env.o
cc -pedantic -ansi -Wall -g -O0 -std=c11 -D_XOPEN_SOURCE=500 -D__STRICT_ANSI__ -c tail.c -o tail.o
cc  -o ts main.o server.o server_start.o client.o msgdump.o jobs.o execute.o msg.o mail.o error.o signals.o list.o print.o info.o env.o tail.o
/usr/lib/gcc/x86_64-pc-linux-gnu/11.2.1/../../../../x86_64-pc-linux-gnu/bin/ld: jobs.o:/home/adel/Documents/Gethings/task-spooler/jobs.c:43: multiple definition of `logdir'; server.o:/home/adel/Documents/Gethings/task-spooler/server.c:40: first defined here
collect2: error: ld returned 1 exit status
make: *** [Makefile:29: ts] Error 1

Same goes with install_cmake

Thanks!

Timeout

I suggest to add the ability to set a timeout for running tasks, killing them on expiration.

Getting strange errors when trying to submit tasks

Hello,

I am trying to utilize ts as a cpu task queue. Currently using ts 1.0 (since that is being distributed with Ubuntu). I have some code in PHP that periodically submits shell script to be executed as task in ts. The task does get submitted but process is immediately finished and stuff is written to socket error file.

PHP code example does this:

shell_exec('TS_SOCKET=/run/task-spooler/tsp.sock /usr/bin/tsp '/path/to/my/script.sh' 2>&1');

Task does get added, it is listed but always shows result -1 and socket error lists stuff like this:

-------------------Warning
 Msg: JobID 4 quit while running.
 errno 32, "Broken pipe"
date Thu Jan  7 13:52:04 2021
pid 721
type SERVER
New_jobs
  new_job
    jobid 4
    command "/tmp/stacks/specialclient/stack124/apply-all.sh"
    state running
    result.errorlevel 0
    output_filename "/tmp/ts-out.h0dQen"
    store_output 1
    pid 5186
    should_keep_finished 1
  new_job
    jobid 5
    command "/tmp/stacks/specialclient/stack124/output-all.sh"
    state queued
    result.errorlevel 0
    output_filename "NULL"
    store_output 1
    pid 0
    should_keep_finished 1

Do you have any hint or suggestion here of what am I doing wrong here?

asynchronous launch

How can I enqueue without locking?

For example
ts "/usr/bin/cat /dev/random > /dev/null"

Maybe it might be because I have the queue "full"?

My usecase is that I need to enqueue 3k tasks, and I want the enqueueing process to be asynchronous. I assume that just using & to launch them in background is a hack and it should be a better way.

If this is not possible, then my feature request is to add a flag to ensure that the enqueueing will return immediately.

can I query the GPU ids allocated for a job?

Thanks for developing the tool! Right now ts -i only returns the number of GPUs allocated and ts -p returns the pid of the main process. To know which GPUs the process and its child processes actually uses, I am using pgrep and crossing ref with nvidia-smi. Is there an easy way to do so?

Evaluate $(...) in commands at run not at enqueue

ts is such a great tool. Thanks for maintaining it.

I'm passing a port number to many jobs, and the port has to be free for the job to run correctly.
I have a script that finds a free port, and I use it to run the command using ts. i.e., ts -G 1 run --port=$(find_free_port).
The problem is doing it like above evaluates the script when the command is added to the queue!
Is there a way to evaluate it when ts runs the command?

GPU selection scheme

Is there a way to suggest beforehand which GPUs to allocate for a job?

My use case is I'm sharing a server with others, but unfortunately, they don't want to use task-spooler, and they hardcode their GPU IDs.
We have set a soft division of the GPUs where it's okay to tap into their GPUs if I'm using all of mine.
For example, a server with 8 GPUs and I got 0-3, and they have 4-7. I want to run five jobs with a GPU each, and they want to run 2. They'll always use their GPUs in the following order: 4,5,6,7. So, I want to set my preferred order as 0,1,2,3,7,6,5,4. By doing so, I don't get in their way.

Real-time log with Task-spooler

Thank you a lot for such a great tool. I have been looking for this tool for a while.

In your tutorial, it said that:

To see the output, use the -c or -t flag. You should see the training in real-time. You can use ctrl+c to stop getting stdout anytime without actually canceling the experiment.

However, when I ran this command, ts -c <id>, it hang out for a while and returned the entire log once until the task finished.

When I checked the manual via man ts, I just see:

-c [id ... It will block until all the output can be sent to standard output, and will exit with the job errorlevel as in -c.

Is there any way to work around this problem to view the stdout in real-time?

Separate logging and queueing?

Thanks again for your work. Your tool helps me push out a lot of great work. Feel free to check out my website.

Recently, our workstation has unstable connection with the GPUs (might be an issue with the driver). Basically, nvidia-smi would return

Unable to determine the device handle for GPU 0000:4C:00.0: Unknown Error

When this issue occurs, the ts session will break down and restart. While the access to the previous session is lost, the jobs launched by the previous ts session are still running and we can no longer track their logging outputs with ts -t.

I guess it might be better to separate logging and queueing, so that the logging module does not depend on the GPU status and can still work when GPU error occurs.

Thanks!

Less intelligent mode for GPU allocation

Let's say I have 2 GPUs that are shared with others, I would like to allocate a single job to a single GPU.

Using the --gpus option will require that a GPU is considered free, but setting the right free percentage might be tricky. The -g flag will ignore the free requirement but consecutive jobs assigned to the same GPU will start as long as there are available slots. The high level view is that there will be a single slot for each GPU and jobs will run on a GPU as long as the current user does not have a process running on it.

Essentially I want to be able to just specify the number of gpus needed by a job, and task-spooler will allocate the gpus based on whether there are any running jobs on the GPU, regardless of the memory usage. It is a hybrid mode between the automatic allocation and manual allocation.

What I am currently doing is to create two different ts servers that uses different TMPDIR and use the -g flag to force a single GPU for jobs submitted to a given server, which isn't ideal and kinda defeats the purpose of ts.

BTW could there be a configuration file that permanently sets the env vars? It would be great if things like the GPU wait could be set permanently as well.

Copyright

Hello,

I'm the copyright owner of task spooler. I'm surprised you republished my program, added yourself as author and copyright owner, and published a new version under the same name. Please make sure distributions do not take this "the next version of Task Spooler". I came here because I saw termux has the package "task-spooler/2.0.0".

It's fine if you fork it, but please rename it. You can create derived work and you may have copyright only over nontrivial additions.

Some links that may be of your interest:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.