There seems to be a weird interaction when you run jug cleanup while jobs are still ru

I'm in a somewhat similar situation as <a class="user-mention notranslate" data-hoverc

I edited my previous post, that doesn't work. I think <a class="user

Jug doesn't like jobs when they're ended by SIGTERM about jug HOT 12 CLOSED

luispedro commented on May 30, 2024

Jug doesn't like jobs when they're ended by SIGTERM

from jug.

Comments (12)

danpf commented on May 30, 2024

I messed around with trying to delete files based on timestamps, but it got too complicated, but I found a different workaround that kind of works.

it would appear that when a job gets hit by SIGTERM jug leaves the file in a folder but with zero size.
thus if you're running on a cluster w/ backfill/low priority:

Do not use *** doesn't work :/ **
#PBS blah blah

#PBS -go to correct wd-
find jobscript.jugdata/* -size 0 -print0 |xargs -0 rm
parallel jug execute jobscript.py ::: {1..16}

~~this should only delete things that were killed since the lock files have size 28 (at least for me), thus resetting only the SIGTERM'd ones to 'ready'.~~

probably nobody is using low priority clusters (except me haha) but this could be useful until a better fix is found.

from jug.

luispedro commented on May 30, 2024

There is a simple, effective, solution, and I will code it up when I have some time: catch the SIGTERM and clean up the lock files.

from jug.

luispedro commented on May 30, 2024

Ok, I actually had a look at the code and it already catches SIGTERM and cleans up. I tried it on the command line and it cleans up correctly if I do a kill -TERM <JOBID>?

Are you sure that your system really is sending SIGTERMS? Also, I wonder about the zero-sized files. There shouldn't be any zero-sized locks.

from jug.

Andlon commented on May 30, 2024

I'm in a somewhat similar situation as @danpf. In my case, I couldn't use parallel, so I did something to the effect of:

for i in {1..16}
do
    PYTHONPATH=. jug execute jugfile.py &
done
wait # Wait for all child processes to finish

Now, I don't know how long the jobs take, so sometimes the cluster will kill the jobs before they are complete. In this case, wait will not forward any SIGTERM commands. The default setting of the cluster is to try to send a SIGTERM, wait a bit, and then proceed to SIGKILL the process, if I remember correctly. Hence, in my case, using wait, the SIGTERM is ignored and it eventually gets forcefully killed instead.

I tried to find out if parallel forwards SIGTERM, but I couldn't find anything definite. Could it be that parallel simply ignores the SIGTERM in the same way as wait?

On a related note, I'd be interested to hear of a better approach that correctly forwards SIGTERM to executing jug child processes!

from jug.

luispedro commented on May 30, 2024

Ah, that explains it. If jug does not even see the SIGTERM, then it has no chance of handling it (SIGKILLs just kill the process forcibly).

Here is an alternative, which I'm not 100% sure should be part of jug, so I'm not (atm) including it: you catch SIGHUP. Include the following code at the top of your jugfile and it should clean up better.

import signal
def __exit_on_hup(_1, _2):
   import sys
   sys.exit(1)
signal.signal(signal.SIGHUP, __exit_on_hup)

from jug.

luispedro commented on May 30, 2024

Perhaps here is the right approach: catch SIGHUP and clean up by default, but add an option to disable it for advanced users.

Not sure yet, but in the meanwhile, including the code above on your jugfiles is probably a good work-around.

from jug.

Andlon commented on May 30, 2024

Interesting. Could you explain where SIGHUP comes into play? I didn't realize SIGHUP would also be sent in the case of clusters killing jobs...?

from jug.

luispedro commented on May 30, 2024

Sorry, you're right. There is probably no SIGHUP when bash exits as the pseudo-terminal is probably not closed when bash exits, although this sort of weird unix interaction is sometimes hard to predict and it depends on whether the shell is interactive or not (I started reading up on it, but it's complicated).

from jug.

danpf commented on May 30, 2024

I edited my previous post, that doesn't work.

I think @Andlon is right, the problem here probably isn't jug, but how how SIGTERM is being sent to the processes. I have sent an email to the cluster admins asking for more information about how the SIGTERM is sent.

If the SIGTERM is just being sent to every process under my username, then it would explain why my executable would get killed, and jug would see that and think that it has finished, but jug would get killed at that same moment... kind of a weird timeline and all probably happens in less than a second haha.

I don't really know much about these cluster systems, and information on the backfill system especially seems pretty sparse.

from jug.

luispedro commented on May 30, 2024

You can also manually add trap statements to your bash script to propagate the signals.

In any case, I will make this into a documentation item and close this bug. If jug processes are getting killed with SIGKILL, there is nothing jug can do.

from jug.

Andlon commented on May 30, 2024

Just wanted to say thank you for adding the trap example to the documentation! I will try it out next time I run batch jobs.

from jug.

danpf commented on May 30, 2024

No one cares but I figured I'd follow up,
My problems were fixed by getting the cluster admins to implement a queue that would allow for cleanup time after being preemted.

The following code works great, and is modeled after what luis had suggested

#COMMANDS FOR SCHEDULER HERE

killparalleltwice()
{
    kill -15 $ppid
    kill -15 $ppid
    echo 'recieved -15, terminated parallel twice'
    sleep 20
}
runcommand()
{
    $PATHTOGNUP/parallel $PATHTOJUG/jug execute jugfile.py ::: {1..16}
}
trap 'killparalleltwice' SIGTERM SIGINT
runcommand
ppid=$!
wait

In case you've made it this far and are still interested, GNU parallel doesn't have a flag (that i'm aware of) that will actually distribute SIGTERM to it's children on the first command, so you have to send it SIGTERM twice.

Sleep 20 is used because we only get ~10 seconds after preemption, so that way the scheduler knows that my job hasn't finished yet, and gets resubmitted.

So, in short, jug is perfect, and clusters are hard.

from jug.

Jug doesn't like jobs when they're ended by SIGTERM about jug HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent