Comments (12)
I messed around with trying to delete files based on timestamps, but it got too complicated, but I found a different workaround that kind of works.
it would appear that when a job gets hit by SIGTERM jug leaves the file in a folder but with zero size.
thus if you're running on a cluster w/ backfill/low priority:
Do not use *** doesn't work :/ **
#PBS blah blah
#PBS -go to correct wd-
find jobscript.jugdata/* -size 0 -print0 |xargs -0 rm
parallel jug execute jobscript.py ::: {1..16}
this should only delete things that were killed since the lock files have size 28 (at least for me), thus resetting only the SIGTERM'd ones to 'ready'.
probably nobody is using low priority clusters (except me haha) but this could be useful until a better fix is found.
from jug.
There is a simple, effective, solution, and I will code it up when I have some time: catch the SIGTERM and clean up the lock files.
from jug.
Ok, I actually had a look at the code and it already catches SIGTERM and cleans up. I tried it on the command line and it cleans up correctly if I do a kill -TERM <JOBID>
?
Are you sure that your system really is sending SIGTERMS? Also, I wonder about the zero-sized files. There shouldn't be any zero-sized locks.
from jug.
I'm in a somewhat similar situation as @danpf. In my case, I couldn't use parallel, so I did something to the effect of:
for i in {1..16}
do
PYTHONPATH=. jug execute jugfile.py &
done
wait # Wait for all child processes to finish
Now, I don't know how long the jobs take, so sometimes the cluster will kill the jobs before they are complete. In this case, wait
will not forward any SIGTERM commands. The default setting of the cluster is to try to send a SIGTERM, wait a bit, and then proceed to SIGKILL the process, if I remember correctly. Hence, in my case, using wait
, the SIGTERM is ignored and it eventually gets forcefully killed instead.
I tried to find out if parallel
forwards SIGTERM, but I couldn't find anything definite. Could it be that parallel
simply ignores the SIGTERM in the same way as wait
?
On a related note, I'd be interested to hear of a better approach that correctly forwards SIGTERM to executing jug child processes!
from jug.
Ah, that explains it. If jug does not even see the SIGTERM, then it has no chance of handling it (SIGKILLs just kill the process forcibly).
Here is an alternative, which I'm not 100% sure should be part of jug, so I'm not (atm) including it: you catch SIGHUP. Include the following code at the top of your jugfile and it should clean up better.
import signal
def __exit_on_hup(_1, _2):
import sys
sys.exit(1)
signal.signal(signal.SIGHUP, __exit_on_hup)
from jug.
Perhaps here is the right approach: catch SIGHUP and clean up by default, but add an option to disable it for advanced users.
Not sure yet, but in the meanwhile, including the code above on your jugfiles is probably a good work-around.
from jug.
Interesting. Could you explain where SIGHUP comes into play? I didn't realize SIGHUP would also be sent in the case of clusters killing jobs...?
from jug.
Sorry, you're right. There is probably no SIGHUP when bash exits as the pseudo-terminal is probably not closed when bash exits, although this sort of weird unix interaction is sometimes hard to predict and it depends on whether the shell is interactive or not (I started reading up on it, but it's complicated).
from jug.
I edited my previous post, that doesn't work.
I think @Andlon is right, the problem here probably isn't jug, but how how SIGTERM is being sent to the processes. I have sent an email to the cluster admins asking for more information about how the SIGTERM is sent.
If the SIGTERM is just being sent to every process under my username, then it would explain why my executable would get killed, and jug would see that and think that it has finished, but jug would get killed at that same moment... kind of a weird timeline and all probably happens in less than a second haha.
I don't really know much about these cluster systems, and information on the backfill system especially seems pretty sparse.
from jug.
You can also manually add trap
statements to your bash script to propagate the signals.
In any case, I will make this into a documentation item and close this bug. If jug processes are getting killed with SIGKILL, there is nothing jug can do.
from jug.
Just wanted to say thank you for adding the trap
example to the documentation! I will try it out next time I run batch jobs.
from jug.
No one cares but I figured I'd follow up,
My problems were fixed by getting the cluster admins to implement a queue that would allow for cleanup time after being preemted.
The following code works great, and is modeled after what luis had suggested
#COMMANDS FOR SCHEDULER HERE
killparalleltwice()
{
kill -15 $ppid
kill -15 $ppid
echo 'recieved -15, terminated parallel twice'
sleep 20
}
runcommand()
{
$PATHTOGNUP/parallel $PATHTOJUG/jug execute jugfile.py ::: {1..16}
}
trap 'killparalleltwice' SIGTERM SIGINT
runcommand
ppid=$!
wait
In case you've made it this far and are still interested, GNU parallel doesn't have a flag (that i'm aware of) that will actually distribute SIGTERM to it's children on the first command, so you have to send it SIGTERM twice.
Sleep 20 is used because we only get ~10 seconds after preemption, so that way the scheduler knows that my job hasn't finished yet, and gets resubmitted.
So, in short, jug is perfect, and clusters are hard.
from jug.
Related Issues (20)
- Invalidate inside shell doesn't invalidate dependent tasks HOT 2
- Invalidate --target is too greedy HOT 4
- Ability to pass argument(s) that don't create new tasks HOT 2
- Always trigger `hooks.exit_checks.exit_env_vars`
- Implement --target for jug cleanup HOT 2
- Processes only start working after checking status HOT 7
- Failed to assign tasks correctly HOT 6
- The right way to use Jug with class object HOT 5
- Jug sleep-until exits too early when using barrier
- Error while running jug execute prime.py HOT 1
- NoLoad does not track dependencies
- Calling execute from code HOT 2
- Memory "leak" HOT 2
- Tasklet usage/Tasklet.can_load() HOT 3
- jug shell's invalidate() fails if NoLoad task in DAG
- `pdb` option doesn't seem to work HOT 1
- Jug throwing error, while running on PBS based HPC cluster HOT 3
- add options to webstatus subcommand (port, ipaddress) HOT 1
- IterateTask as a decorator HOT 5
- Allow cleanup of only locks HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from jug.