Comments (14)
There's also a problem where if the leave works on a client, the kill will fail, resulting in a "FAILED" response from $retcode.
I also have observed some cases of clients in a "failed" state where they should have left, which I think is down to a race condition between issuing a leave and the subsequent 'kill -9'.
@pforman What would happen if leave_on_terminate
is set to true
, keep consul leave
, then execute kill -TERM
right after? Would that cause a race condition as well? Would the state of the client be failed
or left
?
Edit: Updated question for clarity
from puppet-consul.
If leave_on_terminate is true, then a TERM should issue another leave if it actually runs. Normally leave_on_terminate is false. (There's another default but modifiable behavior around SIGINT as well, basically SIGINT will cause a leave unless specified not to).
Normally the leave works correctly. In some rare cases, I've seen it not work, but I can't reliably duplicate it (yet). Luckily my test cluster just triggered it for me.
Here's what I got:
[root@ip-10-103-65-99 ~]# /etc/init.d/consul stop
Shutting down consul: Error leaving: client closed
Here's the log:
2015/08/21 03:25:39 [INFO] agent: Synced check 'service:nginx'
2015/08/21 03:31:53 [INFO] agent: Synced check 'service:nginx'
2015/08/21 03:34:34 [INFO] agent.rpc: Accepted client: 127.0.0.1:49830
2015/08/21 03:34:34 [INFO] agent.rpc: Accepted client: 127.0.0.1:49831
2015/08/21 03:34:34 [INFO] agent.rpc: Graceful leave triggered
2015/08/21 03:34:34 [INFO] consul: client starting leave
2015/08/21 03:34:34 [INFO] serf: EventMemberLeave: ip-10-103-65-99 10.103.65.99
2015/08/21 03:34:34 [INFO] agent: requesting shutdown
2015/08/21 03:34:34 [INFO] consul: shutting down client
2015/08/21 03:34:34 [INFO] agent: shutdown complete
This looks identical to any other client leave in the logs.
So it looks like the shutdown code returned "1", but then shut down correctly anyway. Checking from another client gives a "left" status:
[zinc@ip-10-103-67-243 ~]$ consul members
Node Address Status Type Build Protocol DC
ip-10-103-66-5 10.103.66.5:8301 alive server 0.5.2 2 pfdemo-us-west-2
ip-10-103-65-5 10.103.65.5:8301 alive server 0.5.2 2 pfdemo-us-west-2
ip-10-103-65-99 10.103.65.99:8301 left client 0.5.2 2 pfdemo-us-west-2
ip-10-103-67-243 10.103.67.243:8301 alive client 0.5.2 2 pfdemo-us-west-2
ip-10-103-67-5 10.103.67.5:8301 alive server 0.5.2 2 pfdemo-us-west-2
This may just be a consul thing from internal goroutines and the shutdown order. I'll look over on their issues. As far as I can tell, the TERM is just sort of superfluous, unless you really want to force a "failed" state instead of left.
I haven't ever seen a leave actually fail, but with the old script I did see some "failed" states come about when clients should have left. I think that's because the SIGKILL can't be caught, and the process just terminates abruptly.
Seems like having clients simply do "consul leave" and having servers do "kill -TERM" is probably the best. Having clients use INT (which the handler will generate a leave from) is also an option.
Not sure if that's more or less confusing. I spent some quality time reading the consul signals code today to get what understanding I do have.
It's at https://github.com/hashicorp/consul/blob/master/command/agent/command.go
I'll dig in a little more, and go check the issues of consul to see if they've noticed this behavior.
from puppet-consul.
As far as I can tell, the TERM is just sort of superfluous, unless you really want to force a "failed" state instead of left.
Seems like having clients simply do "consul leave" and having servers do "kill -TERM" is probably the best. Having clients use INT (which the handler will generate a leave from) is also an option.
@pforman So then why not remove consul leave
and just use INT
? Isin't that what we want, consul to leave gracefully? That way we don't have to deal with any race condition with consul leave
.
from puppet-consul.
I think that's viable. Let me beat on it a bit and see if I can get it to misbehave in any way.
The nice part about using INT is that it's controllable by config_hash, using the "skip_leave_on_interrupt" parameter.
All of this discussion about leave-vs-TERM is currently only the case for "sysv", though. Everything else appears to just use "consul leave". Seems like a lot of us with weird needs are using sysv-based systems...
I'm not sure if consistency across the init types is a good goal or not.
from puppet-consul.
@pforman Thank You! The more testing the better 😄
Currently consul leave
is only implemented for sysv, debian and upstart scripts. So if you can confirm INT
isin't doing any funny business, I will update the upstart and debian scripts to use INT
too.
So please keep us posted 👍
from puppet-consul.
Initial results with INT look good. We get this in the log (turned up to DEBUG):
==> Caught signal: interrupt
==> Gracefully shutting down agent...
2015/08/21 05:05:15 [INFO] consul: client starting leave
2015/08/21 05:05:15 [DEBUG] serf: messageLeaveType: ip-10-103-65-99
2015/08/21 05:05:15 [DEBUG] serf: messageLeaveType: ip-10-103-65-99
2015/08/21 05:05:15 [DEBUG] serf: messageLeaveType: ip-10-103-65-99
2015/08/21 05:05:15 [INFO] serf: EventMemberLeave: ip-10-103-65-99 10.103.65.99
2015/08/21 05:05:15 [DEBUG] memberlist: Initiating push/pull sync with: 10.103.65.5:8301
2015/08/21 05:05:15 [DEBUG] http: Shutting down http server (127.0.0.1:8500)
2015/08/21 05:05:15 [INFO] agent: requesting shutdown
2015/08/21 05:05:15 [INFO] consul: shutting down client
2015/08/21 05:05:15 [INFO] agent: shutdown complete
The one oddity is that the script contains this:
rm -f /var/lock/subsys/consul $PID_FILE
That's too fast, if the rm
is allowed to remove $PID_FILE, then consul itself complains in the logs when it subsequently tries to. So the signal is sent to consul, but then the script marches on. Turns out the script is quicker at issuing "rm" than consul is at shutting down.
2015/08/21 04:58:03 [WARN] agent: could not delete pid file Could not remove pid file: stat /var/run/consul/consul.pid: no such file or directory
Still looking into the original condition that causes "consul leave" to occasionally generate a return code of 1. Seems like that should be sent upstream. "consul leave" is a bit more readable than "kill -INT", so I'd hate to obfuscate everything if we can find the root issue.
from puppet-consul.
@pforman 💡 We can do one of the following:
- We can check if the process exists
pid -0 $PID
(possibly fixed loop count with a one second back off)?
- This raises another question. If after
x
amount of tries the process is still running then we have to decide whether to exit right then and echo "hey something went wrong, pid is still running" or forcefullykill -9
and clean up the pid and lock files?
OR
- We can just issue
INT
and assume consul took care of everything else but on start up ensure the pid is not running (if it does dokill -9
) and clean up the lock and pid files.
Also, I kinda understand your point about consul leave
being more clear. But the log doesn't really show where the leave was triggered from. It would been nice if consul added a flag called -reason
or something so we could do consul leave -reason 'triggered from init'
or something. Right now its quite ambiguous.
from puppet-consul.
@pforman I checked the other init scripts and most are sleeping/waiting before removing the pid file, so I think its fine to do that here as well (option 1).
Upstart is the exception to that rule, but I will add support to it as well.
from puppet-consul.
I opened hashicorp/consul#1189 about this. We'll see what happens.
I think INT and backoff is satisfactory, and if they fix this issue upstream going back to consul leave
might be preferable then.
from puppet-consul.
@pforman Awesome. Looking forward to your PR 😎 ⛵
from puppet-consul.
Was just thinking/working on this exact same issue today. I'd prefer if we can let the leave_on_terminate property determine the behavior of the cluster leave/failing otherwise there are situations in which this can cause a node to leave the cluster when you didn't really want it to because of a stop/start.
@pforman have you made headway on this and have a PR ready soon?
from puppet-consul.
PR is #181, waiting for merge.
The behavior for both clients and servers is somewhat configurable with the Consul properties, because the script just sends INT or TERM respectively.
If the signal isn't handled in a reasonable time it escalates the signal, eventually to KILL.
Consul is pretty well behaved as a daemon so I haven't seen that happen.
from puppet-consul.
Merged in #181.
I look forward to the day when we can just "use the upstream init script" :(
from puppet-consul.
Something like this could be used to standardize init scripts https://github.com/jordansissel/pleaserun
from puppet-consul.
Related Issues (20)
- Drop EoL Fedora 25/26/27 support
- Drop EoL FreeBSD 10 support
- Drop support for old SLES/SLED versions
- Drop support for EoL Puppet 5
- Migrate from master to main HOT 2
- migrate module to Vox Pupuli? HOT 6
- Module Release?
- RFC: refactor to include official hashicorp package repos HOT 1
- Systemd service does't work for 'package' install_method HOT 6
- RFC: change dependency camp2camp/systemd module to voxpupuli/systemd
- legacy ACL v1 no longer working starting from Consul version 1.11 HOT 9
- Add ability to manage consul log directory HOT 1
- Status of the project HOT 1
- Adding ACLS / Policies failes with unable to get local issuer certificate -> Puppet 6 / LetsEncrypt HOT 2
- alternative commands to Consul Reload
- module doesn't support grpc checks
- systemd Failed to parse service type, ignoring: exec
- How to access the secret_id of tokens HOT 1
- Allow to not restart consul when updating the binary
- consul_prepared_query doesn't support updating
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from puppet-consul.