Continuously swapping master redis_node_manager about redis_failover HOT 11 CLOSED

ryanlecompte commented on August 19, 2024

Continuously swapping master redis_node_manager

from redis_failover.

Comments (11)

ryanlecompte commented on August 19, 2024

Hmm, so one member of your ZooKeeper cluster was taken offline when you
took the app server down? Did you see other redis_node_manager's swapping /
becoming the primary, or just on that one box? The logs seem to indicate
that there was a ZK connection error
which caused that instance to lose its ZK lock. I'm wondering if the
underlying ZK client had trouble reconnecting. Did you by any chance try
restarting that redis_node_manager?

Ryan

On Fri, Oct 5, 2012 at 1:25 PM, Max Justus Spransy <[email protected]

wrote:

We're running 7 instances of redis node manager in production, one per app
server. There was a connection timeout with zookeeper on a few clients ZK::Exceptions::OperationTimeOut:
inputs: {:path=>"/redis_failover_nodes"} because we took down one of our
app servers to move it to a different host. The loss of connection seemed
to cause it to start switching between master node manager about once every
20 seconds with the error:

ZK::Exceptions::LockAssertionFailedError: we do not actually hold the lock

https://gist.github.com/3feb567ff0374be12757

Any ideas?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/37.

from redis_failover.

maxjustus commented on August 19, 2024

the node managers all started shuffling around, becoming primary. So app-2 will be primary for 20 seconds and then raise that exception, and then app-5 will become primary and do the same thing, et-al. I didn't try restarting the managers, I can give that a shot.

from redis_failover.

ryanlecompte commented on August 19, 2024

That's really strange. I'm wondering if it had something to do with the way
the exclusive locker is implemented in the ZK gem and the fact that an
entire ZooKeeper node was moved to a different host. @slyphon, do you have
any ideas here?

Max, can you ping me on gchat also? I'm [email protected]

On Fri, Oct 5, 2012 at 1:44 PM, Max Justus Spransy <[email protected]

wrote:

the node managers all started shuffling around, becoming primary. So app-2
will be primary for 20 seconds and then raise that exception, and then
app-5 will become primary and do the same thing, et-al. I didn't try
restarting the managers, I can give that a shot.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/37#issuecomment-9189200.

from redis_failover.

ryanlecompte commented on August 19, 2024

Also, how many nodes do you have in your ZooKeeper cluster? When you took
down a node and moved it to a different host, did you update your ZK nodes
config for redis_failover?

On Fri, Oct 5, 2012 at 1:49 PM, Ryan LeCompte [email protected] wrote:

That's really strange. I'm wondering if it had something to do with the
way the exclusive locker is implemented in the ZK gem and the fact that an
entire ZooKeeper node was moved to a different host. @slyphon, do you have
any ideas here?

Max, can you ping me on gchat also? I'm [email protected]

On Fri, Oct 5, 2012 at 1:44 PM, Max Justus Spransy <
[email protected]> wrote:

the node managers all started shuffling around, becoming primary. So
app-2 will be primary for 20 seconds and then raise that exception, and
then app-5 will become primary and do the same thing, et-al. I didn't try
restarting the managers, I can give that a shot.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/37#issuecomment-9189200.

from redis_failover.

maxjustus commented on August 19, 2024

We've got 7 zookeeper nodes (running on the same servers as the node managers). I didn't update the config since I didn't expect the server to be down for long and assumed it'd be somewhat resilient to a node going down now and again.

from redis_failover.

maxjustus commented on August 19, 2024

And yeah, one of the zookeeper nodes was taken offline, along with one of the redis_node_manager instances.

from redis_failover.

ryanlecompte commented on August 19, 2024

Gotcha. So, that would leave 6 zookeeper nodes running. When you brought
the zookeeper node back online, did the hostname change at all?

On Fri, Oct 5, 2012 at 1:59 PM, Max Justus Spransy <[email protected]

wrote:

And yeah, one of the zookeeper nodes was taken offline, along with one of
the redis_node_manager instances.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/37#issuecomment-9189628.

from redis_failover.

maxjustus commented on August 19, 2024

I dunno, it hasn't come back up yet :)

from redis_failover.

maxjustus commented on August 19, 2024

Ok, so it's back up and the zk instance correctly rejoined the cluster. I'm still seeing the exception zk and in redis node manager gisted above. It looks like it's swapping between masters allot faster then I thought: https://gist.github.com/afce95a96838a4becd94
I'll see if bringing it down to one manager and bring each one back up one by one will fix it. I had this same issue on our staging servers and that seemed to do the trick.

from redis_failover.

ryanlecompte commented on August 19, 2024

Great, yes please give that a shot. Also, are you using redis_failover with
ZK 1.7? That hasn't been tested and may be a cause of some of your issues,
since I only test redis_failover with 1.6.x (which is what the gemspec
specifies).

Keep me posted!

On Fri, Oct 5, 2012 at 2:11 PM, Max Justus Spransy <[email protected]

wrote:

Ok, so it's back up and the zk instance correctly rejoined the cluster.
I'm still seeing the exception zk and in redis node manager gisted above.
It looks like it's swapping between masters allot faster then I thought:
https://gist.github.com/afce95a96838a4becd94
I'll see if bringing it down to one manager and bring each one back up one
by one will fix it. I had this same issue on our staging servers and that
seemed to do the trick.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/37#issuecomment-9189930.

from redis_failover.

ryanlecompte commented on August 19, 2024

I'm going to close this issue as fixed in redis_failover 1.0 (to be released tomorrow AM). It relies on a new ZK version that has better locking cleanup. I was not able to repro this behavior with the latest ZK version.

from redis_failover.

Continuously swapping master redis_node_manager about redis_failover HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent