Coder Social home page Coder Social logo

Comments (11)

ryanlecompte avatar ryanlecompte commented on August 19, 2024

Hmm, so one member of your ZooKeeper cluster was taken offline when you
took the app server down? Did you see other redis_node_manager's swapping /
becoming the primary, or just on that one box? The logs seem to indicate
that there was a ZK connection error
which caused that instance to lose its ZK lock. I'm wondering if the
underlying ZK client had trouble reconnecting. Did you by any chance try
restarting that redis_node_manager?

Ryan

On Fri, Oct 5, 2012 at 1:25 PM, Max Justus Spransy <[email protected]

wrote:

We're running 7 instances of redis node manager in production, one per app
server. There was a connection timeout with zookeeper on a few clients ZK::Exceptions::OperationTimeOut:
inputs: {:path=>"/redis_failover_nodes"} because we took down one of our
app servers to move it to a different host. The loss of connection seemed
to cause it to start switching between master node manager about once every
20 seconds with the error:

ZK::Exceptions::LockAssertionFailedError: we do not actually hold the lock

https://gist.github.com/3feb567ff0374be12757

Any ideas?


Reply to this email directly or view it on GitHubhttps://github.com//issues/37.

from redis_failover.

maxjustus avatar maxjustus commented on August 19, 2024

the node managers all started shuffling around, becoming primary. So app-2 will be primary for 20 seconds and then raise that exception, and then app-5 will become primary and do the same thing, et-al. I didn't try restarting the managers, I can give that a shot.

from redis_failover.

ryanlecompte avatar ryanlecompte commented on August 19, 2024

That's really strange. I'm wondering if it had something to do with the way
the exclusive locker is implemented in the ZK gem and the fact that an
entire ZooKeeper node was moved to a different host. @slyphon, do you have
any ideas here?

Max, can you ping me on gchat also? I'm [email protected]

On Fri, Oct 5, 2012 at 1:44 PM, Max Justus Spransy <[email protected]

wrote:

the node managers all started shuffling around, becoming primary. So app-2
will be primary for 20 seconds and then raise that exception, and then
app-5 will become primary and do the same thing, et-al. I didn't try
restarting the managers, I can give that a shot.


Reply to this email directly or view it on GitHubhttps://github.com//issues/37#issuecomment-9189200.

from redis_failover.

ryanlecompte avatar ryanlecompte commented on August 19, 2024

Also, how many nodes do you have in your ZooKeeper cluster? When you took
down a node and moved it to a different host, did you update your ZK nodes
config for redis_failover?

On Fri, Oct 5, 2012 at 1:49 PM, Ryan LeCompte [email protected] wrote:

That's really strange. I'm wondering if it had something to do with the
way the exclusive locker is implemented in the ZK gem and the fact that an
entire ZooKeeper node was moved to a different host. @slyphon, do you have
any ideas here?

Max, can you ping me on gchat also? I'm [email protected]

On Fri, Oct 5, 2012 at 1:44 PM, Max Justus Spransy <
[email protected]> wrote:

the node managers all started shuffling around, becoming primary. So
app-2 will be primary for 20 seconds and then raise that exception, and
then app-5 will become primary and do the same thing, et-al. I didn't try
restarting the managers, I can give that a shot.


Reply to this email directly or view it on GitHubhttps://github.com//issues/37#issuecomment-9189200.

from redis_failover.

maxjustus avatar maxjustus commented on August 19, 2024

We've got 7 zookeeper nodes (running on the same servers as the node managers). I didn't update the config since I didn't expect the server to be down for long and assumed it'd be somewhat resilient to a node going down now and again.

from redis_failover.

maxjustus avatar maxjustus commented on August 19, 2024

And yeah, one of the zookeeper nodes was taken offline, along with one of the redis_node_manager instances.

from redis_failover.

ryanlecompte avatar ryanlecompte commented on August 19, 2024

Gotcha. So, that would leave 6 zookeeper nodes running. When you brought
the zookeeper node back online, did the hostname change at all?

On Fri, Oct 5, 2012 at 1:59 PM, Max Justus Spransy <[email protected]

wrote:

And yeah, one of the zookeeper nodes was taken offline, along with one of
the redis_node_manager instances.


Reply to this email directly or view it on GitHubhttps://github.com//issues/37#issuecomment-9189628.

from redis_failover.

maxjustus avatar maxjustus commented on August 19, 2024

I dunno, it hasn't come back up yet :)

from redis_failover.

maxjustus avatar maxjustus commented on August 19, 2024

Ok, so it's back up and the zk instance correctly rejoined the cluster. I'm still seeing the exception zk and in redis node manager gisted above. It looks like it's swapping between masters allot faster then I thought: https://gist.github.com/afce95a96838a4becd94
I'll see if bringing it down to one manager and bring each one back up one by one will fix it. I had this same issue on our staging servers and that seemed to do the trick.

from redis_failover.

ryanlecompte avatar ryanlecompte commented on August 19, 2024

Great, yes please give that a shot. Also, are you using redis_failover with
ZK 1.7? That hasn't been tested and may be a cause of some of your issues,
since I only test redis_failover with 1.6.x (which is what the gemspec
specifies).

Keep me posted!

On Fri, Oct 5, 2012 at 2:11 PM, Max Justus Spransy <[email protected]

wrote:

Ok, so it's back up and the zk instance correctly rejoined the cluster.
I'm still seeing the exception zk and in redis node manager gisted above.
It looks like it's swapping between masters allot faster then I thought:
https://gist.github.com/afce95a96838a4becd94
I'll see if bringing it down to one manager and bring each one back up one
by one will fix it. I had this same issue on our staging servers and that
seemed to do the trick.


Reply to this email directly or view it on GitHubhttps://github.com//issues/37#issuecomment-9189930.

from redis_failover.

ryanlecompte avatar ryanlecompte commented on August 19, 2024

I'm going to close this issue as fixed in redis_failover 1.0 (to be released tomorrow AM). It relies on a new ZK version that has better locking cleanup. I was not able to repro this behavior with the latest ZK version.

from redis_failover.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.