Coder Social home page Coder Social logo

Upgrading to 3.0 about riak HOT 5 CLOSED

MikaAK avatar MikaAK commented on August 11, 2024
Upgrading to 3.0

from riak.

Comments (5)

martinsumner avatar martinsumner commented on August 11, 2024

The standard way for any upgrade is to stop/update/start one node at a time across the cluster. There shouldn't be a need to do it by adding nodes unless you're changing storage backends.

Whichever way you go though, I wouldn't expect out of memory issues. This is something going unexpectedly wrong, as if you have triggered a bug. Do you have some information on your cluster you can share?

How many nodes;
Ring size;
Storage backend;
Number of clusters replicating;
Replication version used;
AAE version used;
Approximate key count;
Approximate mean object size;
Precise version migrating from and to;
Operating system;
Physical configuration of each node (CPU, memory, storage type).

It would be useful to know:

Are the OOM issues on all nodes, or just updated nodes;
If you run run riak admin top (3.0) or riak-admin top (2.9) sorted by memory, what are the processes hogging memory.

from riak.

MikaAK avatar MikaAK commented on August 11, 2024

Here you go! I got most of this info from our dev-ops, lemme know if there's more i can get.

How many nodes; 5
Ring size; 128
Storage backend; multi
Number of clusters replicating; 5-6
Replication version used; not sure
AAE version used; not sure
Approximate key count; Not sure how to get this either, but maybe half a billion or more, we do around 100k puts daily
Approximate mean object size; This I'm not sure how to get this, if i had to guess I'd say mostly under 1kb, except one bucket which is full of 300kb blobs
Precise version migrating from and to; 2.9 -> 3.0.10
Operating system; debian 9 on 2.9 debian 10 on 3.0.10
Physical configuration of each node (CPU, memory, storage type)
16 CPU 72GB Ram 5TB data disk ssd

Are the OOM issues on all nodes, or just updated nodes; all nodes OOM and crash
If you run run riak admin top (3.0) or riak-admin top (2.9) sorted by memory, what are the processes hogging memory.
this causes a severe outage so we did not run these commands and cannot induce again to run them.

This did not happen in a staging cluster of the clones of prod 5 nodes, adding 5 new nodes 1 at a time and removing old 1 at a time. Same data and specs. Only difference is prod traffic during crash.

from riak.

martinsumner avatar martinsumner commented on August 11, 2024

I don't understand this. There's no obvious reason for this behaviour.

The process of adding a node, and removing a node is much more expensive than stop/update/start - though I wouldn't immediately expect it to blow-up in terms of memory. Is there a reason why you're doing this update this way rather than simply stop/update/start?

There have been problems in Riak with leveled backends and excessive memory use. You can have a leveled backend if you enable tictac_aae, or if you set one of your backends to leveled in multi backend. Is leveled in play here?

from riak.

MikaAK avatar MikaAK commented on August 11, 2024

From our DevOps:

I do not believe we are using leveld. Reasons we're doing add cluster are mostly, if we stop upgrade one of our five and the upgrade fails we just lost a node and we have to take it out of the load balancer so we would take a performance hit and possible outage

We're going to attempt a stop/upgrade/start in a test cluster though!

from riak.

MikaAK avatar MikaAK commented on August 11, 2024

This is now fixed! Thanks for the support, we did a hybrid approach where we took the following steps and were successful

  1. remove old riak being replaced from load balancer
  2. spin up new debian 10 with new riak
  3. join cluster staged
  4. run replace cmd on current old to new riak
  5. replace staged
  6. run commit to have the old direct transfer to the new only while both are out of load balancer
  7. once done add new to lb and turn off old

Since it was a 1-1 transfer, this prevented the OOM it seems.

the two other approaches we tried:

  • adding new to cluster while in lb, failed 100% of the time
  • adding new to cluster while out of lb, failed for us 50%, 1 worked 1 did not

from riak.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.