Is there a guide anywhere on upgrading to 3.0 from earlier versions like 2.9? <p d

Upgrading to 3.0 about riak HOT 5 CLOSED

MikaAK commented on August 11, 2024

Upgrading to 3.0

from riak.

Comments (5)

martinsumner commented on August 11, 2024

The standard way for any upgrade is to stop/update/start one node at a time across the cluster. There shouldn't be a need to do it by adding nodes unless you're changing storage backends.

Whichever way you go though, I wouldn't expect out of memory issues. This is something going unexpectedly wrong, as if you have triggered a bug. Do you have some information on your cluster you can share?

How many nodes;
Ring size;
Storage backend;
Number of clusters replicating;
Replication version used;
AAE version used;
Approximate key count;
Approximate mean object size;
Precise version migrating from and to;
Operating system;
Physical configuration of each node (CPU, memory, storage type).

It would be useful to know:

Are the OOM issues on all nodes, or just updated nodes;
If you run run riak admin top (3.0) or riak-admin top (2.9) sorted by memory, what are the processes hogging memory.

from riak.

MikaAK commented on August 11, 2024

Here you go! I got most of this info from our dev-ops, lemme know if there's more i can get.

How many nodes; 5
Ring size; 128
Storage backend; multi
Number of clusters replicating; 5-6
Replication version used; not sure
AAE version used; not sure
Approximate key count; Not sure how to get this either, but maybe half a billion or more, we do around 100k puts daily
Approximate mean object size; This I'm not sure how to get this, if i had to guess I'd say mostly under 1kb, except one bucket which is full of 300kb blobs
Precise version migrating from and to; 2.9 -> 3.0.10
Operating system; debian 9 on 2.9 debian 10 on 3.0.10
Physical configuration of each node (CPU, memory, storage type)
16 CPU 72GB Ram 5TB data disk ssd

Are the OOM issues on all nodes, or just updated nodes; all nodes OOM and crash
If you run run riak admin top (3.0) or riak-admin top (2.9) sorted by memory, what are the processes hogging memory.
this causes a severe outage so we did not run these commands and cannot induce again to run them.

This did not happen in a staging cluster of the clones of prod 5 nodes, adding 5 new nodes 1 at a time and removing old 1 at a time. Same data and specs. Only difference is prod traffic during crash.

from riak.

martinsumner commented on August 11, 2024

I don't understand this. There's no obvious reason for this behaviour.

The process of adding a node, and removing a node is much more expensive than stop/update/start - though I wouldn't immediately expect it to blow-up in terms of memory. Is there a reason why you're doing this update this way rather than simply stop/update/start?

There have been problems in Riak with leveled backends and excessive memory use. You can have a leveled backend if you enable tictac_aae, or if you set one of your backends to leveled in multi backend. Is leveled in play here?

from riak.

MikaAK commented on August 11, 2024

From our DevOps:

I do not believe we are using leveld. Reasons we're doing add cluster are mostly, if we stop upgrade one of our five and the upgrade fails we just lost a node and we have to take it out of the load balancer so we would take a performance hit and possible outage

We're going to attempt a stop/upgrade/start in a test cluster though!

from riak.

MikaAK commented on August 11, 2024

This is now fixed! Thanks for the support, we did a hybrid approach where we took the following steps and were successful

remove old riak being replaced from load balancer
spin up new debian 10 with new riak
join cluster staged
run replace cmd on current old to new riak
replace staged
run commit to have the old direct transfer to the new only while both are out of load balancer
once done add new to lb and turn off old

Since it was a 1-1 transfer, this prevented the OOM it seems.

the two other approaches we tried:

adding new to cluster while in lb, failed 100% of the time
adding new to cluster while out of lb, failed for us 50%, 1 worked 1 did not

from riak.

Upgrading to 3.0 about riak HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent