Coder Social home page Coder Social logo

Comments (4)

martinsumner avatar martinsumner commented on August 11, 2024 1

Given that these are hinted handoffs, I think it would be expected that they are handoffs from secondary partitions (i.e. fallback vnodes that were temporarily created to maintain n_val during an outage).

There's been a lot of work done in the last few versions of Riak to try and improve handoff reliability, as there were a lot of problems with handoff timeouts, particularly when handoffs are occurring during busy periods or vnodes are particularly large.

In your version, the first thing is probably to reduce the riak_core handoff_acksync_threshold across your cluster. This reduces the number of batches between acknowledgements.

There may also be value in increasing the riak_core handoff_timeout across the cluster.

There may also be value in increasing the riak_core handoff_receive_vnode_timeout.

These changes can all be made via riak attach and application set_env (which will change for the next handoff). Also you can add different settings into advanced.config (which will have effect following reboot).

Finally, if you have increased the riak_core handoff_concurrency from the default setting, there may be value in reducing back to the default again.

Monitoring of these handoffs has been improved in recent versions, as working out what exactly is going wrong in older Riak versions is hard. When a handoff fails, it starts to re-send all the data from the beginning, so if the fallback vnodes were created as part of an extended outage (and are quite large) then continuous failures are possible.

If you are confident that all the data is sufficiently covered in your cluster (due to other replicas and anti-entropy mechanisms), in the worst case scenario you can stop each node in turn and manually delete the fallback vnodes. Obviously though, it would be more sustainable to find a configuration which will work for future handoffs.

from riak.

patrickkokou avatar patrickkokou commented on August 11, 2024

Thanks Martin, I'll try these config changes steps and see who it goes. Will keep you updated.

from riak.

patrickkokou avatar patrickkokou commented on August 11, 2024

I did some changes in riak attach and application set_env
and restart riak.
That kicks off the transfer again, but now I'm seing a different error in riak errors logs

2023-05-03 01:34:09.787 [error] <0.304.0>@riak_core_ring:check_tainted:263 Error: riak_core_ring/ring_ready called on tainted ring
2023-05-03 01:34:09.787 [error] <0.304.0>@riak_core_ring:check_tainted:263 Error: riak_core_ring/ring_ready called on tainted ring

The transfer seems to be in progress, but I don't understand how to fix this riak_core_ring:check_tainted error

I need your help again, thanks

from riak.

martinsumner avatar martinsumner commented on August 11, 2024

I don't know really. I believe the tainted flag was added, so that before a read-only cache of the ring is exported (using mochiglobal), it is marked as tainted so that it can be confirmed that such a cached ring is never mistakenly used as the version to make an updated ring - i.e. some code updates the ring from get_raw_ring not get_my_ring.

So the tainted state, and the error messages were a check to make sure this never happens. But clearly, in some rare circumstance it can. Because of this the unset_tainted function was added so that this could be fixed from remote_console ... but that isn't available in older versions of Riak.

If the error logs don't go away, there might be another method to clear this status. I don't think it will work, but perhaps riak_core_ring_manager:force_update/0 might be worth a shot. You could compile a new version of the riak_core_ring module with the exported unset_tainted function added, and hot code load it, then use the function to unset.

from riak.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.