Coder Social home page Coder Social logo

Potential stale reads about yugabyte-db HOT 3 CLOSED

yugabyte avatar yugabyte commented on May 18, 2024
Potential stale reads

from yugabyte-db.

Comments (3)

mbautin avatar mbautin commented on May 18, 2024

Hi @rystsov,

Thank you for your detailed post! Our leader lease algorithm works as follows:

  • With every leader-to-follower message (AppendEntries in Raft's terminology), whether replicating new entries or even an empty heartbeat message, the leader sends a "leader lease" request as a time interval, e.g. could be "I want a 2 second lease". This number is usually a system-wide parameter. For each peer, the leader also keeps track of the lease expiration time corresponding to each pending request (i.e. time when the request was sent + lease duration), which is stored in terms of local monotonic time (CLOCK_MONOTONIC in Linux). The leader considers itself as a special case of a "peer" for this purpose. Then, as it receives responses from followers, it maintains the majority-replicated watermark of these expiration times as stored at request sending time. The leader adopts this majority-replicated watermark as its lease expiration time, and uses it when deciding whether it can serve consistent read requests or accept writes.

  • When a follower receives the above Raft RPC, it reads the value of its current monotonic clock, adds the provided lease interval to that, and remembers this lease expiration time, also in terms of its local monotonic time. If this follower becomes the new leader, it is not allowed to serve consistent reads or accept writes until any potential old leader's lease expires.

  • To guarantee that any new leader is aware of any old leader's lease expiration, another bit of logic is necessary. Each Raft group member records the latest expiration time of an old leader that it knows about (in terms of this server's local monotonic time). Whenever a server sends a response to a RequestVote RPC, it includes the largest remaining amount of time of any known old leader's lease with its vote. This is handled similarly to the lease duration from AppendEntries on the receiving server: at least this amount of time has to pass since the receipt of this request before the recipient can service requests in case it becomes a leader. This part of the algorithm is needed so that we can prove that a new leader will always know about any old leader's majority-replicated leader. This is analogous to Raft's correctness proof: there is always a server ("the voter") that received a lease request from the old leader and voted for the new leader, because the two majorities must overlap.

Note that we are not relying on any kind of clock synchronization for this leader lease implementation, as we're only sending time intervals over the network, and each server operates in terms of its local monotonic clock. The only two requirements to the clock implementation are:

  • Bounded monotonic clock drift rate between different servers. E.g. if we use the standard Linux assumption of less than 500us per second drift rate, we could account for it by multiplying all delays mentioned above by 1.001.

  • The monotonic clock does not freeze. E.g. if we're running on a VM which freezes temporarily, the hypervisor needs to refresh the VM's clock from the hardware clock when it starts running again.

Also, I would like mention out that a faulty local clock will likely cause problems in many different software systems, including the user's application. I believe that our assumption of the availability of a local monotonic clock with a bounded drift relative to any other such clock within the same cluster is quite reasonable. And to quote the Spanner paper:

Our machine statistics show that bad CPUs are 6 times more likely than bad clocks. That is, clock issues are extremely infrequent, relative to much more serious hardware problems. As a result, we believe that TrueTime’s implementation is as trustworthy as any other piece of software upon which Spanner depends.

That said, we are working on verifying our assumptions about the clock performance in a variety of configurations. However, if you could clarify what you meant by "clock freezing" more precisely or point out scenarios in which it is likely to happen, or ways to reproduce it, that would be very helpful!

from yugabyte-db.

rystsov avatar rystsov commented on May 18, 2024

As I understand one of the situation when time may freeze is cloud deployment with live migration happening behind the scenes. In this case we can't guarantee that:

  • hypervisor refreshes the VM's clock
  • clocks on both hypervisors are in sync

from yugabyte-db.

mbautin avatar mbautin commented on May 18, 2024

@rystsov: thanks for pointing out the live migration case. Looking at this thread, AWS did not support live migration as of Aug 2016, and I cound not find any indication that this has changed, and Google Cloud Platform supports live migration, which can be turned on or off for each instance. We will make sure to update our documentation to mention the interaction between clocks and live migration, and will look into what various cloud environments supporting live migration guarantee regarding clock behavior in that case. However, given that the underlying platform provides bounded-drift monotonic clocks, we believe that YugaByteDB's architecture allows reading the latest value without querying a majority of replicas (a "quorum read"), so I will close this issue for now.

from yugabyte-db.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.