ktoso / akka-raft Goto Github PK

A toy project implementing RAFT on top of Akka Cluster (not prod ready)

License: Apache License 2.0

Scala 100.00%

akka-raft's Introduction

akka-raft

This is an akka based implementation of the Raft consensus algorithm. It is generic enough that you can build your own replicated state machines on top of it (with raft keeping track of the consensus part of it).

This implementation is akka-cluster aware, so it can be easily deployed on multiple machines. Implementation wise, all parts of the raft whitepaper are covered:

Leader election
Dynamic membership changes (with transitioning periods implemented as Joint Consensus)
Deployment across multiple nodes
Log compaction, via snapshotting

Disclaimer

💥 💥

This project is a side-project of mine and is still work in progress (treat it as EARLY PREVIEW) and has a number of known protocol bugs (see Issues). It is NOT recommended to be used in production, however it's a great project to play around with implementing and discussing the Raft protocol.

💥 💥

In other words: Use at own risk, best not on any production-like environments (for now).

Basic info

Raft is a distributed consensus algorithm, much like Paxos (but simpler). This implementation is fully akka (and akka-cluster) based, and can be used to deploy a replicated state machine on top of akka clusters.

THIS API IS STILL SUBJECT TO CHANGE

class WordConcatRaftActor extends RaftActor {

  type Command = Cmnd

  var words = Vector[String]()

  /** 
   * Called when a command is determined by Raft to be safe to apply; 
   * Application results are sent back to the client issuing the command.
   */
  def apply = { 
    case AppendWord(word) =>
      words +: word
      log.info("Applied command [{}], full words is: {}", command, words)

      word // will be sent back to original actor, who sent the AppendWord command

    case GetWords =>
      val res = words.toList
      log.info("Replying with {}", res)
      res
  }
}

// ...

val members = (1 to 3) map { i => system.actorOf(Props[WordConcatRaftActor], name = s"raft-member-$i") }
val clusterConfiguration = ClusterConfiguration(raftConfiguration.members + additionalActor) // 0, 1

members foreach { _ ! ChangeConfiguration(clusterConfiguration)

// todo implement re-routing if you send to a non-leader
// then send messages to it; the state machine will only be applied when consensus has been reached about a value
leader ! ClientRequest(AppendWord("I"))
leader ! ClientRequest(AppendWord("like"))
leader ! ClientRequest(AppendWord("capybaras"))

// ... after some time
leader ! GetWords

expectMsg(List("I", "like", "capybaras"))

And if you want to enable snapshotting support it's as simple as implementing one method and matching for InstallSnapshot in your Actor:

class SnapshottingWordConcatRaftActor extends RaftActor {

  type Command = Cmnd

  var words = Vector[String]()

  def apply = {
    case AppendWord(word) =>
      words +: word
      word

    case GetWords =>
      val res = words.toList
      log.info("Replying with {}", res)
      res

    case InstallSnapshot(snapshot) =>
      words = snapshot.data.asInstanceOf[Vector[String]]
  }

  override def prepareSnapshot(meta: RaftSnapshotMetadata) =
    Future.successful(Some(RaftSnapshot(meta, words)))
}

RaftClientActor

In the above examples, the client implementation is very naive, and assumes you have some way of finding out who the current Leader is (as this is a requirement to interact with any Raft cluster). Thankfully, you can use the provided RaftClientActor, which works like a proxy that forwards all your messages to the current Leader, or stashes them if the cluster has no Leader at the moment (is undergoing an election) and sends the messages once the Leader becomes available.

License

Simply: Apache 2.0

Issues, Pull Requests as well as Tweets and Emails are more than welcome!

Links & kudos

An excellent analysis using fuzz testing of akka-raft, discovering a number of protocol bugs by @colin-scott, blog post Fuzzing Raft for Fun and Publication and linked inside it the whitepaper draft.
Raft - In Search of an Understandable Consensus Algorithm whitepaper
See other implementations (many lanugages) on raftconsensus.github.io
Docs on akka-cluster in 2.3.0-RC2.

We have discussed this paper both in Kraków and London, on these awesome reading clubs (drop by if you're into CS papers!):

akka-raft's People

Stargazers

Watchers

Forkers

bambuchaadm aritradhar ngbinh moizr shell0dh inexplicable yzernik revcbh dimitar9 wangsan njosephef linearregression piotrtrzpil splade dfrsg dmitraver colin-scott heikoheiko boulevard23 zx247549135 tigerqiu712 thirugnanamsubbiah shobhitmaheshwari yinchunguang knowledgehacker carrot-garden linpingchuan heiqiaoxiang phaller springasa kovrizhin negokaz leetoo vipnlock myxof mdf369 gollen833 jxnu-liguobin yuetaoz burcuku svilen-ivanov leili2017

akka-raft's Issues

DeclineCandidate from Candidate includes wrong term

When a candidate responds to a RequestVote with DeclineCandidate, it should include its own term in the response instead of the request's term. This allows the candidate that sent the request to update its own term.

Change `ReplicatedLog` impl so that we don't rely on erasure to store "special" messages

Storing "special messages" is a hack currently, MUST fix before releasing.

We store Entry[Cmnd], where type Cmnd = String for example, but sometimes a Snapshot may be contained within the Entry[_]; safety is currently only assured because of matches "because we know" it might be there => must be moved into a typesafe representation.

Delayed messages to self can cause leader to be elected without quorum

Upon receiving an ElectionTimeout message, the Candidate sends a BeginElection message to its clusterSelf.

It's possible that if the BeginElection message is delayed, the Candidate might vote for itself twice.

Specifically, consider the following scenario:

Candidate receives ElectionTimeout message
Candidate sends BeginElection message [which is delayed]
Candidate receives ElectionTimeout message
Candidate sends BeginElection message
Akka dispatcher delivers first BeginElection message
Akka dispatcher delivers second BeginElection message

Upon receiving the BeginElection message, the Candidate increments the number of votes it has received for the current Term (without checking whether it has already voted for itself).

However, the Candidate does not change its current Term without receiving an ElectionTimeout message. Consequently, in the scenario above, the Candidate would vote for itself twice in the same Term.

This scenario would admittedly be triggered very rarely in practice. But I do believe that it would possible to trigger it, especially if the election timeout value was set to a low value; akka's dispatcher doesn't AFAICT provide guarentees on when particular messages are delivered.

Make raft actor names configurable

Actors should take some name pattern from the config. "member-1" is not very descriptive when you have loads of them (or other actors which use this naming ;-)).

Improve auto discovery and cluster resizes in clustered mode

There's an initial idea about using cluster events to add members automatically.
Currently this is not used during the raft-clusters lifetime, but only during initialisation.

Cleanup this code and reuse this mechanism when already initialised.

Simple example continually logging: Consensus for persisted index: 1. (Comitted index: 1)...

I created a simple example using the WordConcatRaftActor.

When I send a command using the RaftClientActor the update appears to work but then the following messages are logged continually:

2014-06-07 10:34:22.265 INFO --- [lt-dispatcher-6] n.c.grainger.main.WordConcatRaftActor : Follower Actor[akka://eventStore/user/raft-member-2#1579065873] took write in term: Term(1), index: 1
2014-06-07 10:34:22.266 INFO --- [lt-dispatcher-6] n.c.grainger.main.WordConcatRaftActor : Consensus for persisted index: 1. (Comitted index: 1)

Any idea why?

Log compaction can cause append to take too many entries

The call to ReplicatedLog.append in the Follower's append code does not take log compaction into account. If a log has been compacted, then taking a number of entries equal to the log index and appending the new entries to that will leave too many extra entries in the log (because the number of entries is not equal to the index of the last log entry in this case).

Update: 2 actors (cluster / raft) make it possible to leak raft "self", where clusterSelf needed, add checks?

Since an raft actor's matchIndex etc are ActorRef => Int we must be warry of not exposing "sometimes the cluster ref and sometimes the raft ref". This makes the raft impl a bit weird at times..

The problem is very visible in: d65a00e#diff-9ad0897621d0099532d800ab73511ac5R57 for example.

Minor review comments

Great work Konrad! I was curious to see how you had used Akka in this interesting project. Wrote a few minor comments. Let me know if you have any questions.

--- a/README.md
+++ b/README.md
@@ -7,6 +7,8 @@ This is an akka based implementation of the Raft consensus algorithm.

 It is akka-cluster (which is _experimental_) aware, and supports the additional features of raft such as: mambership changes and (_not yet_) snapshotting.

+PN: akka-cluster is not experimental
+
 **This is still work in progress and has not been stress tested (athough it is tested on multiple nodes already)**

 Basic info
@@ -24,6 +26,8 @@ class WordConcatRaftActor extends RaftActor {

   var words = ListBuffer[String]()

+PN: val + mutable collection, or better var + immutable collection. I would use var Vector here  
+
   /** 
    * Called when a command is determined by Raft to be safe to apply; 
    * Application results are sent back to the client issuing the command.
@@ -38,6 +42,7 @@ class WordConcatRaftActor extends RaftActor {
     case GetWords =>
       log.info("Replying with {}", words.toList)
       words.toList
+      
   }
 }

@@ -51,6 +56,9 @@ members foreach { _ ! ChangeConfiguration(clusterConfiguration)
 // todo implement re-routing if you send to a non-leader
 // then send messages to it; the state machine will only be applied when consensus has been reached about a value
 leader ! ClientRequest(AppendWord("I"))
+
+PN: perhaps illustrate how the leader ref is created/retrieved?
+
 leader ! ClientRequest(AppendWord("like"))
 leader ! ClientRequest(AppendWord("capybaras"))

diff --git a/src/main/resources/application.conf b/src/main/resources/application.conf
index d6ddc00..8a4f1bb 100644
--- a/src/main/resources/application.conf
+++ b/src/main/resources/application.conf
@@ -1,3 +1,4 @@
+# I think it's dangerous to include an application.conf in a library like this
 akka {
   loglevel = "INFO"
   stdout-loglevel = "INFO"
diff --git a/src/main/resources/reference.conf b/src/main/resources/reference.conf
index df79cf7..cabdb27 100644
--- a/src/main/resources/reference.conf
+++ b/src/main/resources/reference.conf
@@ -19,6 +19,7 @@ akka {

     # When propagating entries among members, AppendEntries can carry multiple log entries.
     # Use this valud to tweak this number as it depends on the characteristics of your Commands.
+#PN: s/valud/value/    
     default-append-entries-batch-size = 5

     # When turned on, will push events like "entry 1 committed" onto the eventStream, mostly designed for testing,
diff --git a/src/main/scala/pl/project13/scala/akka/raft/Candidate.scala b/src/main/scala/pl/project13/scala/akka/raft/Candidate.scala
index 9c74cf1..c2a5819 100644
--- a/src/main/scala/pl/project13/scala/akka/raft/Candidate.scala
+++ b/src/main/scala/pl/project13/scala/akka/raft/Candidate.scala
@@ -18,6 +18,7 @@ private[raft] trait Candidate {
         goto(Follower) using m.forFollower
       } else {
         log.info(s"Initializing election (among ${m.config.members.size} nodes) for ${m.currentTerm}")
+        // PN: log.info("Initializing election (among {} nodes) for {}", m.config.members.size, m.currentTerm) 

         val request = RequestVote(m.currentTerm, self, replicatedLog.lastTerm, replicatedLog.lastIndex)
         m.membersExceptSelf foreach { _ ! request }
diff --git a/src/main/scala/pl/project13/scala/akka/raft/RaftActor.scala b/src/main/scala/pl/project13/scala/akka/raft/RaftActor.scala
index 3f9eed3..3d3a08b 100644
--- a/src/main/scala/pl/project13/scala/akka/raft/RaftActor.scala
+++ b/src/main/scala/pl/project13/scala/akka/raft/RaftActor.scala
@@ -134,10 +134,13 @@ abstract class RaftActor extends Actor with LoggingFSM[RaftState, Metadata]
     require(toMs > fromMs, s"to ($to) must be greater than from ($from) in order to create valid election timeout.")

     (fromMs + Random.nextInt(toMs.toInt - fromMs.toInt)).millis
+    //PN: scala.concurrent.forkjoin.ThreadLocalRandom
   }

   @inline private[raft] def electionTimeoutStillValid(since: Long) = {
     val stillValid = electionTimeoutDieOn < System.currentTimeMillis()
+    //PN: I would use System.nanoTime for measuring durations (currentTimeMillis may jump)
+    //    Here you might find scala.concurrent.duration.Deadline useful

     if (stillValid)
       log.info(s"Timeout reached (since: $since, ago: ${System.currentTimeMillis() - since})")
diff --git a/src/main/scala/pl/project13/scala/akka/raft/cluster/ClusterRaftActor.scala b/src/main/scala/pl/project13/scala/akka/raft/cluster/ClusterRaftActor.scala
index 9eb375d..cbc37ba 100644
--- a/src/main/scala/pl/project13/scala/akka/raft/cluster/ClusterRaftActor.scala
+++ b/src/main/scala/pl/project13/scala/akka/raft/cluster/ClusterRaftActor.scala
@@ -38,6 +38,7 @@ trait ClusterRaftActor extends RaftActor {
     case MemberUp(member) =>
         log.info("Node is Up: {}, selecting and adding actors to Raft cluster..", member.address)
         val memberSelection = context.actorSelection(RootActorPath(member.address) / "user" / "member-*")
+        //PN: make the name configurable?
         memberSelection ! Identify(member.address)

     case ActorIdentity(address, Some(raftActorRef)) =>
@@ -51,10 +52,12 @@ trait ClusterRaftActor extends RaftActor {
     case UnreachableMember(member) =>
       log.info("Node detected as unreachable: {}", member)
       // todo remove from raft ???
+      // PN: perhaps, but note that it can become reachable again

     case MemberRemoved(member, previousStatus) =>
       log.info("Member is Removed: {} after {}", member.address, previousStatus)
       // todo remove from raft ???
+      // PN: yes, but note that an ActorIdentity for an removed member might come in after MemberRemoved

     case _: MemberEvent =>
       // ignore
diff --git a/src/main/scala/pl/project13/scala/akka/raft/example/cluster/WordConcatClusterApp.scala b/src/main/scala/pl/project13/scala/akka/raft/example/cluster/WordConcatClusterApp.scala
index bda9dfd..129b097 100644
--- a/src/main/scala/pl/project13/scala/akka/raft/example/cluster/WordConcatClusterApp.scala
+++ b/src/main/scala/pl/project13/scala/akka/raft/example/cluster/WordConcatClusterApp.scala
@@ -14,7 +14,9 @@ object WordConcatClusterApp extends App {
   val system = ActorSystem("RaftSystem", config)

   val member = system.actorOf(Props(classOf[WordConcatClusterRaftActor]))
+  //PN: wasn't the expected name "member-*" ?

   Cluster(system).subscribe(member)
+  //PN: remove, subscribe is done inside ClusterRaftActor

 }
diff --git a/src/main/scala/pl/project13/scala/akka/raft/example/cluster/WordConcatClusterRaftActor.scala b/src/main/scala/pl/project13/scala/akka/raft/example/cluster/WordConcatClusterRaftActor.scala
index eff52f6..cfbb1e6 100644
--- a/src/main/scala/pl/project13/scala/akka/raft/example/cluster/WordConcatClusterRaftActor.scala
+++ b/src/main/scala/pl/project13/scala/akka/raft/example/cluster/WordConcatClusterRaftActor.scala
@@ -10,6 +10,7 @@ class WordConcatClusterRaftActor extends ClusterRaftActor {
    type Command = Cmnd

    var words = ListBuffer[String]()
+   // PN: I would use a Vector

    /** Called when a command is determined by Raft to be safe to apply */
    def apply = {
diff --git a/src/multi-jvm/scala/pl/project13/scala/akka/raft/cluster/ClusterWithManyMembersOnEachNodeElectionSpec.scala b/src/multi-jvm/scala/pl/project13/scala/akka/raft/cluster/ClusterWithManyMembersOnEachNodeElectionSpec.scala
index f4fda3d..e99a608 100644
--- a/src/multi-jvm/scala/pl/project13/scala/akka/raft/cluster/ClusterWithManyMembersOnEachNodeElectionSpec.scala
+++ b/src/multi-jvm/scala/pl/project13/scala/akka/raft/cluster/ClusterWithManyMembersOnEachNodeElectionSpec.scala
@@ -42,6 +42,9 @@ abstract class ClusterWithManyMembersOnEachNodeElectionSpec extends RaftClusterS
         system.actorOf(Props[WordConcatClusterRaftActor], s"member-$idx")
       }
     }
+    //PN: I think you should start the raft actors first, and then have a barrier,
+    //    otherwise there is a risk that some node receives MemberUp and sends Identify
+    //    before the target actor is started

     // start additional members
     runOn(first) {
--

Use akka-persistence in order to persist log

The replicatedLog could be persisted using akka-persitence.

Should reject AppendEntries if term < currentTerm

Whoops, nevermind:

https://github.com/ktoso/akka-raft/blob/master/src/main/scala/pl/project13/scala/akka/raft/Follower.scala#L81

Nodes should not forget who they voted for

When a Candidate receives an AppendEntries message from a leader in the same term, it correctly steps down to Follower.

However, it currently forgets who it voted for that term.

This can lead to two leaders being elected in the same term: the Candidate that stepped down may now vote for another Candidate who will eventually become the second leader.

This behavior can occur even after merging in #55.

I have a test case that triggers this behavior. Here are the messages delivered to trigger this bug [format is sender,receiver,message, and the cluster has 4 nodes in it]:

MsgEvent(deadLetters,raft-member-4,ChangeConfiguration(StableRaftConfiguration(Set(raft-member-1, raft-member-2, raft-member-3, raft-member-4))))
MsgEvent(deadLetters,raft-member-1,ChangeConfiguration(StableRaftConfiguration(Set(raft-member-1, raft-member-2, raft-member-3, raft-member-4))))
MsgEvent(deadLetters,raft-member-1,Timer(election-timer,ElectionTimeout,false,1))
MsgEvent(deadLetters,raft-member-4,Timer(election-timer,ElectionTimeout,false,1))
MsgEvent(deadLetters,raft-member-3,ChangeConfiguration(StableRaftConfiguration(Set(raft-member-1, raft-member-2, raft-member-3, raft-member-4))))
MsgEvent(deadLetters,raft-member-1,Timer(election-timer,ElectionTimeout,false,3))
MsgEvent(deadLetters,raft-member-4,Timer(election-timer,ElectionTimeout,false,3))
MsgEvent(raft-member-4,raft-member-4,BeginElection)
MsgEvent(deadLetters,raft-member-2,ChangeConfiguration(StableRaftConfiguration(Set(raft-member-1, raft-member-2, raft-member-3, raft-member-4))))
MsgEvent(raft-member-4,raft-member-1,RequestVote(Term(2),Actor[akka://new-system-0/user/raft-member-4#-1663394226],Term(0),0))
MsgEvent(deadLetters,raft-member-3,Timer(election-timer,ElectionTimeout,false,1))
MsgEvent(deadLetters,raft-member-2,Timer(election-timer,ElectionTimeout,false,1))
MsgEvent(deadLetters,raft-member-3,Timer(election-timer,ElectionTimeout,false,3))
MsgEvent(raft-member-3,raft-member-3,BeginElection)
MsgEvent(raft-member-3,raft-member-3,BeginElection)
MsgEvent(raft-member-4,raft-member-4,BeginElection)
MsgEvent(deadLetters,raft-member-2,Timer(election-timer,ElectionTimeout,false,3))
MsgEvent(raft-member-1,raft-member-4,VoteCandidate(Term(2)))
MsgEvent(raft-member-4,raft-member-2,RequestVote(Term(2),Actor[akka://new-system-0/user/raft-member-4#-1663394226],Term(0),0))
MsgEvent(raft-member-2,raft-member-4,VoteCandidate(Term(2)))
MsgEvent(raft-member-4,raft-member-4,ElectedAsLeader)
MsgEvent(raft-member-4,raft-member-1,AppendEntries(term:Term(2),prevLog:(Term(1),1),entries:List()))
MsgEvent(raft-member-3,raft-member-1,RequestVote(Term(2),Actor[akka://new-system-0/user/raft-member-3#-463317739],Term(0),0))
MsgEvent(raft-member-4,raft-member-2,AppendEntries(term:Term(2),prevLog:(Term(1),1),entries:List()))
MsgEvent(raft-member-1,raft-member-3,VoteCandidate(Term(2)))
MsgEvent(raft-member-3,raft-member-2,RequestVote(Term(2),Actor[akka://new-system-0/user/raft-member-3#-463317739],Term(0),0))
MsgEvent(raft-member-2,raft-member-3,VoteCandidate(Term(2)))

At this point, both raft-member-3 and raft-member-4 are leaders for Term(2).

Let me know if you want more detailed reproducing steps.

LogCompactation / Snapshots

Should be similar to akka persistence, in order to integrate https://github.com/ktoso/akka-persistence-hbase

See §5.3 in https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf

Not all states handle all protocol messages

There are a few instances in which a server in a given state will not handle a given protocol message. Usually these messages would not be sent to servers in such states, but if a message is delayed long enough for the recipient to change states before receiving it, the message might arrive in that later state.

Instances found:

Follower does not handle VoteCandidate, DeclineCandidate, AppendSuccessful, or AppendRejected
Leader does not handle RequestVote
Candidate does not handle AppendSuccessful or AppendRejected

Simple clustered code example?

Do you have a simple example that shows how to use Akka-raft in an Akka cluster?

verify flow when catching up from "very behind"

Follower does not check consistency of AppendEntries

If a follower has fallen very far behind, it might be the case that the leader sends it a batch of messages that starts past the end of the follower's log. In this case, the follower should reject the request, and the leader should send a new request starting at an earlier point in the log.

See section 5.3 (towards the end of that section) of the tech report for more details.

Support cluster node roles

In a large cluster you probably want to run raft on a few nodes, not all. Let ClusterRaftActor only care about nodes with a specified cluster role.
http://doc.akka.io/docs/akka/2.3.0-RC2/scala/cluster-usage.html#Node_Roles

Distributed "like zookeeper" example

Two Leaders Elected in the Same Term

Hello,

I have a test case that, as far as I can tell, causes akka-raft to violate Raft's "Election Safety" property (see Figure 3 from the paper), i.e. it appears that two leaders are elected for the same term.

The test case consists of the following external events:

Start 9 RaftActors, named "raft-member-{1-9}".
Bootstrap all but one of the RaftActors, i.e. send them ChangeConfiguration messages
containing ActorRefs for all 9 RaftActors. raft-member-8 does not receive a ChangeConfiguration message.

Upon running the test, I see the following in the console output:

[INFO] [04/24/2015 23:52:24.490] [new-system-0-akka.actor.default-dispatcher-5] [akka://new-system-0/user/raft-member-9] Initializing election (among 9 nodes) for Term(2)
[INFO] [04/24/2015 23:52:24.492] [new-system-0-akka.actor.default-dispatcher-7] [akka://new-system-0/user/raft-member-7] Initializing election (among 9 nodes) for Term(2)
...
[INFO] [04/24/2015 23:52:24.497] [new-system-0-akka.actor.default-dispatcher-5] [akka://new-system-0/user/raft-member-7] Received vote by Actor[akka://new-system-0/user/raft-member-4#-2022599620]; Won election with 5 of 9 votes
[INFO] [04/24/2015 23:52:24.496] [new-system-0-akka.actor.default-dispatcher-5] [akka://new-system-0/user/raft-member-9] Received vote by Actor[akka://new-system-0/user/raft-member-2#-1430451575]; Won election with 5 of 9 votes

For what it's worth, rather than inspecting the console output to detect this bug, we took a distributed snapshot of all RaftActor's states and found that in the same snapshot raft-member-9 is in state
LeaderMeta(Actor[akka://new-system-0/user/raft-member-9],Term(2)) while
raft-member-7 is in state LeaderMeta(Actor[akka://new-system-0/user/raft-member-7],Term(2)).

In our failing execution, we have the akka runtime deliver 27 total messages, including the 8 ChangeConfiguration messages. The delivery order is as follows (format is sender,receiver,message):

null,raft-member-7,ChangeConfiguration(StableRaftConfiguration(Set(raft-member-1, raft-member-7, raft-member-2, raft-member-6, raft-member-5, raft-member-9, raft-member-8, raft-member-3, raft-member-4)))
null,raft-member-4,ChangeConfiguration(StableRaftConfiguration(Set(raft-member-1, raft-member-7, raft-member-2, raft-member-6, raft-member-5, raft-member-9, raft-member-8, raft-member-3, raft-member-4)))
null,raft-member-2,ChangeConfiguration(StableRaftConfiguration(Set(raft-member-1, raft-member-7, raft-member-2, raft-member-6, raft-member-5, raft-member-9, raft-member-8, raft-member-3, raft-member-4)))
null,raft-member-1,ChangeConfiguration(StableRaftConfiguration(Set(raft-member-1, raft-member-7, raft-member-2, raft-member-6, raft-member-5, raft-member-9, raft-member-8, raft-member-3, raft-member-4)))
null,raft-member-9,ChangeConfiguration(StableRaftConfiguration(Set(raft-member-1, raft-member-7, raft-member-2, raft-member-6, raft-member-5, raft-member-9, raft-member-8, raft-member-3, raft-member-4)))
null,raft-member-6,ChangeConfiguration(StableRaftConfiguration(Set(raft-member-1, raft-member-7, raft-member-2, raft-member-6, raft-member-5, raft-member-9, raft-member-8, raft-member-3, raft-member-4)))
raft-member-9,raft-member-9,BeginElection
null,raft-member-3,ChangeConfiguration(StableRaftConfiguration(Set(raft-member-1, raft-member-7, raft-member-2, raft-member-6, raft-member-5, raft-member-9, raft-member-8, raft-member-3, raft-member-4))
raft-member-9,raft-member-1,(RequestVote,Term(1),raft-member-9,Term(0),0)
raft-member-9,raft-member-2,(RequestVote,Term(1),raft-member-9,Term(0),0)
raft-member-9,raft-member-6,(RequestVote,Term(1),raft-member-9,Term(0),0)
raft-member-1,raft-member-9,VoteCandidate(Term(0))
raft-member-6,raft-member-9,VoteCandidate(Term(0))
null,raft-member-5,ChangeConfiguration(StableRaftConfiguration(Set(raft-member-1, raft-member-7, raft-member-2, raft-member-6, raft-member-5, raft-member-9, raft-member-8, raft-member-3, raft-member-4)))
raft-member-9,raft-member-9,BeginElection
raft-member-9,raft-member-3,(RequestVote,Term(2),raft-member-9,Term(0),0)
raft-member-7,raft-member-7,BeginElection
raft-member-7,raft-member-1,(RequestVote,Term(2),raft-member-7,Term(0),0)
raft-member-1,raft-member-1,BeginElection
raft-member-7,raft-member-5,(RequestVote,Term(2),raft-member-7,Term(0),0)
raft-member-7,raft-member-4,(RequestVote,Term(2),raft-member-7,Term(0),0)
raft-member-3,raft-member-9,VoteCandidate(Term(2))
raft-member-7,raft-member-7,BeginElection
raft-member-1,raft-member-7,VoteCandidate(Term(2))
raft-member-5,raft-member-7,VoteCandidate(Term(1))
raft-member-2,raft-member-9,VoteCandidate(Term(0))
raft-member-4,raft-member-7,VoteCandidate(Term(1))

Based on that delivery order, it appears that raft-member-9 receives votes from raft-member-{1,6,3,2,9}, and raft-member-7 receives votes from raft-member-{1,5,4,7}. A few things are strange about this: the votes received by raft-member-9 appear to be from different Terms; and raft-member-7 does not actually receive a quorum of votes (regardless of Term). I'm not exactly sure what the root cause is here.

We made akka's message scheduler deterministic so that you can easily reproduce the bug for yourself. Steps to reproduce:

$ git clone -b raft-leader-safety [email protected]:NetSys/sts2-applications.git
$ cd sts2-applications
$ git remote add interposition [email protected]:NetSys/sts2-interposition.git
$ git subtree pull --prefix=interposition interposition master
$ git clone [email protected]:NetSys/sts2-experiments.git experiments
$ sbt assembly
$ java -d64 -Xmx15g -cp target/scala-2.11/randomSearch-assembly-0.1.jar Main 2>&1 | tee console.out

From there you should be able to add logging statements and continue replaying as many times as needed.

We made a few small changes to akka-raft to generate this test case:

We
use a seeded random number generator to make the execution
deterministic.
We added a receive() override to RaftActor.scala to allow us to take
distributed snapshots.
We modified
Follower.scala to always begin an election
rather than check if electionDeadline.isOverdue(). This is again to make
the execution deterministic (since electionDeadline.isOverdue() calls
gettimeofday), but that shouldn't affect correctness as far as I can
tell.

Let me know if you have any questions about how we ran this test.

Thanks!
-Colin

Follower does not reset election deadline after voting for a candidate

Implement joint consensus on configuration changes

Should implement joint consensus on cluster configuration to remedy this, see §6 in https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf

Would get better performance if akka-raft used udp

It looks like akka-raft is configured by default to use TCP between nodes.

I would bet that the implementation could get notably better performance if it used UDP, by avoiding all of the extra network delays implied by TCP. The raft protocol should be resilient to message reorderings, duplicates, drops, etc.

Upgrade sbt built to sbt 0.13

upgrade sbt
find the right multi-jvm plugin version

All nodes should transition to Follower if term > currentTerm

From the raft paper:

• If RPC request or response contains term T > currentTerm:
set currentTerm = T, convert to follower (§5.1)

This is not implemented in all of the akka-raft states for all messages. It does seem to be implemented for AppendEntries (Follower, Leader, Candidate) but there are still three remaining issues:

It is not implemented for other message types.
Upon receiving AppendEntries from a higher term in Candidate state, the actor does convert to Follower state, but it does not update its Term
Upen receiving AppendEntries from a higher term in Leader state, the actor does convert to Follower state, but it does not update its Term

LeaderTest starts log at wrong index

This test puts the first entry in the log at index 1, even though the log really starts at index 0. Again, this seems related to the inconsistent use of index numbers.

RPC response not always sent

There is a number of cases in which a server does not send a response to an RPC request (even though every request should receive a response). Some of the cases I've seen:

Follower does not respond to heartbeats
Candidate does not respond to AppendEntries from a lagging leader
Leader does not respond to RequestVote
Leader does not respond to AppendEntries from a leader with a later term

Fail nicely, when the client's apply() is not definedAt

indexOnMajority is wrong

Instead of calculating the newest index that a majority of the servers have in their logs, indexOnMajority appears to calculate something like the "mode" of matching indices: it returns the index that the largest number of members have as their match index.

For example, say that a cluster of five servers have match indices 1, 1, 2, 3, and 4. The majority of servers have at least entry 2 in their logs, but this function will return 1 because more servers have 1 as their match index.

I think this algorithm (pseudocode) should work instead:
sortedIndices = sortDescending(matchIndices)
return sortedIndices[ceiling(length(config.members) / 2) - 1]

Inconsistent use of index numbers

The index numbers used in the code are confusing and seem to be inconsistent. The indices in the log will start at 0. However, an AppendEntries with an empty list of entries to send will indicate its last index as 1, which would actually be the second index in the log (I would expect this to be 0). I seem to recall similar issues, but can't seem to find them now.

The code should be consistent on what the start index of the log is, and what number indicates "this is before the start of the log". The tech report uses 0 as a special index value to indicate a point before the beginning of the log and starts the log at index 1, but -1 and 0 could be used equally well. Whichever way is used should be documented somewhere in the code.

Empty batch is sent with term 1/index 1

In AppendEntries.apply, if a batch is empty (e.g. because the follower is completely caught up and we're sending a heartbeat), the message is sent with prevLogTerm 1 and prevLogIndex 1. This is not necessarily correct; the numbers should be based off of what is actually at the end of the log.

Crash: "Unable to find log entry at index 2"

I found the following stacktrace while fuzz testing akka-raft. Unfortunately I do not have a reproducible test case for this one:

[ERROR] [03/12/2015 15:04:50.367] [new-system-0-akka.actor.default-dispatcher-12] [akka://new-system-0/user/raft-member-3] Unable to find log entry at index 2
java.lang.RuntimeException: Unable to find log entry at index 2
        at pl.project13.scala.akka.raft.model.ReplicatedLog$$anonfun$termAt$2.apply(ReplicatedLog.scala:107)
        at pl.project13.scala.akka.raft.model.ReplicatedLog$$anonfun$termAt$2.apply(ReplicatedLog.scala:107)
        at scala.Option.getOrElse(Option.scala:120)
        at pl.project13.scala.akka.raft.model.ReplicatedLog.termAt(ReplicatedLog.scala:107)
        at pl.project13.scala.akka.raft.protocol.RaftProtocol$AppendEntries$.apply(RaftProtocol.scala:49)
        at pl.project13.scala.akka.raft.Leader$$anonfun$replicateLog$1.apply(Leader.scala:118)
        at pl.project13.scala.akka.raft.Leader$$anonfun$replicateLog$1.apply(Leader.scala:114)
        at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:322)
        at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:978)
        at pl.project13.scala.akka.raft.Leader$class.replicateLog(Leader.scala:114)
        at pl.project13.scala.akka.raft.RaftActor.replicateLog(RaftActor.scala:13)
        at pl.project13.scala.akka.raft.Leader$$anonfun$1.applyOrElse(Leader.scala:43)
        at pl.project13.scala.akka.raft.Leader$$anonfun$1.applyOrElse(Leader.scala:16)
        at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
        at akka.actor.FSM$class.processEvent(FSM.scala:604)
        at pl.project13.scala.akka.raft.RaftActor.akka$actor$LoggingFSM$$super$processEvent(RaftActor.scala:13)
        at akka.actor.LoggingFSM$class.processEvent(FSM.scala:734)
        at pl.project13.scala.akka.raft.RaftActor.processEvent(RaftActor.scala:13)
        at akka.actor.FSM$class.akka$actor$FSM$$processMsg(FSM.scala:598)
        at akka.actor.FSM$$anonfun$receive$1.applyOrElse(FSM.scala:592)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
        at pl.project13.scala.akka.raft.RaftActor.aroundReceive(RaftActor.scala:13)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
        at akka.actor.ActorCell.invoke(ActorCell.scala:487)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
        at akka.dispatch.Mailbox.run(Mailbox.scala:220)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Investigate ordering issue while replay

Wrong project ;)

verify our logic behind the "did the right timeout trigger" makes sense IRL

General cleanup before 1.0

Todo:

Update Entry[] to be explicitly aware of "special" entries, such as snapshots etc. These should not be applied to the
Cleanup log replication - there's two places we could replicate entries from; make it only rely on the heartbeat = less noise in the network.
Increase the time for heartbeat - 50ms is way too often. Although ok for testing locally

Election timeout reset twice on Follower->Candidate transition

When a follower transitions to the candidate state because of an election timeout, the server resets its election timeout two times: once in onTransition, and once in beginElection. This isn't a correctness issue per se, but it's at least a code smell that indicates there should be a more consistent set of rules for timeout resets.

Raft based auto-downing for the cluster

Would be awesome.

Members should not respond with Vote if sender is not in Configuration

Suppose there are 5 nodes, labeled raft-member-{1-5}.

In the beginning, we send ChangeConfiguration(raft-member-1, raft-member-2, raft-member-3) [bootstrap] messages to raft-member-{1-3}.

Later, we add two nodes raft-member-{4-5}. We send new ChangeConfiguration(raft-member-1, raft-member-2, raft-member-3, raft-member-4, raft-member-5) messages to all 5 members, except, the messages arrive to raft-member-{4-5} much more quickly than they arrive to raft-member-{1-3}.

The current behavior in this scenario is that raft-member-4 or raft-member-5 will start an election, and send RequestVote messages to all members. Strangely, raft-member-{1-3} respond with votes (even though they are not aware of the existence of raft-member-{4-5})! This seems broken, as it can cause multiple leaders to be elected for the same term.

I realize this is probably not the proper way to initiate joint consensus, so maybe the fix is just to document the proper way to do so? Incidentally, how is joint consensus supposed to be triggered?

Thanks!

akka-cluster awareness

Incorrect next index calculation for log

If I'm reading the code correctly, nextIndex is based on the number of entries in the log. This won't work when log compaction is used, because a single compacted log entry counts for multiple index numbers.

Servers should check log consistency before granting vote

Votes should only be granted when the candidate's log is at least as up-to-date as the voting server's log. See section 5.4.1 of the tech report.

Follower should be able to vote more than once in a term

Because Raft RPCs are idempotent, Followers should be able to vote for the same candidate more than once in a given term. This allows the candidate to receive a response in case the first response was dropped.

Follower does not commit entries on receiving heartbeat

If no new requests come in for a long time, it might be that the only way a follower knows to commit a request is to do so on a heartbeat message, so the commit logic should happen there, too.

Really, there should be no difference in the processing for heartbeats and non-heartbeats. A heartbeat is just an AppendEntries in which the list of new entries is empty, so an empty list is appended to the log.

Candidate should check term in VoteCandidate messages

Candidates currently don't check that a VoteCandidate message comes from a server in the expected term. This could allow a candidate to be elected by votes from a previous term whose delivery was severely delayed.

Follower should not vote if Candidate has out-of-date log

The raft paper states that a node should only grant a vote to a Candidate if two conditions hold:

the follower hasn't already voted in its current Term (checked here)
The candidate’s log is at least as up-to-date as receiver’s log

The second condition is not currently checked by either the Follower or the Candidate states. In other words, the "lastLogIndex" and "lasLogTerm" fields of the RequestVote message are never read.

Candidate should have election restart timeout

In the unlikely event that no candidate receives the majority vote for a given term, the candidates should timeout and start a new election. Currently there is no timeout, so eventually all servers in the cluster would transition to the Candidate state for the same term and get stuck.

termAt can return wrong term for first item

The termAt method will return the wrong term in the case where the first entry in the log actually comes from a term other than term 1 (could happen if multiple elections occur before a request is sent). This is related to the inconsistent use of index numbers - the test should be for an index less than 0, not less than or equal to.

Stress and long running tests

Implement some long running tests using multi-jvm tests.
Also, make sure we can handle big numbers of nodes / members.

Candidate does not distinguish between votes from same follower

The Candidate state does not distinguish between multiple votes in the same term from the same follower, although it should. Otherwise, a candidate might get elected by the same follower (this will only happen if bug #29 is fixed as well to allow multiple votes from same follower).

Simplify usage of akka-raft - RaftClientActor

Simplify usage so that the client does not need to care too much about "who is the leader right now?", when sending messages to raft to process them