lalithsuresh / rapid Goto Github PK

Rapid is a scalable distributed membership service

License: Apache License 2.0

Java 100.00%

distributed-systems membership-management failure-detection strong-consistency

rapid's Introduction

What is Rapid?

Rapid is a distributed membership service. It allows a set of processes to easily form clusters and receive notifications when the membership changes.

We observe that datacenter failure scenarios are not always crash failures, but commonly involve misconfigured firewalls, one-way connectivity loss, flip-flops in reachability, and some-but-not-all packets being dropped. However, existing membership solutions struggle with these common failure scenarios, despite being able to cleanly detect crash faults. In particular, existing tools take long to, or never converge to, a stable state where the faulty processes are removed.

To address the above challenge, we present Rapid, a scalable, distributed membership system that is stable in the face of a diverse range of failure scenarios, and provides participating processes a strongly consistent view of the system's membership.

How does Rapid work?

Rapid achieves its goals through the following three building blocks:

Expander-based monitoring edge overlay. To scale monitoring load, Rapid organizes a set of processes (a configuration) into a stable failure detection topology comprising observers that monitor and disseminate reports about their communication edges to their subjects. The monitoring relationships between processes forms a directed expander graph with strong connectivity properties, which ensures with a high probability that healthy processes detect failures. We interpret multiple reports about a subject's edges as a high-fidelity signal that the subject is faulty.
Multi-process cut detection. For stability, processes in Rapid (i) suspect a faulty process p only upon receiving alerts from multiple observers of p, and (ii) delay acting on alerts about different processes until the churn stabilizes, thereby converging to detect a global, possibly multi-node cut of processes to add or remove from the membership. This filter is remarkably simple to implement, yet it suffices by itself to achieve almost-everywhere agreement -- unanimity among a large fraction of processes about the detected cut.
Practical consensus. For consistency, we show that converting almost-everywhere agreement into full agreement is practical even in large-scale settings. Rapid's consensus protocol drives configuration changes by a low-overhead, leaderless protocol in the common case: every process simply validates consensus by counting the number of identical cut detections. If there is a quorum containing three-quarters of the membership set with the same cut, then without a leader or further communication, this is a safe consensus decision.

Pluggable failure detectors

A powerful feature of Rapid is that it allows users to use custom failure detectors. By design, users inform Rapid how an observer o can announce its monitoring edge to a subject s as faulty by implementing a simple interface (IEdgeFailureDetectorFactory). Rapid builds the expander-based monitoring overlay using the user-supplied template for a monitoring edge.

Pluggable messaging

When embedding a membership service in a larger system, there is no reason for the membership service to use its own messaging implementation. For this reason, Rapid also allows users to plugin their own messaging implementations by implementing two interfaces, IMessagingClient and IMessagingServer. To see example usage, have a look at examples/src/main/.../AgentWithNettyMessaging.java.

Where can I read more?

We suggest you start with our USENIX ATC 2018 paper. The paper and an accompanying tech report are both available in the docs folder.

How do I use Rapid?

Clone this repository and install rapid into your local maven repository:

   $: mvn install

If your project uses maven, add the following dependency into your project's pom.xml:

  <dependency>
     <groupId>com.github.lalithsuresh</groupId>
     <artifactId>rapid</artifactId>
     <version>0.8.0</version>
  </dependency>

For a simple example project that uses Rapid's APIs, see examples/.

Running Rapid

For the following steps, ensure that you've built or installed Rapid:

  $: mvn package  # or mvn install

To launch a simple Rapid-based agent, run the following commands in your shell from the top-level directory:

  $: java -jar examples/target/standalone-agent.jar \ 
          --listenAddress 127.0.0.1:1234 \
          --seedAddress 127.0.0.1:1234

From two other terminals, try adding a few more nodes on different listening addresses, but using the same seed address of "127.0.0.1:1234". For example:

  $: java -jar examples/target/standalone-agent.jar \ 
          --listenAddress 127.0.0.1:1235 \
          --seedAddress 127.0.0.1:1234

  $: java -jar examples/target/standalone-agent.jar \
          --listenAddress 127.0.0.1:1236 \
          --seedAddress 127.0.0.1:1234

Or use the following script to start multiple agents in the background that bootstrap via node 127.0.0.1:1234.

  #! /bin/bash
  for each in `seq 1235 1245`;
  do
        java -jar examples/target/standalone-agent.jar \
             --listenAddress 127.0.0.1:$each \
             --seedAddress 127.0.0.1:1234 &> /tmp/rapid.$each &
  done

To run the AgentWithNettyMessaging example, replace standalone-agent.jar in the above commands with netty-based-agent.jar.

rapid's People

Contributors

Stargazers

Watchers

Forkers

zalokhan pamarnath sathiyapk p-kar mdhvn walktall fififish ambiousxx manuelbernhardt xiashuijun baodingfengyun bschuchardt lbqds zstan amashenkov capesepias hellblazer deweydbb

rapid's Issues

consensus liveness of the implementation

Hi, I'm diving into the implementation of the rapid membership protocol, and I have a question about the consensus liveness. As described below:

There are 9 nodes in the cluster, the FastPaxos need at least 7 votes, let's assume that 3 nodes are unreachable, which means every consensus will fallback to classic paxos. Now:

one of the node(let's say node A) start the classic paxos because of fast round timeout, and assume that node A have
largest node index in the remain 6 nodes
node A send phase1a messages to other nodes
other nodes handle phase1a message and set the rnd to Rank(2, A.NodeIndex)
now node A crashed because of unknown reason, and we have 5 nodes in the cluster
the remain 5 nodes will continue the classic paxos, but no one accept the phase1a message at here because of the node have larger rnd, and the consensus get stuck

Am I misunderstanding something?

Partition detection

Summarizing the discussion I've had with @lalithsuresh per e-mail on this topic so far:

In the event of a 2-way network partition, consensus can be reached on the majority but not on the minority side. Note that in order to do so reliable, the reinforcement mechanism described in the last paragraph of section 4.2 of the paper needs to be implemented.

On the minority side, no consensus can be reached. I currently have two ideas that would allow to make progress (where progress means informing the application that we are out).

Consensus timeout

After fast-paxos times out and paxos times out, conclude that we can't reach consensus fast enough and leave. Note that I am not sure yet that even with the reinforcement mechanism as described in the paper there'd be a way to reach a cut detection under those circumstances.

Extend the reinforcement mechanism to include partition detection

I think it may be possible to evolve the reinforcement mechanisms so that:

the timeout starts to count down only once there have not been any new alerts received as a whole (and not on a per-subject basis)
all reinforced-REMOVEs are broadcasted at once. This broadcast retries alerts and/or has a higher timeout (i.e. we really would like to get a good sense of who else is there with us)
when it broadcasts the reinforced-REMOVEs, it keeps a count of how many responses it gets / how many requests made it through
if a majority of nodes can be reached, proceed as usual
if only a minority can be reached, infer / detect that there has been a partition and that we are in the minority side

I think there could be a way for the system to detect that there has been a partition and go into "partition consensus" mode. For example I think that by counting the reinforced-REMOVEs it would be possible to confirm that there's only as many nodes left in the same side of the partition as inferred by the broadcast count. I.e. if I broadcast to m nodes and I also receive at most m reinforced-REMOVE alerts then that should give me a high confidence that we're the only ones left on that side. If I receive more than m, then a two-way partition cannot be assumed. (note that I think that reinforced-REMOVE needs to become its own alert type to be able to distinguish late REMOVEs from the reinforced ones).

Once a minority partition has been detected the minority side can proceed to shut down.

I have not yet thought this through for 3-way partitions.

Exception while using the Rapid Library

Hi Lalith,

I'm trying to use Rapid for membership information. Trying to execute the StandaloneAgent example provided in the examples directory. Below are the steps I'm executing.

Step1: on Terminatal1 executed the following command

$>java -cp standalone-agent.jar:. com.test.StandaloneAgent --listenAddress 127.0.0.1:1234 --seedAddress 127.0.0.1:1234

Output I'm seeing as follows

[main] INFO com.test.StandaloneAgent - Node 127.0.0.1:1234 -- cluster size 1
[main] INFO com.test.StandaloneAgent - Node 127.0.0.1:1234 -- cluster size 1
[main] INFO com.test.StandaloneAgent - Node 127.0.0.1:1234 -- cluster size 1
[main] INFO com.test.StandaloneAgent - Node 127.0.0.1:1234 -- cluster size 1
[main] INFO com.test.StandaloneAgent - Node 127.0.0.1:1234 -- cluster size 1

Step2: Opened Terminal2 and executed the following command

$>java -cp standalone-agent.jar:. com.test.StandaloneAgent --listenAddress 127.0.0.1:1235 --seedAddress 127.0.0.1:1234

Output:

Terminal1

[main] INFO com.vrg.rapid.Cluster - 127.0.0.1:1235 is sending a join-p2 to 127.0.0.1:1234 for config 3713649891269931577
[main] INFO com.test.StandaloneAgent - Node 127.0.0.1:1235 -- cluster size 2
[main] INFO com.test.StandaloneAgent - Node 127.0.0.1:1235 -- cluster size 2
[main] INFO com.test.StandaloneAgent - Node 127.0.0.1:1235 -- cluster size 2

and the output on Terminal2 is as follows

[protocol-127.0.0.1:1234-0] INFO com.vrg.rapid.MembershipService - Join at seed for {seed:127.0.0.1:1234, sender:127.0.0.1:1235, config:3713649891269931577, size:1}
[protocol-127.0.0.1:1234-0] INFO com.vrg.rapid.MembershipService - Proposing membership change of size 1
[protocol-127.0.0.1:1234-0] INFO com.test.StandaloneAgent - The condition detector has outputted a proposal: ClusterStatusChange{configurationId=3713649891269931577, membership=[hostname: "127.0.0.1"
port: 1234
], delta=[127.0.0.1:1235:UP:]}
[protocol-127.0.0.1:1234-0] INFO com.vrg.rapid.MembershipService - Decide view change called in current configuration 3713649891269931577 (1 nodes), for proposal [127.0.0.1:1235, ]
[protocol-127.0.0.1:1234-0] INFO com.test.StandaloneAgent - View change detected: ClusterStatusChange{configurationId=-4337195239393783641, membership=[hostname: "127.0.0.1"
port: 1235
, hostname: "127.0.0.1"
port: 1234
], delta=[127.0.0.1:1235:UP:]}
[main] INFO com.test.StandaloneAgent - Node 127.0.0.1:1234 -- cluster size 2
[main] INFO com.test.StandaloneAgent - Node 127.0.0.1:1234 -- cluster size 2

Step3:

I have killed the process in the Terminal2 and moved to the Terminal1. I'm getting following exception in the Terminal1

[bg-127.0.0.1:1234-0] ERROR com.vrg.rapid.messaging.impl.Retries - Retrying call to hostname: "127.0.0.1"
port: 1235
because of exception {}
io.grpc.StatusRuntimeException: UNAVAILABLE
at io.grpc.Status.asRuntimeException(Status.java:526)
at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:433)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:41)
at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:339)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:443)
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:63)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:525)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:446)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:557)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:107)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.0.1:1235
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
... 1 more
Caused by: java.net.ConnectException: Connection refused
... 11 more
[bg-127.0.0.1:1234-0] ERROR com.vrg.rapid.messaging.impl.Retries - Retrying call to hostname: "127.0.0.1"
port: 1235
because of exception {}

What I expected on Terminal1 was cluster size 1. However, I'm getting the above expection

Step 4:

on Terminal2, I'm trying to restart the process that I have killed in the step3 and getting the following exception

[main] ERROR com.vrg.rapid.Cluster - Join message to seed 127.0.0.1:1234 returned an exception: {}
com.vrg.rapid.Cluster$JoinPhaseTwoException
at com.vrg.rapid.Cluster$Builder.joinAttempt(Cluster.java:400)
at com.vrg.rapid.Cluster$Builder.join(Cluster.java:315)
at com.vrg.rapid.Cluster$Builder.join(Cluster.java:294)
at com.test.StandaloneAgent.startCluster(StandaloneAgent.java:44)
at com.test.StandaloneAgent.main(StandaloneAgent.java:100)
[main] INFO com.vrg.rapid.Cluster - 127.0.0.1:1235 is sending a join-p2 to 127.0.0.1:1234 for config -1

What I expect here is that, after restarting the process again, it will rejoin the cluster and cluster size should be 2.

Is the behaviour in the step3 and step4 are expected? Am I using the membership library correctly.
Can you please help me here?

Thanks
Ravi

Travis builds failing due to network connectivity issues

Tracked here: travis-ci/travis-ci#9990

Integration tests

Add support for multi-JVM cluster tests and a means to detect performance regressions. Related to #3

Metadata not set for members that join whilst already in the membership list

It is possible for members to issue two phase 2 join requests if the response to the first one timed out (but was successfully processed nonetheless). In that case the seed will reply with a configuration id of -1, prompting the client to fetch the list. In this case, the metadata isn't sent along with the list of members.

See this commit which appears to fix the issue. Please disregard the changes to the tests as those do not appear to be working (I tried to simulate the scenario but the way I went about it doesn't appear to work)

Flexible messaging

Enable plugging in RPC frameworks other than gRPC.

Docker integration

Hi all,

I was trying to launch some Rapid nodes using Docker, but every node is throwing an exception saying that it cannot join.
Basically, I've compiled using maven and Java 8 and copied the JAR files to a Docker image. From what I could inspect, the JARs have all the dependencies needed.

I've attached a zip file with my Docker file, the container's script and a log of a previous attempt.
files.zip
Do you have any idea of what is causing this issue?

Thank you

Avoid static assignments for configurations in ClusterTest

This is related to the FindBugs exclusion for ClusterTest about making assignments to static fields