Coder Social home page Coder Social logo

library's Introduction

Byzantine Fault-Tolerant (BFT) State Machine Replication (SMaRt) v2.0

This is a Byzantine fault-tolerant state machine replication project named BFT-SMaRt, a Java open source library maintained by the LASIGE Computer Science and Engineering Research Centre at the University of Lisbon.

This package contains the source code (src/), dependencies (lib/), documentation (docs/), running scripts (runscripts/), and configuration files (config/) for version 2.0 of the project.

Quick start

To run any demonstration you first need to configure BFT-SMaRt to define the protocol behavior and the location of each replica.

The servers must be specified in the configuration file (see config/hosts.config):

#server id, address and port (the ids from 0 to n-1 are the service replicas) 
0 127.0.0.1 11000 11001
1 127.0.0.1 11010 11011
2 127.0.0.1 11020 11021
3 127.0.0.1 11030 11031

Important tip #1: Always provide IP addresses instead of hostnames. If a machine running a replica is not correctly configured, BFT-SMaRt may fail to bind to the appropriate IP address and use the loopback address instead (127.0.0.1). This phenomenon may prevent clients and/or replicas from successfully establishing a connection among them.

Important tip #2: Clients requests should not be issued before all replicas have been properly initialized. Replicas are ready to process client requests when each one outputs Ready to process operations in the console.

The system configurations also have to be specified (seeconfig/system.config). Most of the parameters are self-explanatory.

Important tip #3: When using the library in real systems, always make sure to set system.communication.defaultkeys to false and system.communication.useSignatures to 1. Also make sure that only the config/keys directory has the private key for the respective replica/client.

Compiling

Type ./gradlew installDist in the main directory. The required jar files and default configuration files will be available in the build/install/library directory.

WARNING: You might need to give execution permission to the gradlew script.

Copy content of build/install/library into multiple folders for local testing or machines for distributed testing.

Running the counter demonstration

You can run the counter demonstration by executing the following commands, from within the folders containing compiled code across four different consoles (4 replicas, to tolerate 1 fault):

./smartrun.sh bftsmart.demo.counter.CounterServer 0
./smartrun.sh bftsmart.demo.counter.CounterServer 1
./smartrun.sh bftsmart.demo.counter.CounterServer 2
./smartrun.sh bftsmart.demo.counter.CounterServer 3

Important tip #4: If you are getting timeout messages, it is possible that the application you are running takes too long to process the requests or the network delay is too high and PROPOSE messages from the leader does not arrive in time, so replicas may start the leader change protocol. To prevent that, try to increase the system.totalordermulticast.timeout parameter in config/system.config.

Important tip #5: Never forget to delete the config/currentView file after you modify config/hosts.config or config/system.config. If config/currentView exists, BFT-SMaRt always fetches the group configuration from this file first. Otherwise, BFT-SMaRt fetches information from the other files and creates config/currentView from scratch. Note that config/currentView only stores information related to the group of replicas. You do not need to delete this file if, for instance, you want to change the value of the request timeout.

Once all replicas are ready, the client can be launched as follows:

./smartrun.sh bftsmart.demo.counter.CounterClient 1001 <increment> [<number of operations>]

If <increment> equals 0 the request will be read-only. Default <number of operations> equals 1000.

Important tip #6: Always make sure that each client uses a unique ID. Otherwise, clients may not be able to complete their operations.

Read-only optimization

BFT-SMaRt implements a read-only optimization that allows replicas to process read-only requests without executing consensus protocol. Recent work (see section Additional information and publications) has shown that this optimization could violate the liveness property of the system.

The recent BFT-SMaRt version implements the proposed solution, which guarantees that the system will not violate the live property when using the read-only optimization. However, due to the high memory consumption of the current implementation, this optimization is turned off by default but can be enabled by setting the system.optimizations.readonly_requests parameter to true in the config/system.config file.

State transfer protocol(s)

BFT-SMaRt offers two state transfer protocols. The first is a basic protocol that can be used by extending the classes bftsmart.tom.server.defaultservices.DefaultRecoverable and bftsmart.tom.server.defaultservices.DefaultSingleRecoverable. Thee classes logs requests into memory and periodically takes snapshots of the application state.

The second, more advanced protocol can be used by extending the class bftsmart.tom.server.defaultservices.durability.DurabilityCoordinator. This protocol stores its logs to disk. To mitigate the latency of writing to disk, such tasks are done in batches and in parallel with the requests' execution. Additionally, the snapshots are taken at different points of the execution in different replicas.

Important tip #7: We recommend developers to use bftsmart.tom.server.defaultservices.DefaultRecoverable, since it is the most stable of the three classes.

Important tip #8: Regardless of the chosen protocol, developers must avoid using Java API objects like HashSet or HashMap, and use TreeSet or TreeMap instead. This is because serialization of Hash* objects is not deterministic, i.e, it generates different byte arrays for equal objects. This will lead to problems after more than f replicas used the state transfer protocol to recover from failures.

Group reconfiguration

The library also implements a reconfiguration protocol that can be used to add/remove replicas from the initial group.

You can add a replica to the group on-the-fly by executing the following command:

./smartrun.sh bftsmart.reconfiguration.util.DefaultVMServices <smart id> <ip address> <port client-to-replica> <port replica-to-replica>

You can remove a replica from the group on-the-fly by executing the following command:

./smartrun.sh bftsmart.reconfiguration.util.DefaultVMServices <smart id>

Important tip #9: Everytime you use the reconfiguration protocol, you must make sure that all replicas and the host where you invoke the above commands have the latest config/currentView file. The current implementation of BFT-SMaRt does not provide any mechanism to distribute this file, so you will need to distribute it on your own (e.g., using the scp command). You also need to make sure that any client that starts executing can read from the latest config/currentView file.

BFT-SMaRt under crash faults

You can run BFT-SMaRt in crash-faults only mode by setting the system.bft parameter in the configuration file to false. This mode requires fewer replicas to execute, but will not withstand full Byzantine behavior from compromised replicas.

Generating public/private key pairs

If you need to generate public/private keys for more replicas or clients, you can use the following command.

To generate RSA key pairs, execute the following command:

./smartrun.sh bftsmart.tom.util.RSAKeyPairGenerator <id> <key length> [config dir]

To generate ECDSA key pairs, execute the following command:

./smartrun.sh bftsmart.tom.util.ECDSAKeyPairGenerator <id> <domain parameter> [config dir]

Default config dir are config/keysRSA and config/keysECDSA, respectively. The commands above create key pairs both for clients and replicas. Alternatively, you can set the system.communication.defaultkeys to true in the config/system.config file to forces all processes to use the same public/private keys pair and secret key. This is useful when deploying experiments and benchmarks, because it enables the programmer to avoid generating keys for all principals involved in the system. However, this must not be used in a real deployments.

Additional information and publications

If you are interested in learning more about BFT-SMaRt, you can read:

  • The paper about its state machine protocol published in EDCC'12;
  • The paper about its advanced state transfer protocol published in Usenix'13;
  • The tool description published in DSN'14;
  • The paper about read-only optimization published in SRDS'21.

Feel free to contact us if you have any questions!

library's People

Contributors

andreneonet avatar ben-rubin avatar bergerch avatar bessani avatar broschch avatar fmrsabino avatar manishkk avatar maximilian-seitz avatar mynacol avatar njs97 avatar rvassantlal avatar sandroplopes avatar tiagorncarvalho avatar ylht avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

library's Issues

Reproducing performance results

Hi, I'm trying to reproduce the results described in the paper using the ThroughputLatency microbenchmark, but I can't get any more than ~20 ops/s, which is nowhere near the ~70 kops/s mentioned in it. The paper describes a setup where each replica is running on its own Xeon machine, while mine are all running on the same VM, but I have a hard time believing that could be the problem when the CPU is mostly idling. Can you think of what could be causing this discrepancy?

Bug Report

hi,

my name is Ricardo Mendes and I am using SMaRt to develop a project.

by request of professor Allysson here are some bugs I found:

- in class navigators.smart.tom.ServiceProxy, at line ~191, we don't want to 
receive only n-f messages because the faulty server that we want to tolerate 
can be one of the first n-f servers.

- in class navigators.smart.tom.core.DeliveryThread, at line ~211, we can't 
simply replace the request[0] by cons.firstMessage because we can't lose the 
timestamp of the request.

Regards
RicardoMendes

Original issue reported on code.google.com by ricardo.mends on 27 Oct 2011 at 1:22

Netty issues when connecting replicas

I followed the Getting Started with BFT-SMaRt wiki page and implemented my own version of DefaultRecoverable. Everything compiles without an issue - but running server replicas results in seeing a rather long stack trace. As far as I know I'm following instructions correctly, but I can't be certain.

I tried using the latest version by cloning the repository and using the .jar files from master, but still no luck. Here's what happens:

Replica 0 boots and operates correctly.
Replica 1 boots and reports connection with 0 was successful.
When I start Replica 2, only #Using view stored on disk is seen on the output, and Replica 1 outputs:

Session Created, active clients=0
io.netty.handler.codec.DecoderException: java.lang.NegativeArraySizeException
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:299)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:168)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NegativeArraySizeException
    at bftsmart.communication.client.netty.NettyTOMMessageDecoder.decode(NettyTOMMessageDecoder.java:121)
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:268)
    ... 12 more

Broken authentication? Insecure authkey for MAC creation

In https://github.com/bft-smart/library/blob/master/src/bftsmart/communication/client/netty/NettyTOMMessageDecoder.java#L181 and https://github.com/bft-smart/library/blob/master/src/bftsmart/communication/client/netty/NettyClientServerCommunicationSystemClientSide.java#L199

The authkey is created from a string which is clientID:replicaID
However, since these IDs (and thus the authkeys) are known to any replica, a single malicious replica could spoof MACs and imitate any other replica. By sending n response TOMMessages, each of them with a spoofed senderID and the respective correct MAC, the malicious replica can imitate all other replicas and thus "circumvent" the bft quorum and force the client to accept its result.

I think it is necessary to use a key-agreement protocol for client-replica connections such as Diffie-Hellman to establish a secure shared secret, that no other system participant can know. Or perhaps did I overlook something here?

[BUG] Messages to multiple targets

While building a Forwarder with BFT-SMaRt, I encountered a severe bug that occurs when sending a single message to multiple targets:

https://github.com/bft-smart/library/blob/6892ab38135a222a2cf5d54fbbd5c586ec0aef6f/src/bftsmart/communication/client/netty/NettyClientServerCommunicationSystemServerSide.java#L274:L283

The bug is here:

for (int i = 0; i < targets.length; i++) {
			rl.readLock().lock();
			//sendLock.lock();
			try {       
				NettyClientServerSession ncss = (NettyClientServerSession) sessionTable.get(targets[i]);
				if (ncss != null) {
					Channel session = ncss.getChannel();
					sm.destination = targets[i];
					//send message
session.writeAndFlush(sm);

Problem:
writeAndFlush() works asynchronously, so you can not assume that the messages have been already sent when the for loop goes to the next iteration. But in the next iteration the destination field of the same object (sm) is being modified and since the objects are passed by reference in Java, the destination field will have the same value (target[1] instead of target[0]) for both iterations, which leads to them being encoded in exactly the same way: with the same MACs being applied to them although they are being sent to different clients. This leads to one client rejecting the message because it thinks that the MAC is corrupt.

Solution:
It is sufficient to make a shallow copy and pass the copied object to writeAndFlush()

Why do we need to cache replies?

There is legacy code in the ClientManager object that caches the 5 last replies 
to the clients. Is this really necessary, given that nowadays we use a client 
session, and that correct clients never re-use a sequence number?


Original issue reported on code.google.com by [email protected] on 16 Jan 2015 at 8:43

Processing NOOP operations from the recover / MessageContexts in DefaultRecoverable

If a replica asks for a state, the DEfaultRecoverable 
andDefaultSingleRecoverable might try to make the application parse a NOOP 
operation that is used only within the logs of the state. Currently the demos 
have a workaround this issue, but it is necessary to had code to explicitly 
signal that this is not meant to be delivered.

I believe this can be done if we also log the MEssageContext objects, like it 
is already being done in Durability Coordinator. If we had this longing to 
DefaultRecoverable plus a new getter to indicate a NOOP, we fix 2 issues 
simultaneously.

Original issue reported on code.google.com by [email protected] on 16 Jan 2015 at 8:48

ServiceProxy.invoke hangs if only one nodes is running

Problem

When no nodes are connected, but only one node is running, the second call to ServiceProxy.invoke hangs

Reproduction Step

  1. Start only one node.
  2. call ServiceProxy.invok twice with different data.
  3. second call hangs.

Root cause

When there an exception is raised, the canSend lock is not unlocked.

exception : java.lang.RuntimeException: Server not connected, message : Server not connected, occurred at : java.lang.RuntimeException: Server not connected
	at bftsmart.communication.client.netty.NettyClientServerCommunicationSystemClientSide.send(NettyClientServerCommunicationSystemClientSide.java:383)
	at bftsmart.tom.TOMSender.TOMulticast(TOMSender.java:150)
	at bftsmart.tom.ServiceProxy.invoke(ServiceProxy.java:197)
	at bftsmart.tom.ServiceProxy.invokeOrdered(ServiceProxy.java:143)

Solution

We need to unlock the canSend lock in the ServiceProxy.invoke method in the finally clause.

DES has been obsolete since 2001 and is easily broken

ServiceReplica#leave throws NullPointerException

What steps will reproduce the problem?
1. instantiate 4 service replicas 
2. send ordered messages
3. call ServiceReplica#leave to teardown my test cases

What is the expected output? What do you see instead?

I would expect the replica to be removed from the test set. My idea is to call 
this for each replica so that I can start new replica servers with different 
business logic (for other test cases).

What version of the product are you using? On what operating system?

bft-smart 0.8, Ubuntu Desktop 13.04 as well as Ubuntu Server 13.04

Please provide any additional information below.

A short investigation showed that the ServiceProxy within the generated 
Reconfiguration Object is never initialed (it was commented out in the 
constructor, the new (?) connect method is not called)

java.lang.NullPointerException
    at bftsmart.reconfiguration.Reconfiguration.execute(Reconfiguration.java:59)
    at bftsmart.tom.ServiceReplica.leave(ServiceReplica.java:376)
    at at.ac.ait.archistar.storage.bft.FakeBftRemoteStorageServer.shutdown(FakeBftRemoteStorageServer.java:199)
    at at.ac.ait.archistar.bft.BftS3Test.destroyReplicas(BftS3Test.java:45)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:601)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
    at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
    at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:53)
    at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:123)
    at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:104)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:601)
    at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:164)
    at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:110)
    at org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:175)
    at org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcessWhenForked(SurefireStarter.java:107)
    at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:68)

Original issue reported on code.google.com by [email protected] on 3 Jul 2013 at 3:36

Timeout on multithreaded load

I am doing load tests on BFT-SMART and I am having timeouts when I load more than one thread.
1000 // 0 // TIMEOUT // Replies received: 0
1000 // 0 // TIMEOUT // Replies received: 0
1000 // 0 // TIMEOUT // Replies received: 0

ProcessorClient_NOP_LOAD - Cópia.txt

Everyting works fine with one thread, but with more than one thread I get long waits calling appExecuteOrdered method.

Denial of Service and Timeout Manipulation possible?

Attacks that impact message delay such as denial of service attacks can be used to increase the timeout value used to detect a faulty leader e.g. in such an attacking scenario the timeouts doubles each time the view changes. The adversary can use this behavior by stopping its attack when a malicious replica is the leader, thus the malicious leader can slow down the system's throughput dramatically. [1]

My question is:
Is the request timer in BFT-SMaRt increased (e.g. doubled) when the leader changes and if so, is it being reset later on? (So basically would a timeout manipulation attack be possible?)

[1] compare Amir, Yair, et al. "Prime: Byzantine replication under attack." IEEE Transactions on Dependable and Secure Computing 8.4 (2011): 564-577.

DefaultRecoverable is the only stable executor/recoverer

DefaultSingleRecoverer apperantly is unable to process requests and 
DurabilityCoordinator's state transfer is unstable.

The only executor/recoverer that offer satisfactory stability is 
DefaultRecoverable.Until we properly test/fix the others, we have to include a 
not in the readme file warning users about this issue.

Original issue reported on code.google.com by [email protected] on 26 Jan 2015 at 6:20

Bft smart primary stops working

I have some problems with the Bft-smart primary at the moment.
Everything wents well for like 2-3 minutes then I receive several of those:

(MessageHandler.processData) LC_MSG received: type LOCAL, regency -1, (replica -1)
(MessageHandler.processData) LC_MSG received: type LOCAL, regency -1, (replica -1)
(MessageHandler.processData) LC_MSG received: type LOCAL, regency -1, (replica -1)

And it seems like the communication of other bft-smart members to this one stops, this member doesn't receive their requests anymore.

It is quite normal to receive one of them once in a while, but the moment it really goes down is when I receive like 5-6 of them at a time.

BftSmart, extract additional information

Hi,

I'm running bft smart in a bigger environment and I would like to send to other servers a proof of the order process.
Since the bftSmart nodes interact to define a order, each server probably has to sign its decision and send it to the others.
Is there a way to obtain these signatures at one of the replicas?

public byte[][] appExecuteBatch(final byte[][] bytes, final MessageContext[] messageContexts, final boolean noop)

It would be great for example if we could find these signatures in the messageContext.

Livelock after partitioning 1 node?

I’m trying to understand how BFT-SMaRt behaves under some adversarial schedulers.

I hacked the code so that the servers listen on an alternate port, so I can interpose a software scheduler, written in python. The script I’m using to run the experiment is here:
https://github.com/amiller/library/blob/amiller-bug-oct15/run-bug.sh

But I’m encountering some unexpected behavior - the system stops making progress even after partitioning only 1 node. I’m wondering if you could help interpret the log files to understand what’s happening, or suggest how I could go about diagnosing it?

More details:

Essentially I run the following experiments:

TTP question/problem

If we try to use the TTP without having already the most up-to-date currentView 
file, the TTP's operation is discarded by the replicas and the TTP gets no 
replies.

Should the TTP be used always making the assumption that it always has the most 
up-to-date view before issuing operations, or should it be able to update its 
view like regular clients already do?

Original issue reported on code.google.com by [email protected] on 26 Jan 2015 at 6:24

getReplyQuorum() computed wrong in ServiceProxy?

In https://github.com/bft-smart/library/blob/ff5f027ab7f110abc6f56900dc74f243cdf3107b/src/bftsmart/tom/ServiceProxy.java#L399:L406

The BFT quorum is computed by q = ceil((n+f)/2)+1 which should be wrong, because in a n=4 f=1 configuration this equals ceil((4+1)/2)+1 =ceil(2.5)+1 = 4
and not a single faulty replica can be tolerated!

This formula should use floor() instead of ceil() and should be used for input validation and not for output validation?

The CFT quorum is computed by q=ceil(n/2)+1 which should be wrong too, since in a n=3 f=1 config this equals ceil(3/2)+1 = 3 and not a single faulty replica can be tolerated!

Also: Shouldn't the correct quorum for output validation in BFT mode be f+1? (thus 2 in a n=4 f=1 config) Because we can assume that at most f replicas are faulty?

In a CFT model, the quorum for output validation can be even constant 1? As replicas are only assumed to crash, if a response arrives at the client it must be correct?

I am seriously confused right now, so please explain if I am wrong because this error would lead to the BFT-SMaRt system's implementation not being fault tolerant at all.

Bft smart dropping currentView in config directory no matter what

Bft smart should actually take the config home defined on startup of each replica and post the currentView there and not just in /config.

" path = System.getProperty("user.dir") + sep + "config";"

There should be 2 constructors:

public DefaultViewStorage(final String alternativePath) {
    String sep = System.getProperty("file.separator");
    path = System.getProperty("user.dir") + sep + alternativePath;
    File f = new File(path);
    if (!f.exists()) {
        f.mkdirs();
    }
    path = path + sep + "currentView";
}

Two reconfigurations on the same machine interfere

I'm running two instances of bft-smart on the same machine.
I noticed that when I try to reconfigure both at the same moment they will interfere.

final ViewManager viewManager = new ViewManager(configLocation); viewManager.removeServer(idToCheck); viewManager.executeUpdates(); Thread.sleep(2000L); viewManager.close();

It stops at executeUpdates() on one server, and on the other it makes both changes.

DefaultRecoverable doesn't seem to deal with nodes going out-to-lunch and then returning

I've written a simple test application with BFT-Smart, as a warmup for doing something real. This application consists in:

(a) a simple client that sends "commands" consisting of some fixed amount of random bytes (1kbytes in this case). Perhaps one command, perhaps 100k commands serially, depending on the test

(b) The server-side application starts off with a SHA1 hash that's all zeroes, and for each command, appends the hash and the command, takes the hash of it, and sets that to be the new hash.

--> So: a primitive sort of "state" that can be affected by a large "command".

I'm running with 4 replicas, and the server extends DefaultRecoverable.

If I start 4 replicas, fire up a client doing 100k writes, and then after some thousands of writes, "control-Z" (stop, not kill) one replica, the other replicas keep going. MUCH, much later, if I "fg" (resume) the stopped replica, it attempts to catch up, and many, many bad things start happening:

(i) initially, {in,out}QueueSize was 500k msgs, and I got OutOfMemory errors. I reduced those to 5000 messages.

(ii) Then the stopped/resumed replica gets errors b/c it's run out of buffers, so it discards messages, and then asks for state-transfer. This state-transfer never successfully completes.

(iii) Eventually, even in this is much-more-constrained scenario, two replicas end up suffering OOM errors (NOT the stopped/started replica) and I killed the test off.

I'm not sure how to go about debugging this, and/or how to go about providing enough detail that you can reproduce it. I'm happy to share code and steps to repro, if you'd like.

Also, is the sort of faults I induced supposed to be tolerated by 4-replica DefaultRecoverable ?

Thanks,
--chet--

Where is BFTMap?

It seems like some tests (test/bftsmart/demo/bftmapjunit/ConsoleTest.java for example) are using BFTMap, but I can't find the code for this - only the class module is present: /bin/bftsmart/demo/bftmap/BFTMap.class. Is that by design?

Problem when clients constantly change connections

There is a race condition in the "send" method from 
NettyClientServerCommunicationSystemServerSide.

If for instance, a client is always re-starting its connection, the 
aforementioned object might not have yet the "sessionTable" attribute ready 
with the connection to the client, and thus will discard the reply and not send 
anything to the client. This can happen if the arrival of the batch, the 
execution of consensus and delivery to application ends faster than setting up 
a connection between a client and Netty. A possible solution is to force 
clients to send an initial readonly to the replicas just to force a quorum of 
replicas to establish a connection to it.

However, this issue does not manifest very often if the client is always 
connected. This must be tested in cases where the client processes end and then 
restart.

Original issue reported on code.google.com by [email protected] on 16 Jan 2015 at 8:41

Contact information

Hey,

I've been trying to contact you for some questions but wasn't able to find any contact information anywhere. Would you be so kind to post it. Thank you :)

PS: Sorry for inappropriate issue.

Byzantine simulation on appExecuteUnordered

I'm simulating a byzantine behaviour in one of my replicas in a get request in appExecuteUnordered.

It seems that if one replica returns a different value, every replica tries to execute appExecuteBatch with that same get request.

This is my implementation of appExecuteBatch:

@OverRide
public byte[][] appExecuteBatch(byte[][] bytes, MessageContext[] messageContexts) {
byte[][] replies = new byte[bytes.length][];
for (int i = 0; i < bytes.length; i++)
replies[i] = executeSingle(bytes[i], messageContexts[i]);
return replies;
}

For some reason, if one replica returns a different value, the ServiceProxy object gets a null value. Is that suppose to happen?

Implement the ClientContext and ReplicaContext on clients and servers

Modify the API to access these objects.

From these objects one should be able to:
- send and receive messages using the communication system
- get the private key of the local process, the public keys from other 
processes and the secret key shared between this process and the other

Original issue reported on code.google.com by [email protected] on 22 Sep 2011 at 10:06

Faulty smartrun.bat

The smartrun.bat has the wrong name for the netty library, so you get an ClassNotFound Exception if you try to run it.

There is also a minor fault in the README concerning the name of the .bat: There is the line "You can use the './runscripts/runsmart.bat'" script in Windows, and the './runscripts/runsmart.sh' script in Linux." but the bat and sh both have the name smartrun.

Bft smart getting stuck

Hey there,

I am working with BFT-Smart in the context of my master thesis and it started to show some strange behavior after I set it up on 4 different servers.

It does get stuck at:

this.replica = new ServiceReplica(id,configDirectory, this, this, null, new DefaultReplier());

At all the servers:

Config home: global/config
Config home in getViewStore: global/config
Trying with alternative part: /home/ubuntu/thesis/global/config/currentView
-- Creating current view from configuration file
-- ID = 0
-- N = 4
-- F = 1
-- Port = 11300
-- requestTimeout = 2000
-- maxBatch = 400
-- Using MACs
-- In current view: ID:0; F:1; Processes:0(/172.31.0.18:11300),1(/172.31.0.19:11310),2(/172.31.0.20:11320),3(/172.31.0.23:11330),
(17/05/03 16:09:05 - TOM Layer) Running.
(17/05/03 16:09:05 - TOM Layer) Next leader for CID=0: 0
(17/05/03 16:09:05 - TOM Layer) (TOMLayer.run) I'm the leader.
-- Diffie-Hellman complete with 1
-- Diffie-Hellman complete with 3
-- Diffie-Hellman complete with 2

It does come on all server to completing Diffie-hellman but does then stop and not progress to the lines after it.

I'm quite close to finishing my thesis and help would be very welcome.

Thanks already,

Ray

I can't connect to the server


4 virtual machine(ubuntu10.04) as server

1 client (ubuntu 11.04)
I can't connect to the server
please help me


Connecting to replica 0 at /192.168.226.101:11000
Impossible to connect to 0
Connecting to replica 1 at /192.168.226.102:11010
Impossible to connect to 1
Connecting to replica 2 at /192.168.226.103:11020
Impossible to connect to 2
Connecting to replica 3 at /192.168.226.104:11030
Impossible to connect to 3
Counter sending: 0
(12/03/30 10:40:51 - main) Channel to 3 is not connected
(12/03/30 10:40:51 - main) Channel to 2 is not connected
(12/03/30 10:40:51 - main) Channel to 1 is not connected
(12/03/30 10:40:51 - main) Channel to 0 is not connected
java.lang.RuntimeException: Impossible to connect to servers!
    at navigators.smart.communication.client.netty.NettyClientServerCommunicationSystemClientSide.send(NettyClientServerCommunicationSystemClientSide.java:373)
    at navigators.smart.tom.TOMSender.TOMulticast(TOMSender.java:168)
    at navigators.smart.tom.ServiceProxy.invoke(ServiceProxy.java:171)
    at navigators.smart.tom.ServiceProxy.invoke(ServiceProxy.java:148)
    at navigators.smart.tom.demo.counter.CounterClient.main(CounterClient.java:80)

Original issue reported on code.google.com by [email protected] on 30 Mar 2012 at 2:57

Attachments:

BftSmart not ordering well when too many requests come in.

I noticed that the order of the requests in executeBatch can vary depending on the amount of client requests.

I noticed that the sequence ID and the consensus ID seem well, but the array of the messages is out of order.

Meaning I can't just

    @Override
    public byte[][] appExecuteBatch(final byte[][] bytes, final MessageContext[] messageContexts, final boolean noop)
    {

final byte[][] allResults = new byte[bytes.length][];
        for (int i = 0; i < bytes.length; ++i)
        {

        consume(bytes[i)];

Execute them like this, because they are not guaranteed in the right order

Java 1.6 support

Hi guys, I am using bft-smart for my project and my customer is asking to support Java 6.
Do you have any plan to support Java 6? (Oracle extended the support of Java 6 until end of next year.)
If you are ok, I am going to create a pull request. The changes are all about specifying explicit types when you create an instance of a generic class.

     public BaseStateManager() {
-        senderStates = new HashMap<>();
-        senderViews = new HashMap<>();
-        senderRegencies = new HashMap<>();
-        senderLeaders = new HashMap<>();
-        senderProofs = new HashMap<>();
+        senderStates = new HashMap<Integer, ApplicationState>();
+        senderViews = new HashMap<Integer, View>();
+        senderRegencies = new HashMap<Integer, Integer>();
+        senderLeaders = new HashMap<Integer, Integer>();
+        senderProofs = new HashMap<Integer, CertifiedDecision>();
     }

BFT-SMART not compatible with latest versions of Netty

What steps will reproduce the problem?
1. Install a Netty version higher than 3.2.0 ALPHA4

What is the expected output? What do you see instead?
Clients are disconnected without obtaining a response.

What version of the product are you using? On what operating system?
SMaRt-v0.6.zip

Please provide any additional information below.
The next version, Netty 3.2.0 BETA1, already is incompatible. The changelog for 
this version is at: 
https://issues.jboss.org/secure/ReleaseNote.jspa?projectId=12310721&version=1231
4480

Original issue reported on code.google.com by andrefcruz on 28 Oct 2011 at 3:46

TimeOut bft-smart multiple clients

I have a quite big number of clients, but after 1-2 minutes I get:

Client 41: 141 // 892 // TIMEOUT // Replies received: 3
Client 40: 140 // 892 // TIMEOUT // Replies received: 3
Client 42: 142 // 892 // TIMEOUT // Replies received: 3
Client 44: 144 // 896 // TIMEOUT // Replies received: 3
Client 38: 138 // 893 // TIMEOUT // Replies received: 3
Client 45: 145 // 895 // TIMEOUT // Replies received: 3

Now, I configured it to 4 replicas (bft) with f = 1.

I don't know what the issue is.

Chaning view on the run

Is there anyway to automatically launch a new replica after a detected crash without actually having to execute the group reconfiguration commands in README.txt ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.