bft-smart / library Goto Github PK

BFT-SMaRt's project home page

Home Page: http://bft-smart.github.io/library/

License: Apache License 2.0

Shell 0.58% Java 99.41% Batchfile 0.01%

library's Introduction

Byzantine Fault-Tolerant (BFT) State Machine Replication (SMaRt) v2.0

This is a Byzantine fault-tolerant state machine replication project named BFT-SMaRt, a Java open source library maintained by the LASIGE Computer Science and Engineering Research Centre at the University of Lisbon.

This package contains the source code (src/), dependencies (lib/), documentation (docs/), running scripts (runscripts/), and configuration files (config/) for version 2.0 of the project.

Quick start

To run any demonstration you first need to configure BFT-SMaRt to define the protocol behavior and the location of each replica.

The servers must be specified in the configuration file (see config/hosts.config):

#server id, address and port (the ids from 0 to n-1 are the service replicas) 
0 127.0.0.1 11000 11001
1 127.0.0.1 11010 11011
2 127.0.0.1 11020 11021
3 127.0.0.1 11030 11031

Important tip #1: Always provide IP addresses instead of hostnames. If a machine running a replica is not correctly configured, BFT-SMaRt may fail to bind to the appropriate IP address and use the loopback address instead (127.0.0.1). This phenomenon may prevent clients and/or replicas from successfully establishing a connection among them.

Important tip #2: Clients requests should not be issued before all replicas have been properly initialized. Replicas are ready to process client requests when each one outputs Ready to process operations in the console.

The system configurations also have to be specified (seeconfig/system.config). Most of the parameters are self-explanatory.

Important tip #3: When using the library in real systems, always make sure to set system.communication.defaultkeys to false and system.communication.useSignatures to 1. Also make sure that only the config/keys directory has the private key for the respective replica/client.

Compiling

Type ./gradlew installDist in the main directory. The required jar files and default configuration files will be available in the build/install/library directory.

WARNING: You might need to give execution permission to the gradlew script.

Copy content of build/install/library into multiple folders for local testing or machines for distributed testing.

Running the counter demonstration

You can run the counter demonstration by executing the following commands, from within the folders containing compiled code across four different consoles (4 replicas, to tolerate 1 fault):

./smartrun.sh bftsmart.demo.counter.CounterServer 0
./smartrun.sh bftsmart.demo.counter.CounterServer 1
./smartrun.sh bftsmart.demo.counter.CounterServer 2
./smartrun.sh bftsmart.demo.counter.CounterServer 3

Important tip #4: If you are getting timeout messages, it is possible that the application you are running takes too long to process the requests or the network delay is too high and PROPOSE messages from the leader does not arrive in time, so replicas may start the leader change protocol. To prevent that, try to increase the system.totalordermulticast.timeout parameter in config/system.config.

Important tip #5: Never forget to delete the config/currentView file after you modify config/hosts.config or config/system.config. If config/currentView exists, BFT-SMaRt always fetches the group configuration from this file first. Otherwise, BFT-SMaRt fetches information from the other files and creates config/currentView from scratch. Note that config/currentView only stores information related to the group of replicas. You do not need to delete this file if, for instance, you want to change the value of the request timeout.

Once all replicas are ready, the client can be launched as follows:

./smartrun.sh bftsmart.demo.counter.CounterClient 1001 <increment> [<number of operations>]

If <increment> equals 0 the request will be read-only. Default <number of operations> equals 1000.

Important tip #6: Always make sure that each client uses a unique ID. Otherwise, clients may not be able to complete their operations.

Read-only optimization

BFT-SMaRt implements a read-only optimization that allows replicas to process read-only requests without executing consensus protocol. Recent work (see section Additional information and publications) has shown that this optimization could violate the liveness property of the system.

The recent BFT-SMaRt version implements the proposed solution, which guarantees that the system will not violate the live property when using the read-only optimization. However, due to the high memory consumption of the current implementation, this optimization is turned off by default but can be enabled by setting the system.optimizations.readonly_requests parameter to true in the config/system.config file.

State transfer protocol(s)

BFT-SMaRt offers two state transfer protocols. The first is a basic protocol that can be used by extending the classes bftsmart.tom.server.defaultservices.DefaultRecoverable and bftsmart.tom.server.defaultservices.DefaultSingleRecoverable. Thee classes logs requests into memory and periodically takes snapshots of the application state.

The second, more advanced protocol can be used by extending the class bftsmart.tom.server.defaultservices.durability.DurabilityCoordinator. This protocol stores its logs to disk. To mitigate the latency of writing to disk, such tasks are done in batches and in parallel with the requests' execution. Additionally, the snapshots are taken at different points of the execution in different replicas.

Important tip #7: We recommend developers to use bftsmart.tom.server.defaultservices.DefaultRecoverable, since it is the most stable of the three classes.

Important tip #8: Regardless of the chosen protocol, developers must avoid using Java API objects like HashSet or HashMap, and use TreeSet or TreeMap instead. This is because serialization of Hash* objects is not deterministic, i.e, it generates different byte arrays for equal objects. This will lead to problems after more than f replicas used the state transfer protocol to recover from failures.

Group reconfiguration

The library also implements a reconfiguration protocol that can be used to add/remove replicas from the initial group.

You can add a replica to the group on-the-fly by executing the following command:

./smartrun.sh bftsmart.reconfiguration.util.DefaultVMServices <smart id> <ip address> <port client-to-replica> <port replica-to-replica>

You can remove a replica from the group on-the-fly by executing the following command:

./smartrun.sh bftsmart.reconfiguration.util.DefaultVMServices <smart id>

Important tip #9: Everytime you use the reconfiguration protocol, you must make sure that all replicas and the host where you invoke the above commands have the latest config/currentView file. The current implementation of BFT-SMaRt does not provide any mechanism to distribute this file, so you will need to distribute it on your own (e.g., using the scp command). You also need to make sure that any client that starts executing can read from the latest config/currentView file.

BFT-SMaRt under crash faults

You can run BFT-SMaRt in crash-faults only mode by setting the system.bft parameter in the configuration file to false. This mode requires fewer replicas to execute, but will not withstand full Byzantine behavior from compromised replicas.

Generating public/private key pairs

If you need to generate public/private keys for more replicas or clients, you can use the following command.

To generate RSA key pairs, execute the following command:

./smartrun.sh bftsmart.tom.util.RSAKeyPairGenerator <id> <key length> [config dir]

To generate ECDSA key pairs, execute the following command:

./smartrun.sh bftsmart.tom.util.ECDSAKeyPairGenerator <id> <domain parameter> [config dir]

Default config dir are config/keysRSA and config/keysECDSA, respectively. The commands above create key pairs both for clients and replicas. Alternatively, you can set the system.communication.defaultkeys to true in the config/system.config file to forces all processes to use the same public/private keys pair and secret key. This is useful when deploying experiments and benchmarks, because it enables the programmer to avoid generating keys for all principals involved in the system. However, this must not be used in a real deployments.

Additional information and publications

If you are interested in learning more about BFT-SMaRt, you can read:

The paper about its state machine protocol published in EDCC'12;
The paper about its advanced state transfer protocol published in Usenix'13;
The tool description published in DSN'14;
The paper about read-only optimization published in SRDS'21.

Feel free to contact us if you have any questions!

library's People

Contributors

Stargazers

Watchers

Forkers

vonwenm theoryno3 codebold amiller bigahega fmrsabino ngaut vs-uulm linearregression andreneonet garlou wenbox mhsantos ndsc-sdn landoyjx cedac nunofernandes-plight radsz phileich bergerch andrewdong14 zetaops bernardopalma raycoms dazraf linpelvis oldsharp dapenghu musalbas inesmess voodoo12345 nikileshsa caioycosta blax11 haiderny zunya58 ghabiger paulo-coelho buffbob ianmadlenya dumingxiao yangjiannr cryptokat mkell85 manishkk input-output-hk usrcoin-interesting cx8 cotyar rolandomar martoon-00 manualstar hart199 simonbru friends110110 diegomasini jackustc inginx suimi dzqoo zhaohaidao jasonkresch tulioalberton leffler369 julienguo determinant knagware9 sujaya sredeption punkq andrefboliveira arastoul fraccaman jcs47 danielporto zhanjunmap just2husky spolejan frenet21 quan8 youhaixia phymbert yusen08 steven558877 dvs-wang viteshan xyli1905 jwrb2g16 xiaokewang miguelreisa tianbingsheng davidmr001 grapebaba gideshi ben-rubin giblegible quincyqing kohdmonkey sepulveda901 dragana11

library's Issues

BFT Smart in crash mode leaderId = -1

Why does getLeader() on the primaryId of messageContext return -1 when reading it in the 2f+1 mode?

Reproducing performance results

Hi, I'm trying to reproduce the results described in the paper using the ThroughputLatency microbenchmark, but I can't get any more than ~20 ops/s, which is nowhere near the ~70 kops/s mentioned in it. The paper describes a setup where each replica is running on its own Xeon machine, while mine are all running on the same VM, but I have a hard time believing that could be the problem when the CPU is mostly idling. Can you think of what could be causing this discrepancy?

Bug Report

hi,

my name is Ricardo Mendes and I am using SMaRt to develop a project.

by request of professor Allysson here are some bugs I found:

- in class navigators.smart.tom.ServiceProxy, at line ~191, we don't want to 
receive only n-f messages because the faulty server that we want to tolerate 
can be one of the first n-f servers.

- in class navigators.smart.tom.core.DeliveryThread, at line ~211, we can't 
simply replace the request[0] by cons.firstMessage because we can't lose the 
timestamp of the request.

Regards
RicardoMendes

Original issue reported on code.google.com by ricardo.mends on 27 Oct 2011 at 1:22

Netty issues when connecting replicas

I followed the Getting Started with BFT-SMaRt wiki page and implemented my own version of DefaultRecoverable. Everything compiles without an issue - but running server replicas results in seeing a rather long stack trace. As far as I know I'm following instructions correctly, but I can't be certain.

I tried using the latest version by cloning the repository and using the .jar files from master, but still no luck. Here's what happens:

Replica 0 boots and operates correctly.
Replica 1 boots and reports connection with 0 was successful.
When I start Replica 2, only #Using view stored on disk is seen on the output, and Replica 1 outputs:

Session Created, active clients=0
io.netty.handler.codec.DecoderException: java.lang.NegativeArraySizeException
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:299)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:168)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NegativeArraySizeException
    at bftsmart.communication.client.netty.NettyTOMMessageDecoder.decode(NettyTOMMessageDecoder.java:121)
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:268)
    ... 12 more

Broken authentication? Insecure authkey for MAC creation

In https://github.com/bft-smart/library/blob/master/src/bftsmart/communication/client/netty/NettyTOMMessageDecoder.java#L181 and https://github.com/bft-smart/library/blob/master/src/bftsmart/communication/client/netty/NettyClientServerCommunicationSystemClientSide.java#L199

The authkey is created from a string which is clientID:replicaID
However, since these IDs (and thus the authkeys) are known to any replica, a single malicious replica could spoof MACs and imitate any other replica. By sending n response TOMMessages, each of them with a spoofed senderID and the respective correct MAC, the malicious replica can imitate all other replicas and thus "circumvent" the bft quorum and force the client to accept its result.

I think it is necessary to use a key-agreement protocol for client-replica connections such as Diffie-Hellman to establish a secure shared secret, that no other system participant can know. Or perhaps did I overlook something here?

[BUG] Messages to multiple targets

While building a Forwarder with BFT-SMaRt, I encountered a severe bug that occurs when sending a single message to multiple targets:

https://github.com/bft-smart/library/blob/6892ab38135a222a2cf5d54fbbd5c586ec0aef6f/src/bftsmart/communication/client/netty/NettyClientServerCommunicationSystemServerSide.java#L274:L283

The bug is here:

for (int i = 0; i < targets.length; i++) {
			rl.readLock().lock();
			//sendLock.lock();
			try {       
				NettyClientServerSession ncss = (NettyClientServerSession) sessionTable.get(targets[i]);
				if (ncss != null) {
					Channel session = ncss.getChannel();
					sm.destination = targets[i];
					//send message
session.writeAndFlush(sm);

Problem:
writeAndFlush() works asynchronously, so you can not assume that the messages have been already sent when the for loop goes to the next iteration. But in the next iteration the destination field of the same object (sm) is being modified and since the objects are passed by reference in Java, the destination field will have the same value (target[1] instead of target[0]) for both iterations, which leads to them being encoded in exactly the same way: with the same MACs being applied to them although they are being sent to different clients. This leads to one client rejecting the message because it thinks that the MAC is corrupt.

Solution:
It is sufficient to make a shallow copy and pass the copied object to writeAndFlush()

Why do we need to cache replies?

There is legacy code in the ClientManager object that caches the 5 last replies 
to the clients. Is this really necessary, given that nowadays we use a client 
session, and that correct clients never re-use a sequence number?

Original issue reported on code.google.com by [email protected] on 16 Jan 2015 at 8:43

Processing NOOP operations from the recover / MessageContexts in DefaultRecoverable

If a replica asks for a state, the DEfaultRecoverable 
andDefaultSingleRecoverable might try to make the application parse a NOOP 
operation that is used only within the logs of the state. Currently the demos 
have a workaround this issue, but it is necessary to had code to explicitly 
signal that this is not meant to be delivered.

I believe this can be done if we also log the MEssageContext objects, like it 
is already being done in Durability Coordinator. If we had this longing to 
DefaultRecoverable plus a new getter to indicate a NOOP, we fix 2 issues 
simultaneously.

Original issue reported on code.google.com by [email protected] on 16 Jan 2015 at 8:48

ServiceProxy.invoke hangs if only one nodes is running

Problem

When no nodes are connected, but only one node is running, the second call to ServiceProxy.invoke hangs

Reproduction Step

Start only one node.
call ServiceProxy.invok twice with different data.
second call hangs.

Root cause

When there an exception is raised, the canSend lock is not unlocked.

exception : java.lang.RuntimeException: Server not connected, message : Server not connected, occurred at : java.lang.RuntimeException: Server not connected
	at bftsmart.communication.client.netty.NettyClientServerCommunicationSystemClientSide.send(NettyClientServerCommunicationSystemClientSide.java:383)
	at bftsmart.tom.TOMSender.TOMulticast(TOMSender.java:150)
	at bftsmart.tom.ServiceProxy.invoke(ServiceProxy.java:197)
	at bftsmart.tom.ServiceProxy.invokeOrdered(ServiceProxy.java:143)

Solution

We need to unlock the canSend lock in the ServiceProxy.invoke method in the finally clause.

DES has been obsolete since 2001 and is easily broken

https://github.com/bft-smart/library/blob/master/src/bftsmart/communication/client/netty/NettyClientServerCommunicationSystemClientSide.java#L98 and https://github.com/bft-smart/library/blob/master/src/bftsmart/communication/client/netty/NettyClientServerCommunicationSystemClientSide.java#L177

And MD5 is also obsolete.

Also, password based encryption usually uses CBC mode, and is not suitable for transport protocols, which is what it is being used for here.

See: https://crypto.stackexchange.com/questions/24592/is-it-safe-to-use-pbewithmd5anddes

ServiceReplica#leave throws NullPointerException

What steps will reproduce the problem?
1. instantiate 4 service replicas 
2. send ordered messages
3. call ServiceReplica#leave to teardown my test cases

What is the expected output? What do you see instead?

I would expect the replica to be removed from the test set. My idea is to call 
this for each replica so that I can start new replica servers with different 
business logic (for other test cases).

What version of the product are you using? On what operating system?

bft-smart 0.8, Ubuntu Desktop 13.04 as well as Ubuntu Server 13.04

Please provide any additional information below.

A short investigation showed that the ServiceProxy within the generated 
Reconfiguration Object is never initialed (it was commented out in the 
constructor, the new (?) connect method is not called)

java.lang.NullPointerException
    at bftsmart.reconfiguration.Reconfiguration.execute(Reconfiguration.java:59)
    at bftsmart.tom.ServiceReplica.leave(ServiceReplica.java:376)
    at at.ac.ait.archistar.storage.bft.FakeBftRemoteStorageServer.shutdown(FakeBftRemoteStorageServer.java:199)
    at at.ac.ait.archistar.bft.BftS3Test.destroyReplicas(BftS3Test.java:45)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:601)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
    at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
    at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:53)
    at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:123)
    at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:104)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:601)
    at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:164)
    at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:110)
    at org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:175)
    at org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcessWhenForked(SurefireStarter.java:107)
    at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:68)

Original issue reported on code.google.com by [email protected] on 3 Jul 2013 at 3:36

Timeout on multithreaded load

I am doing load tests on BFT-SMART and I am having timeouts when I load more than one thread.
1000 // 0 // TIMEOUT // Replies received: 0
1000 // 0 // TIMEOUT // Replies received: 0
1000 // 0 // TIMEOUT // Replies received: 0

ProcessorClient_NOP_LOAD - Cópia.txt

Everyting works fine with one thread, but with more than one thread I get long waits calling appExecuteOrdered method.

Create a LICENSE file in the root of the project

Thanks :-)

Possibe timeout issue with too many clients.

In RequestTimer.java, I do not think unwatch will ever call stopTimer, if the watched queue is continuously populated. Therefore, the timer will expire even during normal operation, and cause a leader change. Is there some other mechanism that ensures that the timer is reset?

https://github.com/bft-smart/library/blob/master/src/bftsmart/tom/leaderchange/RequestsTimer.java#L128

Make the DeliveryThread deliver an array of messages, instead of one

Create the operation deliveryUnordered(TOMMessage[] messages). The idea is to 
make it easy for an application to deal with a batch. The reason for this is to 
make it easy for writing data for the whole batch to the disk (allowing 
efficient logging).

Original issue reported on code.google.com by [email protected] on 22 Sep 2011 at 10:11

Denial of Service and Timeout Manipulation possible?

Attacks that impact message delay such as denial of service attacks can be used to increase the timeout value used to detect a faulty leader e.g. in such an attacking scenario the timeouts doubles each time the view changes. The adversary can use this behavior by stopping its attack when a malicious replica is the leader, thus the malicious leader can slow down the system's throughput dramatically. [1]

My question is:
Is the request timer in BFT-SMaRt increased (e.g. doubled) when the leader changes and if so, is it being reset later on? (So basically would a timeout manipulation attack be possible?)

[1] compare Amir, Yair, et al. "Prime: Byzantine replication under attack." IEEE Transactions on Dependable and Secure Computing 8.4 (2011): 564-577.

DefaultRecoverable is the only stable executor/recoverer

DefaultSingleRecoverer apperantly is unable to process requests and 
DurabilityCoordinator's state transfer is unstable.

The only executor/recoverer that offer satisfactory stability is 
DefaultRecoverable.Until we properly test/fix the others, we have to include a 
not in the readme file warning users about this issue.

Original issue reported on code.google.com by [email protected] on 26 Jan 2015 at 6:20

Bft smart primary stops working

I have some problems with the Bft-smart primary at the moment.
Everything wents well for like 2-3 minutes then I receive several of those:

(MessageHandler.processData) LC_MSG received: type LOCAL, regency -1, (replica -1)
(MessageHandler.processData) LC_MSG received: type LOCAL, regency -1, (replica -1)
(MessageHandler.processData) LC_MSG received: type LOCAL, regency -1, (replica -1)

And it seems like the communication of other bft-smart members to this one stops, this member doesn't receive their requests anymore.

It is quite normal to receive one of them once in a while, but the moment it really goes down is when I receive like 5-6 of them at a time.

BftSmart, extract additional information

Hi,

I'm running bft smart in a bigger environment and I would like to send to other servers a proof of the order process.
Since the bftSmart nodes interact to define a order, each server probably has to sign its decision and send it to the others.
Is there a way to obtain these signatures at one of the replicas?

public byte[][] appExecuteBatch(final byte[][] bytes, final MessageContext[] messageContexts, final boolean noop)

It would be great for example if we could find these signatures in the messageContext.

Livelock after partitioning 1 node?

I’m trying to understand how BFT-SMaRt behaves under some adversarial schedulers.

I hacked the code so that the servers listen on an alternate port, so I can interpose a software scheduler, written in python. The script I’m using to run the experiment is here:
https://github.com/amiller/library/blob/amiller-bug-oct15/run-bug.sh

But I’m encountering some unexpected behavior - the system stops making progress even after partitioning only 1 node. I’m wondering if you could help interpret the log files to understand what’s happening, or suggest how I could go about diagnosing it?

More details:

I modified one line of code of BFT-SMaRt so that if a node is supposed to be reachable at port 100XX, then it actually listens on port 110XX (adds 1000). amiller@e9bf110#diff-1fab8f649d6de8298fdf534e08c832c1R82
Here is the python script that forwards from port 100XX to port 110XX, and can be scripted to create or heal partitions https://github.com/amiller/library/blob/amiller-bug-oct15/amiller-bug.py
The scheduler can inspect the messages to determine when a view-change is about to occur.

Essentially I run the following experiments:

Run 1: (OK) Ordinary (without any modified code) Ordinary progress. https://github.com/amiller/library/tree/amiller-bug-oct15/run1
Run 2: (OK) Ordinary (with patched code, and python scheduler) Same as previous. https://github.com/amiller/library/tree/amiller-bug-oct15/run2
Run 3: (OK) After about 20 seconds, the scheduler isolates the leader (Node 0). As expected, a view-change occurs after 60 more seconds, and the system continues making progress.
https://github.com/amiller/library/tree/amiller-bug-oct15/run3
Run 4: (OK) After about 20 seconds, the scheduler isolates the leader (Node 0). After the scheduler detects a view-change, the scheduler heals the partition. Progress continues as expected.
https://github.com/amiller/library/tree/amiller-bug-oct15/run4
Run 5. __(Fails!)__ After about 20 seconds, the scheduler isolates the leader (Node 0) After the scheduler detects a view-change the scheduler delivers messages between Nodes {0,2,3}, but the would-be-new-Leader (Node 1) is isolated.
The system thereafter does not make any progress, but instead the nodes {0,2,3} continually attempt view-changes that do not succeed.
https://github.com/amiller/library/tree/amiller-bug-oct15/run5

TTP question/problem

If we try to use the TTP without having already the most up-to-date currentView 
file, the TTP's operation is discarded by the replicas and the TTP gets no 
replies.

Should the TTP be used always making the assumption that it always has the most 
up-to-date view before issuing operations, or should it be able to update its 
view like regular clients already do?

Original issue reported on code.google.com by [email protected] on 26 Jan 2015 at 6:24

getReplyQuorum() computed wrong in ServiceProxy?

In https://github.com/bft-smart/library/blob/ff5f027ab7f110abc6f56900dc74f243cdf3107b/src/bftsmart/tom/ServiceProxy.java#L399:L406

The BFT quorum is computed by q = ceil((n+f)/2)+1 which should be wrong, because in a n=4 f=1 configuration this equals ceil((4+1)/2)+1 =ceil(2.5)+1 = 4
and not a single faulty replica can be tolerated!

This formula should use floor() instead of ceil() and should be used for input validation and not for output validation?

The CFT quorum is computed by q=ceil(n/2)+1 which should be wrong too, since in a n=3 f=1 config this equals ceil(3/2)+1 = 3 and not a single faulty replica can be tolerated!

Also: Shouldn't the correct quorum for output validation in BFT mode be f+1? (thus 2 in a n=4 f=1 config) Because we can assume that at most f replicas are faulty?

In a CFT model, the quorum for output validation can be even constant 1? As replicas are only assumed to crash, if a response arrives at the client it must be correct?

I am seriously confused right now, so please explain if I am wrong because this error would lead to the BFT-SMaRt system's implementation not being fault tolerant at all.

Bft smart dropping currentView in config directory no matter what

Bft smart should actually take the config home defined on startup of each replica and post the currentView there and not just in /config.

" path = System.getProperty("user.dir") + sep + "config";"

There should be 2 constructors:

public DefaultViewStorage(final String alternativePath) {
    String sep = System.getProperty("file.separator");
    path = System.getProperty("user.dir") + sep + alternativePath;
    File f = new File(path);
    if (!f.exists()) {
        f.mkdirs();
    }
    path = path + sep + "currentView";
}

Two reconfigurations on the same machine interfere

I'm running two instances of bft-smart on the same machine.
I noticed that when I try to reconfigure both at the same moment they will interfere.

final ViewManager viewManager = new ViewManager(configLocation); viewManager.removeServer(idToCheck); viewManager.executeUpdates(); Thread.sleep(2000L); viewManager.close();

It stops at executeUpdates() on one server, and on the other it makes both changes.

DefaultRecoverable doesn't seem to deal with nodes going out-to-lunch and then returning

I've written a simple test application with BFT-Smart, as a warmup for doing something real. This application consists in:

(a) a simple client that sends "commands" consisting of some fixed amount of random bytes (1kbytes in this case). Perhaps one command, perhaps 100k commands serially, depending on the test

(b) The server-side application starts off with a SHA1 hash that's all zeroes, and for each command, appends the hash and the command, takes the hash of it, and sets that to be the new hash.

--> So: a primitive sort of "state" that can be affected by a large "command".

I'm running with 4 replicas, and the server extends DefaultRecoverable.

If I start 4 replicas, fire up a client doing 100k writes, and then after some thousands of writes, "control-Z" (stop, not kill) one replica, the other replicas keep going. MUCH, much later, if I "fg" (resume) the stopped replica, it attempts to catch up, and many, many bad things start happening:

(i) initially, {in,out}QueueSize was 500k msgs, and I got OutOfMemory errors. I reduced those to 5000 messages.

(ii) Then the stopped/resumed replica gets errors b/c it's run out of buffers, so it discards messages, and then asks for state-transfer. This state-transfer never successfully completes.

(iii) Eventually, even in this is much-more-constrained scenario, two replicas end up suffering OOM errors (NOT the stopped/started replica) and I killed the test off.

I'm not sure how to go about debugging this, and/or how to go about providing enough detail that you can reproduce it. I'm happy to share code and steps to repro, if you'd like.

Also, is the sort of faults I induced supposed to be tolerated by 4-replica DefaultRecoverable ?

Thanks,
--chet--

Make a new ServiceReplica that can forward messages and associated proxies

We need to make a ForwarderReplica that can process requests and send the reply 
to other processes. 

At client side we need to implement a SenderProxy and ReceiverProxy, each one 
taking parts of the code of the ServiceProxy.

Original issue reported on code.google.com by [email protected] on 22 Sep 2011 at 10:18

Refactor ServiceReplica & ServiceProxy constructors

Make at most 4 constructors for each of them.

ConfigHome must be passed as a system parameter (e.g., -Dbftsmart.configHome=...

Original issue reported on code.google.com by [email protected] on 2 Mar 2012 at 2:25

Where is BFTMap?

It seems like some tests (test/bftsmart/demo/bftmapjunit/ConsoleTest.java for example) are using BFTMap, but I can't find the code for this - only the class module is present: /bin/bftsmart/demo/bftmap/BFTMap.class. Is that by design?

Separate leader thread from TOMLayer

Remove the "leader" thread code from the TOMLayer and put it in the 
...roles.Proposer class or a new class created just for it.

Original issue reported on code.google.com by [email protected] on 22 Sep 2011 at 10:00

Merged into: #2

Problem when clients constantly change connections

There is a race condition in the "send" method from 
NettyClientServerCommunicationSystemServerSide.

If for instance, a client is always re-starting its connection, the 
aforementioned object might not have yet the "sessionTable" attribute ready 
with the connection to the client, and thus will discard the reply and not send 
anything to the client. This can happen if the arrival of the batch, the 
execution of consensus and delivery to application ends faster than setting up 
a connection between a client and Netty. A possible solution is to force 
clients to send an initial readonly to the replicas just to force a quorum of 
replicas to establish a connection to it.

However, this issue does not manifest very often if the client is always 
connected. This must be tested in cases where the client processes end and then 
restart.

Original issue reported on code.google.com by [email protected] on 16 Jan 2015 at 8:41

Malicious Byzantine fault support

To what extent are malicious Byztantine faults tolerated in the current version?
I've seen the config options for enabling the bft mode and message signing, but looking at the code, for example, https://github.com/bft-smart/library/blob/master/src/bftsmart/communication/client/netty/NettyClientServerCommunicationSystemServerSide.java#L265,
replies from replicas seem to never get signed.

Bft smart creates ordered message out of unordered response

In our project we sent a semiEmpty response with only one byte "1" to the server.
This somehow triggered an ordered message.

I believe that shouldn't be this way.

Contact information

Hey,

I've been trying to contact you for some questions but wasn't able to find any contact information anywhere. Would you be so kind to post it. Thank you :)

PS: Sorry for inappropriate issue.

does bft-smart have been used in production enviroment?

does bft-smart have been used in production enviroment?
which project use bft-smart?

thx

Get confused by a true test in run_lc_protocol()

https://github.com/bft-smart/library/blob/master/src/bftsmart/tom/leaderchange/RequestsTimer.java#L162

if ((request.receptionTime + System.currentTimeMillis()) > t) {  // So this test will always be True, isn't it?
    pendingRequests.add(request);
}

What's the meaning of adding two time values with different unit?

Severe issues with invokeAsynchRequest

If in the byzantine set, a replica in the cluster fails we send a message with invokeAsynchRequest.
The client will get stuck indefinitely.

Byzantine simulation on appExecuteUnordered

I'm simulating a byzantine behaviour in one of my replicas in a get request in appExecuteUnordered.

It seems that if one replica returns a different value, every replica tries to execute appExecuteBatch with that same get request.

This is my implementation of appExecuteBatch:

@OverRide
public byte[][] appExecuteBatch(byte[][] bytes, MessageContext[] messageContexts) {
byte[][] replies = new byte[bytes.length][];
for (int i = 0; i < bytes.length; i++)
replies[i] = executeSingle(bytes[i], messageContexts[i]);
return replies;
}

For some reason, if one replica returns a different value, the ServiceProxy object gets a null value. Is that suppose to happen?

Implement the ClientContext and ReplicaContext on clients and servers

Modify the API to access these objects.

From these objects one should be able to:
- send and receive messages using the communication system
- get the private key of the local process, the public keys from other 
processes and the secret key shared between this process and the other

Original issue reported on code.google.com by [email protected] on 22 Sep 2011 at 10:06

Faulty smartrun.bat

The smartrun.bat has the wrong name for the netty library, so you get an ClassNotFound Exception if you try to run it.

There is also a minor fault in the README concerning the name of the .bat: There is the line "You can use the './runscripts/runsmart.bat'" script in Windows, and the './runscripts/runsmart.sh' script in Linux." but the bat and sh both have the name smartrun.

Separate leader thread from TOMLayer

Remove the "leader" thread code from the TOMLayer and put it in the 
...roles.Proposer class or a new class created just for it.

Original issue reported on code.google.com by [email protected] on 22 Sep 2011 at 10:00

Leader replay attack

The current leader may perform replay attack by proposing a batch that already been proposed and decided in the past.

Some PoC patches:

oldsharp@7e04621
oldsharp@14b16e7

Bft smart getting stuck

Hey there,

I am working with BFT-Smart in the context of my master thesis and it started to show some strange behavior after I set it up on 4 different servers.

It does get stuck at:

this.replica = new ServiceReplica(id,configDirectory, this, this, null, new DefaultReplier());

At all the servers:

Config home: global/config
Config home in getViewStore: global/config
Trying with alternative part: /home/ubuntu/thesis/global/config/currentView
-- Creating current view from configuration file
-- ID = 0
-- N = 4
-- F = 1
-- Port = 11300
-- requestTimeout = 2000
-- maxBatch = 400
-- Using MACs
-- In current view: ID:0; F:1; Processes:0(/172.31.0.18:11300),1(/172.31.0.19:11310),2(/172.31.0.20:11320),3(/172.31.0.23:11330),
(17/05/03 16:09:05 - TOM Layer) Running.
(17/05/03 16:09:05 - TOM Layer) Next leader for CID=0: 0
(17/05/03 16:09:05 - TOM Layer) (TOMLayer.run) I'm the leader.
-- Diffie-Hellman complete with 1
-- Diffie-Hellman complete with 3
-- Diffie-Hellman complete with 2

It does come on all server to completing Diffie-hellman but does then stop and not progress to the lines after it.

I'm quite close to finishing my thesis and help would be very welcome.

Thanks already,

Ray

I can't connect to the server


4 virtual machine(ubuntu10.04) as server

1 client (ubuntu 11.04)
I can't connect to the server
please help me


Connecting to replica 0 at /192.168.226.101:11000
Impossible to connect to 0
Connecting to replica 1 at /192.168.226.102:11010
Impossible to connect to 1
Connecting to replica 2 at /192.168.226.103:11020
Impossible to connect to 2
Connecting to replica 3 at /192.168.226.104:11030
Impossible to connect to 3
Counter sending: 0
(12/03/30 10:40:51 - main) Channel to 3 is not connected
(12/03/30 10:40:51 - main) Channel to 2 is not connected
(12/03/30 10:40:51 - main) Channel to 1 is not connected
(12/03/30 10:40:51 - main) Channel to 0 is not connected
java.lang.RuntimeException: Impossible to connect to servers!
    at navigators.smart.communication.client.netty.NettyClientServerCommunicationSystemClientSide.send(NettyClientServerCommunicationSystemClientSide.java:373)
    at navigators.smart.tom.TOMSender.TOMulticast(TOMSender.java:168)
    at navigators.smart.tom.ServiceProxy.invoke(ServiceProxy.java:171)
    at navigators.smart.tom.ServiceProxy.invoke(ServiceProxy.java:148)
    at navigators.smart.tom.demo.counter.CounterClient.main(CounterClient.java:80)

Original issue reported on code.google.com by [email protected] on 30 Mar 2012 at 2:57

Attachments:

BftSmart not ordering well when too many requests come in.

I noticed that the order of the requests in executeBatch can vary depending on the amount of client requests.

I noticed that the sequence ID and the consensus ID seem well, but the array of the messages is out of order.

Meaning I can't just

    @Override
    public byte[][] appExecuteBatch(final byte[][] bytes, final MessageContext[] messageContexts, final boolean noop)
    {

final byte[][] allResults = new byte[bytes.length][];
        for (int i = 0; i < bytes.length; ++i)
        {

        consume(bytes[i)];

Execute them like this, because they are not guaranteed in the right order

No key-establishment between pairs when the connections are established (uses constant keys)

A fundamental missing feature.

Helio had solved the problem, but his code was not integrated in the svn.

Original issue reported on code.google.com by [email protected] on 22 Sep 2011 at 9:57

Java 1.6 support

Hi guys, I am using bft-smart for my project and my customer is asking to support Java 6.
Do you have any plan to support Java 6? (Oracle extended the support of Java 6 until end of next year.)
If you are ok, I am going to create a pull request. The changes are all about specifying explicit types when you create an instance of a generic class.

     public BaseStateManager() {
-        senderStates = new HashMap<>();
-        senderViews = new HashMap<>();
-        senderRegencies = new HashMap<>();
-        senderLeaders = new HashMap<>();
-        senderProofs = new HashMap<>();
+        senderStates = new HashMap<Integer, ApplicationState>();
+        senderViews = new HashMap<Integer, View>();
+        senderRegencies = new HashMap<Integer, Integer>();
+        senderLeaders = new HashMap<Integer, Integer>();
+        senderProofs = new HashMap<Integer, CertifiedDecision>();
     }

BFT-SMART not compatible with latest versions of Netty

What steps will reproduce the problem?
1. Install a Netty version higher than 3.2.0 ALPHA4

What is the expected output? What do you see instead?
Clients are disconnected without obtaining a response.

What version of the product are you using? On what operating system?
SMaRt-v0.6.zip

Please provide any additional information below.
The next version, Netty 3.2.0 BETA1, already is incompatible. The changelog for 
this version is at: 
https://issues.jboss.org/secure/ReleaseNote.jspa?projectId=12310721&version=1231
4480

Original issue reported on code.google.com by andrefcruz on 28 Oct 2011 at 3:46

TimeOut bft-smart multiple clients

I have a quite big number of clients, but after 1-2 minutes I get:

Client 41: 141 // 892 // TIMEOUT // Replies received: 3
Client 40: 140 // 892 // TIMEOUT // Replies received: 3
Client 42: 142 // 892 // TIMEOUT // Replies received: 3
Client 44: 144 // 896 // TIMEOUT // Replies received: 3
Client 38: 138 // 893 // TIMEOUT // Replies received: 3
Client 45: 145 // 895 // TIMEOUT // Replies received: 3

Now, I configured it to 4 replicas (bft) with f = 1.

I don't know what the issue is.

Move all code for log and checkpoint management to ServiceReplica

The idea is to make the storage of logs and checkpoint be responsibility of the 
application, with a default in implementation of memory storage, like it is 
today.

Original issue reported on code.google.com by [email protected] on 22 Sep 2011 at 10:29

Chaning view on the run

Is there anyway to automatically launch a new replica after a detected crash without actually having to execute the group reconfiguration commands in README.txt ?