jcarreira / cirrus-kv Goto Github PK

High-performance key-value store

License: Apache License 2.0

Makefile 1.89% M4 0.74% C++ 94.19% C 0.05% Shell 1.09% Gnuplot 0.01% Python 2.04%

cirrus-kv's Introduction

Cirrus

Cirrus is a remote data access system for interacting with disaggregated memory from uInstances in a performant fashion.

Requirements

This library has been tested on Ubuntu >= 14.04 as well as MacOS 10.12.5. Additional Mac requirements and instructions are listed below.

It has been tested with the following environment / dependencies:

Ubuntu 14.04
g++ 5.4
Boost
autotools
Mellanox OFED 3.4 (optional)
cmake
cpplint
snappy
bzip2
zlib

You can install these with

$ sudo apt-get update && sudo apt-get install build-essential autoconf libtool g++-6 libboost-all-dev cmake libsnappy-dev zlib1g-dev libbz2-dev && sudo pip install cpplint

MacOS Requirements

Building on MacOS requires the installation of gettext, boost and wget. Please ensure that automake, autoconf, xcode command line tools, and gcc/g++ are installed. gcc/g++ can be installed using macports, and the port select command allows you to set the new version of gcc/g++ as the one you want to use. The remaining programs can be installed using homebrew. cpplint can be installed via pip.

gettext can be installed as follows using homebrew:

$ brew install gettext
$ brew link --force gettext

To install wget do:

$ brew install wget

To install gcc/g++ do:

$ port install gcc5
$ port select --list gcc

To install boost do:

$ sudo port install boost

To install cpplint do:

$ pip install cpplint

Building

$ ./boostrap.sh
$ make

Running Tests

To run tests, run the following command from the top level of the project:

$ make check

To create additional tests, add them to the TESTS variable in the top level Makefile.am . Tests are currently located in the tests directory. To change the ip address (necessary for RDMA), change the address in tests/test_runner.py.

Benchmarks

To run benchmarks execute the following command from the top of the project directory

$ make benchmark

This will leave log files for each benchmark run in the top directory. To add additional benchmarks, modify the script run_benchmarks.py, located in the benchmarks directory. The benchmarks are currently set to run locally, but may be set to run using a remote server by manually changing the ip address in the benchmark files. However, this then makes it so that the benchmarks must be manually launched from the command line after starting the server remotely. Additionally, the log files will be left in the benchmarks directory.

Benchmark Results (outdated)

Single node burst of 128 byte put (synchronous) - latencies

    msg/s: 166427
    min: 5
    avg: 5.34385
    max: 93
    sd: 0.993202
    99%: 6

Single node burst of 128 byte put (async) - latencies

    min: 50 us
    avg: 261.7 us
    max: 460 us
    sd: 118.149 us
    99%: 459 us

Single node contention 10 source clients 128 byte put (sync)

    Average (us) per msg: 16
    MSG/s: 61715.9

    Average (us) per msg: 9 # with 6 clients
    MSG/s: 105298

Single node contention 6 source clients 128 byte put (async)

    Average (us) per msg: 11
    MSG/s: 87242.9

Static analysis with Coverity

Download coverity scripts (e.g., cov-analysis-linux64-8.5.0.5.tar.gz)

tar -xof cov-analysis-linux64-8.5.0.5.tar.gz

Make sure all configure.ac are setup to use C++11

cov-build --dir cov-int make -j 10

tar czvf cirrus_cov.tgz cov-int

Upload file to coverity website

cirrus-kv's People

Contributors

Stargazers

Watchers

cirrus-kv's Issues

Make travis work (again)

We should have a way to run tests and check style for every commit.

Have all clients run in separate threads

Implement log

Logs from the Cirrus backend should be turned off through an environment variable

Doing something like

CIRRUS_LOGGING=0

should turn off logs. Check C/C++ function getenv

Add benchmark suite

We should be able to do

make benchmark

and get some numbers about latency and throughput.

The user probably has to specify the name/ip of another server where we can run the Cirrus server.

Look into the benchmarks folder to see an example of a benchmark.

Make -j may not work

Due to dependency on local .a not specified.

Example, in src/server:

g++ -Wall -Wextra -ansi -fPIC -std=c++1z -pthread  -o bladepoolmain bladepoolmain-bladepoolmain.o -L. -lserver -lrdmacm -libverbs -L../authentication/ -lauthentication -L../utils/ -lutils -L../common/ -lcommon 
/usr/bin/ld: cannot find -lserver
collect2: error: ld returned 1 exit status
make: *** [bladepoolmain] Error 1
make: *** Waiting for unfinished jobs....
/usr/bin/ld: cannot find -lserver
collect2: error: ld returned 1 exit status
make: *** [allocmain] Error 1

Do a dht

Error Message When TCPClient is destructed

When client destructor is called as it goes out of scope, the following message is shown:
terminate called without an active exception

This may be due to not detaching or joining the two loose threads during the destructor.

Make Ethernet+RDMA work seamlessly

We should be able to easily change between RDMA or Ethernet.

Additionally, we should be able to disable all the RDMA dependencies when compiling on an ethernet-only environment (e.g., EC2).

Investigate unit testing framework?

Add test_mult_clients to make check

Test_mult clients is not currently run in make check. Add it in.

Test Cirrus on Amazon Lambdas

We should be able to run Cirrus from within a lambda.

Update Make Check so that it runs all tests

Add the other tests, such as memory test or simple get puts, to the make check command

Server should send error back to client when no more space available

In src/BladeAllocServer.cpp

allocator->allocate(size)

can fail (exception thrown) when there is no more space available.

We should be able to send a message back to the client side with this error and propagate it all the way to whoever called the put().

Enable/disable logs through command line or environment variable

We should have an easy way to turn on and off logging information. Right now this involves recompiling.

An environment variable may be a good option.

Add Phase Two of New Interface

Includes the following features:

Store:

putBulk
getBulk

CacheManager:

putBulk
getBulk

Add better documentation

We should have automatic generation of code documentation.

Doxygen seems like a good tool to do this.

Segfault in Throughput.cpp

Throughput.cpp benchmark crashes to to a segfault when attempting the 10 MB put test.

The offending line is

test_throughput<10   * 1024 * 1024>(num_runs / 100);

GDB output:

Data Replication

Should we have a mechanism to replicate data to multiple servers?

If so, should this be strongly consistent or eventually consistency would be fine? Or both?

It seems to me eventually consistent would be enough for a large class of applications, namely machine learning.

Throughput at 128 byte level in benchmarks low

Currently, we are only seeing throughput of about 20 MB/s on 128 byte puts (before the introduction of the new interface). We should be seeing speeds of about 1 GB/S

Current speeds: (MB/s, messages/s)
128 bytes: 20.7 MB/s, 162072
4K bytes: 556.371 MB/s, 135833
50K bytes: 2445.7 MB/s, 47767.9
1M bytes: 4442e MB/s, 4236.22
10M bytes: 4369.74 MB/s, 416.731

Plug this into a microfunctions infrastructure

It seems there are a few open-source platforms for microfunctions. It might be worth taking a look into what they do.

For instance:
https://github.com/Azure/service-fabric
http://blog.kubernetes.io/2017/01/fission-serverless-functions-as-service-for-kubernetes.html

Make test of Cirrus in any cluster

Right now some of our tests make some assumptions specific to our development environment (e.g., IPs of servers).

We should allow make test to run anywhere.

Concurrency bug when doing async RDMA reads

Successive async reads can end up writing to the same memory address. This can lead to buggy reads.

The same bug also exists with async writes.

This bug can be replicated by running the iterator test in the iterator branch.

Make TCPServer and BladeAllocServer consistent

Right now TCPServer has the attribute max_objects which is inconsistent with the interface of the RDMAServer (which uses pool_size).

Both should use a raw count of bytes: pool_size.

In the same way, tcpservermain should allow setting the size of the server pool as a command line argument (but have default size).

Things to do:

Make TCPServer use pool_size (number of bytes in the pool) instead of the number of objects. Also, make it an argument of the constructor as in the BladeAllocServer
Make tcpservermain and bladeallocmain get arguments with the size of the pool (they both should have a default size of 10GB).
Fix the memory exhaustion test to use a small value for the server pools so that the servers throw an exception earlier. Tests should be quick

Fix ip/port hardcoded values

We should think of getting rid of the IP/port hardcoded values scattered throughout the tests.

A benefit of the hard coded values is that they simplify testing because we only need to call the binary to run the test -- no need to create a custom launch script.

We will need to create a python script to launch these tests.

The IP/port values can come from a few places:

./configure --test_IP=127.0.0.1 --port=18723
make IP=127.0.0.1 PORT=127.0.0.1 test

@TylerADavis What do you think?

Fix test_mt.cpp test

Test is designed to ensure system works with multiple threads on one client. However, a concurrency issue exists as all threads read and write from the oid 1, thus overwriting one another. Additionally, a memory leak exists as d2 is never freed. As it stands now, the test will likely never pass.

Make TCPClient Support Multiple clients

Currently supports only one client at a time.

Add benchmark results to main page

Check for dependencies at configure time

We can check for some of the dependencies at configure time (in configure.ac). If the user has not installed some of these we want to emit a nice error message

Let's start by checking the existence of:

flatc
cpplint

This may be useful:
https://stackoverflow.com/questions/7490978/autoconf-check-for-program-and-fail-if-not-found

Add Phase Three of New Interface

Includes the following features

store:

remove
removeBulk

cache:

prefetchBulk

Investigate using a serialization library

We currently use raw structs to pass requests through the network.

Benchmark 'Throughput.cpp' Hangs at 10MB Test

After installing cirrus via the on-site instructions on two servers, I modified throughput.cpp on both servers to point to the IP address of the server running tcpservermain. Then I ran the following commands:

On one server:
nohup ./tcpservermain & disown

On the other:
nohup ./throughput & disown

throughput.cpp began running tests, outputting the corresponding information:

throughput_128.log:

throughput 128 test
msg/s: 13.0619
bytes/s: 1671.92

throughput_406.log:

throughput 4096 test
msg/s: 13.0543
bytes/s: 53470.2

throughput_51200.log:

throughput 51200 test
msg/s: 24.9987
bytes/s: 1.27993e+06

throughput_1048576.log:

throughput 1048576 test
msg/s: 24.9772
bytes/s: 2.61905e+07

The benchmark hung after this point. nohup.out reads:

Warming up
Warm up done
size is 128
Measuring msgs/s..
Warming up
Warm up done
size is 4096
Measuring msgs/s..
Warming up
Warm up done
size is 51200
Measuring msgs/s..
Warming up
Warm up done
size is 1048576
Measuring msgs/s..

Environment:
Ubuntu 14.04 LTS (Amazon EC2 m4.large)
g++ 6.3

Also ran automake --add-missing in order to make ./bootstrap.sh work without error.

Provide Python interface

Cirrus crashes with >10MB messages

Implement CacheManager eviction policy

Ideally, we would like to provide the ability for the developer to provide its own eviction policy.

This might not be the way to go (at least for now) because:

hard to abstract away internals of the cache from the eviction policy
eviction policy if not tightly implemented with the cache becomes inefficient

References:
Redis eviction policies: https://redis.io/topics/lru-cache

Make Cirrus compile in MacOS

The plan is to make the build system check for the existence of the Infiniband/RDMA library (e.g., using AC_CHECK_LIB [1]).

The RDMA backend can be put in between an

#ifdef HAVE_LIBRDMACM //
// ... code here
#endif

The same for tests and benchmarks that depend on the RDMA/Infiniband backend.

[1] https://www.gnu.org/software/autoconf/manual/autoconf-2.66/html_node/Libraries.html

Statically link against cityhash/libcuckoo or remove dependency

Currently we have a dependency on cityhash/libcuckoo. This means we need to set LD_LIBRARY_PATH to run any binary that uses Cirrus. For instance:

[joao@havoc:/data/joao/ddc/tests/object_store]% LD_LIBRARY_PATH=/data/joao/ddc/third_party/libcuckoo/cityhash-1.1.1/src/.libs ./test_fullblade_store

We should remove this dependency or find a way to statically link this library.

Add async operations

The current interface does not support asynchronous gets/puts, and so the code for these tests was commented out of some tests, and removed from test_fullblade_store. These tests should be added back once the interface supports asynchronous operations.

Add back the asynchronous operations, as well as true prefetching.

Look into coveralls.io

Coverage statistics can be useful in driving the development of tests.

We may want to use gcov to get these stats and publish them into coveralls.io

Create a single entry point for all tests

We need a single script that runs all the tests

make test

Bandwidth Benchmark sometimes stalls on >=1MB puts

The test benchmarks/throughput.cpp runs well on object sizes up to 50 kilobytes, but occasionally stalls on larger objects. Present in bandwidth_benchmark branch. As logging must be disabled to get the speeds for the test, the cause of stalls is not readily apparent. Benchmark had been run without resetting server in-between, could this cause issues? Errors about "pthread_setaffinity_np error 22" were thrown as well on occasion, and only in the later revisions of the test.

Current speeds: (MB/s, messages/s) (at time of issue creation)
128 bytes: 20.7 MB/s, 162072
4K bytes: 556.371 MB/s, 135833
50K bytes: 2445.7 MB/s, 47767.9
1M bytes: 4442e MB/s, 4236.22
10M bytes: 4369.74 MB/s, 416.731
100M byes: stalled entirely

Edit: ran the benchmark once more after resetting the remote server, and all tests ran, albeit after a long delay. Strangely, despite the tests taking so long, the results for transfer speeds are still rather high. This almost makes me think that the stall is happening outside of the timed section.

100M bytes: msg/s: 42.8607 bytes/s: 4494.27MB/s

~4.5 gigabytes/s is the highest I've seen any benchmark run

Behavior for when connecting to already populated store

At the moment, all state about the store is kept locally. If a client connects to a remote store that already contains objects, it will have no knowledge of the ObjectIDs in the store or the Mem_addr/ peer_rkey + other information associated with them.

Should we implement some way for the client to get state from the server, and if so, how?

Coverity results

Coverity issues a sizeable set of alarms on the existing code. We should check the report.

Cirrus/Disaggregation for GPUs

Opening the discussion for thinking of GPU disaggregation.

Two things come to mind:

Attaching GPUs to uInstances

This allows us to pay for a cheaper instance. However, GPUs are so much more expensive than any instance that the savings here are likely to be negligible.

GPU as a Service model

GPUs are expensive and are exclusively allocated to a single user. However, they are likely to not be fully utilized at all times. This means they could be shared among concurrent users.

We could build a service that provides high levels of GPU virtualization by keeping the dataset remote. Isolation between concurrent tasks could be enforced in software (has been shown to work, e..g., Singularity, but not sure about this adversarial context).

sudo pip install cpplint