Coder Social home page Coder Social logo

cirrus-kv's People

Contributors

devloop0 avatar jcarreira avatar tyleradavis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

cirrus-kv's Issues

Data Replication

Should we have a mechanism to replicate data to multiple servers?

If so, should this be strongly consistent or eventually consistency would be fine? Or both?

It seems to me eventually consistent would be enough for a large class of applications, namely machine learning.

Benchmark 'Throughput.cpp' Hangs at 10MB Test

After installing cirrus via the on-site instructions on two servers, I modified throughput.cpp on both servers to point to the IP address of the server running tcpservermain. Then I ran the following commands:

On one server:
nohup ./tcpservermain & disown

On the other:
nohup ./throughput & disown

throughput.cpp began running tests, outputting the corresponding information:

throughput_128.log:

throughput 128 test
msg/s: 13.0619
bytes/s: 1671.92

throughput_406.log:

throughput 4096 test
msg/s: 13.0543
bytes/s: 53470.2

throughput_51200.log:

throughput 51200 test
msg/s: 24.9987
bytes/s: 1.27993e+06

throughput_1048576.log:

throughput 1048576 test
msg/s: 24.9772
bytes/s: 2.61905e+07

The benchmark hung after this point. nohup.out reads:

Warming up
Warm up done
size is 128
Measuring msgs/s..
Warming up
Warm up done
size is 4096
Measuring msgs/s..
Warming up
Warm up done
size is 51200
Measuring msgs/s..
Warming up
Warm up done
size is 1048576
Measuring msgs/s..

Environment:
Ubuntu 14.04 LTS (Amazon EC2 m4.large)
g++ 6.3

Also ran automake --add-missing in order to make ./bootstrap.sh work without error.

Add benchmark suite

We should be able to do

make benchmark

and get some numbers about latency and throughput.

The user probably has to specify the name/ip of another server where we can run the Cirrus server.

Look into the benchmarks folder to see an example of a benchmark.

Concurrency bug when doing async RDMA reads

Successive async reads can end up writing to the same memory address. This can lead to buggy reads.

The same bug also exists with async writes.

This bug can be replicated by running the iterator test in the iterator branch.

Make TCPServer and BladeAllocServer consistent

Right now TCPServer has the attribute max_objects which is inconsistent with the interface of the RDMAServer (which uses pool_size).

Both should use a raw count of bytes: pool_size.

In the same way, tcpservermain should allow setting the size of the server pool as a command line argument (but have default size).

Things to do:

  1. Make TCPServer use pool_size (number of bytes in the pool) instead of the number of objects. Also, make it an argument of the constructor as in the BladeAllocServer
  2. Make tcpservermain and bladeallocmain get arguments with the size of the pool (they both should have a default size of 10GB).
  3. Fix the memory exhaustion test to use a small value for the server pools so that the servers throw an exception earlier. Tests should be quick

Error Message When TCPClient is destructed

When client destructor is called as it goes out of scope, the following message is shown:
terminate called without an active exception

This may be due to not detaching or joining the two loose threads during the destructor.

Fix test_mt.cpp test

Test is designed to ensure system works with multiple threads on one client. However, a concurrency issue exists as all threads read and write from the oid 1, thus overwriting one another. Additionally, a memory leak exists as d2 is never freed. As it stands now, the test will likely never pass.

Coverity results

Coverity issues a sizeable set of alarms on the existing code. We should check the report.

Cirrus/Disaggregation for GPUs

Opening the discussion for thinking of GPU disaggregation.

Two things come to mind:

  1. Attaching GPUs to uInstances

This allows us to pay for a cheaper instance. However, GPUs are so much more expensive than any instance that the savings here are likely to be negligible.

  1. GPU as a Service model

GPUs are expensive and are exclusively allocated to a single user. However, they are likely to not be fully utilized at all times. This means they could be shared among concurrent users.

We could build a service that provides high levels of GPU virtualization by keeping the dataset remote. Isolation between concurrent tasks could be enforced in software (has been shown to work, e..g., Singularity, but not sure about this adversarial context).

Look into coveralls.io

Coverage statistics can be useful in driving the development of tests.

We may want to use gcov to get these stats and publish them into coveralls.io

Segfault in Throughput.cpp

Throughput.cpp benchmark crashes to to a segfault when attempting the 10 MB put test.

The offending line is

test_throughput<10   * 1024 * 1024>(num_runs / 100);

GDB output:
screen shot 2017-07-10 at 12 35 15 pm

Make Ethernet+RDMA work seamlessly

We should be able to easily change between RDMA or Ethernet.

Additionally, we should be able to disable all the RDMA dependencies when compiling on an ethernet-only environment (e.g., EC2).

Implement CacheManager eviction policy

Ideally, we would like to provide the ability for the developer to provide its own eviction policy.

This might not be the way to go (at least for now) because:

  1. hard to abstract away internals of the cache from the eviction policy
  2. eviction policy if not tightly implemented with the cache becomes inefficient

References:
Redis eviction policies: https://redis.io/topics/lru-cache

Bandwidth Benchmark sometimes stalls on >=1MB puts

The test benchmarks/throughput.cpp runs well on object sizes up to 50 kilobytes, but occasionally stalls on larger objects. Present in bandwidth_benchmark branch. As logging must be disabled to get the speeds for the test, the cause of stalls is not readily apparent. Benchmark had been run without resetting server in-between, could this cause issues? Errors about "pthread_setaffinity_np error 22" were thrown as well on occasion, and only in the later revisions of the test.

Current speeds: (MB/s, messages/s) (at time of issue creation)
128 bytes: 20.7 MB/s, 162072
4K bytes: 556.371 MB/s, 135833
50K bytes: 2445.7 MB/s, 47767.9
1M bytes: 4442e MB/s, 4236.22
10M bytes: 4369.74 MB/s, 416.731
100M byes: stalled entirely

Edit: ran the benchmark once more after resetting the remote server, and all tests ran, albeit after a long delay. Strangely, despite the tests taking so long, the results for transfer speeds are still rather high. This almost makes me think that the stall is happening outside of the timed section.

100M bytes: msg/s: 42.8607 bytes/s: 4494.27MB/s

~4.5 gigabytes/s is the highest I've seen any benchmark run

Make test of Cirrus in any cluster

Right now some of our tests make some assumptions specific to our development environment (e.g., IPs of servers).

We should allow make test to run anywhere.

Statically link against cityhash/libcuckoo or remove dependency

Currently we have a dependency on cityhash/libcuckoo. This means we need to set LD_LIBRARY_PATH to run any binary that uses Cirrus. For instance:

[joao@havoc:/data/joao/ddc/tests/object_store]% LD_LIBRARY_PATH=/data/joao/ddc/third_party/libcuckoo/cityhash-1.1.1/src/.libs ./test_fullblade_store

We should remove this dependency or find a way to statically link this library.

Fix ip/port hardcoded values

We should think of getting rid of the IP/port hardcoded values scattered throughout the tests.

A benefit of the hard coded values is that they simplify testing because we only need to call the binary to run the test -- no need to create a custom launch script.

We will need to create a python script to launch these tests.

The IP/port values can come from a few places:

  1. ./configure --test_IP=127.0.0.1 --port=18723
  2. make IP=127.0.0.1 PORT=127.0.0.1 test

@TylerADavis What do you think?

Add async operations

The current interface does not support asynchronous gets/puts, and so the code for these tests was commented out of some tests, and removed from test_fullblade_store. These tests should be added back once the interface supports asynchronous operations.

Add back the asynchronous operations, as well as true prefetching.

Add better documentation

We should have automatic generation of code documentation.

Doxygen seems like a good tool to do this.

Make -j may not work

Due to dependency on local .a not specified.

Example, in src/server:

g++ -Wall -Wextra -ansi -fPIC -std=c++1z -pthread  -o bladepoolmain bladepoolmain-bladepoolmain.o -L. -lserver -lrdmacm -libverbs -L../authentication/ -lauthentication -L../utils/ -lutils -L../common/ -lcommon 
/usr/bin/ld: cannot find -lserver
collect2: error: ld returned 1 exit status
make: *** [bladepoolmain] Error 1
make: *** Waiting for unfinished jobs....
/usr/bin/ld: cannot find -lserver
collect2: error: ld returned 1 exit status
make: *** [allocmain] Error 1

Throughput at 128 byte level in benchmarks low

Currently, we are only seeing throughput of about 20 MB/s on 128 byte puts (before the introduction of the new interface). We should be seeing speeds of about 1 GB/S

Current speeds: (MB/s, messages/s)
128 bytes: 20.7 MB/s, 162072
4K bytes: 556.371 MB/s, 135833
50K bytes: 2445.7 MB/s, 47767.9
1M bytes: 4442e MB/s, 4236.22
10M bytes: 4369.74 MB/s, 416.731

Change license to Apache 2

We should make Cirrus compatible with the Apache 2 license.

This entails removing the copyright messages from the source files.

Behavior for when connecting to already populated store

At the moment, all state about the store is kept locally. If a client connects to a remote store that already contains objects, it will have no knowledge of the ObjectIDs in the store or the Mem_addr/ peer_rkey + other information associated with them.

Should we implement some way for the client to get state from the server, and if so, how?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.