basho / machi Goto Github PK

View Code? Open in Web Editor NEW

122.0 82.0 23.0 6.03 MB

Machi file store

License: Apache License 2.0

Makefile 0.85% Erlang 97.03% Shell 0.38% Perl 0.11% Protocol Buffer 1.34% Ruby 0.29%

machi's Introduction

Machi: a distributed, decentralized blob/large file store

Travis-CI ::

Outline

Why another blob/file store?
Where to learn more about Machi
Development status summary
Contributing to Machi's development

## 1. Why another blob/file store?

Our goal is a robust & reliable, distributed, highly available, large file and blob store. Such stores already exist, both in the open source world and in the commercial world. Why reinvent the wheel? We believe there are three reasons, ordered by decreasing rarity.

We want end-to-end checksums for all file data, from the initial file writer to every file reader, anywhere, all the time.
We need flexibility to trade consistency for availability: e.g. weak consistency in exchange for being available in cases of partial system failure.
We want to manage file replicas in a way that's provably correct and also easy to test.

Criteria #3 is difficult to find in the open source world but perhaps not impossible.

If we have app use cases where availability is more important than consistency, then systems that meet criteria #2 are also rare. Most file stores provide only strong consistency and therefore have unavoidable, unavailable behavior when parts of the system fail. What if we want a file store that is always available to write new file data and attempts best-effort file reads?

If we really do care about data loss and/or data corruption, then we really want both #3 and #1. Unfortunately, systems that meet criteria #1 are very rare. (Nonexistant?) Why? This is 2015. We have decades of research that shows that computer hardware can (and indeed does) corrupt data at nearly every level of the modern client/server application stack. Systems with end-to-end data corruption detection should be ubiquitous today. Alas, they are not.

Machi is an effort to change the deplorable state of the world, one Erlang function at a time.

## 2. Where to learn more about Machi

The two major design documents for Machi are now mostly stable. Please see the doc directory's README for details.

We also have a Frequently Asked Questions (FAQ) list.

Scott recently (November 2015) gave a presentation at the RICON 2015 conference about one of the techniques used by Machi; "Managing Chain Replication Metadata with Humming Consensus" is available online now.

See later in this document for how to run the Humming Consensus demos, including the network partition simulator.

## 3. Development status summary

Mid-March 2016: The Machi development team has been downsized in recent months, and the pace of development has slowed. Here is a summary of the status of Machi's major components.

Humming Consensus and the chain manager
- No new safety bugs have been found by model-checking tests.
- A new document, Hands-on experiments with Machi and Humming Consensus is now available. It is a tutorial for setting up a 3 virtual machine Machi cluster and how to demonstrate the chain manager's reactions to server stops & starts, crashes & restarts, and pauses (simulated by SIGSTOP and SIGCONT).
- The chain manager can still make suboptimal-but-safe choices for chain transitions when a server hangs/pauses temporarily.
  - Recent chain manager changes have made the instability window much shorter when the slow/paused server resumes execution.
  - Scott believes that a modest change to the chain manager's calculation of a new projection can reduce flapping in this (and many other cases) less likely. Currently, the new local projection is calculated using only local state (i.e., the chain manager's internal state + the fitness server's state). However, if the "latest" projection read from the public projection stores were also input to the new projection calculation function, then many obviously bad projections can be avoided without needing rounds of Humming Consensus to demonstrate that a bad projection is bad.
FLU/data server process
- All known correctness bugs have been fixed.
- Performance has not yet been measured. Performance measurement and enhancements are scheduled to start in the middle of March 2016. (This will include a much-needed update to the basho_bench driver.)
Access protocols and client libraries
- The protocol used by both external clients and internally (instead of using Erlang's native message passing mechanisms) is based on Protocol Buffers.
  - (Machi PB protocol specification: ./src/machi.proto)[./src/machi.proto]
  - At the moment, the PB specification contains two protocols. Sometime in the near future, the spec will be split to separate the external client API (the "high" protocol) from the internal communication API (the "low" protocol).
Recent conference talks about Machi
- Erlang Factory San Francisco 2016 the slides and video recording will be available a few weeks after the conference ends on March 11, 2016.
- Ricon 2015
  - The slides
  - and the video recording are now available.
  - If you would like to run the Humming Consensus code (with or without the network partition simulator) as described in the RICON 2015 presentation, please see the Humming Consensus demo doc.

## 4. Contributing to Machi's development

4.1 License

Basho Technologies, Inc. as committed to licensing all work for Machi under the Apache Public License version 2. All authors of source code and documentation who agree with these licensing terms are welcome to contribute their ideas in any form: suggested design or features, documentation, and source code.

Machi is still a very young project within Basho, with a small team of developers; please bear with us as we grow out of "toddler" stage into a more mature open source software project. We invite all contributors to review the CONTRIBUTING.md document for guidelines for working with the Basho development team.

4.2 Development environment requirements

All development to date has been done with Erlang/OTP version 17 on OS X. The only known limitations for using R16 are minor type specification difference between R16 and 17, but we strongly suggest continuing development using version 17.

We also assume that you have the standard UNIX/Linux developer tool chain for C and C++ applications. Also, we assume that Git and GNU Make are available. The utility used to compile the Machi source code, rebar, is pre-compiled and included in the repo. For more details, please see the Machi development environment prerequisites doc.

Machi has a dependency on the ELevelDB library. ELevelDB only supports UNIX/Linux OSes and 64-bit versions of Erlang/OTP only; we apologize to Windows-based and 32-bit-based Erlang developers for this restriction.

4.3 New protocols and features

If you'd like to work on a protocol such as Thrift, UBF, msgpack over UDP, or some other protocol, let us know by opening an issue to discuss it.

machi's People

Contributors

Stargazers

Watchers

machi's Issues

client error when chain UPI=[] (unavailable), also perhaps file descriptor/other leak?

When I was at dinner, running env EQC_TIMEOUT=3600 rebar skip_deps=true -v eunit suites=machi_ap_repair_eqc tests=prop_repair_par_test_ on the commit b5005c3 (branch ss-repair-with-partition-simulator) with the https://gist.github.com/slfritchie/12e40859a08d5e4a678a patch applied, I saw this when I returned.

At a minimum, the machi_cr_client should do something less silly when the chain is not available.

Also, there may be a resource leak (e.g. file descriptor) that caused this test to fail after about 52 iterations (estimate, based on test output).

.  Got stable chain: [{a,{1113,[a,b,c],[],[]}},{b,{1113,[a,b,c],[],[]}},{c,{1113,[a,b,c],[],[]}}]
==== Start post operations, stabilize and confirm results
TODO: Using ?WORST_PROJ, chain is not available <0.31552.2>
TODO: Using ?WORST_PROJ, chain is not available <0.31538.2>
  NOT YET stable chain: [{a,{1117,[c],[a,b],[]}},{b,{1117,[c],[a,b],[]}},{c,{1117,[c],[a,b],[]}}]
  NOT YET stable chain: [{a,{1117,[c],[a,b],[]}},{b,{1117,[c],[a,b],[]}},{c,{1117,[c],[a,b],[]}}]
  NOT YET stable chain: [{a,{1117,[c],[a,b],[]}},{b,{1117,[c],[a,b],[]}},{c,{1117,[c],[a,b],[]}}]
MissingFileSummary [{<<"pre^3ef5a24e-3833-4ce3-9b32-e24aaf968795^1">>,
                     {1094,[a,b]}},
                    {<<"pre^69061121-5f50-4610-ac0e-4d9df67e249d^2">>,
                     {1044,[]}},
                    {<<"pre^b6be4085-b1a4-44b7-93a2-7f907acf9c95^1">>,
                     {1084,[b,c]}}]
Make repair directives: ... done
Out-of-sync data for FLU a: 0.1 MBytes
Out-of-sync data for FLU b: 0.1 MBytes
Out-of-sync data for FLU c: 0.1 MBytes
Execute repair directives: .. done
  Got stable chain: [{a,{1119,[c,a,b],[],[]}},{b,{1119,[c,a,b],[],[]}},{c,{1119,[c,a,b],[],[]}}]
  Written=10, DATALOSS=0, Acceptable=5
  Failed=0, Critical=0

.  Got stable chain: [{a,{1113,[a,b,c],[],[]}},{b,{1113,[a,b,c],[],[]}},{c,{1113,[a,b,c],[],[]}}]
TODO: Using ?WORST_PROJ, chain is not available <0.32479.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32479.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32479.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32479.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32479.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32479.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32479.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32479.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32479.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32479.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32455.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32455.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32455.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32455.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32455.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32455.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32455.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32455.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32479.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32455.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32455.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32483.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32483.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32483.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32483.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32483.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32483.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32483.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32455.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32479.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32483.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32483.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32483.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32455.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32483.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32479.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32483.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32455.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32447.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32447.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32447.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32447.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32447.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32447.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32447.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32447.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32483.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32447.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32487.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32487.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32487.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32487.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32487.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32487.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32487.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32487.2>
==== Start post operations, stabilize and confirm results
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
  NOT YET stable chain: [{a,{1126,[c],[b,a],[]}},{b,{1126,[c],[b,a],[]}},{c,{1126,[c],[b,a],[]}}]
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32455.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
  NOT YET stable chain: [{a,{1126,[c],[b,a],[]}},{b,{1126,[c],[b,a],[]}},{c,{1126,[c],[b,a],[]}}]
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32483.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
  NOT YET stable chain: [{a,{1126,[c],[b,a],[]}},{b,{1126,[c],[b,a],[]}},{c,{1126,[c],[b,a],[]}}]
  NOT YET stable chain: [{a,{1126,[c],[b,a],[]}},{b,{1126,[c],[b,a],[]}},{c,{1126,[c],[b,a],[]}}]
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
  NOT YET stable chain: [{a,{1126,[c],[b,a],[]}},{b,{1126,[c],[b,a],[]}},{c,{1126,[c],[b,a],[]}}]
  NOT YET stable chain: [{a,{1126,[c],[b,a],[]}},{b,{1126,[c],[b,a],[]}},{c,{1126,[c],[b,a],[]}}]
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
MissingFileSummary [{<<"pre^58d8ff5c-2371-4e5e-85f8-4fca1f44eedf^1">>,
                     {1034,[]}},
                    {<<"pre^8e67a7c2-570d-41e6-bf48-9405bfd777c2^1">>,
                     {1054,[c]}},
                    {<<"pre^979fad11-8e0e-4674-af5f-58aa3e5be461^2">>,
                     {1034,[a]}},
                    {<<"pre^e7670123-4c9e-41a3-8c20-93518dca6e01^1">>,
                     {1064,[a,b]}}]
Make repair directives: .... done
Out-of-sync data for FLU a: 0.1 MBytes
Out-of-sync data for FLU b: 0.1 MBytes
Out-of-sync data for FLU c: 0.1 MBytes
Execute repair directives: ... done
  Got stable chain: [{a,{1128,[c,b,a],[],[]}},{b,{1128,[c,b,a],[],[]}},{c,{1128,[c,b,a],[],[]}}]
  Written=8, DATALOSS=0, Acceptable=6
  Failed=0, Critical=0

.TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32483.2>
*failed*
in function eqc:quickcheck/1 (../src/eqc.erl, line 1270)
in call from machi_ap_repair_eqc:'-prop_repair_par_test_/0-fun-1-'/3 (test/machi_ap_repair_eqc.erl, line 91)
**exit:{{badmatch,[]},
 [{machi_cr_client,do_append_head2,7,
                   [{file,"src/machi_cr_client.erl"},{line,330}]},
  {gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,607}]},
  {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,639}]},
  {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,237}]}]}
  output:<<"Starting Quviq QuickCheck version 1.35.2
   (compiled at {{2015,6,23},{16,47,22}})
Licence for Basho reserved until {{2015,10,31},{2,57,8}}
">>

Non-deterministic failure of machi_merkle_tree_test

On master branch at commit e9b1134

% sh -c 'for i in `seq 1 10`; do echo number $i; rebar skip_deps=true eunit suites=machi_merkle_tree_test; done'
number 1
==> machi (eunit)
  Test passed.
number 2
==> machi (eunit)
  Test passed.
number 3
==> machi (eunit)
  Test passed.
number 4
==> machi (eunit)
  Test passed.
number 5
==> machi (eunit)
machi_merkle_tree_test: basic_test (module 'machi_merkle_tree_test')...*failed*
in function machi_merkle_tree_test:'-basic_test/0-fun-1-'/3 (test/machi_merkle_tree_test.erl, line 45)
**error:{assertEqual_failed,[{module,machi_merkle_tree_test},
                     {line,45},
                     {expression,"length ( machi_merkle_tree : naive_diff ( T1 , T2 ) )"},
                     {expected,1},
                     {value,8}]}

Possible disk space issues

Problems:

Clients die and lose assigned file names and chunks at append and they never are trimmed
Clients die and unwritten space assigned via chunk_extra via append never be written
Say, 20GB have 19.8 GB chunk trimmed and 0.2 GB written - consumes 20GB for 0.2GB data until it is trimmed

Possible resolutions:

Application-side garbage collection of listing all files that belongs to a prefix+namespace?
Same above?
Compaction or any other way to make that file as sparse file? fallocate(2) in Linux has FALLOC_FL_PUNCH_HOLE option so that we can kill trimmed space. Wow. I found an answer right after I wrote a question.

Problem 1, 2 are different from 3 as with FALLOC_FL_PUNCH_HOLE space are no more issue but overhead of keeping files in that way and never being collected, like inodes and so on.

Shortcoming of resolution 3 is that makes Machi Linux-specific. A simple googling made me feel that for Solaris and FreeBSD there might be an API for punching hole in file.

Witness servers and CP mode clarification needed in design document

The role of witness servers in Section 11 of the chain manager document should be clarified.

From my initial reading, it seems that the technique of only accepting writes on the majority side of the partition can be presented completely with only "real" servers. "Witness" servers appear to be an optimization that allows continued operation in CP mode when only a minority of real servers are available, which I believe should be presented separately. Am I missing something obvious?

Only mentioned briefly and not presented clear enough is the reasoning for why witness servers should be placed at the front of the chain. Why is this required?

Finally, in Figure 3, given the reasoning in the document, it appears that the following:

[W_0, W_1, S_0, S_1, S_2] : cluster
[W_1, W_0, S_1] : majority partition
[S_0, S_2] : minority partition

will continue to accept writes on the majority partition side, given the presence of two witnesses and one real node. However, given the witnesses store no data, only metadata regarding the current projection, a single failure of S_1 before the partition heals results in data loss.

Shouldn't you require that a majority of the majority side of the partition be real servers for durability in CP mode? (instead of the requirement for only 1, which feels like it should be the invariant for AP mode, not CP mode)

Per-append overhead is too high

On a chain with a single FLU, the per-append overhead is far too high for production use. Current write-once enforcement and checksum management tests are all OK, but basho_bench performance measurement shows that the serialization in both the process structure and eleveldb iterator use are both too large.

For example:

cd /path/to/machi/source/repo/machi
make clean
make stagedevrel
rm -f /tmp/setup
cat <<EOF > /tmp/setup
{host, "localhost", []}.
{flu,f1,"localhost",20401,[]}.
{chain,c1,[f1],[]}.
EOF
./dev/dev1/bin/machi start
sleep 5
./dev/dev1/bin/machi-admin quick-admin-apply /tmp/setup localhost

Then use this basho_bench config, which uses 25 concurrent clients to append 4KByte chunks to a single prefix. (I.e., worst-case serialization)

%% Mandatory: adjust this code path to top of your compiled Machi source distro
{code_paths, ["/path/to/machi/source/repo/machi"]}.
{driver, machi_basho_bench_driver}.

%% Chose your maximum rate (per worker proc, see 'concurrent' below)
%{mode, {rate,10}}.
%{mode, {rate,20}}.
%{mode, {rate,20}}.
{mode, max}.

%% Runtime & reporting interval
{duration, 10}.         % minutes
{report_interval, 1}.   % seconds

%% Choose your number of worker procs
%{concurrent, 1}.
%{concurrent, 5}.
%{concurrent, 10}.
{concurrent, 25}.
%{concurrent, 100}.

%% Here's a chain of (up to) length 3, all on localhost
%% Note: if any servers are down, and your OS/TCP stack has an
%% ICMP response limit such as OS X's "net.inet.icmp.icmplim" setting,
%% then if that setting is very low (e.g., OS X's limit is 50), then
%% you can have big problems with ICMP/RST responses being delayed and
%% interactive *very* badly with your test.
%% For OS X, fix using "sudo sysctl -w net.inet.icmp.icmplim=9999"
{machi_server_info,
 [
  {p_srvr,f1,machi_flu1_client,"localhost",20401,[]}
 ]}.
{machi_ets_key_tab_type, set}.   % set | ordered_set

%% Workload-specific definitions follow....

%% 10 parts 'append' operation + 0 parts anything else = 100% 'append' ops
{operations, [{append, 10}]}.
%{operations, [{read, 10}]}.

%% For append, key = Machi file prefix name
{key_generator, {to_binstr, "prefix~w", {uniform_int, 1}}}.
%{key_generator, {to_binstr, "prefix~w", {uniform_int, 30}}}.
%{key_generator, {to_binstr, "prefix~w", {uniform_int, 200}}}.
%{key_generator, {uniform_int, 3333222111}}.

%% Increase size of value_generator_source_size if value_generator is big!!
{value_generator_source_size, 2111000}.
{value_generator, {fixed_bin, 4096}}.
%{value_generator, {fixed_bin, 32768}}.   %  32 KB
%{value_generator, {fixed_bin, 256000}}.
%{value_generator, {fixed_bin, 1011011}}.

Here is the result running on a TRIM-enabled external Thunderbolt+SSD on my MacBook. Units are seconds (elapsed & window) or microseconds (all other columns).

% head tests/current/append_latencies.csv
elapsed, window, n, min, mean, median, 95th, 99th, 99_9th, max, errors
1.000938, 1.000938, 1164, 16611, 21263.9, 18841, 26650, 93227, 103340, 104752, 0
2.002061, 1.001123, 1249, 16509, 19753.0, 18629, 27024, 32586, 34494, 34910, 0
3.000928, 0.998867, 1165, 16303, 20899.6, 19478, 29138, 32377, 33866, 33902, 0
4.001929, 1.001001, 1256, 16303, 20746.9, 19403, 28808, 31151, 35461, 35519, 0
5.001932, 1.000003, 1218, 16423, 20174.0, 19031, 26810, 34556, 39735, 40749, 0
6.001932, 1.0, 1199, 17108, 20681.5, 19256, 29062, 40275, 43867, 44428, 0
7.001991, 1.000059, 1272, 17720, 20282.8, 19305, 25013, 40179, 43867, 44428, 0
8.001922, 0.999931, 1258, 17520, 19852.3, 19050, 25236, 26580, 27807, 27860, 0
9.001923, 1.000001, 1315, 17520, 19395.7, 18472, 25271, 26381, 28369, 28691, 0

Using 1 MByte chunks, this same-load-except-for-keygen-of-30-file-prefixes is happy to sustain about 340 MByte/sec of throughput. This is happy, since that's about the maximum throughput of the Thunderbolt+SSD device combination, but it also avoids the main serialization bottleneck(s) in the workload described in detail above.

Chain manager repair for strong consistency remains unfinished

Current chain manager repair is eventual consistency mode only. Strong consistency remains future work.

One component that's missing is a FLU operation to undo partial writes: e.g. chain is [A,B], a partial write to A at {File,Offset} is not successfully replicated to B. Then the chain transitions to [B] and then to [B] ++ repairing=[A]. The bytes at {File,Offset} are unwritten on B and therefore must be un-written on A by repair.

doublewrite_diff error

This is on today's 'master' branch at commit 3a35fe3. The error report is too big to go here, see instead https://gist.github.com/slfritchie/9d8a60cbf6953ad76f2f

Final summary:

Reason:
  exception:
    exit({critical_error,
            {doublewrite_diff,
               {1054, 10,
                <<112, 114, 101, 94, 48, 51, 97, 52, 53, 97, 51, 99, 45, 97, 52,
                  57, 54, 45, 52, 100, 48, 56, 45, 98, 98, 53, 55, 45, 51, 54,
                  102, 57, 56, 97, 51, 102, 48, 100, 49, 99, 94, 49>>},
               {<<121, 61, 156, 3, 58, 16, 120, 36, 36, 202>>,
                <<61, 177, 252, 82, 12, 242, 189, 153, 181, 156>>}}})
      in machi_ap_repair_eqc:append/3 (test/machi_ap_repair_eqc.erl:147)
         eqc_statem:run_parallel_commands/2 (../src/eqc_statem.erl:1445)
         machi_ap_repair_eqc:-prop_repair_par/1-fun-2-/3 (test/machi_ap_repair_eqc.erl:266)
*failed*

Repair optimization using Merkle tree strategy remains unfinished

The nice work that @mrallen1 started using a Merkle tree to accelerate chain repair remains unfinished work. Schedule TBD.

code structure for append needs refactoring

Hrm. With the introduction of eleveldb for max offset & checksum storage, the overall flow for append operations is now messy IMHO and needs refactoring.

For example, given a max file size rollover, there's an icky race between machi_flu_filename_mgr and machi_flu_append_server and machi_file_proxy. It's the file proxy proc that decides that it's time to rollover, but there are already append filename assignments by machi_flu_append_server (in collusion with the filename server) but not yet file offsets (which are assigned by file proxy).

For example, try a stress test using basho_bench, large'ish appends (e.g. 256KB or 1MB), and a small max_file_size value in app.config (e.g. 10MB). It works, but max latencies are absurdly high while a rollover is in progress. Also, the logs show that it's likely for a file to be closed for rollover twice. The second time is caused by racing straggler operations that are forwarded to a new file proxy proc.

Fix TravisCI failure 15

This error happened during a TravisCI run.

** Generic server a_filename_mgr terminating 
** Last message in was {find_filename,{0,
                                       <<0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                                         0,0>>},
                                      "prefix"}
** When Server state == {state,a,a_filename_mgr,"./data.api_smoke_flu2",0}
** Reason for termination == 
** {{badmatch,
        {error,
            {{badmatch,{error,enoent}},
             [{machi_util,increment_max_filenum,2,
                  [{file,"src/machi_util.erl"},{line,168}]},
              {machi_flu_filename_mgr,increment_and_cache_filename,3,
                  [{file,"src/machi_flu_filename_mgr.erl"},{line,237}]},
              {machi_flu_filename_mgr,handle_call,3,
                  [{file,"src/machi_flu_filename_mgr.erl"},{line,150}]},
              {gen_server,try_handle_call,4,
                  [{file,"gen_server.erl"},{line,607}]},
              {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,639}]},
              {proc_lib,init_p_do_apply,3,
                  [{file,"proc_lib.erl"},{line,237}]}]}}},
    [{machi_flu_filename_mgr,increment_and_cache_filename,3,
         [{file,"src/machi_flu_filename_mgr.erl"},{line,237}]},
     {machi_flu_filename_mgr,handle_call,3,
         [{file,"src/machi_flu_filename_mgr.erl"},{line,150}]},

This is caused because the filename manager assumes that a sequence number file exists and it clearly does not.

Issues from PR #35

From PR #35

Are filenames in PB high clients string()?
- https://github.com/basho/machi/pull/35/files#diff-84643a086b4e4b105b74001aa044e0ffR130
- https://github.com/basho/machi/blob/master/test/machi_pb_high_client_test.erl#L67-L69
ETS tables owned by file proxy processes are not closed
read_args uses offset(), which likely reterns < 1024

Remember eofp of all files

During #45 work it was revealed that eofp in file proxy will be forgotten after reboot of file proxy or whole flu server process, which should be remembered. This is problematic especially when chunk_extra was given by client and reserved for later, but client is so slow that file proxy process "sleeps" by stopping. My idea is to store them in eleveldb, which makes eleveldb introduction priority higher than before.

Below are quote from Scott's mail:

Background:

In the append operation, the client can request an allocation of extra space, e.g. the client wants to write 500MB eventually but only has 64KB now.  The client can do something like:

1. {ok, File, Offset} = append(Prefix, Chunk64K, Extra=(500MB-64KB))
2. read next 64K chunk
3. write(File, Offset+64K, Chunk64K_2)
4. goto 2 and continue advancing the offset until all 500MB are written

Clarification:

The FLU's sequencer must not forget any extra space allocations requested by a client.  The client can issue those write() ops much later, e.g. a month later.

Thoughts?

Repair process falls into infinite (-seeming) loop in case of conflicting bytes

Found in the testing process of #33, repair outputs repeating log and does not progress.
Root cause of conflicting bytes would be another bug.

The log of EQC test of #33 at commit b5005c3 is:
https://gist.github.com/shino/011f3d3e3f4028093d58#file-gistfile1-txt-L53

filename manager is reusing file names after an epoch change

This bug was originally found by @shino's work on #33 ... that QuickCheck model is working! Shino's QuickCheck model almost always finds this error within a couple of minutes.

Description of an actual counter-example:

During epoch 1113, UPI=[a,b,c] ... and there is some data written to this file, and it is written successfully to all a,b,c.
Then epoch changes to 1115. B believes that UPI=[b].
The first append to b, the epoch is different, so the filename_mgr uses the line 155 case, creates a new name (correctly, yay!).
The second append to b in epoch 1115, the handle_find_file() is ... looking on disk to find a file to append to, finds the the file written during epoch 1113, and uses it for this second append. This is when Machi's eventual consistency "files are always mergeable" property fails.

In eventual consistency mode, the filename manager must always pick a new & unique file name after an epoch change. In strong consistency mode, reuse after an epoch change is OK because we don't have the same merging requirement.

However, we'd still like to preserve the ability to append to a file after a period of inactivity. The current implementation (which uses filelib:wildcard() and so needs a refactoring sometime for better efficiency) appears to do that AFAICT. So a modification that can do the right thing with eventual mode + epoch change seems best?

{error, written} handling in cr client when writing

Spin off from #33 .

For the first one bad_return_value, quick fix is at 903e939, just reply tag missing for handle_call return. I will create another PR with small eunit unless too difficult

Superficial reason of bad_return_value is reply tag as explained by quick fix [1].
But it seems like the root cause is double write as other two errors found.

bad_return_value was occurred in call from do_append_midtail2 to
do_repair_chunk in machi_cr_client. The current code triggers
repair by {error, written} [2] but FLU returns ok when "correct" chunk
is written [3].

The quick fix [1] will hide conflict in data and probably returns false ok.

[1] 903e939
[2] https://github.com/basho/machi/blob/master/src/machi_cr_client.erl#L434
[3] https://github.com/basho/machi/blob/master/src/machi_file_proxy.erl#L683-L696