basho / bitcask Goto Github PK

View Code? Open in Web Editor NEW

1.3K 1.3K 168.0 6.15 MB

because you need another a key/value storage engine

Makefile 1.03% C 22.04% Erlang 76.75% Shell 0.18%

bitcask's People

Contributors

Stargazers

Watchers

Forkers

pkaleta sonian argv0 djui nicom slfritchie tmoertel rampage socrateslee seomoz hdima truesef-dev quviq haoting iwamatsu gar1t hfeeki d2fn edwardt hanshenu newsky zhoufeng1989 gejigeji irwenqiang pengliang liule lemenkov chinnurtb novalis lebinhe licenser hustfisher patrikbwi jackliu8722 bdionne liveforeverx kuenishi cloudamqp shjgiser andzi kevindobest tnt-dev paulperegud wuchuguang jj1bdx nofuqs herberteuler tony-x gabrielnicolasavellaneda neo-v libichong piboye think-95 dailypipsgxj ricardobcl knowledgehacker no-problemo project-fifo welling88 johnchancfz matthewkantemir mpresker jsvisa davidalphafox cloudxtreme martinsumner hemel-cse leo-project suhuaguo binarytemple-external jaredmorrow lenary shino jasobrown jrwest bashosync sel-fish desperado1992 jadeallenx the41 matsuoremembered x4lldux tmcgilchrist rnowak-basho-forks legend147 zetaops edwardbetts emnvn martincox thoughtwire merchise lin-lee panii senturk andytill xxjegomezxx16 smileclound xlwh bet365 flstar

bitcask's Issues

Consider moving logging over to lager. [JIRA: RIAK-1918]

Because we're getting filtered/rate-limited in certain situations (see error messages in basho/riak#517), and backpressure might serve us better. Lager is already a dep, so maybe might as well use it? Or perhaps a logging macro, so the user can decide at compile time which system they'd like to use.

Backport file server serialization fix

Sparrow noticed that some customers with very high merge activity are affected by extremely high latencies due to the file server serialization issue fixed for 2.0 in #85 and #118. For example, in ZD ticket 7618. When this occurs, nodes need a restart. For customers for which this is not an acceptable workaround and are schedule sensitiv, we may need to backport this fix. This is a ping to @evanmcc, @jonmeredith and @michellep to chime in on that, as I believe some are discussing another 1.4.9/1.5 release.

File descriptor leak + zero byte file leak on aggressive merge activity

Setup: 1 node Riak cluster, default config for app.config and vm.args, plus: add {max_file_size, 1048576} to the bitcask section, to create artifically-high numbers of bitcask files.

Then drive it with basho_bench and this config:

{mode, max}.

{duration, 6000}.

{concurrent, 100}.

{driver, basho_bench_driver_riakc_pb}.

{key_generator, {int_to_bin, {uniform_int, 10000}}}.

{value_generator, {fixed_bin, 1024}}.

{riakc_pb_ips, [{127,0,0,1}]}.

{riakc_pb_replies, 1}.

{operations, [{put, 1}]}.

%% Use {auto_reconnect, false} to get "old" behavior (prior to April 2013).
%% See deps/riakc/src/riakc_pb_socket.erl for all valid socket options.
{pb_connect_options, [{auto_reconnect, true}]}.

%% Overrides for the PB client's default 60 second timeout, on a
%% per-type-of-operation basis.  All timeout units are specified in
%% milliseconds.  The pb_timeout_general config item provides a
%% default timeout if the read/write/listkeys/mapreduce timeout is not
%% specified.

{pb_timeout_general, 30000}.
{pb_timeout_read, 5000}.
{pb_timeout_write, 5000}.
{pb_timeout_listkeys, 50000}.
%% The general timeout will be used because this specific item is commented:
%% {pb_timeout_mapreduce, 50000}.

My theory is that opening a new cask is racing with merge activity and opening data files that are unused due to re-opening yet another cask?

% lsof -nP -p 37788 | egrep '\.data' | sed 's/.* //' | sort | uniq | wc
     999     999  112477

% lsof -nP -p 37788 | egrep '\.hint' | sed 's/.* //' | sort | uniq | wc
   13082   13082 1451053

% find /tmp/riak-1.4.2/rel/riak/data/bitcask -type f -size 0 -name \*.data -ls | head -3
40232003        0 -rw-rw-r--    1 fritchie         wheel                   0 Oct 23 18:57 /tmp/riak-1.4.2/rel/riak/data/bitcask/0/1015.bitcask.data
40236345        0 -rw-rw-r--    1 fritchie         wheel                   0 Oct 23 19:03 /tmp/riak-1.4.2/rel/riak/data/bitcask/0/1046.bitcask.data
40240520        0 -rw-rw-r--    1 fritchie         wheel                   0 Oct 23 19:10 /tmp/riak-1.4.2/rel/riak/data/bitcask/0/1076.bitcask.data

% find /tmp/riak-1.4.2/rel/riak/data/bitcask -type f -size 0 -name \*.data -ls | wc -l
   24236

% find /tmp/riak-1.4.2/rel/riak/data/bitcask -type f -size 0 -name \*.hint -ls | wc -l
       0

fold_visits_frozen_test(true) occasionally freezes on the builders

fold_visits_frozen_test with RollOver == true seems to freeze sometimes. I looked into this on a colo machine and my laptop but I suspect that maybe they're too fast. It shouldn't take more than ~2 seconds to run. We should get to the bottom of it before the RC.

======================== EUnit ========================
module 'bitcask'
  bitcask: a0_test...[0.100 s] ok
  bitcask: roundtrip_test...
=INFO REPORT==== 16-Apr-2014::19:43:58 ===
Bitcask IO mode is: bitcask_file
[0.505 s] ok
  bitcask: write_lock_perms_test...[0.266 s] ok
  bitcask: list_data_files_test...[0.024 s] ok
  bitcask: fold_test...[1.874 s] ok
  bitcask: iterator_test...[0.244 s] ok
  bitcask: fold_corrupt_file_test...
=ERROR REPORT==== 16-Apr-2014::19:44:02 ===
Trailing data, discarding (10 bytes)

=ERROR REPORT==== 16-Apr-2014::19:44:02 ===
Trailing data, discarding (14 bytes)
[0.550 s] ok
  bitcask:1687: fold_visits_frozen_test_...[2.301 s] ok
  bitcask:1688: fold_visits_frozen_test_...process killed by signal 11
program finished with exit code -1
elapsedTime=29.704782

Incorrectly detected write lock

After crash, bitcask does not always detect stale locks correctly. Here is an example from a crashed Riak node.

07:25:13.738 [error] Supervisor riak_core_vnode_sup had child undefined started with {riak_core_vnode,start_link,undefined} at <0.25793.657> exit with reason bad return value: {error,{write_locked,locked,"/var/lib/riak/bitcask/536645132457440915277915524513010171279912730624"}} in context child_terminated

Folds miss values due to race with deletes

Folds work over a snapshot of the data. When a folder starts, it fetches the current epoch number, which will be used to resolve which version of an object it is allowed to use. This multiple version mechanism was introduced to allow multiple folders to work over the same data. Before this, only one snapshot of the data was available. Monotonically increasing epochs were introduced recently to replace the racy timestamp based mechanism we had before. Each operation generates a new epoch which is saved on each modified entry. So an entry in the keydir may have different versions for different epochs, each pointing to potentially different files.

Merges delete their input files once they complete. The race here is that a fold fetches an epoch, then proceeds to open files in the Bitcask directory. See

bitcask/src/bitcask.erl

Lines 386 to 389 in 4a048be

    
           CurrentEpoch = bitcask_nifs:keydir_get_epoch(State#bc_state.keydir), 
        
           PendingEpoch = pending_epoch(State#bc_state.keydir), 
        
           FoldEpoch = min(CurrentEpoch, PendingEpoch), 
        
           case open_fold_files(State#bc_state.dirname, ?OPEN_FOLD_RETRIES) of

. It will keep retrying until it can successfully open all files listed in the directory. In between grabbing the epoch and opening the files, some of the files needed for the snapshot might have been deleted. Even if new values exist in files created by the merges, they will have higher epochs and not be present in the fold's snapshot. So the fold will miss those entries. This is demonstrated in this branch, where the code has been modified to artificially delay the file opening, allowing a merge to delete some files. The fold misses all values: https://github.com/basho/bitcask/compare/fold-open-delete-race-demo.

This is what I was trying to explain in this comment and this other one.

Disable merges on startup to prevent high disk io with heavy requests

Sometimes, bitcask will merge on node startup. This can cause very high disk IO when paired with lots of requests and can result in request timeouts and very high FSM times. Adding a timer on startup for bitcask merges may reduce this occurrence. Another idea is to disable bitcask merging during full-syncs or high volume of client requests.

Delete found in fold operation (counterexample Cp8)

Here's counterexample Cp8:

Cp8 = [[{set,{var,1},{call,bitcask_pulse,bc_open,[true,{true,{62,59,175},26}]}},{set,{var,2},{call,bitcask_pulse,puts,[{var,1},{2,3},<<188,81,240,255,122,185,227,8,104,196,56,142,250,34,8>>]}},{set,{var,6},{call,bitcask_pulse,bc_close,[{var,1}]}},{set,{var,10},{call,bitcask_pulse,fork,[[{init,{state,undefined,false,false,[]}},{set,{not_var,1},{not_call,bitcask_pulse,bc_open,[false,{true,{282,224,221},100}]}},{set,{not_var,2},{not_call,bitcask_pulse,fold,[{not_var,1}]}},{set,{not_var,3},{not_call,bitcask_pulse,fold,[{not_var,1}]}}]]}},{set,{var,12},{call,bitcask_pulse,bc_open,[true,{true,{270,229,202},55}]}},{set,{var,18},{call,bitcask_pulse,delete,[{var,12},2]}},{set,{var,19},{call,bitcask_pulse,merge,[{var,12}]}},{set,{var,20},{call,bitcask_pulse,fork_merge,[{var,12}]}},{set,{var,23},{call,bitcask_pulse,puts,[{var,12},{3,43},<<0,0,0,0>>]}},{set,{var,25},{call,bitcask_pulse,fold,[{var,12}]}},{set,{var,26},{call,bitcask_pulse,bc_close,[{var,12}]}},{set,{var,29},{call,bitcask_pulse,bc_open,[true,{true,{246,190,50},100}]}},{set,{var,35},{call,bitcask_pulse,delete,[{var,29},5]}},{set,{var,36},{call,bitcask_pulse,fold,[{var,29}]}}],{43245,66780,7696},[{events,[]}]].
[begin Now = {1405,406317,535184}, io:format("Now ~w ", [Now]), true = eqc:check(bitcask_pulse:prop_pulse(), [lists:nth(1,Cp8), Now, lists:nth(3,Cp8)]) end || _ <- lists:seq(1,100)].

That seed of {1405,406317,535184} is nice & deterministic on my Mac. YMMV, substitute now() in its place and run until you find something that fails very consistently.

The last three ops of the main thread are:

  {set,{var,29},
       {call,bitcask_pulse,bc_open,[true,{true,{246,190,50},100}]}},
  {set,{var,35},{call,bitcask_pulse,delete,[{var,29},5]}},
  {set,{var,36},{call,bitcask_pulse,fold,[{var,29}]}}],

The failure is key 5 should not be found by step 36's fold, but it is.

Bad:
[{0,410681,[]},
 {410681,infinity,
  [{bad,<<"pid_1">>,
        {fold,[{2,[not_found]},
               {3,[<<0,0,0,0>>]},
               {4,[<<0,0,0,0>>]},
               {5,[not_found]},
[...]
               {43,[<<0,0,0,0>>]}],
              [{3,<<0,0,0,0>>},
               {4,<<0,0,0,0>>},
               {5,<<0,0,0,0>>},
[...]

More research required. There's a possible race with a forked merge and/or with merge at open time via bitcask:make_merge_file().

{badrecord,mstate} in bitcask:merge_files/1

For example:

2012-10-16 22:18:44.022 UTC [error] <0.8433.0> Failed to merge ["/var/lib/riak/bitcask/1168915860326181142586736208979136516697469485056",[{data_root,"/var/lib/riak/bitcask"},{read_write,true}],["/var/lib/riak/bitcask/1168915860326181142586736208979136516697469485056/3.bitcask.data"]]: {{badrecord,mstate},[{bitcask,merge_files,1,[{file,"src/bitcask.erl"},{line,835}]},{bitcask,merge1,3,[{file,"src/bitcask.erl"},{line,509}]},{bitcask_merge_worker,do_merge,1,[{file,"src/bitcask_merge_worker.erl"},{line,130}]}]}

Riak version 1.2.0, which translates to bitcask tag "1.5.1".

Improper Shutdowns can lead to permanent write locks (until proper restart)

When Riak is improperly shut down or its process is killed, the cleanup processes that release lock files are not triggered. If another OS PID has been created that shares Riak's old process id, and Riak is started again, the subsequent checks in Riak will see the original OS PID that was written to the write lock file is still active, and will not release the lock (even though the process id in question does not refer to Riak anymore).

This can be replicated by:

Starting Riak.
Shutting down Riak improperly by issuing a kill on the Riak process.
Starting a process (that is not Riak) that uses the same Pid as the killed Riak process.
Starting Riak.
Attempting a PUT.

Could the operation that checks for the OS PID's existence also confirm that it is in fact a beam.smp process, to lessen the likelihood of this stuck lock file?

For additional context, see the following:
https://basho.zendesk.com/agent/#/tickets/3873
https://basho.zendesk.com/agent/#/tickets/5336

CRC check should be performed before unpacking data

Currently Bitcask performs a CRC check after unpacking data. This may lead to a badmatch error when unpacking the data if the data is corrupt. Checking the CRC before unpacking the data would avoid this badmatch error.

Relevant section of code:
https://github.com/basho/bitcask/blob/d8958d98d6619a1a8b9e71a7a7b19a7d9fb38ef0/src/bitcask_fileops.erl#L171-181

Cluster wide Bitcask merge synchronization

Large amounts of concurrent Bitcask merging activity across a cluster can have a significant impact on latencies and performance. One of the recommended ways to get around this is to use staggered, non-overlapping merge windows, resulting in at most one node performing merging at any point in time. A potential issue with this approach is however that it means merging on each node occurs less frequently, leading to a significant increase in disk usage between windows if a lot of data is inserted/updated.

Given that the aim of this approach is to ensure concurrent merging is kept to a minimum, would it make sense to introduce a global Bitcask merge coordinator that automatically serialises merges across the cluster?

Each individual node could still observe any configured merge windows, but the coordinator would ensure that merges are not performed concurrently. As concurrent merges are undesirable but not necessarily causes problems, a best effort approach might be sufficient.

Key transformations can break

This problem was observed as merges crashing in Riak when using the Bitcask backend. The new code may pass a {tombstone, Key} argument to a key transformation callback, which normally just take a binary key. Key transformations are used in the riak_kv_bitcask_backend to support the traditional and new key encoding format introduced in Bitcask 1.7.0/Riak 2.0.

All paths that call a key transformer must make sure that it receives only the binary key to transform.

faulterl is private, and shouldn't be included in rebar.config by default until that changes

I've pushed 25cf7b6 to comment it out to resolve the problem for now, but we should likely make a decision at some point as to what we want to do wrt that repo and its inclusion in our tests.

Race makes opening Bitcask dir impossible

I have reproduced a problem where Bitcask gets stuck, unable to re-open a cask in a certain directory.

On open, first a keydir object is created, but not marked as ready. Then the files are scanned to populate it, at which point it is marked ready here bitcask.erl#L1248 and things are good. However, if the scan errors out, we hit this branch instead in bitcask.erk#1244, which does not mark the keydir as ready, but leaves it behind in that state. Calling open again on the same directory finds this existing keydir, but detects it is not ready, so tries to wait for it to load in bitcask.erl#L1252, eventually timing out.

When the error happens, the newly created keydir should probably be released.

Now, the fact that the error happens on scan might lead to a different bug. What has been observed is that the function to list files in a directory, bitcask_fileops:list_dir/1 returns {error, einval}, which is not handled in bitcask_fileops:data_file_tstamps/1, causing the error that leads to the stuck keydir. Notice how this function is trying to avoid a call to the file server by calling the efile port directly, which might be part of the reason. I'm currently investigating the exact sequence of events that leads to this.

0-length file can indefintitely hold up puts.

This issue affects 2.0 and the 1.4 branch post 1.4.4.

The fix to avoid file server usage (#118) changed the way that the largest file ID is accounted for. Instead of listing the entire directory (potentially very expensive, directly in the put path), it keeps track of the file id in the keydir. Unfortunately, there is a corner case.

On the initial population of the keydir, each file is folded over in turn. If it has any keys at all, the keydir's largest id counter will be increased. However, if it has no keys, it will not be, because no keydir_put will happen. This can cause a problem when a keyless hintfile/datafile pair has the largest file id in the directory on startup, e.g.:

-rw-rw-r--  1 riak riak 1453263 Feb  3 14:11 186.bitcask.data
-rw-rw-r--  1 riak riak   60798 Feb  3 14:11 186.bitcask.hint
-rw-rw-r--  1 riak riak 1613200 Feb  3 14:10 187.bitcask.data
-rw-rw-r--  1 riak riak  124681 Feb  3 14:10 187.bitcask.hint
-rw-rw-r--  1 riak riak 1469163 Feb  3 22:24 188.bitcask.data
-rw-rw-r--  1 riak riak   59843 Feb  3 22:54 188.bitcask.hint
-rw-------  1 riak riak       0 Jan 30 19:57 189.bitcask.data
-rw-------  1 riak riak      18 Jan 30 19:57 189.bitcask.hint

Since in this case the biggest_file_id will be 188, the code here:
https://github.com/basho/bitcask/blob/develop/src/bitcask_fileops.erl#L73-L88
will fail, causing the vnode to crash and be restarted, leaving us exactly where we started.

A workaround is to manually remove the empty hint and data file pair, a fix should be forthcoming shortly.

Test downgrade

The 1.7.0 version now has different file formats:

Hint files now use a bit in the size field to signal a tombstone. To old code, it would look like a massive file offset and therefore a corrupted hint file. Users would likely have to delete hint files in case of a downgrade.
Data files now contain 2 new tombstone formats. To Riak, they would look like invalid objects, which would hopefully always end up as a not found.

We need to verify that there is a path to downgrade, however annoying.

/cc @jonmeredith

Bitcask cannot recover from invalid or empty lock files

Repro steps: Start a single node with bitcask, fill with enough data to have a few data files, stop node, create empty lock file, start node.

Errors:

09:38:29.529 [error] Failed to read lock data from ./data/bitcask/0/bitcask.create.lock: {invalid_data,<<>>}
09:38:29.529 [error] gen_fsm <0.692.0> in state active terminated with reason: lock_failure in bitcask_fileops:get_create_lock/2 line 106
09:38:29.529 [error] CRASH REPORT Process <0.692.0> with 1 neighbours exited with reason: lock_failure in bitcask_fileops:get_create_lock/2 line 106 in gen_fsm:terminate/7 line 611
09:38:29.530 [error] Supervisor riak_core_vnode_sup had child undefined started with {riak_core_vnode,start_link,undefined} at <0.692.0> exit with reason lock_failure in bitcask_fileops:get_create_lock/2 line 106 in context child_terminated
09:38:29.530 [error] gen_fsm <0.800.0> in state ready terminated with reason: lock_failure in bitcask_fileops:get_create_lock/2 line 106
09:38:29.530 [error] CRASH REPORT Process <0.800.0> with 10 neighbours exited with reason: lock_failure in bitcask_fileops:get_create_lock/2 line 106 in gen_fsm:terminate/7 line 611
09:38:29.531 [error] Supervisor {<0.801.0>,poolboy_sup} had child riak_core_vnode_worker started with riak_core_vnode_worker:start_link([{worker_module,riak_core_vnode_worker},{worker_args,[0,[],worker_props,<0.798.0>]},{worker_callback_mod,...},...]) at undefined exit with reason lock_failure in bitcask_fileops:get_create_lock/2 line 106 in context shutdown_error
09:38:29.531 [error] gen_server <0.801.0> terminated with reason: lock_failure in bitcask_fileops:get_create_lock/2 line 106
09:38:29.531 [error] CRASH REPORT Process <0.801.0> with 0 neighbours exited with reason: lock_failure in bitcask_fileops:get_create_lock/2 line 106 in gen_server:terminate/6 line 747

The node sits in a sad loop, never recovering.

Remove expired on get race

While documenting the read code path, I came across a race when reads find an expired value. The code then unconditionally removes that entry. BUT, this is not an atomic operation. A new value could have been written between the time we found the expired entry and the time we delete it. This needs to be changed to a conditional delete that will only remove the entry if it's exactly the same one we found, or try again if it has changed since the first get.

This is the line in bitcask:get/3 that does it: bitcask.erl#L238

Adds "grace timeout" to expiry.

bitcask:init_keydir(): Avoid unnecessary wait by bitcask:poll_deferred_delete_queue_empty/0

It is possible for Bitcask instance A's open(), which calls init_keydir(), to block for long periods of time while Bitcask instance B performs a long-running merge. In turn, such a long-blocking open() can make the riak_core_vnode_manager process very unhappy, since it is the process that starts new Riak vnode instances which then might open a Bitcask instance.

Below is a stack trace from a blocked Bitcask instance (instance A in the description above). The same cluster-info report shows that the bitcask_merge_delete server has a queue of hundreds of files from instance B to delete. The report shows that there's a 3rd instance which is doing a long-running fold (repair handoff) that is preventing the bitcask_merge_delete server from deleting B's old merge files.

https://github.com/basho/bitcask/blob/1.6.3-release/src/bitcask_merge_delete.erl#L109

One possible solution is to split the monolithic bitcask_merge_delete server into per-bitcask-instance servers?

=proc:&lt;0.9668.5878>
State: Waiting
Spawned as: proc_lib:init_p/5
Spawned by: &lt;0.247.0>
Started: Thu Mar 20 11:25:11 2014
Message queue length: 1
Message queue: [{'$gen_sync_event',{&lt;0.9667.5878>,#Ref&lt;0.0.4474.115775>},wait_for_init}]
Number of heap fragments: 0
Heap fragment data: 0
Link list: []
Dictionary: [{'$initial_call',{riak_core_vnode,init,1}},{'$ancestors',[riak_core_vnode_sup,riak_core_sup,&lt;0.244.0>]},{random_seed,{26833,7398,23484}}]
Reductions: 128073
Stack+heap: 987
OldHeap: 4181
Heap unused: 683
OldHeap unused: 3540
Stack dump:
Program counter: 0x00007f8539fccc28 (bitcask:poll_deferred_delete_queue_empty/0 + 88)
CP: 0x0000000000000000 (invalid)
arity = 0

0x00007f83b256aeb8 Return addr 0x00007f8539fc9570 (bitcask:init_keydir/3 + 632)

0x00007f83b256aec0 Return addr 0x00007f8539fc3c78 (bitcask:open/2 + 568)
y(0)     &lt;&lt;0 bytes>>
y(1)     Catch 0x00007f8539fc9660 (bitcask:init_keydir/3 + 872)
y(2)     true
y(3)     &lt;&lt;0 bytes>>
y(4)     "/var/lib/riak/bitcask/702205864811332261480676696969151607100311339008"

0x00007f83b256aef0 Return addr 0x00007f853baef280 (riak_kv_bitcask_backend:start/2 + 1336)
y(0)     []
y(1)     []
y(2)     2147483648
y(3)     fresh
y(4)     [{data_root,"/var/lib/riak/bitcask"},{read_write,true}]
y(5)     "/var/lib/riak/bitcask/702205864811332261480676696969151607100311339008"

0x00007f83b256af28 Return addr 0x00007f853b1d4160 (riak_cs_kv_multi_backend:start_backend/4 + 152)
y(0)     []
y(1)     []
y(2)     [{data_root,"/var/lib/riak/bitcask"},{read_write,true}]
y(3)     "/var/lib/riak/bitcask"
y(4)     "702205864811332261480676696969151607100311339008"
y(5)     702205864811332261480676696969151607100311339008

0x00007f83b256af60 Return addr 0x00007f853b1d9208 (riak_cs_kv_multi_backend:'-start_backend_fun/1-fun-0-'/3 + 256)
y(0)     be_blocks
y(1)     riak_kv_bitcask_backend
y(2)     Catch 0x00007f853b1d4258 (riak_cs_kv_multi_backend:start_backend/4 + 400)

0x00007f83b256af80 Return addr 0x00007f855c674960 (lists:foldl/3 + 120)
y(0)     riak_kv_bitcask_backend
y(1)     []
y(2)     [{be_default,riak_kv_eleveldb_backend,{state,&lt;&lt;0 bytes>>,"/var/lib/riak/leveldb/702205864811332261480676696969151607100311339008",[{create_if_missing,true},{max_open_files,50},{use_bloomfilter,true},{write_buffer_size,59927635}],[{create_if_missing,true},{data_root,"/var/lib/riak/leveldb"},{included_applications,[]},{max_open_files,50},{use_bloomfilter,true},{write_buffer_size,59927635}],[],[],[{fill_cache,false}],true,false}}]
y(3)     Catch 0x00007f853b1d9320 (riak_cs_kv_multi_backend:'-start_backend_fun/1-fun-0-'/3 + 536)

0x00007f83b256afa8 Return addr 0x00007f853b1d3e90 (riak_cs_kv_multi_backend:start/2 + 808)
y(0)     #Fun&lt;riak_cs_kv_multi_backend.1.32308858>
y(1)     []

0x00007f83b256afc0 Return addr 0x00007f853ae86778 (riak_kv_vnode:init/1 + 872)
y(0)     be_default
y(1)     [{&lt;&lt;3 bytes>>,be_blocks}]
y(2)     []
y(3)     []

0x00007f83b256afe8 Return addr 0x00007f853ae6d220 (riak_core_vnode:do_init/1 + 248)
y(0)     Catch 0x00007f853ae86778 (riak_kv_vnode:init/1 + 872)
y(1)     3000
y(2)     &lt;&lt;8 bytes>>
y(3)     10
y(4)     100
y(5)     100
y(6)     1000
y(7)     true
y(8)     riak_cs_kv_multi_backend
y(9)     702205864811332261480676696969151607100311339008

0x00007f83b256b040 Return addr 0x00007f853ae6ce28 (riak_core_vnode:started/2 + 200)
y(0)     []
y(1)     []
y(2)     []
y(3)     []
y(4)     []
y(5)     []
y(6)     []
y(7)     []
y(8)     []
y(9)     []
y(10)    undefined
y(11)    702205864811332261480676696969151607100311339008
y(12)    riak_kv_vnode
y(13)    {state,702205864811332261480676696969151607100311339008,riak_kv_vnode,undefined,undefined,none,undefined,undefined,undefined,undefined,undefined,0}

0x00007f83b256b0b8 Return addr 0x00007f853ae64e98 (gen_fsm:handle_msg/7 + 224)
y(0)     0

0x00007f83b256b0c8 Return addr 0x00007f855c6675b8 (proc_lib:init_p_do_apply/3 + 56)
y(0)     undefined
y(1)     Catch 0x00007f853ae64e98 (gen_fsm:handle_msg/7 + 224)
y(2)     riak_core_vnode
y(3)     {state,702205864811332261480676696969151607100311339008,riak_kv_vnode,undefined,undefined,none,undefined,undefined,undefined,undefined,undefined,0}
y(4)     started
y(5)     &lt;0.9668.5878>
y(6)     &lt;0.247.0>
y(7)     {'$gen_event',timeout}

0x00007f83b256b110 Return addr 0x0000000000847c38 (&lt;terminate process normally>)
y(0)     Catch 0x00007f855c6675d8 (proc_lib:init_p_do_apply/3 + 88)

R13B04 Compatibility

In the Readme it says "Bitcask requires Erlang R13B04 or later." but it won't compile with R13B04 because it references ErlNifPid.

c_src/bitcask_nifs.c:105: error: expected specifier-qualifier-list before ‘ErlNifPid’

Bitcask should check for missing hintfiles on close and write them out.

If your hintfiles are deleted, and you shut down cleanly the node before any merges happen, your hintfiles will not be recreated. This is undesirable on very large nodes, so since we already have the hintfile information, we should make sure that it is up to date on bitcask:close().

Bitcask may get stuck due to caching port

There is code in Bitcask to list files in a directory that tries to avoid serializing on the file server by caching an efile port in the process dictionary, which it uses to list directories when needed. The problem is that the port may go away. The process that opened it would get a notification if it happened, but nothing would tie that back to Bitcask to release the cached port. After that point, calls to list directory contents to, say, open a bitcask, would fail forever.

The caching function is bitcask_fileops: get_efile_port/0, used by bitcask_fileops:list_dir/1.

I believe the best thing to do at this point is to use a regular directory listing operation here. The initial file server serialization problem was caused by merges piling up on the Riak side when things got slow. The Bitcask backend code will now avoid issuing merge requests until the last one has finished, which should prevent that from happening again.

dialyzer analysis found a few interesting things to address

10:58:31:bitcask_pull(master_+|MERGING) $ dialyzer --plt .bitcask.plt -pa deps//ebin -pa ebin --src src/*.erl
Checking whether the PLT .bitcask.plt is up-to-date... yes
Proceeding with analysis...
bitcask.erl:474: The pattern [U | ] can never match the type []
bitcask.erl:574: The pattern 'true' can never match the type 'false'
bitcask.erl:590: Function will never be called
bitcask.erl:591: Function will never be called
bitcask.erl:593: Function will never be called
bitcask.erl:607: Function will never be called
bitcask.erl:614: Function frag_threshold/1 will never be called
bitcask.erl:617: Guard test '>='(any(),FragThreshold::none()) can never succeed
bitcask.erl:624: Function dead_bytes_threshold/1 will never be called
bitcask.erl:627: Guard test '>='(any(),DeadBytesThreshold::none()) can never succeed
bitcask.erl:634: Function small_file_threshold/1 will never be called
bitcask.erl:642: Guard test '<'(any(),Threshold::none()) can never succeed
bitcask.erl:652: Function expired_threshold/1 will never be called
bitcask.erl:654: Guard test '<'(any(),Cutoff::none()) can never succeed
bitcask.erl:668: The pattern [F | ] can never match the type []
bitcask.erl:698: The call bitcask:summarize(any(),S::{non_neg_integer(),pos_integer(),pos_integer(),pos_integer(),pos_integer()}) will never return since it differs in the 2nd argument from the success typing arguments: (string(),{integer(),number(),number(),number(),number(),})
bitcask.erl:702: The pattern [S | ] can never match the type []
bitcask.erl:711: Function summarize/2 has no local return
bitcask.erl:711: The pattern <Dirname, {FileId, LiveCount, TotalCount, LiveBytes, TotalBytes, OldestTstamp}> can never match the type <,{non_neg_integer(),pos_integer(),pos_integer(),pos_integer(),pos_integer()}>
bitcask.erl:781: The call bitcask_fileops:fold_keys(File::#filestate{mode::'read_only',filename::string() | #filestate{},tstamp::integer(),hintcrc::0,ofs::0},F::fun((,,,) -> 'already_exists' | 'ok'),'undefined','recovery') breaks the contract ('fresh' | #filestate{},fun((binary(),integer(),{integer(),integer()},any()) -> any()),any(),'datafile' | 'hintfile' | 'default') -> any()
Unknown functions:
erlang:max/2
done in 0m1.53s
done (warnings were emitted)

Bitcask keydir remembers deleted files

Symptom: Riak's 100%ile put latencies grow worse over time, rising to 200+ milliseconds while the 99%ile remains under 10 milliseconds.

This issue may or may not be intertwined with this ticket: #113 ... after capturing lsof -nP -p {riak pid} output, I then check to see the number of open file handles that refer to deleted files:

% grep '(deleted)' ~/slf/lsof.out1 | wc -l
191

... which is a much smaller number than issue #113 shows. And Bitcask does have a mechanism to close them. (The closing time can be decoupled from the deleting time.)

However, it's pretty clear from circumstantial evidence that the keydir is "remembering" cask files that no longer exist. Tracing shows that the status() call taking over 150 milliseconds each. Watching the time spent by the file_server_2 process during these calls, 90+ milliseconds is spent handling read_file_info calls. If I use https://gist.github.com/slfritchie/159a8ce1f49fc03c77c6 to trace the arguments to the file_server:handle_call/3 for 5 seconds:

func_args_tracer:start(file_server, handle_call, 3, 5, fun([X, _, _]) -> catch {element(1,X), element(2, X) end).

... or however long is necessary to capture a burst of file_server_2 activity. Then scape the output to a file and then grep for read_file_info and a specific vnode number (just to avoid counting operations for different vnodes), I see this:

% grep read_file_info ~/foo2 | grep  274031556999544297163190906134303066185487351808 | sort -u | wc -l
2400

So, it's 2400 unique files. But ls in that vnode's data dir shows less than 100 files, both data & hint files combined. Drat.

Move to CFLAG=-pedantic-errors

since it's the default on the erlang side, we should be doing whatever static analysis we can on the C side. We can also consider going to C99 and look at places where that might make the code cleaner.

I am assigning this to 2.1 because there's some erlang/rebar magic going on that makes "erl_nifs.h" not findable at check time. If there's a simple solution to that (I haven't even googled) then we might consider making this 2.0-final or 2.0.1, although if we do that, we should make going to C99 another issue, since that's too much change right now.

To do this, all that we have to do is add {"DRV_CFLAGS", "-pedantic-errors"}, or {"DRV_CFLAGS", "-pedantic-errors" -std=c99}, to rebar.config, and do the required cleanup.

Feature Request: Expiration per Bucket/Object (Not per backend)

Customer needs support for more than a handful of different expiration settings. They would like support for dozens to < 100 of different expiration settings without configuring backends for each of them, hence the support by bucket/object.

Some brainstorming brought up the possibility of attaching an expiry date timestamp next to the creation timestamp in the bitcask entry, or attaching it as metadata. Some further brainstorming brought up the possibility of bumping this metadata up a level to riak_kv, so all backends could support this.

From ZD:4865

At any rate, we do currently map buckets to specific multi-backends that have different expiration settings, but we have product requests to do more expiration options than the handful, 7 or 60 days that we currently offer. We don't want to get into a cycle of adding new backend configurations to app.config.

...Breaking those datasets further by a variety of retention policies would cause a multiplication of backends that Riak is not currently built to handle. (dozens, if not more than a hundred)

Add is_empty_estimate/1 to PULSE model [JIRA: RIAK-2411]

@jtuple's recent addition of the is_empty_estimate/1 function should be added to the QuickCheck + PULSE model. It's currently not exercised by the model.

QuickCheck failure being masked by try/catch/after

See this gist for example failure of the prop_expiry_test_ test:

https://gist.github.com/2f904b7a961c23a62982

A quick code read suggests that the try/catch/after in test/bitcask_qc_expiry.erl around lines 112-140 is masking the exception that ?assertEqual() is attempting to throw, probably introduced by commit ae7810b?

improve the behavior of needs_merge

A few things here:

We shouldn't even call needs merge outside of merge windows, as it's pointless, repeated work and blocks the vnode, raising tail latencies.
We should add some randomness to the scheduling of needs merge to make lockstep less likely.
Reconsider the 3 minute needs_merge interval. We should at least make it tunable.

Bitcask should not call application:start(bitcask) when opening a cask

Bitcask calls application:start(bitcask) every time bitcask:open/1,2 is called:
https://github.com/basho/bitcask/blob/master/src/bitcask.erl#L92
https://github.com/basho/bitcask/blob/master/src/bitcask.erl#L714

The responsibility for loading/starting the bitcask application should lie with the application which is using bitcask.

The current implementation can cause problems for applications, such as Riak, that open casks without blocking the main application's start up. For example, in Riak, the call to application:start(bitcask) within a vnode can deadlock with init:stop(). When processing init:stop() the application controller will wait for vnodes to shutdown. If the vnodes are still starting when init:stop() is called they will block on the call to application:start(bitcask). The deadlock eventually times out, after 5 minutes, and the application controller can complete the shutdown of Riak.

Steps to reproduce:

Run riak start; riak stop
Repeat step 1 until riak stop hangs for about 5 minutes

split or seperate mutlibackend from .schema

Same as basho/eleveldb#80.

Include DTrace scripts in source repo for tracing latency bugs

Scott Fritchie, 2:41 PM, I should open an issue to have you add that script(s) to the eleveldb repo.
Jon Meredith, 2:41 PM, that's a good idea

Ability to change bitcask key expiry without a node restart [JIRA: RIAK-2413]

customer "must have" request received 24 July 13

Data which fails CRC check should be deleted and not_found returned.

There isn't anything we can do when a value is corrupted; toss it, bitcask tombstone it, delete it from the keydir and return not_found.

Adds "grace timeout" to expiry.

Corrects for a situation in which there's always a key left after we have a merge start.

Robustify bitcask lock file creation

The current method of creating lock files is mostly good enough, especially with the recent testing by QuickCheck. However, it's probably a good idea to use the rename method used by riak_core_util:replace_file/2. h/t @gburd for reminding me.

require_hint_crc value is ignored

https://github.com/basho/bitcask/blob/1.6/src/bitcask_fileops.erl#L477 should return Acc, not <<>>. The empty binary looks sensible but it actually breaks the fold. This essentially forces require_hint_crc to true even though it defaults to false.

To fix this we need to change the line to return Acc, change the bulk crc check to not happen if require_hint_crc is false, and also to maintain safety, we should default require_hint_crc to true.

Build Breaks With Current Rebar Version

Due to so_name being deprecated and now removed from rebar in favor of port_specs, the shared library is not created when building bitcask with a current version of rebar. In this case, the files under the c_src-directory are getting compiled but not linked.

In fact, this is a problem for everybody who wants to depend on bitcask but also use a current rebar version.

Note for reference: issue #49 (pull request "update rebar to 2.0.0") contains a commit ("Use port_specs" 292819e) that fixes this issue.

Investigate possible startup race between open and merge.

Seen at a customer over the weekend, it looks like if a needs_merge call hits a vnode just as it's starting up, it can cause the main vnode bitcask:open to fail, taking the vnode and potentially the system down. Not sure if this needs to be solved in bitcask, kv, or core, or possibly all three. The simplest potential solution I can think of is to make a 'maybe' version of init_keydir that gets used by needs_merge, that won't trigger a keydir creation, but solutions should wait for fuller understanding.

cc @bsparrow435 @jcapricebasho

Corrupted lock files should be deleted

At a customer site we seem to have a situation where a number of creation lock files end up created empty, and the retry mechanism gets stuck in a loop because corrupted lock files are not considered stale, which would remove them and allow the retry.

That decision is taken here: https://github.com/basho/bitcask/blob/1.6.6/src/bitcask_lockops.erl#L147

The error log would have messages like these:

2014-05-05 00:38:22.537 [error] <0.6561.379> Failed to read lock data from /data/riak/riak-data/riak/bitcask/696496874040508421956443553091353626554780352512/bitcask.create.lock: {invalid_data,<<>>}

I think a corrupted lock file should be treated the same as a stale file to avoid getting permanently stuck.

Another way for fstats to leak

thanks to @joecaswell for finding this.

This leak is slightly hard to trigger, but certain pathological behavior patterns might trigger it.

Merge creates files, and thus creates fstats entries. These are not synchronized with the write thread in any way, but once a value within them is read, they're added to the write thread's read_files state, from where it is trimmed. So, to reproduce this, you need to have a file that is created by merge, never read from, and then merged away again.

The scenario where we saw this in the wild seemed to be:

a few very hot keys in a partition
a remainder of other values in file smaller than the small files threshold.
since we have hot keys, the main writing file gets read as extremely fragmented, so when needs_merge comes around, both files are selected for the partial merge.
a new merge file is created, containing the data from the small file, thus creating another small file.
the cycle repeats, as nothing is ever added to the small file, other than the occasional value, since the hot keys represent most writes to the partition.

A truncated hintfile can cause expiration of a file to repeatedly fail

A truncated hintfile causes the hintfile fold in the expiration logic to fail. It does not seem to be defaulting back to a data file fold like it is supposed to. Example logs:

2014-06-05 03:20:31.217 [error] <0.27283.8119> Hintfile 'DATA_ROOT/10836.bitcask.hint' contains pointer 18446744073709551615 3206534561 that is greater than total data size 15032468 
2014-06-05 03:20:31.217 [error] <0.27283.8119> Error folding keys for "DATA_ROOT/10836.bitcask.data": {trunc_hintfile,ok}

The only mitigation is to remove the hintfile, then the default fold is correctly utilized.

monitoring of / autorecovery from Bitcask file corruption errors

We occasionally encounter issues with nodes crashing due to Bitcask file corruption issues. Although Erlang recovery scripts are available, the process of detecting corruption in the logs and applying the fix requires manual intervention. A better method for identifying and recovering from these errors would be helpful.

Bitcask cannot compile, stdbool.h not available on Solaris10

The following error happens on compile on Solaris 10u9.

Compiling /export/home/buildbot/masters/riak/riak-build-solaris-10u9-64/build/distdir/riak-2.0.0pre2/deps/bitcask/c_src/bitcask_nifs.c
In file included from /export/home/buildbot/masters/riak/riak-build-solaris-10u9-64/build/distdir/riak-2.0.0pre2/deps/bitcask/c_src/bitcask_nifs.c:39:
/usr/include/stdbool.h:42:2: #error "Use of <stdbool.h> is valid only in a c99 compilation environment."

Deleted keys can be resurrected after restart

when the following conditions are met:

key is written to bitcask data file 1
data file 1 is cutoff and a new write file started
key is deleted, writing tombstone to data file 2 and removing key from keydir
data file 2 is cutoff and a new write file started
Bitcask merges, data file 2 meets threshold, data file 1 does not - removes tombstone
Bitcask is restarted and reopens the directory - previously deleted value is found while scanning datafiles to build keydir, no tombstone is found to cause its removal

Possible Race Condition in bitcask:merge

I don't understand the merging code very well. But I'm working under the assumption that if a merge is performed on a file that is currently being written to then data can be lost and this is why the merge function tries to exclude the file currently being written to.

when you don't pass in an array of files merge calls readable_files:

-spec merge(Dirname::string()) -> ok.
merge(Dirname) ->
    merge(Dirname, [], readable_files(Dirname)).

readable_files gets the current file being written to so it can ignore that files in list_data_files.

readable_files(Dirname) ->
    %% Check the write and/or merge locks to see what files are currently
    %% being written to. Generate our list excepting those.
    WritingFile = bitcask_lockops:read_activefile(write, Dirname),
    MergingFile = bitcask_lockops:read_activefile(merge, Dirname),
    list_data_files(Dirname, WritingFile, MergingFile).

list_data_files(Dirname, WritingFile, Mergingfile) ->
    %% Get list of {tstamp, filename} for all files in the directory then
    %% reverse sort that list and extract the fully-qualified filename.
    Files = bitcask_fileops:data_file_tstamps(Dirname),
    [F || {_Tstamp, F} <- reverse_sort(Files),
          F /= WritingFile,
          F /= Mergingfile].

however, i think there is a small race condition because the following can happen:

current write file is F1
merge function reads current write file as F1
current write file is switched from F1 to F2
list_data_files finds F1 and F2 and excludes F1
merge starts working on F2 which is currently being written to
data loss?

Avoid file_server serialized functions

For maximum parallelism, Bitcask should be avoiding use of file I/O-related functions that end up calling the local file_server process.

For example, bitcask:needs_merge() calls filelib:is_file(), which is handled by and serialized by a single Erlang process: registered name file_server_2, code module file_server.

IIRC, the Hibari code uses some prim_file hackery to avoid sending file I/O to the local file_server_2 proc.

Bitcask merge visibility [JIRA: RIAK-2412]

It would be nice to have better visibility into bitcask merge activity.

Log events when merge starts?
A public ETS table (very low update frequency, to avoid worries about performance impact) that tracks which bitcask instances are merging (and perhaps also list of input files)?

	CurrentEpoch = bitcask_nifs:keydir_get_epoch(State#bc_state.keydir),
	PendingEpoch = pending_epoch(State#bc_state.keydir),
	FoldEpoch = min(CurrentEpoch, PendingEpoch),
	case open_fold_files(State#bc_state.dirname, ?OPEN_FOLD_RETRIES) of