basho / bitcask Goto Github PK
View Code? Open in Web Editor NEWbecause you need another a key/value storage engine
because you need another a key/value storage engine
Because we're getting filtered/rate-limited in certain situations (see error messages in basho/riak#517), and backpressure might serve us better. Lager is already a dep, so maybe might as well use it? Or perhaps a logging macro, so the user can decide at compile time which system they'd like to use.
Sparrow noticed that some customers with very high merge activity are affected by extremely high latencies due to the file server serialization issue fixed for 2.0 in #85 and #118. For example, in ZD ticket 7618. When this occurs, nodes need a restart. For customers for which this is not an acceptable workaround and are schedule sensitiv, we may need to backport this fix. This is a ping to @evanmcc, @jonmeredith and @michellep to chime in on that, as I believe some are discussing another 1.4.9/1.5 release.
Setup: 1 node Riak cluster, default config for app.config and vm.args, plus: add {max_file_size, 1048576}
to the bitcask
section, to create artifically-high numbers of bitcask files.
Then drive it with basho_bench and this config:
{mode, max}.
{duration, 6000}.
{concurrent, 100}.
{driver, basho_bench_driver_riakc_pb}.
{key_generator, {int_to_bin, {uniform_int, 10000}}}.
{value_generator, {fixed_bin, 1024}}.
{riakc_pb_ips, [{127,0,0,1}]}.
{riakc_pb_replies, 1}.
{operations, [{put, 1}]}.
%% Use {auto_reconnect, false} to get "old" behavior (prior to April 2013).
%% See deps/riakc/src/riakc_pb_socket.erl for all valid socket options.
{pb_connect_options, [{auto_reconnect, true}]}.
%% Overrides for the PB client's default 60 second timeout, on a
%% per-type-of-operation basis. All timeout units are specified in
%% milliseconds. The pb_timeout_general config item provides a
%% default timeout if the read/write/listkeys/mapreduce timeout is not
%% specified.
{pb_timeout_general, 30000}.
{pb_timeout_read, 5000}.
{pb_timeout_write, 5000}.
{pb_timeout_listkeys, 50000}.
%% The general timeout will be used because this specific item is commented:
%% {pb_timeout_mapreduce, 50000}.
My theory is that opening a new cask is racing with merge activity and opening data files that are unused due to re-opening yet another cask?
% lsof -nP -p 37788 | egrep '\.data' | sed 's/.* //' | sort | uniq | wc
999 999 112477
% lsof -nP -p 37788 | egrep '\.hint' | sed 's/.* //' | sort | uniq | wc
13082 13082 1451053
% find /tmp/riak-1.4.2/rel/riak/data/bitcask -type f -size 0 -name \*.data -ls | head -3
40232003 0 -rw-rw-r-- 1 fritchie wheel 0 Oct 23 18:57 /tmp/riak-1.4.2/rel/riak/data/bitcask/0/1015.bitcask.data
40236345 0 -rw-rw-r-- 1 fritchie wheel 0 Oct 23 19:03 /tmp/riak-1.4.2/rel/riak/data/bitcask/0/1046.bitcask.data
40240520 0 -rw-rw-r-- 1 fritchie wheel 0 Oct 23 19:10 /tmp/riak-1.4.2/rel/riak/data/bitcask/0/1076.bitcask.data
% find /tmp/riak-1.4.2/rel/riak/data/bitcask -type f -size 0 -name \*.data -ls | wc -l
24236
% find /tmp/riak-1.4.2/rel/riak/data/bitcask -type f -size 0 -name \*.hint -ls | wc -l
0
fold_visits_frozen_test with RollOver == true seems to freeze sometimes. I looked into this on a colo machine and my laptop but I suspect that maybe they're too fast. It shouldn't take more than ~2 seconds to run. We should get to the bottom of it before the RC.
======================== EUnit ========================
module 'bitcask'
bitcask: a0_test...[0.100 s] ok
bitcask: roundtrip_test...
=INFO REPORT==== 16-Apr-2014::19:43:58 ===
Bitcask IO mode is: bitcask_file
[0.505 s] ok
bitcask: write_lock_perms_test...[0.266 s] ok
bitcask: list_data_files_test...[0.024 s] ok
bitcask: fold_test...[1.874 s] ok
bitcask: iterator_test...[0.244 s] ok
bitcask: fold_corrupt_file_test...
=ERROR REPORT==== 16-Apr-2014::19:44:02 ===
Trailing data, discarding (10 bytes)
=ERROR REPORT==== 16-Apr-2014::19:44:02 ===
Trailing data, discarding (14 bytes)
[0.550 s] ok
bitcask:1687: fold_visits_frozen_test_...[2.301 s] ok
bitcask:1688: fold_visits_frozen_test_...process killed by signal 11
program finished with exit code -1
elapsedTime=29.704782
After crash, bitcask does not always detect stale locks correctly. Here is an example from a crashed Riak node.
07:25:13.738 [error] Supervisor riak_core_vnode_sup had child undefined started with {riak_core_vnode,start_link,undefined} at <0.25793.657> exit with reason bad return value: {error,{write_locked,locked,"/var/lib/riak/bitcask/536645132457440915277915524513010171279912730624"}} in context child_terminated
Folds work over a snapshot of the data. When a folder starts, it fetches the current epoch number, which will be used to resolve which version of an object it is allowed to use. This multiple version mechanism was introduced to allow multiple folders to work over the same data. Before this, only one snapshot of the data was available. Monotonically increasing epochs were introduced recently to replace the racy timestamp based mechanism we had before. Each operation generates a new epoch which is saved on each modified entry. So an entry in the keydir may have different versions for different epochs, each pointing to potentially different files.
Merges delete their input files once they complete. The race here is that a fold fetches an epoch, then proceeds to open files in the Bitcask directory. See
Lines 386 to 389 in 4a048be
This is what I was trying to explain in this comment and this other one.
Sometimes, bitcask will merge on node startup. This can cause very high disk IO when paired with lots of requests and can result in request timeouts and very high FSM times. Adding a timer on startup for bitcask merges may reduce this occurrence. Another idea is to disable bitcask merging during full-syncs or high volume of client requests.
Here's counterexample Cp8:
Cp8 = [[{set,{var,1},{call,bitcask_pulse,bc_open,[true,{true,{62,59,175},26}]}},{set,{var,2},{call,bitcask_pulse,puts,[{var,1},{2,3},<<188,81,240,255,122,185,227,8,104,196,56,142,250,34,8>>]}},{set,{var,6},{call,bitcask_pulse,bc_close,[{var,1}]}},{set,{var,10},{call,bitcask_pulse,fork,[[{init,{state,undefined,false,false,[]}},{set,{not_var,1},{not_call,bitcask_pulse,bc_open,[false,{true,{282,224,221},100}]}},{set,{not_var,2},{not_call,bitcask_pulse,fold,[{not_var,1}]}},{set,{not_var,3},{not_call,bitcask_pulse,fold,[{not_var,1}]}}]]}},{set,{var,12},{call,bitcask_pulse,bc_open,[true,{true,{270,229,202},55}]}},{set,{var,18},{call,bitcask_pulse,delete,[{var,12},2]}},{set,{var,19},{call,bitcask_pulse,merge,[{var,12}]}},{set,{var,20},{call,bitcask_pulse,fork_merge,[{var,12}]}},{set,{var,23},{call,bitcask_pulse,puts,[{var,12},{3,43},<<0,0,0,0>>]}},{set,{var,25},{call,bitcask_pulse,fold,[{var,12}]}},{set,{var,26},{call,bitcask_pulse,bc_close,[{var,12}]}},{set,{var,29},{call,bitcask_pulse,bc_open,[true,{true,{246,190,50},100}]}},{set,{var,35},{call,bitcask_pulse,delete,[{var,29},5]}},{set,{var,36},{call,bitcask_pulse,fold,[{var,29}]}}],{43245,66780,7696},[{events,[]}]].
[begin Now = {1405,406317,535184}, io:format("Now ~w ", [Now]), true = eqc:check(bitcask_pulse:prop_pulse(), [lists:nth(1,Cp8), Now, lists:nth(3,Cp8)]) end || _ <- lists:seq(1,100)].
That seed of {1405,406317,535184}
is nice & deterministic on my Mac. YMMV, substitute now()
in its place and run until you find something that fails very consistently.
The last three ops of the main thread are:
{set,{var,29},
{call,bitcask_pulse,bc_open,[true,{true,{246,190,50},100}]}},
{set,{var,35},{call,bitcask_pulse,delete,[{var,29},5]}},
{set,{var,36},{call,bitcask_pulse,fold,[{var,29}]}}],
The failure is key 5 should not be found by step 36's fold, but it is.
Bad:
[{0,410681,[]},
{410681,infinity,
[{bad,<<"pid_1">>,
{fold,[{2,[not_found]},
{3,[<<0,0,0,0>>]},
{4,[<<0,0,0,0>>]},
{5,[not_found]},
[...]
{43,[<<0,0,0,0>>]}],
[{3,<<0,0,0,0>>},
{4,<<0,0,0,0>>},
{5,<<0,0,0,0>>},
[...]
More research required. There's a possible race with a forked merge and/or with merge at open time via bitcask:make_merge_file()
.
For example:
2012-10-16 22:18:44.022 UTC [error] <0.8433.0> Failed to merge ["/var/lib/riak/bitcask/1168915860326181142586736208979136516697469485056",[{data_root,"/var/lib/riak/bitcask"},{read_write,true}],["/var/lib/riak/bitcask/1168915860326181142586736208979136516697469485056/3.bitcask.data"]]: {{badrecord,mstate},[{bitcask,merge_files,1,[{file,"src/bitcask.erl"},{line,835}]},{bitcask,merge1,3,[{file,"src/bitcask.erl"},{line,509}]},{bitcask_merge_worker,do_merge,1,[{file,"src/bitcask_merge_worker.erl"},{line,130}]}]}
Riak version 1.2.0, which translates to bitcask tag "1.5.1".
When Riak is improperly shut down or its process is killed, the cleanup processes that release lock files are not triggered. If another OS PID has been created that shares Riak's old process id, and Riak is started again, the subsequent checks in Riak will see the original OS PID that was written to the write lock file is still active, and will not release the lock (even though the process id in question does not refer to Riak anymore).
This can be replicated by:
Could the operation that checks for the OS PID's existence also confirm that it is in fact a beam.smp process, to lessen the likelihood of this stuck lock file?
For additional context, see the following:
https://basho.zendesk.com/agent/#/tickets/3873
https://basho.zendesk.com/agent/#/tickets/5336
Currently Bitcask performs a CRC check after unpacking data. This may lead to a badmatch
error when unpacking the data if the data is corrupt. Checking the CRC before unpacking the data would avoid this badmatch
error.
Relevant section of code:
https://github.com/basho/bitcask/blob/d8958d98d6619a1a8b9e71a7a7b19a7d9fb38ef0/src/bitcask_fileops.erl#L171-181
Large amounts of concurrent Bitcask merging activity across a cluster can have a significant impact on latencies and performance. One of the recommended ways to get around this is to use staggered, non-overlapping merge windows, resulting in at most one node performing merging at any point in time. A potential issue with this approach is however that it means merging on each node occurs less frequently, leading to a significant increase in disk usage between windows if a lot of data is inserted/updated.
Given that the aim of this approach is to ensure concurrent merging is kept to a minimum, would it make sense to introduce a global Bitcask merge coordinator that automatically serialises merges across the cluster?
Each individual node could still observe any configured merge windows, but the coordinator would ensure that merges are not performed concurrently. As concurrent merges are undesirable but not necessarily causes problems, a best effort approach might be sufficient.
This problem was observed as merges crashing in Riak when using the Bitcask backend. The new code may pass a {tombstone, Key}
argument to a key transformation callback, which normally just take a binary key. Key transformations are used in the riak_kv_bitcask_backend to support the traditional and new key encoding format introduced in Bitcask 1.7.0/Riak 2.0.
All paths that call a key transformer must make sure that it receives only the binary key to transform.
I've pushed 25cf7b6 to comment it out to resolve the problem for now, but we should likely make a decision at some point as to what we want to do wrt that repo and its inclusion in our tests.
I have reproduced a problem where Bitcask gets stuck, unable to re-open a cask in a certain directory.
On open, first a keydir object is created, but not marked as ready. Then the files are scanned to populate it, at which point it is marked ready here bitcask.erl#L1248 and things are good. However, if the scan errors out, we hit this branch instead in bitcask.erk#1244, which does not mark the keydir as ready, but leaves it behind in that state. Calling open again on the same directory finds this existing keydir, but detects it is not ready, so tries to wait for it to load in bitcask.erl#L1252, eventually timing out.
When the error happens, the newly created keydir should probably be released.
Now, the fact that the error happens on scan might lead to a different bug. What has been observed is that the function to list files in a directory, bitcask_fileops:list_dir/1 returns {error, einval}, which is not handled in bitcask_fileops:data_file_tstamps/1, causing the error that leads to the stuck keydir. Notice how this function is trying to avoid a call to the file server by calling the efile port directly, which might be part of the reason. I'm currently investigating the exact sequence of events that leads to this.
This issue affects 2.0 and the 1.4 branch post 1.4.4.
The fix to avoid file server usage (#118) changed the way that the largest file ID is accounted for. Instead of listing the entire directory (potentially very expensive, directly in the put path), it keeps track of the file id in the keydir. Unfortunately, there is a corner case.
On the initial population of the keydir, each file is folded over in turn. If it has any keys at all, the keydir's largest id counter will be increased. However, if it has no keys, it will not be, because no keydir_put
will happen. This can cause a problem when a keyless hintfile/datafile pair has the largest file id in the directory on startup, e.g.:
-rw-rw-r-- 1 riak riak 1453263 Feb 3 14:11 186.bitcask.data
-rw-rw-r-- 1 riak riak 60798 Feb 3 14:11 186.bitcask.hint
-rw-rw-r-- 1 riak riak 1613200 Feb 3 14:10 187.bitcask.data
-rw-rw-r-- 1 riak riak 124681 Feb 3 14:10 187.bitcask.hint
-rw-rw-r-- 1 riak riak 1469163 Feb 3 22:24 188.bitcask.data
-rw-rw-r-- 1 riak riak 59843 Feb 3 22:54 188.bitcask.hint
-rw------- 1 riak riak 0 Jan 30 19:57 189.bitcask.data
-rw------- 1 riak riak 18 Jan 30 19:57 189.bitcask.hint
Since in this case the biggest_file_id
will be 188, the code here:
https://github.com/basho/bitcask/blob/develop/src/bitcask_fileops.erl#L73-L88
will fail, causing the vnode to crash and be restarted, leaving us exactly where we started.
A workaround is to manually remove the empty hint and data file pair, a fix should be forthcoming shortly.
The 1.7.0 version now has different file formats:
We need to verify that there is a path to downgrade, however annoying.
/cc @jonmeredith
Repro steps: Start a single node with bitcask, fill with enough data to have a few data files, stop node, create empty lock file, start node.
Errors:
09:38:29.529 [error] Failed to read lock data from ./data/bitcask/0/bitcask.create.lock: {invalid_data,<<>>}
09:38:29.529 [error] gen_fsm <0.692.0> in state active terminated with reason: lock_failure in bitcask_fileops:get_create_lock/2 line 106
09:38:29.529 [error] CRASH REPORT Process <0.692.0> with 1 neighbours exited with reason: lock_failure in bitcask_fileops:get_create_lock/2 line 106 in gen_fsm:terminate/7 line 611
09:38:29.530 [error] Supervisor riak_core_vnode_sup had child undefined started with {riak_core_vnode,start_link,undefined} at <0.692.0> exit with reason lock_failure in bitcask_fileops:get_create_lock/2 line 106 in context child_terminated
09:38:29.530 [error] gen_fsm <0.800.0> in state ready terminated with reason: lock_failure in bitcask_fileops:get_create_lock/2 line 106
09:38:29.530 [error] CRASH REPORT Process <0.800.0> with 10 neighbours exited with reason: lock_failure in bitcask_fileops:get_create_lock/2 line 106 in gen_fsm:terminate/7 line 611
09:38:29.531 [error] Supervisor {<0.801.0>,poolboy_sup} had child riak_core_vnode_worker started with riak_core_vnode_worker:start_link([{worker_module,riak_core_vnode_worker},{worker_args,[0,[],worker_props,<0.798.0>]},{worker_callback_mod,...},...]) at undefined exit with reason lock_failure in bitcask_fileops:get_create_lock/2 line 106 in context shutdown_error
09:38:29.531 [error] gen_server <0.801.0> terminated with reason: lock_failure in bitcask_fileops:get_create_lock/2 line 106
09:38:29.531 [error] CRASH REPORT Process <0.801.0> with 0 neighbours exited with reason: lock_failure in bitcask_fileops:get_create_lock/2 line 106 in gen_server:terminate/6 line 747
The node sits in a sad loop, never recovering.
While documenting the read code path, I came across a race when reads find an expired value. The code then unconditionally removes that entry. BUT, this is not an atomic operation. A new value could have been written between the time we found the expired entry and the time we delete it. This needs to be changed to a conditional delete that will only remove the entry if it's exactly the same one we found, or try again if it has changed since the first get.
This is the line in bitcask:get/3 that does it: bitcask.erl#L238
It is possible for Bitcask instance A's open(), which calls init_keydir(), to block for long periods of time while Bitcask instance B performs a long-running merge. In turn, such a long-blocking open() can make the riak_core_vnode_manager
process very unhappy, since it is the process that starts new Riak vnode instances which then might open a Bitcask instance.
Below is a stack trace from a blocked Bitcask instance (instance A in the description above). The same cluster-info
report shows that the bitcask_merge_delete
server has a queue of hundreds of files from instance B to delete. The report shows that there's a 3rd instance which is doing a long-running fold (repair handoff) that is preventing the bitcask_merge_delete
server from deleting B's old merge files.
https://github.com/basho/bitcask/blob/1.6.3-release/src/bitcask_merge_delete.erl#L109
One possible solution is to split the monolithic bitcask_merge_delete
server into per-bitcask-instance servers?
=proc:<0.9668.5878>
State: Waiting
Spawned as: proc_lib:init_p/5
Spawned by: <0.247.0>
Started: Thu Mar 20 11:25:11 2014
Message queue length: 1
Message queue: [{'$gen_sync_event',{<0.9667.5878>,#Ref<0.0.4474.115775>},wait_for_init}]
Number of heap fragments: 0
Heap fragment data: 0
Link list: []
Dictionary: [{'$initial_call',{riak_core_vnode,init,1}},{'$ancestors',[riak_core_vnode_sup,riak_core_sup,<0.244.0>]},{random_seed,{26833,7398,23484}}]
Reductions: 128073
Stack+heap: 987
OldHeap: 4181
Heap unused: 683
OldHeap unused: 3540
Stack dump:
Program counter: 0x00007f8539fccc28 (bitcask:poll_deferred_delete_queue_empty/0 + 88)
CP: 0x0000000000000000 (invalid)
arity = 0
0x00007f83b256aeb8 Return addr 0x00007f8539fc9570 (bitcask:init_keydir/3 + 632)
0x00007f83b256aec0 Return addr 0x00007f8539fc3c78 (bitcask:open/2 + 568)
y(0) <<0 bytes>>
y(1) Catch 0x00007f8539fc9660 (bitcask:init_keydir/3 + 872)
y(2) true
y(3) <<0 bytes>>
y(4) "/var/lib/riak/bitcask/702205864811332261480676696969151607100311339008"
0x00007f83b256aef0 Return addr 0x00007f853baef280 (riak_kv_bitcask_backend:start/2 + 1336)
y(0) []
y(1) []
y(2) 2147483648
y(3) fresh
y(4) [{data_root,"/var/lib/riak/bitcask"},{read_write,true}]
y(5) "/var/lib/riak/bitcask/702205864811332261480676696969151607100311339008"
0x00007f83b256af28 Return addr 0x00007f853b1d4160 (riak_cs_kv_multi_backend:start_backend/4 + 152)
y(0) []
y(1) []
y(2) [{data_root,"/var/lib/riak/bitcask"},{read_write,true}]
y(3) "/var/lib/riak/bitcask"
y(4) "702205864811332261480676696969151607100311339008"
y(5) 702205864811332261480676696969151607100311339008
0x00007f83b256af60 Return addr 0x00007f853b1d9208 (riak_cs_kv_multi_backend:'-start_backend_fun/1-fun-0-'/3 + 256)
y(0) be_blocks
y(1) riak_kv_bitcask_backend
y(2) Catch 0x00007f853b1d4258 (riak_cs_kv_multi_backend:start_backend/4 + 400)
0x00007f83b256af80 Return addr 0x00007f855c674960 (lists:foldl/3 + 120)
y(0) riak_kv_bitcask_backend
y(1) []
y(2) [{be_default,riak_kv_eleveldb_backend,{state,<<0 bytes>>,"/var/lib/riak/leveldb/702205864811332261480676696969151607100311339008",[{create_if_missing,true},{max_open_files,50},{use_bloomfilter,true},{write_buffer_size,59927635}],[{create_if_missing,true},{data_root,"/var/lib/riak/leveldb"},{included_applications,[]},{max_open_files,50},{use_bloomfilter,true},{write_buffer_size,59927635}],[],[],[{fill_cache,false}],true,false}}]
y(3) Catch 0x00007f853b1d9320 (riak_cs_kv_multi_backend:'-start_backend_fun/1-fun-0-'/3 + 536)
0x00007f83b256afa8 Return addr 0x00007f853b1d3e90 (riak_cs_kv_multi_backend:start/2 + 808)
y(0) #Fun<riak_cs_kv_multi_backend.1.32308858>
y(1) []
0x00007f83b256afc0 Return addr 0x00007f853ae86778 (riak_kv_vnode:init/1 + 872)
y(0) be_default
y(1) [{<<3 bytes>>,be_blocks}]
y(2) []
y(3) []
0x00007f83b256afe8 Return addr 0x00007f853ae6d220 (riak_core_vnode:do_init/1 + 248)
y(0) Catch 0x00007f853ae86778 (riak_kv_vnode:init/1 + 872)
y(1) 3000
y(2) <<8 bytes>>
y(3) 10
y(4) 100
y(5) 100
y(6) 1000
y(7) true
y(8) riak_cs_kv_multi_backend
y(9) 702205864811332261480676696969151607100311339008
0x00007f83b256b040 Return addr 0x00007f853ae6ce28 (riak_core_vnode:started/2 + 200)
y(0) []
y(1) []
y(2) []
y(3) []
y(4) []
y(5) []
y(6) []
y(7) []
y(8) []
y(9) []
y(10) undefined
y(11) 702205864811332261480676696969151607100311339008
y(12) riak_kv_vnode
y(13) {state,702205864811332261480676696969151607100311339008,riak_kv_vnode,undefined,undefined,none,undefined,undefined,undefined,undefined,undefined,0}
0x00007f83b256b0b8 Return addr 0x00007f853ae64e98 (gen_fsm:handle_msg/7 + 224)
y(0) 0
0x00007f83b256b0c8 Return addr 0x00007f855c6675b8 (proc_lib:init_p_do_apply/3 + 56)
y(0) undefined
y(1) Catch 0x00007f853ae64e98 (gen_fsm:handle_msg/7 + 224)
y(2) riak_core_vnode
y(3) {state,702205864811332261480676696969151607100311339008,riak_kv_vnode,undefined,undefined,none,undefined,undefined,undefined,undefined,undefined,0}
y(4) started
y(5) <0.9668.5878>
y(6) <0.247.0>
y(7) {'$gen_event',timeout}
0x00007f83b256b110 Return addr 0x0000000000847c38 (<terminate process normally>)
y(0) Catch 0x00007f855c6675d8 (proc_lib:init_p_do_apply/3 + 88)
See also: #95
In the Readme it says "Bitcask requires Erlang R13B04 or later." but it won't compile with R13B04 because it references ErlNifPid.
c_src/bitcask_nifs.c:105: error: expected specifier-qualifier-list before ‘ErlNifPid’
If your hintfiles are deleted, and you shut down cleanly the node before any merges happen, your hintfiles will not be recreated. This is undesirable on very large nodes, so since we already have the hintfile information, we should make sure that it is up to date on bitcask:close()
.
There is code in Bitcask to list files in a directory that tries to avoid serializing on the file server by caching an efile port in the process dictionary, which it uses to list directories when needed. The problem is that the port may go away. The process that opened it would get a notification if it happened, but nothing would tie that back to Bitcask to release the cached port. After that point, calls to list directory contents to, say, open a bitcask, would fail forever.
The caching function is bitcask_fileops: get_efile_port/0, used by bitcask_fileops:list_dir/1.
I believe the best thing to do at this point is to use a regular directory listing operation here. The initial file server serialization problem was caused by merges piling up on the Riak side when things got slow. The Bitcask backend code will now avoid issuing merge requests until the last one has finished, which should prevent that from happening again.
10:58:31:bitcask_pull(master_+|MERGING) $ dialyzer --plt .bitcask.plt -pa deps//ebin -pa ebin --src src/*.erl
Checking whether the PLT .bitcask.plt is up-to-date... yes
Proceeding with analysis...
bitcask.erl:474: The pattern [U | ] can never match the type []
bitcask.erl:574: The pattern 'true' can never match the type 'false'
bitcask.erl:590: Function will never be called
bitcask.erl:591: Function will never be called
bitcask.erl:593: Function will never be called
bitcask.erl:607: Function will never be called
bitcask.erl:614: Function frag_threshold/1 will never be called
bitcask.erl:617: Guard test '>='(any(),FragThreshold::none()) can never succeed
bitcask.erl:624: Function dead_bytes_threshold/1 will never be called
bitcask.erl:627: Guard test '>='(any(),DeadBytesThreshold::none()) can never succeed
bitcask.erl:634: Function small_file_threshold/1 will never be called
bitcask.erl:642: Guard test '<'(any(),Threshold::none()) can never succeed
bitcask.erl:652: Function expired_threshold/1 will never be called
bitcask.erl:654: Guard test '<'(any(),Cutoff::none()) can never succeed
bitcask.erl:668: The pattern [F | ] can never match the type []
bitcask.erl:698: The call bitcask:summarize(any(),S::{non_neg_integer(),pos_integer(),pos_integer(),pos_integer(),pos_integer()}) will never return since it differs in the 2nd argument from the success typing arguments: (string(),{integer(),number(),number(),number(),number(),})
bitcask.erl:702: The pattern [S | ] can never match the type []
bitcask.erl:711: Function summarize/2 has no local return
bitcask.erl:711: The pattern <Dirname, {FileId, LiveCount, TotalCount, LiveBytes, TotalBytes, OldestTstamp}> can never match the type <,{non_neg_integer(),pos_integer(),pos_integer(),pos_integer(),pos_integer()}>
bitcask.erl:781: The call bitcask_fileops:fold_keys(File::#filestate{mode::'read_only',filename::string() | #filestate{},tstamp::integer(),hintcrc::0,ofs::0},F::fun((,,,) -> 'already_exists' | 'ok'),'undefined','recovery') breaks the contract ('fresh' | #filestate{},fun((binary(),integer(),{integer(),integer()},any()) -> any()),any(),'datafile' | 'hintfile' | 'default') -> any()
Unknown functions:
erlang:max/2
done in 0m1.53s
done (warnings were emitted)
Symptom: Riak's 100%ile put latencies grow worse over time, rising to 200+ milliseconds while the 99%ile remains under 10 milliseconds.
This issue may or may not be intertwined with this ticket: #113 ... after capturing lsof -nP -p {riak pid}
output, I then check to see the number of open file handles that refer to deleted files:
% grep '(deleted)' ~/slf/lsof.out1 | wc -l
191
... which is a much smaller number than issue #113 shows. And Bitcask does have a mechanism to close them. (The closing time can be decoupled from the deleting time.)
However, it's pretty clear from circumstantial evidence that the keydir is "remembering" cask files that no longer exist. Tracing shows that the status()
call taking over 150 milliseconds each. Watching the time spent by the file_server_2
process during these calls, 90+ milliseconds is spent handling read_file_info
calls. If I use https://gist.github.com/slfritchie/159a8ce1f49fc03c77c6 to trace the arguments to the file_server:handle_call/3
for 5 seconds:
func_args_tracer:start(file_server, handle_call, 3, 5, fun([X, _, _]) -> catch {element(1,X), element(2, X) end).
... or however long is necessary to capture a burst of file_server_2
activity. Then scape the output to a file and then grep for read_file_info
and a specific vnode number (just to avoid counting operations for different vnodes), I see this:
% grep read_file_info ~/foo2 | grep 274031556999544297163190906134303066185487351808 | sort -u | wc -l
2400
So, it's 2400 unique files. But ls
in that vnode's data dir shows less than 100 files, both data & hint files combined. Drat.
since it's the default on the erlang side, we should be doing whatever static analysis we can on the C side. We can also consider going to C99 and look at places where that might make the code cleaner.
I am assigning this to 2.1 because there's some erlang/rebar magic going on that makes "erl_nifs.h" not findable at check time. If there's a simple solution to that (I haven't even googled) then we might consider making this 2.0-final or 2.0.1, although if we do that, we should make going to C99 another issue, since that's too much change right now.
To do this, all that we have to do is add {"DRV_CFLAGS", "-pedantic-errors"},
or {"DRV_CFLAGS", "-pedantic-errors" -std=c99},
to rebar.config, and do the required cleanup.
Customer needs support for more than a handful of different expiration settings. They would like support for dozens to < 100 of different expiration settings without configuring backends for each of them, hence the support by bucket/object.
Some brainstorming brought up the possibility of attaching an expiry date timestamp next to the creation timestamp in the bitcask entry, or attaching it as metadata. Some further brainstorming brought up the possibility of bumping this metadata up a level to riak_kv, so all backends could support this.
From ZD:4865
At any rate, we do currently map buckets to specific multi-backends that have different expiration settings, but we have product requests to do more expiration options than the handful, 7 or 60 days that we currently offer. We don't want to get into a cycle of adding new backend configurations to app.config.
...Breaking those datasets further by a variety of retention policies would cause a multiplication of backends that Riak is not currently built to handle. (dozens, if not more than a hundred)
@jtuple's recent addition of the is_empty_estimate/1 function should be added to the QuickCheck + PULSE model. It's currently not exercised by the model.
See this gist for example failure of the prop_expiry_test_
test:
https://gist.github.com/2f904b7a961c23a62982
A quick code read suggests that the try/catch/after in test/bitcask_qc_expiry.erl
around lines 112-140 is masking the exception that ?assertEqual() is attempting to throw, probably introduced by commit ae7810b?
A few things here:
Bitcask calls application:start(bitcask)
every time bitcask:open/1,2
is called:
https://github.com/basho/bitcask/blob/master/src/bitcask.erl#L92
https://github.com/basho/bitcask/blob/master/src/bitcask.erl#L714
The responsibility for loading/starting the bitcask application should lie with the application which is using bitcask.
The current implementation can cause problems for applications, such as Riak, that open casks without blocking the main application's start up. For example, in Riak, the call to application:start(bitcask)
within a vnode can deadlock with init:stop()
. When processing init:stop()
the application controller will wait for vnodes to shutdown. If the vnodes are still starting when init:stop()
is called they will block on the call to application:start(bitcask)
. The deadlock eventually times out, after 5 minutes, and the application controller can complete the shutdown of Riak.
Steps to reproduce:
riak start; riak stop
riak stop
hangs for about 5 minutesSame as basho/eleveldb#80.
customer "must have" request received 24 July 13
There isn't anything we can do when a value is corrupted; toss it, bitcask tombstone it, delete it from the keydir and return not_found.
Corrects for a situation in which there's always a key left after we have a merge start.
The current method of creating lock files is mostly good enough, especially with the recent testing by QuickCheck. However, it's probably a good idea to use the rename method used by riak_core_util:replace_file/2
. h/t @gburd for reminding me.
https://github.com/basho/bitcask/blob/1.6/src/bitcask_fileops.erl#L477 should return Acc
, not <<>>
. The empty binary looks sensible but it actually breaks the fold. This essentially forces require_hint_crc
to true
even though it defaults to false.
To fix this we need to change the line to return Acc
, change the bulk crc check to not happen if require_hint_crc
is false, and also to maintain safety, we should default require_hint_crc
to true
.
Due to so_name
being deprecated and now removed from rebar in favor of port_specs
, the shared library is not created when building bitcask with a current version of rebar. In this case, the files under the c_src
-directory are getting compiled but not linked.
In fact, this is a problem for everybody who wants to depend on bitcask but also use a current rebar version.
Note for reference: issue #49 (pull request "update rebar to 2.0.0") contains a commit ("Use port_specs" 292819e) that fixes this issue.
Seen at a customer over the weekend, it looks like if a needs_merge call hits a vnode just as it's starting up, it can cause the main vnode bitcask:open to fail, taking the vnode and potentially the system down. Not sure if this needs to be solved in bitcask, kv, or core, or possibly all three. The simplest potential solution I can think of is to make a 'maybe' version of init_keydir that gets used by needs_merge, that won't trigger a keydir creation, but solutions should wait for fuller understanding.
At a customer site we seem to have a situation where a number of creation lock files end up created empty, and the retry mechanism gets stuck in a loop because corrupted lock files are not considered stale, which would remove them and allow the retry.
That decision is taken here: https://github.com/basho/bitcask/blob/1.6.6/src/bitcask_lockops.erl#L147
The error log would have messages like these:
2014-05-05 00:38:22.537 [error] <0.6561.379> Failed to read lock data from /data/riak/riak-data/riak/bitcask/696496874040508421956443553091353626554780352512/bitcask.create.lock: {invalid_data,<<>>}
I think a corrupted lock file should be treated the same as a stale file to avoid getting permanently stuck.
thanks to @joecaswell for finding this.
This leak is slightly hard to trigger, but certain pathological behavior patterns might trigger it.
Merge creates files, and thus creates fstats entries. These are not synchronized with the write thread in any way, but once a value within them is read, they're added to the write thread's read_files
state, from where it is trimmed. So, to reproduce this, you need to have a file that is created by merge, never read from, and then merged away again.
The scenario where we saw this in the wild seemed to be:
A truncated hintfile causes the hintfile fold in the expiration logic to fail. It does not seem to be defaulting back to a data file fold like it is supposed to. Example logs:
2014-06-05 03:20:31.217 [error] <0.27283.8119> Hintfile 'DATA_ROOT/10836.bitcask.hint' contains pointer 18446744073709551615 3206534561 that is greater than total data size 15032468
2014-06-05 03:20:31.217 [error] <0.27283.8119> Error folding keys for "DATA_ROOT/10836.bitcask.data": {trunc_hintfile,ok}
The only mitigation is to remove the hintfile, then the default fold is correctly utilized.
We occasionally encounter issues with nodes crashing due to Bitcask file corruption issues. Although Erlang recovery scripts are available, the process of detecting corruption in the logs and applying the fix requires manual intervention. A better method for identifying and recovering from these errors would be helpful.
The following error happens on compile on Solaris 10u9.
Compiling /export/home/buildbot/masters/riak/riak-build-solaris-10u9-64/build/distdir/riak-2.0.0pre2/deps/bitcask/c_src/bitcask_nifs.c
In file included from /export/home/buildbot/masters/riak/riak-build-solaris-10u9-64/build/distdir/riak-2.0.0pre2/deps/bitcask/c_src/bitcask_nifs.c:39:
/usr/include/stdbool.h:42:2: #error "Use of <stdbool.h> is valid only in a c99 compilation environment."
when the following conditions are met:
I don't understand the merging code very well. But I'm working under the assumption that if a merge is performed on a file that is currently being written to then data can be lost and this is why the merge function tries to exclude the file currently being written to.
when you don't pass in an array of files merge calls readable_files:
-spec merge(Dirname::string()) -> ok.
merge(Dirname) ->
merge(Dirname, [], readable_files(Dirname)).
readable_files gets the current file being written to so it can ignore that files in list_data_files.
readable_files(Dirname) ->
%% Check the write and/or merge locks to see what files are currently
%% being written to. Generate our list excepting those.
WritingFile = bitcask_lockops:read_activefile(write, Dirname),
MergingFile = bitcask_lockops:read_activefile(merge, Dirname),
list_data_files(Dirname, WritingFile, MergingFile).
list_data_files(Dirname, WritingFile, Mergingfile) ->
%% Get list of {tstamp, filename} for all files in the directory then
%% reverse sort that list and extract the fully-qualified filename.
Files = bitcask_fileops:data_file_tstamps(Dirname),
[F || {_Tstamp, F} <- reverse_sort(Files),
F /= WritingFile,
F /= Mergingfile].
however, i think there is a small race condition because the following can happen:
For maximum parallelism, Bitcask should be avoiding use of file I/O-related functions that end up calling the local file_server
process.
For example, bitcask:needs_merge()
calls filelib:is_file()
, which is handled by and serialized by a single Erlang process: registered name file_server_2
, code module file_server
.
IIRC, the Hibari code uses some prim_file
hackery to avoid sending file I/O to the local file_server_2
proc.
It would be nice to have better visibility into bitcask merge activity.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.