Coder Social home page Coder Social logo

wsrep-lib's People

Contributors

ayurchen avatar denis-protivensky avatar ottok avatar pacheco avatar sciascid avatar shahriyarr avatar sjaakola avatar temeo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wsrep-lib's Issues

Cluster view must be recovered from state upon SST completion

When a node receives an SST it needs to recover the last view (in which SST happened) from the (received) state, since it won't receive a corresponding view event (it happened and was ordered before SST and its effects are encapsulated within it)

Typo in wsrep-lib/src/server_state.cpp void wsrep::server_state::on_view

It should be protocol_version instead of prococol_version. Will send PR:

2019-02-05 16:04:57 3 [Note] WSREP: ================================================
View:
  id: 115168fc-293e-11e9-b868-b2a2c5ce7818:0
  seqno: 7
  status: 0
  prococol_version: 3
  own_index: 4
  final: 0
  members
void wsrep::server_state::on_view(const wsrep::view& view,
                                  wsrep::high_priority_service* high_priority_service)
{
    wsrep::log_info()
        << "================================================\nView:\n"
        << "  id: " << view.state_id() << "\n"
        << "  seqno: " << view.view_seqno() << "\n"
        << "  status: " << view.status() << "\n"
        << "  prococol_version: " << view.protocol_version() << "\n"
        << "  own_index: " << view.own_index() << "\n"
        << "  final: " << view.final() << "\n"
        << "  members";

mutable_buffer: Assertion '__builtin_expect(__n < this->size(), true)'

How to reproduce

Compile library with -D_GLIBCXX_ASSERTIONS. Attempt to access empty mutable_buffer data will hit the assertion in libstdc++:

#2  0x00005601b88744c4 in std::__replacement_assert (
    __file=0x5601b89fdaa8 "/usr/include/c++/7/bits/stl_vector.h", __line=797, 
    __function=0x5601b89fdc40 <std::vector<char, std::allocator<char> >::operator[](unsigned long)::__PRETTY_FUNCTION__> "std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = char; _Alloc = std::allocator<char>; std::vector<_Tp, _Alloc>::reference = cha"..., 
    __condition=0x5601b89fda78 "__builtin_expect(__n < this->size(), true)")
    at /usr/include/x86_64-linux-gnu/c++/7/bits/c++config.h:472
#3  0x00005601b8874ea5 in std::vector<char, std::allocator<char> >::operator[]
    (this=0x7ffddfcc6fd0, __n=0) at /usr/include/c++/7/bits/stl_vector.h:797
#4  0x00005601b887460b in wsrep::mutable_buffer::data (this=0x7ffddfcc6fd0)
    at /home/teemu/work/git/wsrep-lib/include/wsrep/buffer.hpp:88

Suggested fix

Instead of &buffer_[0], use std::vector::data to access the underlying data array. This call is valid even for empty vectors, see https://en.cppreference.com/w/cpp/container/vector/data.

Fix `make test`

Running make test in the build directory should run unit tests (if unit tests are compiled in)

Compilation fails on 32-bit systems with maintainer flags

/wsrep-lib/src/gtid.cpp:76:21: warning: conversion to 'ssize_t {aka int}' from 'std::streamoff {aka long long int}' may alter its value [-Wconversion]

Since the input for this function has type size_t, the return value probably should be of the same size, so it makes sense to statically cast std::streamoff to ssize_t

Build fails on 32-bit platform

/root/wsrep-lib/src/view.cpp: In member function 'int wsrep::view::member_index(const wsrep::id&) const':
/root/wsrep-lib/src/view.cpp:34:18: error: comparison between signed and unsigned integer expressions [-Werror=sign-compare]
         if (i != own_index_ && members_[i].id() == member_id) return i;
                  ^
cc1plus: all warnings being treated as errors

Typo in view.cpp::void wsrep::view::print it should be protocol instead of prococol

Replace prococol with protocol:

void wsrep::view::print(std::ostream& os) const
{
    os << "  id: " << state_id() << "\n"
       << "  status: " << to_c_string(status()) << "\n"
       << "  prococol_version: " << protocol_version() << "\n"
       << "  capabilities: " << provider::capability::str(capabilities())<<"\n"
       << "  final: " << (final() ? "yes" : "no") << "\n"
       << "  own_index: " << own_index() << "\n"
       << "  members(" << members().size() << "):\n";

    for (std::vector<wsrep::view::member>::const_iterator i(members().begin());
         i != members().end(); ++i)
    {
        os << "\t" << (i - members().begin()) /* ordinal index */
           << ": " << i->id()
           << ", " << i->name() << "\n";
    }
}

Rolling Upgrade: 10.4 node failed with Assertion `active() == false' failed

Opening this issue here because actual assertion happened in:
"/home/shako/Galera_Tests/MariaDB/wsrep-lib/src/transaction.cpp", line=159

Here is the scenario:

  • Started 5 nodes 10.3
  • Upgraded node5 to 10.4
  • Run following SQLs on one of 10.3 nodes:
CREATE TABLE test.sbtest4(id int(10) NOT NULL, PRIMARY KEY (id))ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE IF NOT EXISTS test.temp4 AS(SELECT * FROM test.sbtest4)
CREATE TABLE IF NOT EXISTS test.temp5 AS (SELECT * FROM test.sbtest4)
  • Lose node5

Bt output:

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ff0d9c38801 in __GI_abort () at abort.c:79
#2  0x00007ff0d9c2839a in __assert_fail_base (fmt=0x7ff0d9daf7d8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x55ac37dc3457 "active() == false", 
    file=file@entry=0x55ac37dc3418 "/home/shako/Galera_Tests/MariaDB/wsrep-lib/src/transaction.cpp", line=line@entry=159, 
    function=function@entry=0x55ac37dc47e0 <wsrep::transaction::start_transaction(wsrep::ws_handle const&, wsrep::ws_meta const&)::__PRETTY_FUNCTION__> "int wsrep::transaction::start_transaction(const wsrep::ws_handle&, const wsrep::ws_meta&)") at assert.c:92
#3  0x00007ff0d9c28412 in __GI___assert_fail (assertion=0x55ac37dc3457 "active() == false", 
    file=0x55ac37dc3418 "/home/shako/Galera_Tests/MariaDB/wsrep-lib/src/transaction.cpp", line=159, 
    function=0x55ac37dc47e0 <wsrep::transaction::start_transaction(wsrep::ws_handle const&, wsrep::ws_meta const&)::__PRETTY_FUNCTION__> "int wsrep::transaction::start_transaction(const wsrep::ws_handle&, const wsrep::ws_meta&)") at assert.c:101
#4  0x000055ac3789a59c in wsrep::transaction::start_transaction (this=0x7ff0ac006c30, ws_handle=..., ws_meta=...)
    at /home/shako/Galera_Tests/MariaDB/wsrep-lib/src/transaction.cpp:159
#5  0x000055ac36f3dcac in wsrep::client_state::start_transaction (this=0x7ff0ac006bc0, wsh=..., meta=...)
    at /home/shako/Galera_Tests/MariaDB/wsrep-lib/include/wsrep/client_state.hpp:366
#6  0x000055ac36f3b4df in Wsrep_high_priority_service::start_transaction (this=0x7ff0d45395f0, ws_handle=..., ws_meta=...)
    at /home/shako/Galera_Tests/MariaDB/sql/wsrep_high_priority_service.cc:198
#7  0x000055ac378907f7 in apply_write_set (server_state=..., high_priority_service=..., ws_handle=..., ws_meta=..., data=...)
    at /home/shako/Galera_Tests/MariaDB/wsrep-lib/src/server_state.cpp:192
#8  0x000055ac37894056 in wsrep::server_state::on_apply (this=0x55ac39f1e830, high_priority_service=..., ws_handle=..., ws_meta=..., data=...)
    at /home/shako/Galera_Tests/MariaDB/wsrep-lib/src/server_state.cpp:944
#9  0x000055ac378aa045 in wsrep::high_priority_service::apply (this=0x7ff0d45395f0, ws_handle=..., ws_meta=..., data=...)
    at /home/shako/Galera_Tests/MariaDB/wsrep-lib/include/wsrep/high_priority_service.hpp:46
#10 0x000055ac378a773e in (anonymous namespace)::apply_cb (ctx=0x7ff0d45395f0, wsh=0x7ff0d4538860, flags=73, buf=0x7ff0d4538870, meta=0x7ff0d4538b20, 
    exit_loop=0x7ff0d4538abd) at /home/shako/Galera_Tests/MariaDB/wsrep-lib/src/wsrep_provider_v26.cpp:489
---Type <return> to continue, or q <return> to quit---
#11 0x00007ff0d7864492 in galera::TrxHandleSlave::apply (this=this@entry=0x7ff0ac05f980, recv_ctx=recv_ctx@entry=0x7ff0d45395f0, 
    apply_cb=0x55ac378a750f <(anonymous namespace)::apply_cb(void*, wsrep_ws_handle_t const*, uint32_t, wsrep_buf_t const*, wsrep_trx_meta_t const*, wsrep_bool_t*)>, 
    meta=..., exit_loop=exit_loop@entry=@0x7ff0d4538abd: false) at /home/shako/Galera_Tests/Galera/galera/src/trx_handle.cpp:418
#12 0x00007ff0d7898225 in galera::ReplicatorSMM::apply_trx (this=this@entry=0x55ac39f506c0, recv_ctx=recv_ctx@entry=0x7ff0d45395f0, ts=...)
    at /home/shako/Galera_Tests/Galera/galera/src/replicator_smm.cpp:489
#13 0x00007ff0d789c4ee in galera::ReplicatorSMM::process_trx (this=0x55ac39f506c0, recv_ctx=0x7ff0d45395f0, ts_ptr=...)
    at /home/shako/Galera_Tests/Galera/galera/src/replicator_smm.cpp:2114
#14 0x00007ff0d787ebf4 in galera::GcsActionSource::process_writeset (this=this@entry=0x55ac39f59db0, recv_ctx=0x7ff0d45395f0, act=..., exit_loop=@0x7ff0d453937e: false)
    at /home/shako/Galera_Tests/Galera/galera/src/gcs_action_source.cpp:62
#15 0x00007ff0d787ed85 in galera::GcsActionSource::dispatch (this=this@entry=0x55ac39f59db0, recv_ctx=recv_ctx@entry=0x7ff0d45395f0, act=..., 
    exit_loop=@0x7ff0d453937e: false) at /home/shako/Galera_Tests/Galera/galera/src/gcs_action_source.cpp:109
#16 0x00007ff0d787f159 in galera::GcsActionSource::process (this=0x55ac39f59db0, recv_ctx=0x7ff0d45395f0, exit_loop=@0x7ff0d453937e: false)
    at /home/shako/Galera_Tests/Galera/galera/src/gcs_action_source.cpp:182
#17 0x00007ff0d7895720 in galera::ReplicatorSMM::async_recv (this=0x55ac39f506c0, recv_ctx=0x7ff0d45395f0)
    at /home/shako/Galera_Tests/Galera/galera/src/replicator_smm.cpp:383
#18 0x00007ff0d78b16c8 in galera_recv (gh=<optimized out>, recv_ctx=<optimized out>) at /home/shako/Galera_Tests/Galera/galera/src/wsrep_provider.cpp:236
#19 0x000055ac378a8430 in wsrep::wsrep_provider_v26::run_applier (this=0x55ac39f1ece0, applier_ctx=0x7ff0d45395f0)
    at /home/shako/Galera_Tests/MariaDB/wsrep-lib/src/wsrep_provider_v26.cpp:690
#20 0x000055ac36f5dfb4 in wsrep_replication_process (thd=0x7ff0ac000d60, arg=0x0) at /home/shako/Galera_Tests/MariaDB/sql/wsrep_thd.cc:61
#21 0x000055ac36f4f766 in start_wsrep_THD (arg=0x55ac39fd7c60) at /home/shako/Galera_Tests/MariaDB/sql/wsrep_mysqld.cc:2768
#22 0x00007ff0dab336db in start_thread (arg=0x7ff0d453a700) at pthread_create.c:463
#23 0x00007ff0d9d1988f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

client_state.cpp:121: int wsrep::client_state::before_command(): Assertion `server_state_.rollback_mode() == wsrep::server_state::rm_async' failed.

An assertion

client_state.cpp:121: int wsrep::client_state::before_command(): Assertion `server_state_.rollback_mode() == wsrep::server_state::rm_async' failed.

occurs if the client session acquires ownership via

wait_rollback_complete_and_acquire_ownership()

before BF abort happens. Because client session has now ownership, the control is not given for rollbacker and it is up to client session to terminate the transaction. Therefore the state must remain in s_must_abort and the client session hits wrong assertion in before_command().

Handle transient errors from desync() in desync_and_pause()

Desync operation in server_state::desync_and_pause() may fail due to transient errors from the provider, for example if the node has dropped from primary component. This however should not prevent pausing the provider.

Suggested fix:

  • Ignore desync error in desync_and_pause()
  • Record desync status into a member variable which can be inspected in resume_and_resync() to decide if resume should be called or not.

reporter asserts on startup

Switch case

case wsrep::server_state::s_initializing:
:

    case wsrep::server_state::s_initializing:
        if (s_disconnected_disconnected == state_)
            return s_disconnected_initializing;
        else if (s_joining_sst == state_)
            return s_joining_initializing;
        else if (s_joining_initializing == state_)
            return s_joining_initializing; // continuation
        else
        {
            assert(0);
            return state_;
        }

asserts on startup when state_ is wsrep::reporter::s_joining_initialized.
(Note that input is wsrep::server_state::state, output is wsrep::reporter::state - we translate one enum into a different enum)

so somehow the reporter state got past initialized_ -> that happens only after it gets wsrep::server_state::s_initialized, but then somehow it gets wsrep::server_state::s_initializing again...

Deadlock with high-priority transaction waiting for commit order

This was observed during multimaster testing: Two high priority transactions T1 and T2. The T1 which was ordered first faced a lock conflict due to asymmetric locking and had to check the state of T2. However, T2 had already called provider::commit_order_enter() without releasing the mutex. Because T2 could not proceed as it was ordered after T1 which was trying to acquire T2 mutex, deadlock occurred.

Fix server_state FSM transitions

There are two mutually dependent issues with wsrep::server_state FSM transitions:

  1. it prematurely switches to s_synced on donor when SST start fails:
    state(lock, s_synced);
  2. it arbitrarily switches to s_joined, since there is no dedicated JOINED callback/event in the API:
    state(lock, s_joined);

However this all can be resolved if we consider that s_joined state is actually reached when the last committed seqno becomes greater or equal to connected seqno.

Make origin of log message visible

Currently all messages are logged without any indication where they come from. As there will now be three different sources of wsrep logging in the application log (application patch, wsrep-lib, provider library), it may become difficult to locate the origin of the log message.

It was suggested to provide a prefix for all log messages which would indicate the origin:

  • lib - originates from wsrep-lib
  • provider library name - originates from loaded provider library

With this scheme, the log messages in the application log would look like following, assuming that the application logging callback uses WSREP prefix for all log messages corresponding to wsrep functionality:

WSREP: This message comes from the application wsrep patch
WSREP-lib: This message comes from wsrep-lib
WSREP-Galera: This message comes from loaded Galera provider library

transaction::bf_abort() does not abort non-active transaction

transaction::bf_abort() check is transaction is active, i.e. it has been started by transaction::start_transaction(), and only aborts victims with active transaction.
However, some client sessions in MySQL/MariaDB side can hold locks, and end up as brute force abort victims, even though they have not started transaction yet. One such example is Create Table As Select execution (CTAS), which causes eternal hang with galera.galera_concurrent_ctas test

before_command() wait for ongoing rollbacks leaks

wsrep::client_state::before_command() has a wait loop to make the client execution to pause until external rollbacking of client's transaction has completed.

External rollbacker, sets clients transaction state to aborting, and the wait loop checks if transaction state is aborting:
while (transaction_.state() == wsrep::transaction::s_aborting)
{
cond_.wait(lock);
}
rollbacker sets the transaction state to aborted, in transaction::after_rollback(). This happens first, and later in rollbacker execution client_state::sync_rollback_complete() is called and the cond_ signal is sent then.
This sequence has a race condition: if rollbacker has called for transaction::after_rollback(), but not yet called client_state::sync_rollback_complete(), incoming client will pass the before_command() wait loop, and after that both client and rollbacker will operate on same client state until rollbacker completes.
This race condition accounts for some sporadic failures with multi-master conflict testing, e.g. with galera.galera_FK_duplicate_client_insert

cmake ignores custom boost specified by cli parameter

12:43:27 -- Could NOT find Boost: Found unsuitable version "1.53.0", but required is at least "1.54.0" (found /usr/include, found components: unit_test_framework)
12:43:27 -- Performing Test FOUND_BOOST_TEST_INCLUDED_UNIT_TEST_HPP
12:44:58 -- Performing Test FOUND_BOOST_TEST_INCLUDED_UNIT_TEST_HPP - Failed
12:44:58 CMake Error at wsrep-lib/CMakeLists.txt:179 (message):
12:44:58 Boost unit test header not found

the issue is seen in 8.0 and 8.4 for 8.4 tge error is fatal.

https://jenkins.galeracluster.com/job/mysql-8.4-v26-build-debug/3/console
https://jenkins.galeracluster.com/job/mysql-8.0-v26-build-debug/8/console

Complete server_state state transitions

Not all use cases have been covered, especially changing state to disconnecting in case of errors during server initialization.

Implement unit tests to cover legitimate server state transitions and fix accordingly.

Improve diagnostic output for certain operaitons

Some methods in wsrep-lib still hide/ignore return codes from provider which complicates diagnostics and debugging, e.g.

2020-03-26 15:48:15 0 [ERROR] WSREP: Failed to create a new provider '/home/elenst/galera/galera-4.so' with options '':Failed to set encryption key

Here we have a very generic error statement which gives little insight into the issue: was there something wrong with the key?, was the function not implemented?, etc.

Improve error logging for
sst_sent()
sst_received()
set_encryption_key()

Allow setting encryption key before joining the group

Otherwise there is a plaintext leak in the very beginning of the writeset stream.

Currently src/server_state.cpp has

wsrep::server_state::set_encryption_key(std::vector<unsigned char>& key)
{
    encryption_key_ = key;
    if (state_ != s_disconnected)
    {
      ...

i.e. setting the key is skipped if provider is not connected. This is clearly a mistake as we need the key to be set before provider starts to receive and cache any data.

Making clear output in wsrep::server_state::load_provider

Need to add space between provider and initial position.
To make this:

[Note] WSREP: Loading provider /home/shako/Galera_Tests/Galera-4.x/libgalera_smm.soinitial position: a9d884c5-2b82-11e9-bd75-1f5b01606d57:0

As:

[Note] WSREP: Loading provider /home/shako/Galera_Tests/Galera-4.x/libgalera_smm.so initial position: a9d884c5-2b82-11e9-bd75-1f5b01606d57:0

Provide service interface call to recover streaming transactions

Streaming transaction handles must be created and destructed dynamically when the server is joining and leaving the cluster. This should be done in total order in order to avoid race conditions with fragment applying. The streaming transaction handles are already destructed when the server leaves the cluster, but the recovery in total order is still missing.

Provide a service interface call which will called in total order whenever the streaming transactions should be recovered.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.