codership / galera Goto Github PK

View Code? Open in Web Editor NEW

441.0 441.0 175.0 19.17 MB

Synchronous multi-master replication library

License: GNU General Public License v2.0

Python 0.91% C++ 86.86% C 5.40% Shell 6.00% Makefile 0.03% Perl 0.06% M4 0.05% CMake 0.68% Assembly 0.02%

galera's People

Contributors

Stargazers

Watchers

Forkers

openquery percona ottok raygolden jmatthew cyclefusion nettedfish fromdualjb storozhilov salle-mariadb lzpfmh xbezdick gaizai rohara mariadb linux-on-ibm-z minepay gitssie rrva chen--orange wfxiang08 petzah matiskiba elijah513 stefan-langenmaier ubuntuevangelist 0-t-0 chenbk85 andreasverhoeven gjedeer mysql-inception maniacs-ops eworm-de bovinae grooverdan xxwwbb3 zhangwei5095 jokerzh kuaikuai garfieldisaphilosopher gitqueue dletai anzersy shinguz thegoro elonmia ellerbrock stefreak crazee matthiaz jameslinus enowy namanlin hhorak xuubin sriduth bluemutedwisdom koubin rim99 shawn0915 janlindstrom aicloudnas sthibaul dciabrin isgasho syz521 smartree crixalis2013 gittyhubol fauust ycaihua wangyonghui888 trixpan kandy fzachariah zero804 cvicentiu sp1l jennica0917 laashub-soa 10088 lyxly renowncoder danyspin97 faramoscz andyli029 physolia zmiklank machworklab dzmhust terry1504 rohankumardubey tianyangtao lihaijun-jun cyberfusionio 0-wiz-0 joeyzhang2022 lidi100 slaptrix rahulmalik87

galera's Issues

Add Indication for User Unserviceable Parameters

There are a handful of user unserviceable parameters on galeraparameter.rst. Use the grayed out entries on the wiki as an indicator of which ones they are. Add some indication to galeraparameters.rst to show the reader that they are just for troubleshooting, and not for production. Strengthen the language in the entries to further stress this point.

Update Admin Guide

Update the Administrative Guide for new text and formats.

Failure Simulation

The old configuration page carried a section on failure simulation that made three general points on ways in which it could be done. Expand this into a full tutorial on how to break the cluster in controlled ways to test replication and prepare for actual breakage. Add the new page to the Getting Started Guide, either as a section on testingcluster.rst or a separate page just after.

wsrep_osu_method=RSU allows only one ALTER TABLE to run concurrently

Version: 5.6.15-56 Percona XtraDB Cluster (GPL), Release 25.5, Revision 759, wsrep_25.5.r4061

Running schema changes in parallel is useful In order to speed up the schema changes, so it uses more cpu cores. This also requires nodes to spend less time in 'maintenance mode'.
This currently fails with RSU.

The same error is given as in #63

Example

On a single node, with 2 sessions, do...

First create 2 tables and put some data in it, so that the ALTER TABLE takes a few seconds, enough to start another ALTER TABLE on another table in another session.

session1 mysql> set global wsrep_osu_method=rsu;
Query OK, 0 rows affected (0.00 sec)

RSU

session1 mysql> alter table test add key (a);

Add an index, immediately run the other ALTER on the second table in the second session:

session2 mysql> alter table test2 add key (a);
ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction

You immediately get a deadlock issue.

Log output:

2014-06-17 10:26:04 3875 [Note] WSREP: Member 0.0 (node1) desyncs itself from group
2014-06-17 10:26:04 3875 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 33)
2014-06-17 10:26:04 3875 [Note] WSREP: Provider paused at 62eb8c72-f601-11e3-b42c-ab6847529d86:33 (70)
2014-06-17 10:26:05 3875 [ERROR] WSREP: Node desync failed.: 11 (Resource temporarily unavailable)
     at galera/src/replicator_smm.cpp:desync():1623
2014-06-17 10:26:05 3875 [Warning] WSREP: RSU desync failed 3 for alter table test2 add key (a)
2014-06-17 10:26:05 3875 [Warning] WSREP: ALTER TABLE isolation failure
2014-06-17 10:26:06 3875 [Note] WSREP: resuming provider at 70
2014-06-17 10:26:06 3875 [Note] WSREP: Provider resumed.
2014-06-17 10:26:06 3875 [Note] WSREP: Member 0.0 (node1) resyncs itself to group
2014-06-17 10:26:06 3875 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 33)
2014-06-17 10:26:06 3875 [Note] WSREP: Member 0.0 (node1) synced with group.
2014-06-17 10:26:06 3875 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 33)

Scriptable SST Calling Convention section is Missing.

On the old documentation wiki, there was a section on the Scriptable SST page that covered calling conventions, which did not get carried over onto the new docs.

The rest of the Scriptable SST material can be found at the bottom of statetransfer.rst. If the addition makes the page too long, consider breaking it off into a separate scriptablesst.rst file.

inconsistent results from automated donor selection with segments

Freshly installed cluster of 9 nodes configured to segments in the following way:

segment 1: node1, node4, node7
segment 2: node2, node5, node8
segment 3: node3, node6, node9

Nodes were started with test suite script:

./tests/scripts/command.sh restart

First view was formed by node1 alone:

2014-05-12 13:21:08 5274 [Note] WSREP: New cluster view: global state:
 479307ce-d9d8-11e3-99d1-b6a5ab25498a:0, view# 1: Primary,
 number of nodes: 1, my index: 0, protocol version 2

All other nodes were joined in the second view:

2014-05-12 13:21:32 5274 [Note] WSREP: New cluster view: global state:
 479307ce-d9d8-11e3-99d1-b6a5ab25498a:0, view# 2: Primary,
 number of nodes: 9, my index: 0, protocol version 2

All other nodes except node4 and node7 managed to request and receive state snapshot properly.

Node4 printed in its own log:

2014-05-12 13:21:35 5503 [Note] WSREP: Requesting state transfer failed: 
-11(Resource temporarily unavailable). Will keep retrying every 1 second(s)

but after this no trace of SST requests from node4 in any log. SST requesting thread was sitting in:

Thread 3 (Thread 0x7fe4c9285700 (LWP 5621)):
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007fe4c61a36d4 in gcs_replv (conn=0x358fd10, act_in=0x7fe4c92832f0, 
    act=0x7fe4c9283410, scheduled=false) at gcs/src/gcs.c:1542
#2  0x00007fe4c619e6ff in gcs_repl (conn=0x358fd10, action=0x7fe4c9283410, 
    scheduled=false) at gcs/src/gcs.h:229
#3  0x00007fe4c61a3d20 in gcs_request_state_transfer (conn=0x358fd10, 
    req=0x7fe4ac00f1b0, size=50, 
    donor=0x7fe4c81283f8 <std::string::_Rep::_S_empty_rep_storage+24> "", 
    ist_uuid=0x7fe4c9283690, ist_seqno=-1, local=0x7fe4c92834f8)
    at gcs/src/gcs.c:1643
#4  0x00007fe4c620292b in galera::Gcs::request_state_transfer (this=0x3588b60, 
    req=0x7fe4ac00f1b0, req_len=50, sst_donor=..., ist_uuid=..., ist_seqno=-1, 
    seqno_l=0x7fe4c92834f8) at galera/src/gcs.hpp:169
---Type <return> to continue, or q <return> to quit---
#5  0x00007fe4c62100e9 in galera::ReplicatorSMM::send_state_request (
    this=0x3588620, req=0x7fe4ac00f180) at galera/src/replicator_str.cpp:566
#6  0x00007fe4c6210946 in galera::ReplicatorSMM::request_state_transfer (
    this=0x3588620, recv_ctx=0x7fe4ac0009a0, group_uuid=..., group_seqno=0, 
    sst_req=0x7fe4ac00f150, sst_req_len=36)
    at galera/src/replicator_str.cpp:657
#7  0x00007fe4c61fff7f in galera::ReplicatorSMM::process_conf_change (
    this=0x3588620, recv_ctx=0x7fe4ac0009a0, view_info=..., repl_proto=5, 
    next_state=galera::Replicator::S_CONNECTED, seqno_l=2)
    at galera/src/replicator_smm.cpp:1414
#8  0x00007fe4c61dc263 in galera::GcsActionSource::dispatch (this=0x3588ca8, 
    recv_ctx=0x7fe4ac0009a0, act=..., exit_loop=@0x7fe4c92843c9: false)
    at galera/src/gcs_action_source.cpp:141
#9  0x00007fe4c61dc4d4 in galera::GcsActionSource::process (this=0x3588ca8, 
    recv_ctx=0x7fe4ac0009a0, exit_loop=@0x7fe4c92843c9: false)
    at galera/src/gcs_action_source.cpp:176
#10 0x00007fe4c61f9f94 in galera::ReplicatorSMM::async_recv (this=0x3588620, 
    recv_ctx=0x7fe4ac0009a0) at galera/src/replicator_smm.cpp:357
#11 0x00007fe4c62150aa in galera_recv (gh=0x351fc10, recv_ctx=0x7fe4ac0009a0)
    at galera/src/wsrep_provider.cpp:231
#12 0x00000000006403ac in wsrep_replication_process (thd=0x7fe4ac0009a0)
    at /home/teemu/work/bzr/codership-mysql/5.6/sql/wsrep_thd.cc:309
#13 0x000000000061d906 in start_wsrep_THD (
    arg=0x6402e5 <wsrep_replication_process(THD*)>)
    at /home/teemu/work/bzr/codership-mysql/5.6/sql/mysqld.cc:5349
#14 0x00007fe4c8130f6e in start_thread (arg=0x7fe4c9285700)
    at pthread_create.c:311
#15 0x00007fe4c763d9cd in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Node7 printed:

2014-05-12 13:21:35 5504 [Note] WSREP: Requesting state transfer failed:
-11(Resource temporarily unavailable). Will keep retrying every 1 second(s)
...
2014-05-12 13:21:43 5504 [Note] WSREP: Member 1.1 (node7) requested state
transfer from '*any*'. Selected 0.1 (node1)(SYNCED) as donor.
2014-05-12 13:21:43 5504 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 0)

but node1 said:

2014-05-12 13:21:41 5274 [Note] WSREP: Member 1.1 (node7) requested state
transfer from '*any*'. Selected 3.2 (node8)(SYNCED) as donor.
2014-05-12 13:21:42 5274 [ERROR] WSREP:
gcs/src/gcs_group.c:gcs_group_handle_state_request():1311: Member 1.1 (node7)
requested state transfer, but its state is JOINER. Ignoring.
2014-05-12 13:21:43 5274 [ERROR] WSREP:
gcs/src/gcs_group.c:gcs_group_handle_state_request():1311: Member 1.1 (node7)
requested state transfer, but its state is JOINER. Ignoring.
2014-05-12 13:21:44 5274 [Note] WSREP: 3.2 (node8): State transfer to 1.1 (node7)
complete.

and there is no trace of donating SST to node7 in node8's log, just:

2014-05-12 13:21:43 5211 [Note] WSREP: Member 1.1 (node7) requested state 
transfer from '*any*'. Selected 0.1 (node1)(SYNCED) as donor.

Wrong free space estimation on RingBuffer::seqno_reset()

It is not particularly important, but is a bug and triggers asserts in debug builds.

need library to store various states in one file

currently there are two states stored in two different file:

grastate.dat stores galera replication state
gvwstate.dat stores last prim view state for pc recovery. see #10

but it's just too ugly. so we need some utility or library offers capabilities:

store various states in one file.
modification one state should not affect other states.
modification should be atomic and keep file's integrity.

ps: actually here 'state' could be extended and referred as any data.

Clean up Symptoms and Solutions

At the moment, symptomsolution.rst renders as a table where the particular symptom is listed in the left column and a very detailed solution is listed in the right column. While the logic of this is pretty clear, on a webpage the table formatting makes it difficult to read and find what you're looking for. Additionally, the table entries do not posses id references, so, in the event of a client having difficulties, it's not possible to link them to a particular symptom / solution pairing on the page.

Rewrite the table material into separate sections. If there are enough entries to justify it, set up a link table like on the parameters pages.

Remove Orphaned Files

There are a handful of files that are either part of an older draft or have had their material integrated elsewhere in the documentation. (See the Sphinx warnings for reference.) There are backups on hand if they need to be referenced later, but should be removed from main docs build to avoid loose HTML files turning up on the new site.

add replication latency to status variables

copy from t261

Average message replication latency. Since GCS handle messages asynchronously, it's better to measure it in GCOMM module.

Missing and invalid parameter descriptions

Originally reported here: https://groups.google.com/forum/#!topic/codership-team/O4vHuGBKRC0

http://galeracluster.com/documentation-webpages/galeraparameters.html - 'replication.*' should be 'repl.*'.

missing options on this page:

socket.checksum
repl.key_format
repl.proto_max
repl.max_ws_size
pc.announce_timeout
pc.wait_prim_timeout
gmcast.segment
base_host
base_port
cert.log_conflicts

other suggestions:

4.16.6 selinux - a link to https://blogs.oracle.com/jsmyth/entry/selinux_and_mysql would probably help

5.3.1. Diagnosing Multi-Master Conflicts

could include a reference to cert.log_conflicts

garbd configuration is clobbered by upgrade on Debian/Ubuntu

To avoid that /etc/default/garbd must be listed in a file called "conffiles" in the
debian folder.

gcs hangs after exception generated self-leave from gcomm

Exception thrown from gcomm caused gcs gcomm backend to generate self-leave message. After processing self-leave, only one of the gcs threads is able to exit cleanly:

140518  4:28:46 [ERROR] WSREP: exception from gcomm, backend must be restarted:evs::proto(ba547f34, GATHER, view_id(REG,ba547f34,7)) failed to form singleton view after exceeding max_install_timeouts 3, giving up (FATAL)
         at gcomm/src/evs_proto.cpp:handle_install_timer():612
140518  4:28:46 [Note] WSREP: Received self-leave message.
140518  4:28:46 [Note] WSREP: Flow-control interval: [0, 0]
140518  4:28:46 [Note] WSREP: Received SELF-LEAVE. Closing connection.
140518  4:28:46 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 0)
140518  4:28:46 [Note] WSREP: RECV thread exiting 0: Success
140518  4:28:46 [Note] WSREP: New cluster view: global state: 1e57ce94-ddb7-11e3-96a4-06023ac83134:0, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 2
140518  4:28:46 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
140518  4:28:46 [Note] WSREP: applier thread exiting (code:0)

All other receiver threads are stuck inside gcs waiting for messages:

Thread 2 (Thread 0x7f5be41dc700 (LWP 18645)):
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f5be6a54cae in fifo_lock_get (q=q@entry=0x2cb8520)
    at galerautils/src/gu_fifo.c:233
#2  0x00007f5be6a552a8 in gu_fifo_get_head (q=0x2cb8520, 
    err=err@entry=0x7f5be41db28c) at galerautils/src/gu_fifo.c:292
#3  0x00007f5be6b0d85c in gcs_recv (conn=0x2c8ffe0, action=0x7f5be41db2e0)
    at gcs/src/gcs.c:1671
#4  0x00007f5be6b39deb in galera::Gcs::recv (this=<optimized out>, act=...)
    at galera/src/gcs.hpp:103
#5  0x00007f5be6b22f98 in galera::GcsActionSource::process (this=0x2c89450, 
    recv_ctx=0x7f5bb4000a00, exit_loop=@0x7f5be41db35d: false)
    at galera/src/gcs_action_source.cpp:164
#6  0x00007f5be6b37deb in galera::ReplicatorSMM::async_recv (this=0x2c88e90, 
    recv_ctx=0x7f5bb4000a00) at galera/src/replicator_smm.cpp:390
#7  0x00007f5be6b4351d in galera_recv (gh=<optimized out>, 
    recv_ctx=<optimized out>) at galera/src/wsrep_provider.cpp:213
#8  0x000000000067eec3 in wsrep_replication_process (thd=0x7f5bb4000a00)
    at /home/teemu/work/bzr/codership-mysql/5.5/sql/wsrep_thd.cc:253
#9  0x000000000050d451 in start_wsrep_THD (
    arg=0x67ee4e <wsrep_replication_process(THD*)>)
    at /home/teemu/work/bzr/codership-mysql/5.5/sql/mysqld.cc:4482
#10 0x00007f5be9066f6e in start_thread (arg=0x7f5be41dc700)
    at pthread_create.c:311
#11 0x00007f5be814f9cd in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Both gcs conn and conn->core show closed state:

(gdb) p *conn
$1 = {my_idx = -1, memb_num = 0, my_name = 0x0, channel = 0x0, socket = 0x0, 
  state = GCS_CONN_CLOSED, config = 0x2c88ea0, config_is_local = false, 
  params = {fc_resume_factor = 1, recv_q_soft_limit = 0.25, 
    max_throttle = 0.25, recv_q_hard_limit = 8301034833169298432, 
    fc_base_limit = 16, max_packet_size = 64500, fc_debug = 0, 
    fc_master_slave = false, sync_donor = false}, gcache = 0x2c890c0, 
  sm = 0x7f5be9bc3010, local_act_id = 7, global_seqno = 0, repl_q = 0x2c903a0, 
  send_thread = 0, recv_q = 0x2cb8520, recv_q_size = 0, 
  recv_thread = 140032442623744, timeout = 9223372035999999999, fc_lock = {
    __data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, 
      __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
    __size = '\000' <repeats 39 times>, __align = 0}, conf_id = 4294967295, 
  stop_sent = 0, stop_count = 0, queue_len = 0, upper_limit = 0, 
  lower_limit = 0, fc_offset = 0, max_fc_state = GCS_CONN_JOINED, 
  stats_fc_sent = 0, stats_fc_received = 0, stfc = {
    hard_limit = 8301034833169298432, soft_limit = 2075258708292324608, 
    max_throttle = 0.25, init_size = 0, size = 0, last_sleep = 0, 
    act_count = 0, max_rate = 0, scale = 0, offset = 0, start = 0, debug = 0, 
    sleep_count = 0, sleeps = 0}, need_to_join = false, join_seqno = 0, 
  sync_sent = false, core = 0x2c901c0}
(gdb) p *conn->core
$2 = {config = 0x2c88ea0, cache = 0x2c890c0, prim_comp_no = 0, 
  state = CORE_CLOSED, proto_ver = 0, send_lock = {__data = {__lock = 0, 
      __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, 
      __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
    __size = '\000' <repeats 39 times>, __align = 0}, send_buf = 0x2ca2500, 
  send_buf_len = 32636, send_act_no = 1, recv_msg = {buf = 0x2c924f0, 
    buf_len = 65536, size = 64, sender_idx = -1, type = GCS_MSG_COMPONENT}, 
  fifo = 0x2c8fac0, group = {cache = 0x2c890c0, act_id = 0, conf_id = -1, 
    state_uuid = {data = "\320\ri\234\336D\021\343\222\vs\301\276q<E"}, 
    group_uuid = {data = "\036WΔݷ\021㖤\006\002:\310\061\064"}, num = 0, 
    my_idx = -1, my_name = 0x2c8fbe0 "node1", 
    my_address = 0x2c8d910 "192.168.16.11:3311", 
    state = GCS_GROUP_NON_PRIMARY, last_applied = 0, last_node = 0, 
    frag_reset = true, nodes = 0x0, prim_uuid = {
      data = "\272U\370s\336D\021㈀\367\316ˋ\245a"}, prim_seqno = 1, 
    prim_num = 1, prim_state = GCS_NODE_STATE_SYNCED, 
    gcs_proto_ver = 0 '\000', repl_proto_ver = 4, appl_proto_ver = 2, 
    quorum = {group_uuid = {data = "\036WΔݷ\021㖤\006\002:\310\061\064"}, 
      act_id = 0, conf_id = 0, primary = true, version = 2, gcs_proto_ver = 0, 
      repl_proto_ver = 4, appl_proto_ver = 2}, last_applied_proto_ver = 1}, 
  msg_size = 0, backend = {conn = 0x2cc2000, open = 0x7f5be6b0f979
     <gcomm_open(gcs_backend_t*, char const*, bool)>, 
    close = 0x7f5be6b0fd68 <gcomm_close(gcs_backend_t*)>, 
    destroy = 0x7f5be6b0e968 <gcomm_destroy(gcs_backend_t*)>, 
    send = 0x7f5be6b10657 <gcomm_send(gcs_backend_t*, void const*, size_t, gcs_msg_type_t)>, 
    recv = 0x7f5be6b0f4d9 <gcomm_recv(gcs_backend_t*, gcs_recv_msg_t*, long long)>, name = 0x7f5be6b0e8b0 <gcomm_name()>, 
    msg_size = 0x7f5be6b0eadc <gcomm_msg_size(gcs_backend_t*, long)>, 
    param_set = 0x7f5be6b0ebbb <gcomm_param_set(gcs_backend_t*, char const*, char const*)>, 
    param_get = 0x7f5be6b0e8b8 <gcomm_param_get(gcs_backend_t*, char const*)>}}

Implement inconsistency voting

Currently a single inconsistent node can abort all other nodes by committing a transaction that cannot be applied on slaves. This is because a slave has no way to establish which node is at fault - it or the master of transaction.
The task is to implement post-factum error voting before abort. If majority of the nodes see the same error with the same transaction, they will consider themselves consistent and force minority abort.

Depends on #1

Document PC recovery feature

3.x branch has a new wsrep provider option pc.recovery (boolean). If the value for this parameter is set to true, primary component state is stored to disk and recovered from there when node is started.

Primary component is recovered automatically when all of the nodes that were present in the last known primary component have established communication between each other.

This feature solves two use cases:

Automatic recovery from full cluster crash (for example full data center power outage)
Graceful full cluster restart without the need for bootstrapping new primary component explicitly

Known limitations:

If the wsrep position differs between nodes during the recovery, full SST is required
...

Galera does not compile on PowerPC

Mostly due to endianness issues.

/etc/init.d/garb

In the script there is a statement which tries to connect to one o the cluster nodes. The statement used is:
nc -z $HOST $PORT >/dev/null && break
This statement fails because the -z option is not supported anymore.
I changed this statement into
2>/dev/null >/dev/tcp/${HOST}/${PORT} && break
This new statement works as designed as the old nc -z statement

Backing up Cluster Data section is outdated

The page says:

"Galera Cluster backups can be performed just as regular MySQL backups, using a backup script. Since all the cluster nodes are identical, backing up one node backs up the entire cluster.

However, such backups will have no global transaction IDs associated with them. You can use these backups to recover data, but they cannot be used to recover a Galera Cluster node to a well-defined state. Furthermore, the backup procedure may block the cluster operation for the duration of backup, in the case of a blocking backup."

However

Blocking cluster operation can be avoided with wsrep_desync option which is present in resent versions of Galera Cluster
In case of rsync backup Global Transaction ID can be recovered from backed up InnoDB tablespace
Xtrabackup should have option to record GTID consistently

This section should be expanded to cover ways to take backups without the aid of garbd.

Document wsrep_flow_control_paused_ns

wsrep_flow_control_paused_ns is not documented on http://galeracluster.com/documentation-webpages/galerastatusvariables.html#wsrep-flow-control-paused

PXC 5.6 has a brief mention about it: http://www.percona.com/doc/percona-xtradb-cluster/5.6/wsrep-status-index.html

Messages sent by replicator layer in one configuration may be delivered in another

In particular this concerns "commit cut" message. While normally this should not be a problem, imagine a situation when PC with UUID1:1000 is followed by PC with UUID2:10. In this case commit cut 100 may cause some gcache buffers to be released too soon.

Solution seems to be to accompany each message with UUID of the configuration it was sent from. If the message is delivered in wrong configuration it may be discarded.

The probability of harm is low, however this may result in data loss. The fix requires upgrade in GCS protocol.

General Revision for index.rst and the Overview section.

The landing page for the docs and the files in the Overview section require general revisions. The landing page, index.rst, contains a brief line, the toctree, links to index and search, and the legal notice. Overview contains a brief introduction to Galera Cluster and a few files more appropriate to the Reference section.

Move the introductory material from Overview up onto index.rst. Move reference material in Overview and the legal notice on index.rst down into the Reference section.

wsrep_desync not documented

wsrep_desync is missing from http://galeracluster.com/documentation-webpages/mysqlwsrepoptions.html

Revise toctree Elements

The Sphinx build system by default writes in numbering on section headers. Trim this down for a cleaner look.

Adapt the header pages to the new format. Replace links with :doc: elements. Use a hidden toctree to keep the navigation links in line.

combining wsrep_OSU_method=RSU and wsrep_desync=on gives strange error #Usability

Version: 5.6.15-56 Percona XtraDB Cluster (GPL), Release 25.5, Revision 759, wsrep_25.5.r4061

node1 mysql> set global wsrep_osu_method=rsu;
Query OK, 0 rows affected (0.00 sec)

We switch to RSU method.

node1 mysql> set global wsrep_desync=off;
Query OK, 0 rows affected (0.00 sec)
2014-06-17 10:14:26 3875 [Note] WSREP: Member 0.0 (node1) resyncs itself to group
2014-06-17 10:14:26 3875 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 33)
2014-06-17 10:14:26 3875 [Note] WSREP: Member 0.0 (node1) synced with group.
2014-06-17 10:14:26 3875 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 33)
2014-06-17 10:14:26 3875 [Note] WSREP: Synchronized with group, ready for connections

Node is primary and synced, ensuring wsrep_desync is off.

node1 mysql> alter table test add index (a);
2014-06-17 10:14:28 3875 [Note] WSREP: Member 0.0 (node1) desyncs itself from group
2014-06-17 10:14:28 3875 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 33)
2014-06-17 10:14:28 3875 [Note] WSREP: Provider paused at 62eb8c72-f601-11e3-b42c-ab6847529d86:33 (54)
2014-06-17 10:14:30 3875 [Note] WSREP: resuming provider at 54
2014-06-17 10:14:30 3875 [Note] WSREP: Provider resumed.
2014-06-17 10:14:30 3875 [Note] WSREP: Member 0.0 (node1) resyncs itself to group
2014-06-17 10:14:30 3875 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 33)
2014-06-17 10:14:30 3875 [Note] WSREP: Member 0.0 (node1) synced with group.
2014-06-17 10:14:30 3875 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 33)
2014-06-17 10:14:30 3875 [Note] WSREP: Synchronized with group, ready for connections
Query OK, 0 rows affected, 1 warning (2.21 sec)
Records: 0  Duplicates: 0  Warnings: 1

Index added (the warning is because it's a duplicate index)

node1 mysql> set global wsrep_desync=on;
2014-06-17 10:14:34 3875 [Note] WSREP: Member 0.0 (node1) desyncs itself from group
2014-06-17 10:14:34 3875 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 33)
Query OK, 0 rows affected (0.02 sec)

Let's desync the node.

node1 mysql> alter table test add index (a);
2014-06-17 10:14:35 3875 [ERROR] WSREP: Node desync failed.: 11 (Resource temporarily unavailable)
     at galera/src/replicator_smm.cpp:desync():1623
2014-06-17 10:14:35 3875 [Warning] WSREP: RSU desync failed 3 for alter table test add index (a)
2014-06-17 10:14:35 3875 [Warning] WSREP: ALTER TABLE isolation failure
ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction

If we then try do do an alter table, it fails immediately.

I believe that either it should become possible to do RSU on a desynced node, or a more appropriate error should be shown (Had customers using such scenarios which got them stuck).
This could be useful as part of a migration, where the node needs to be desynced during the process and a schema change is necessary.

openssl version is hardocded into DEB build scripts

GCC __sync_* builtins are deprecated, prefer __atomic_* ones

This also should help with portability since 8-byte _sync* builtins are not available on every platform.

See
https://gcc.gnu.org/onlinedocs/gcc-4.8.3/gcc/_005f_005fsync-Builtins.html#_005f_005fsync-Builtins
https://gcc.gnu.org/onlinedocs/gcc-4.8.3/gcc/_005f_005fatomic-Builtins.html#_005f_005fatomic-Builtins

exception from gcomm: mn.operational() == false

Happened during 3-node stop/cont test, node 1 was stopped for 5 seconds.

Try 4/10

Signaling node local1 with STOP... 
140523 02:14:55.512 Job 'signal_cmd' on 'local1' complete in 0 seconds, 
Sleeping for 5 sec.
Signaling node local1 with CONT...
140523 02:15:00.528 Job 'signal_cmd' on 'local1' complete in 0 seconds,

Node 2 crashed with exception:

2014-05-23 02:15:00 22990 [ERROR] WSREP: exception from gcomm, backend must be restarted: mn.operational() == false:  (FATAL)
         at gcomm/src/evs_proto.cpp:deliver_trans():2820

and a bit later node 3 crashed with

2014-05-23 02:15:38 23382 [ERROR] WSREP: exception from gcomm, backend must be restarted: evs::proto(bc98dc51, GATHER, view_id(REG,adc6d070,24)) failed to form singleton view after exceeding max_install_timeouts 3, giving up (FATAL)
         at gcomm/src/evs_proto.cpp:handle_install_timer():614

which is probably another incarnation of #37.

Log from node 2 (adc6d070):

2014-05-23 02:15:00 22990 [Note] WSREP: filtering out trans message higher than 
install message hs 3253: {v=0,t=1,ut=255,o=0,s=3255,sr=0,as=3252,f=6,src=c4d1ce2
e,srcvid=view_id(REG,adc6d070,24),ru=00000000,r=[-1,-1],fs=43787,nl=(
)
}
2014-05-23 02:15:00 22990 [ERROR] WSREP: exception caused by message: {v=0,t=3,ut=255,o=1,s=-1,sr=-1,as=-1,f=4,src=bc98dc51,srcvid=view_id(REG,adc6d070,25),ru=00000000,r=[-1,-1],fs=47439,nl=(
)
}
 state after handling message: evs::proto(evs::proto(adc6d070, INSTALL, view_id(REG,adc6d070,24)), INSTALL) {
current_view=view(view_id(REG,adc6d070,24) memb {
        adc6d070,0
        bc98dc51,0
        c4d1ce2e,0
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=3255,safe_seq=3252,node_index=node: {idx=0,range=[3257,3256],safe_seq=3255} node: {idx=1,range=[3256,3255],safe_seq=3253} node: {idx=2,range=[3257,3256],safe_seq=3252} },
fifo_seq=66581,
last_sent=3256,
known:
adc6d070 at 
{o=1,s=0,i=1,fs=-1,jm=
{v=0,t=4,ut=255,o=1,s=3252,sr=-1,as=3253,f=0,src=adc6d070,srcvid=view_id(REG,adc6d070,24),ru=00000000,r=[-1,-1],fs=66575,nl=(
        adc6d070, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3253,ir=[3256,3255],}
        bc98dc51, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3253,ir=[3256,3255],}
        c4d1ce2e, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3252,ir=[3254,3253],}
)
},
}
bc98dc51 at tcp://10.0.2.15:10031
{o=1,s=0,i=1,fs=47439,jm=
{v=0,t=4,ut=255,o=1,s=3252,sr=-1,as=3253,f=4,src=bc98dc51,srcvid=view_id(REG,adc6d070,24),ru=00000000,r=[-1,-1],fs=47437,nl=(
        adc6d070, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3253,ir=[3256,3255],}
        bc98dc51, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3253,ir=[3256,3255],}
        c4d1ce2e, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3252,ir=[3254,3253],}
)
},
}
c4d1ce2e at tcp://10.0.2.15:10011
{o=0,s=1,i=0,fs=43777,}
install msg={v=0,t=5,ut=255,o=1,s=3252,sr=-1,as=3253,f=4,src=adc6d070,srcvid=view_id(REG,adc6d070,24),ru=00000000,r=[-1,-1],fs=66576,nl=(
        adc6d070, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3253,ir=[3256,3255],}
        bc98dc51, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3253,ir=[3256,3255],}
        c4d1ce2e, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3252,ir=[3254,3253],}
)
}
 }2014-05-23 02:15:00 22990 [ERROR] WSREP: exception from gcomm, backend must be restarted: mn.operational() == false:  (FATAL)
         at gcomm/src/evs_proto.cpp:deliver_trans():2820

and from node 3 (bc98dc51):

G,adc6d070,24)) suspecting node: adc6d070
2014-05-23 02:15:07 23382 [Note] WSREP: evs::proto(bc98dc51, INSTALL, view_id(RE
G,adc6d070,24)) suspecting node: c4d1ce2e
2014-05-23 02:15:08 23382 [Warning] WSREP: evs::proto(bc98dc51, INSTALL, view_id
(REG,adc6d070,24)) install timer expired
evs::proto(evs::proto(bc98dc51, INSTALL, view_id(REG,adc6d070,24)), INSTALL) {
current_view=view(view_id(REG,adc6d070,24) memb {
        adc6d070,0
        bc98dc51,0
        c4d1ce2e,0
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=3255,safe_seq=3252,node_index=node: {idx=0,ra
nge=[3257,3256],safe_seq=3255} node: {idx=1,range=[3256,3255],safe_seq=3255} nod
e: {idx=2,range=[3257,3256],safe_seq=3252} },
fifo_seq=47455,
last_sent=3255,
known:
adc6d070 at tcp://10.0.2.15:10021
{o=1,s=1,i=0,fs=66580,jm=
{v=0,t=4,ut=255,o=1,s=3252,sr=-1,as=3253,f=0,src=adc6d070,srcvid=view_id(REG,adc
6d070,24),ru=00000000,r=[-1,-1],fs=66576,nl=(
        adc6d070, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3253,ir=[32
56,3255],}
        bc98dc51, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3253,ir=[3256,3255],}
        c4d1ce2e, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3252,ir=[3254,3253],}
)
},
}
bc98dc51 at 
{o=1,s=0,i=1,fs=-1,jm=
{v=0,t=4,ut=255,o=1,s=3252,sr=-1,as=3253,f=0,src=bc98dc51,srcvid=view_id(REG,adc6d070,24),ru=00000000,r=[-1,-1],fs=47437,nl=(
        adc6d070, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3253,ir=[3256,3255],}
        bc98dc51, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3253,ir=[3256,3255],}
        c4d1ce2e, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3252,ir=[3254,3253],}
)
},
}
c4d1ce2e at tcp://10.0.2.15:10011
{o=0,s=1,i=0,fs=43777,}
install msg={v=0,t=5,ut=255,o=1,s=3252,sr=-1,as=3253,f=4,src=adc6d070,srcvid=view_id(REG,adc6d070,24),ru=00000000,r=[-1,-1],fs=66576,nl=(
        adc6d070, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3253,ir=[3256,3255],}
        bc98dc51, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3253,ir=[3256,3255],}
        c4d1ce2e, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3252,ir=[3254,3253],}
)
}
 }
2014-05-23 02:15:08 23382 [Note] WSREP: evs::proto(bc98dc51, INSTALL, view_id(REG,adc6d070,24)) node c4d1ce2e failed to commit for install message, declaring inactive
...
2014-05-23 02:15:30 23382 [Note] WSREP: max install timeouts reached, will isola
te node for PT20S
2014-05-23 02:15:30 23382 [Note] WSREP: no install message received
2014-05-23 02:15:38 23382 [Warning] WSREP: evs::proto(bc98dc51, GATHER, view_id(
REG,adc6d070,24)) install timer expired
...
2014-05-23 02:15:38 23382 [Note] WSREP: going to give up, state dump for diagnos
is:
evs::proto(evs::proto(bc98dc51, GATHER, view_id(REG,adc6d070,24)), GATHER) {
current_view=view(view_id(REG,adc6d070,24) memb {
        adc6d070,0
        bc98dc51,0
        c4d1ce2e,0
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=3255,safe_seq=3252,node_index=node: {idx=0,range=[3257,3256],safe_seq=3255} node: {idx=1,range=[3256,3255],safe_seq=3255} node: {idx=2,range=[3257,3256],safe_seq=3252} },
fifo_seq=47489,
last_sent=3255,
known:
adc6d070 at tcp://10.0.2.15:10021
{o=0,s=1,i=0,fs=66580,}
bc98dc51 at 
{o=1,s=0,i=0,fs=-1,jm=
{v=0,t=4,ut=255,o=1,s=3252,sr=-1,as=3255,f=0,src=bc98dc51,srcvid=view_id(REG,adc6d070,24),ru=00000000,r=[-1,-1],fs=47489,nl=(
        adc6d070, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3255,ir=[3257,3256],}
        bc98dc51, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3255,ir=[3256,3255],}
        c4d1ce2e, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3252,ir=[3257,3256],}
)
},
}
c4d1ce2e at tcp://10.0.2.15:10011
{o=0,s=1,i=0,fs=43777,}
 }
2014-05-23 02:15:38 23382 [ERROR] WSREP: exception from gcomm, backend must be restarted: evs::proto(bc98dc51, GATHER, view_id(REG,adc6d070,24)) failed to form singleton view after exceeding max_install_timeouts 3, giving up (FATAL)
         at gcomm/src/evs_proto.cpp:handle_install_timer():614

Possible sequence of events why this happened:

Node 1 was stopped just when it was sending user message with seqno 3256, but neither node 2 nor 3 received it
Nodes 2 and 3 proceeded to form a new view with highest seen seqno 3255, as indicated by install message

install msg={v=0,t=5,ut=255,o=1,s=3252,sr=-1,as=3253,f=4,src=adc6d070,srcvid=view_id(REG,adc6d070,24),ru=00000000,r=[-1,-1],fs=66576,nl=(
        adc6d070, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3253,ir=[3256,3255],}
        bc98dc51, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3253,ir=[3256,3255],}
        c4d1ce2e, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,adc6d070,24),ss=3252,ir=[3254,3253],}

Node 1 was resumed while nodes 2 and 3 were processing new view installation, and user message with seqno 3256 leaked into node 2 and 3 input maps. Node 2 generated completing user message 3256 too but node 3 didn't
Node 2 which crashed first proceeded to shift to operational state and started delivering transitional view messages, detected that it was delivering message with higher seqno than agreed in install message and aborted
Node 3 declared also node 2 as inactive after a while and tried to form new view but failed, possibly due to #37.

To prevent this from happening, filtering incoming messages in install stage must be tightened. If install message has been received, do more strict checking for incoming messages:

Message from nonoperational source: drop
Message from operational source which does not match to install message state: drop and shift to gather state

This will work because:

If message not matching to generated install message is received before install message, install message will be rejected and representative will generate new install message once install timer expires
If message not matching to processed install message is received, it is dropped and node state won't be altered

Asynchronous Replication: galera_apply_status and galera_binlog_index tables

When combining asynchronous replication with Galera, we need to track what is the last Xid applied, if we want to replicate from another "Master" node.

Currently, the only way to know the last GTID applied is to:

Read the slave's relay-log, and find out the Xid as a comment.
Go to a master, find the matching position for this Xid
Reconfigure the slave to replicate from position found in Step 2.

This is very manual, could be automated, but also could be enhanced by having the following:

On Master(s):
binlog_index table, containing server_id, Xid, binlog name, binlog start and binlog end.
On Slave(s):
binlog_apply table, containing server_id, server_name cluster_id, Xid, binlog name (..)

Example of implementation:
http://dev.mysql.com/doc/refman/5.5/en/mysql-cluster-replication-schema.html

Thanks,
Joffrey

exception last prims not consistent

Nine node multi-site setup, started all nine nodes simultaneously. Three of them crashed in PC layer exception after full group of 9 was reached. Debug log from one of the crashed nodes:

140515 14:02:01 [Note] WSREP: gcomm: connecting to group 'my_wsrep_cluster', peer '192.168.17.12:10081'
140515 14:02:01 [Note] WSREP: evs::proto(7ccb744c, CLOSED, view_id(TRANS,7ccb744c,0)):  state change: CLOSED -> JOINING
140515 14:02:03 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') turning message relay requesting on, nonlive peers: tcp://192.168.17.11:10011 tcp://192.168.17.11:10041 tcp://192.168.17.11:10071 tcp://192.168.17.12:10021 tcp://192.168.17.12:10051 tcp://192.168.17.13:10031 tcp://192.168.17.13:10061 
140515 14:02:03 [Note] WSREP: evs::proto(7ccb744c, JOINING, view_id(TRANS,7ccb744c,0)):  detected new message source 7cfbf591
140515 14:02:03 [Note] WSREP: evs::proto(7ccb744c, JOINING, view_id(TRANS,7ccb744c,0)):  shift to GATHER due to foreign message from 7cfbf591
140515 14:02:03 [Note] WSREP: evs::proto(7ccb744c, JOINING, view_id(TRANS,7ccb744c,0)):  state change: JOINING -> GATHER
140515 14:02:03 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(TRANS,7ccb744c,0)):  detected new message source 7da57d09
140515 14:02:03 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(TRANS,7ccb744c,0)):  shift to GATHER due to foreign message from 7da57d09
140515 14:02:03 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(TRANS,7ccb744c,0)):  detected new message source 7dc7b695
140515 14:02:03 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(TRANS,7ccb744c,0)):  shift to GATHER due to foreign message from 7dc7b695
140515 14:02:03 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.12368S), skipping check
140515 14:02:06 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') turning message relay requesting off
140515 14:02:08 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(TRANS,7ccb744c,0)): setting 7d1499c5 inactive in asymmetry elimination
140515 14:02:08 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(TRANS,7ccb744c,0)): before asym elimination
1 1 1 1 1 1 1 1 1 
140515 14:02:08 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(TRANS,7ccb744c,0)): after asym elimination
1 1 1 0 1 1 1 1 1 
140515 14:02:08 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(TRANS,7ccb744c,0)):  state change: GATHER -> INSTALL
140515 14:02:08 [Note] WSREP: evs::proto(7ccb744c, INSTALL, view_id(TRANS,7ccb744c,0)):  state change: INSTALL -> OPERATIONAL
140515 14:02:08 [Note] WSREP: evs::proto(7ccb744c, INSTALL, view_id(TRANS,7ccb744c,0)):  delivering view view(view_id(TRANS,7ccb744c,0) memb {
    7ccb744c,0
} joined {
} left {
} partitioned {
})
140515 14:02:08 [Note] WSREP: evs::proto(7ccb744c, OPERATIONAL, view_id(REG,79dcfbbc,3)): delivering view view(view_id(REG,79dcfbbc,3) memb {
    79dcfbbc,0
    7ccb744c,0
    7cfbf591,0
    7d6503d1,0
    7d75423e,0
    7da57d09,0
    7dc7b695,0
    7ddadfcf,0
} joined {
    79dcfbbc,0
    7cfbf591,0
    7d6503d1,0
    7d75423e,0
    7da57d09,0
    7dc7b695,0
    7ddadfcf,0
} left {
} partitioned {
})
140515 14:02:08 [Note] WSREP: declaring 79dcfbbc at tcp://192.168.17.11:10011 stable
140515 14:02:08 [Note] WSREP: declaring 7cfbf591 at tcp://192.168.17.13:10031 stable
140515 14:02:08 [Note] WSREP: declaring 7d6503d1 at tcp://192.168.17.11:10071 stable
140515 14:02:08 [Note] WSREP: declaring 7d75423e at tcp://192.168.17.11:10041 stable
140515 14:02:08 [Note] WSREP: declaring 7da57d09 at tcp://192.168.17.12:10021 stable
140515 14:02:08 [Note] WSREP: declaring 7dc7b695 at tcp://192.168.17.12:10051 stable
140515 14:02:08 [Note] WSREP: declaring 7ddadfcf at tcp://192.168.17.12:10081 stable
140515 14:02:08 [Note] WSREP: evs::proto(7ccb744c, OPERATIONAL, view_id(REG,79dcfbbc,3)):  detected new message source 7d1499c5
140515 14:02:08 [Note] WSREP: evs::proto(7ccb744c, OPERATIONAL, view_id(REG,79dcfbbc,3)):  shift to GATHER due to foreign message from 7d1499c5
140515 14:02:08 [Note] WSREP: evs::proto(7ccb744c, OPERATIONAL, view_id(REG,79dcfbbc,3)):  state change: OPERATIONAL -> GATHER
140515 14:02:08 [Note] WSREP: Node 79dcfbbc state prim
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,79dcfbbc,3)): setting 79dcfbbc inactive in asymmetry elimination
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,79dcfbbc,3)): setting 7d1499c5 inactive in asymmetry elimination
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,79dcfbbc,3)): setting 7d6503d1 inactive in asymmetry elimination
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,79dcfbbc,3)): setting 7d75423e inactive in asymmetry elimination
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,79dcfbbc,3)): setting 7dc7b695 inactive in asymmetry elimination
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,79dcfbbc,3)): setting 7ddadfcf inactive in asymmetry elimination
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,79dcfbbc,3)): before asym elimination
1 1 1 1 1 1 1 1 1 
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,79dcfbbc,3)): after asym elimination
0 1 1 0 0 0 1 0 0 
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,79dcfbbc,3)): delayed 79dcfbbc requesting range [1,0]
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,79dcfbbc,3)): delayed 7d6503d1 requesting range [1,0]
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,79dcfbbc,3)): delayed 7d75423e requesting range [1,0]
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,79dcfbbc,3)): delayed 7dc7b695 requesting range [1,0]
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,79dcfbbc,3)): delayed 7ddadfcf requesting range [1,0]
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,79dcfbbc,3)): sending install message
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,79dcfbbc,3)):  state change: GATHER -> INSTALL
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, INSTALL, view_id(REG,79dcfbbc,3)):  state change: INSTALL -> OPERATIONAL
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, INSTALL, view_id(REG,79dcfbbc,3)):  delivering view view(view_id(TRANS,79dcfbbc,3) memb {
    7ccb744c,0
    7cfbf591,0
    7da57d09,0
} joined {
} left {
} partitioned {
    79dcfbbc,0
    7d6503d1,0
    7d75423e,0
    7dc7b695,0
    7ddadfcf,0
})
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, OPERATIONAL, view_id(REG,7ccb744c,4)): delivering view view(view_id(REG,7ccb744c,4) memb {
    7ccb744c,0
    7cfbf591,0
    7da57d09,0
} joined {
} left {
} partitioned {
    79dcfbbc,0
    7d6503d1,0
    7d75423e,0
    7dc7b695,0
    7ddadfcf,0
})
140515 14:02:13 [Note] WSREP: declaring 7cfbf591 at tcp://192.168.17.13:10031 stable
140515 14:02:13 [Note] WSREP: declaring 7da57d09 at tcp://192.168.17.12:10021 stable
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, OPERATIONAL, view_id(REG,7ccb744c,4)):  detected new message source 7d1499c5
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, OPERATIONAL, view_id(REG,7ccb744c,4)):  shift to GATHER due to foreign message from 7d1499c5
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, OPERATIONAL, view_id(REG,7ccb744c,4)):  state change: OPERATIONAL -> GATHER
140515 14:02:13 [Note] WSREP: Node 7cfbf591 state prim
140515 14:02:13 [Warning] WSREP: 7ccb744c sending install message failed: Resource temporarily unavailable
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,7ccb744c,4)):  detected new message source 79dcfbbc
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,7ccb744c,4)):  shift to GATHER due to foreign message from 79dcfbbc
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,7ccb744c,4)):  detected new message source 7d6503d1
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,7ccb744c,4)):  shift to GATHER due to foreign message from 7d6503d1
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,7ccb744c,4)):  detected new message source 7d75423e
140515 14:02:13 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,7ccb744c,4)):  shift to GATHER due to foreign message from 7d75423e
140515 14:02:14 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') turning message relay requesting on, nonlive peers: tcp://192.168.17.11:10011 tcp://192.168.17.11:10041 tcp://192.168.17.11:10071 
140515 14:02:16 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') reconnecting to 79dcfbbc (tcp://192.168.17.11:10011), attempt 0
140515 14:02:16 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') reconnecting to 7d75423e (tcp://192.168.17.11:10041), attempt 0
140515 14:02:16 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') reconnecting to 7d6503d1 (tcp://192.168.17.11:10071), attempt 0
140515 14:02:16 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') reconnecting to 7dc7b695 (tcp://192.168.17.12:10051), attempt 0
140515 14:02:16 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') reconnecting to 7ddadfcf (tcp://192.168.17.12:10081), attempt 0
140515 14:02:17 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') reconnecting to 79dcfbbc (tcp://192.168.17.11:10011), attempt 0
140515 14:02:17 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') reconnecting to 7d75423e (tcp://192.168.17.11:10041), attempt 0
140515 14:02:17 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') reconnecting to 7d6503d1 (tcp://192.168.17.11:10071), attempt 0
140515 14:02:17 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') reconnecting to 7ddadfcf (tcp://192.168.17.12:10081), attempt 0
140515 14:02:17 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') reconnecting to 7dc7b695 (tcp://192.168.17.12:10051), attempt 0
140515 14:02:18 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') reconnecting to 7d75423e (tcp://192.168.17.11:10041), attempt 0
140515 14:02:18 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') reconnecting to 7d6503d1 (tcp://192.168.17.11:10071), attempt 0
140515 14:02:18 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') reconnecting to 7dc7b695 (tcp://192.168.17.12:10051), attempt 0
140515 14:02:18 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') reconnecting to 7ddadfcf (tcp://192.168.17.12:10081), attempt 0
140515 14:02:18 [Note] WSREP: (7ccb744c, 'tcp://0.0.0.0:10091') reconnecting to 79dcfbbc (tcp://192.168.17.11:10011), attempt 0
140515 14:02:18 [Note] WSREP: evs::proto(7ccb744c, GATHER, view_id(REG,7ccb744c,4)):  state change: GATHER -> INSTALL
140515 14:02:19 [Note] WSREP: evs::proto(7ccb744c, INSTALL, view_id(REG,7ccb744c,4)):  state change: INSTALL -> OPERATIONAL
140515 14:02:19 [Note] WSREP: evs::proto(7ccb744c, INSTALL, view_id(REG,7ccb744c,4)):  delivering view view(view_id(TRANS,7ccb744c,4) memb {
    7ccb744c,0
    7cfbf591,0
    7da57d09,0
} joined {
} left {
} partitioned {
})
140515 14:02:19 [Note] WSREP: evs::proto(7ccb744c, OPERATIONAL, view_id(REG,79dcfbbc,5)): delivering view view(view_id(REG,79dcfbbc,5) memb {
    79dcfbbc,0
    7ccb744c,0
    7cfbf591,0
    7d1499c5,0
    7d6503d1,0
    7d75423e,0
    7da57d09,0
    7dc7b695,0
    7ddadfcf,0
} joined {
    79dcfbbc,0
    7d1499c5,0
    7d6503d1,0
    7d75423e,0
    7dc7b695,0
    7ddadfcf,0
} left {
} partitioned {
})
140515 14:02:19 [Note] WSREP: declaring 79dcfbbc at tcp://192.168.17.11:10011 stable
140515 14:02:19 [Note] WSREP: declaring 7cfbf591 at tcp://192.168.17.13:10031 stable
140515 14:02:19 [Note] WSREP: declaring 7d1499c5 at tcp://192.168.17.13:10061 stable
140515 14:02:19 [Note] WSREP: declaring 7d6503d1 at tcp://192.168.17.11:10071 stable
140515 14:02:19 [Note] WSREP: declaring 7d75423e at tcp://192.168.17.11:10041 stable
140515 14:02:19 [Note] WSREP: declaring 7da57d09 at tcp://192.168.17.12:10021 stable
140515 14:02:19 [Note] WSREP: declaring 7dc7b695 at tcp://192.168.17.12:10051 stable
140515 14:02:19 [Note] WSREP: declaring 7ddadfcf at tcp://192.168.17.12:10081 stable
140515 14:02:19 [Note] WSREP: Node 79dcfbbc state prim
140515 14:02:20 [ERROR] WSREP: caught exception in PC, state dump to stderr follows:
pc::Proto{uuid=7ccb744c,start_prim=0,npvo=0,ignore_sb=0,ignore_quorum=0,state=1,last_sent_seq=0,checksum=0,instances=
    79dcfbbc,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,2),to_seq=-1,weight=1,segment=1
    7ccb744c,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=-1,weight=1,segment=3
    7cfbf591,prim=1,un=0,last_seq=1,last_prim=view_id(PRIM,79dcfbbc,2),to_seq=-1,weight=1,segment=3
    7d1499c5,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=-1,weight=1,segment=3
    7d6503d1,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=-1,weight=1,segment=1
    7d75423e,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=-1,weight=1,segment=1
    7da57d09,prim=1,un=0,last_seq=0,last_prim=view_id(PRIM,79dcfbbc,2),to_seq=-1,weight=1,segment=2
    7dc7b695,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=-1,weight=1,segment=2
    7ddadfcf,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=-1,weight=1,segment=2
,state_msgs=
    79dcfbbc,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 79dcfbbc,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=1
    7d6503d1,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=1
    7d75423e,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=1
    7dc7b695,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=2
    7ddadfcf,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=2
}}
    7ccb744c,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 79dcfbbc,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,2),to_seq=-1,weight=1,segment=1
    7ccb744c,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=-1,weight=1,segment=3
    7cfbf591,prim=1,un=0,last_seq=1,last_prim=view_id(PRIM,79dcfbbc,2),to_seq=-1,weight=1,segment=3
    7d6503d1,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=-1,weight=1,segment=1
    7d75423e,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=-1,weight=1,segment=1
    7da57d09,prim=1,un=0,last_seq=0,last_prim=view_id(PRIM,79dcfbbc,2),to_seq=-1,weight=1,segment=2
    7dc7b695,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=-1,weight=1,segment=2
    7ddadfcf,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=-1,weight=1,segment=2
}}
    7cfbf591,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 79dcfbbc,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,2),to_seq=5,weight=1,segment=1
    7ccb744c,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=5,weight=1,segment=3
    7cfbf591,prim=1,un=0,last_seq=1,last_prim=view_id(PRIM,79dcfbbc,2),to_seq=5,weight=1,segment=3
    7d6503d1,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=5,weight=1,segment=1
    7d75423e,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=5,weight=1,segment=1
    7da57d09,prim=1,un=0,last_seq=0,last_prim=view_id(PRIM,79dcfbbc,2),to_seq=5,weight=1,segment=2
    7dc7b695,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=5,weight=1,segment=2
    7ddadfcf,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=5,weight=1,segment=2
}}
    7d1499c5,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 7d1499c5,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=-1,weight=1,segment=3
}}
    7d6503d1,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 79dcfbbc,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=1
    7d6503d1,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=1
    7d75423e,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=1
    7dc7b695,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=2
    7ddadfcf,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=2
}}
    7d75423e,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 79dcfbbc,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=1
    7d6503d1,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=1
    7d75423e,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=1
    7dc7b695,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=2
    7ddadfcf,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=2
}}
    7da57d09,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 79dcfbbc,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,2),to_seq=5,weight=1,segment=1
    7ccb744c,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=5,weight=1,segment=3
    7cfbf591,prim=1,un=0,last_seq=1,last_prim=view_id(PRIM,79dcfbbc,2),to_seq=5,weight=1,segment=3
    7d6503d1,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=5,weight=1,segment=1
    7d75423e,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=5,weight=1,segment=1
    7da57d09,prim=1,un=0,last_seq=0,last_prim=view_id(PRIM,79dcfbbc,2),to_seq=5,weight=1,segment=2
    7dc7b695,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=5,weight=1,segment=2
    7ddadfcf,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000,0),to_seq=5,weight=1,segment=2
}}
    7dc7b695,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 79dcfbbc,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=1
    7d6503d1,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=1
    7d75423e,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=1
    7dc7b695,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=2
    7ddadfcf,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=2
}}
    7ddadfcf,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 79dcfbbc,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=1
    7d6503d1,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=1
    7d75423e,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=1
    7dc7b695,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=2
    7ddadfcf,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,79dcfbbc,4),to_seq=15,weight=1,segment=2
}}
,current_view=view(view_id(REG,79dcfbbc,5) memb {
    79dcfbbc,0
    7ccb744c,0
    7cfbf591,0
    7d1499c5,0
    7d6503d1,0
    7d75423e,0
    7da57d09,0
    7dc7b695,0
    7ddadfcf,0
} joined {
    79dcfbbc,0
    7d1499c5,0
    7d6503d1,0
    7d75423e,0
    7dc7b695,0
    7ddadfcf,0
} left {
} partitioned {
}),pc_view=view((empty)),mtu=2147483647}
140515 14:02:20 [Note] WSREP: {v=0,t=1,ut=255,o=4,s=0,sr=0,as=-1,f=4,src=7ddadfcf,srcvid=view_id(REG,79dcfbbc,5),ru=00000000,r=[-1,-1],fs=142,nl=(
)
} 272
140515 14:02:20 [ERROR] WSREP: exception caused by message: {v=0,t=3,ut=255,o=1,s=0,sr=-1,as=0,f=4,src=7cfbf591,srcvid=view_id(REG,79dcfbbc,5),ru=00000000,r=[-1,-1],fs=112,nl=(
)
}
 state after handling message: evs::proto(evs::proto(7ccb744c, OPERATIONAL, view_id(REG,79dcfbbc,5)), OPERATIONAL) {
current_view=view(view_id(REG,79dcfbbc,5) memb {
    79dcfbbc,0
    7ccb744c,0
    7cfbf591,0
    7d1499c5,0
    7d6503d1,0
    7d75423e,0
    7da57d09,0
    7dc7b695,0
    7ddadfcf,0
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=0,safe_seq=0,node_index=node: {idx=0,range=[1,0],safe_seq=0} node: {idx=1,range=[1,0],safe_seq=0} node: {idx=2,range=[1,0],safe_seq=0} node: {idx=3,range=[1,0],safe_seq=0} node: {idx=4,range=[1,0],safe_seq=0} node: {idx=5,range=[1,0],safe_seq=0} node: {idx=6,range=[1,0],safe_seq=0} node: {idx=7,range=[1,0],safe_seq=0} node: {idx=8,range=[1,0],safe_seq=0} ,msg_index= (8,0),{v=0,t=1,ut=255,o=4,s=0,sr=0,as=-1,f=4,src=7ddadfcf,srcvid=view_id(REG,79dcfbbc,5),ru=00000000,r=[-1,-1],fs=142,nl=(
)
}
,recovery_index=    (0,0),{v=0,t=1,ut=255,o=4,s=0,sr=0,as=-1,f=4,src=79dcfbbc,srcvid=view_id(REG,79dcfbbc,5),ru=00000000,r=[-1,-1],fs=163,nl=(
)
}
    (1,0),{v=0,t=1,ut=255,o=4,s=0,sr=0,as=-1,f=0,src=7ccb744c,srcvid=view_id(REG,79dcfbbc,5),ru=00000000,r=[-1,-1],fs=109,nl=(
)
}
    (2,0),{v=0,t=1,ut=255,o=4,s=0,sr=0,as=-1,f=4,src=7cfbf591,srcvid=view_id(REG,79dcfbbc,5),ru=00000000,r=[-1,-1],fs=110,nl=(
)
}
    (3,0),{v=0,t=1,ut=255,o=4,s=0,sr=0,as=-1,f=4,src=7d1499c5,srcvid=view_id(REG,79dcfbbc,5),ru=00000000,r=[-1,-1],fs=134,nl=(
)
}
    (4,0),{v=0,t=1,ut=255,o=4,s=0,sr=0,as=-1,f=4,src=7d6503d1,srcvid=view_id(REG,79dcfbbc,5),ru=00000000,r=[-1,-1],fs=151,nl=(
)
}
    (5,0),{v=0,t=1,ut=255,o=4,s=0,sr=0,as=-1,f=4,src=7d75423e,srcvid=view_id(REG,79dcfbbc,5),ru=00000000,r=[-1,-1],fs=145,nl=(
)
}
    (6,0),{v=0,t=1,ut=255,o=4,s=0,sr=0,as=-1,f=4,src=7da57d09,srcvid=view_id(REG,79dcfbbc,5),ru=00000000,r=[-1,-1],fs=120,nl=(
)
}
    (7,0),{v=0,t=1,ut=255,o=4,s=0,sr=0,as=-1,f=4,src=7dc7b695,srcvid=view_id(REG,79dcfbbc,5),ru=00000000,r=[-1,-1],fs=139,nl=(
)
}
},
fifo_seq=111,
last_sent=0,
known:
79dcfbbc at tcp://192.168.17.11:10011
{o=1,s=0,i=1,fs=165,}
7ccb744c at 
{o=1,s=0,i=1,fs=-1,}
7cfbf591 at tcp://192.168.17.13:10031
{o=1,s=0,i=1,fs=112,}
7d1499c5 at tcp://192.168.17.13:10061
{o=1,s=0,i=1,fs=136,}
7d6503d1 at tcp://192.168.17.11:10071
{o=1,s=0,i=1,fs=153,}
7d75423e at tcp://192.168.17.11:10041
{o=1,s=0,i=1,fs=147,}
7da57d09 at tcp://192.168.17.12:10021
{o=1,s=0,i=1,fs=122,}
7dc7b695 at tcp://192.168.17.12:10051
{o=1,s=0,i=1,fs=141,}
7ddadfcf at tcp://192.168.17.12:10081
{o=1,s=0,i=1,fs=144,}
 }140515 14:02:20 [ERROR] WSREP: failed to open gcomm backend connection: 131: 7ccb744c last prims not consistent (FATAL)
     at gcomm/src/pc_proto.cpp:is_prim():736
     at gcomm/src/pc_proto.cpp:handle_msg():1349
     at gcomm/src/evs_proto.cpp:handle_gap():3286
     at gcomm/src/evs_proto.cpp:handle_msg():2108
140515 14:02:20 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():202: Failed to open backend connection: -131 (State not recoverable)
140515 14:02:20 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1291: Failed to open channel 'my_wsrep_cluster' at 'gcomm://192.168.17.12:10081?gmcast.listen_addr=tcp://0.0.0.0:10091': -131 (State not recoverable)
140515 14:02:20 [ERROR] WSREP: gcs connect failed: State not recoverable
140515 14:02:20 [ERROR] WSREP: wsrep::connect() failed: 7
140515 14:02:20 [ERROR] Aborting

140515 14:02:20 [Note] WSREP: Service disconnected.
140515 14:02:21 [Note] WSREP: Some threads may fail to exit.
140515 14:02:21 [Note] /home/vagrant/galera/local9/mysql/sbin/mysqld: Shutdown complete

Support for dynamical status variables

Implement gu::Status class for gathering Galera wide status variables dynamically. Append dynamical status variables in galera::ReplicatorSMM::stats_get().

Update Documentation Site Theme

Change the Galera documentation over to the new theme.

Adjust the CSS for better readability.
Push the changes out.
Fix and add the Edit on Github link to the lower navigation bar.

galera stuck in cleanup after exception from gcomm

Maybe related to already closed #38.

2014-05-30 16:05:51 27259 [Note] WSREP: Node 10d5d6f4 state prim
2014-05-30 16:05:51 27259 [ERROR] WSREP: caught exception in PC, state dump to stderr follows:
...
2014-05-30 16:05:51 27259 [ERROR] WSREP: exception from gcomm, backend must be restarted: 14c8c47a last prims not consistent (FATAL)
         at gcomm/src/pc_proto.cpp:is_prim():787
         at gcomm/src/pc_proto.cpp:handle_msg():1402
         at gcomm/src/evs_proto.cpp:handle_gap():3299
         at gcomm/src/evs_proto.cpp:handle_msg():2127
2014-05-30 16:05:51 27259 [Note] WSREP: Received self-leave message.
2014-05-30 16:05:51 27259 [Note] WSREP: Flow-control interval: [0, 0]
2014-05-30 16:05:51 27259 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2014-05-30 16:05:51 27259 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 0)
2014-05-30 16:05:51 27259 [Note] WSREP: RECV thread exiting 0: Success
2014-05-30 16:05:51 27259 [Note] WSREP: New cluster view: global state: 10d6e99c-e7fe-11e3-9562-9795814c7421:2182875, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version -1
2014-05-30 16:05:51 27259 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-05-30 16:05:51 27259 [Note] WSREP: Closing send monitor...
2014-05-30 16:05:51 27259 [Note] WSREP: Closed send monitor.
2014-05-30 16:05:51 27259 [Note] WSREP: recv_thread() joined.
2014-05-30 16:05:51 27259 [Note] WSREP: Closing replication queue.
2014-05-30 16:05:51 27259 [Note] WSREP: Closing slave action queue.
2014-05-30 16:05:51 27259 [Note] WSREP: applier thread exiting (code:0)

Three threads were remaining:

Thread 3 (Thread 0x7f1017112700 (LWP 27264)):
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007f1015cf6e8d in gu::Lock::wait (this=<optimized out>, cond=...)
    at galerautils/src/gu_lock.hpp:56
#2  0x00007f1015de0962 in galera::ServiceThd::thd_func (arg=0x2c2d260)
    at galera/src/galera_service_thd.cpp:30
#3  0x00007f1017cece9a in start_thread (arg=0x7f1017112700)
    at pthread_create.c:308
#4  0x00007f10172073fd in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#5  0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7f1006ffc700 (LWP 27289)):
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00000000005d5133 in inline_mysql_cond_wait (src_line=403, 
    src_file=0xb6e838 "/home/vagrant/codership-mysql/sql/wsrep_thd.cc", 
    mutex=<optimized out>, that=<optimized out>)
    at /home/vagrant/codership-mysql/include/mysql/psi/mysql_thread.h:1162
#2  wsrep_rollback_process (thd=0x7f0ff8000990)
    at /home/vagrant/codership-mysql/sql/wsrep_thd.cc:403
#3  0x00000000005bd137 in start_wsrep_THD (arg=0x5d4c70)
    at /home/vagrant/codership-mysql/sql/mysqld.cc:5350
#4  0x00007f1017cece9a in start_thread (arg=0x7f1006ffc700)
    at pthread_create.c:308
#5  0x00007f10172073fd in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#6  0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7f10191aa740 (LWP 27259)):
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00000000005ce5db in inline_mysql_cond_wait (mutex=<optimized out>, 
    that=<optimized out>, src_file=<optimized out>, src_line=<optimized out>)
    at /home/vagrant/codership-mysql/include/mysql/psi/mysql_thread.h:1162
#2  inline_mysql_cond_wait (src_line=199, mutex=<optimized out>, 
    that=<optimized out>, src_file=<optimized out>)
    at /home/vagrant/codership-mysql/sql/wsrep_sst.cc:193
#3  wsrep_sst_wait () at /home/vagrant/codership-mysql/sql/wsrep_sst.cc:199
#4  0x00000000005c9805 in wsrep_init_startup (first=true)
    at /home/vagrant/codership-mysql/sql/wsrep_mysqld.cc:699
#5  0x00000000005c1636 in init_server_components ()
    at /home/vagrant/codership-mysql/sql/mysqld.cc:4946
#6  0x00000000005c2235 in mysqld_main (argc=36, argv=0x2baa3d8)
    at /home/vagrant/codership-mysql/sql/mysqld.cc:6063
#7  0x00007f101713476d in __libc_start_main (
    main=0x5a14b0 <main(int, char**)>, argc=15, ubp_av=0x7fffad9c8cd8, 
    init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=0x7fffad9c8cc8) at libc-start.c:226
#8  0x00000000005b4a5d in _start ()

However, it is a bit unclear if this is Galera or MySQL side issue.

possible bugs

some possible bugs with patch, but need to be confirmed.

3.x does not survive seesaw/stopcont tests well

Maybe related to #10, node uuid is preserved over restarts.

Add LOST_EVENTS Feature to Galera

Hi,

in ndbcluster, when MySQL server is restarted, it writes a LOST_EVENTS in the binary log. This is a message that we need, on the slave, to issue replication channel failover to another node that was up during the restart, in order not to loose transactions.

Can we add this functionality in Galera. This could be:

Automatic
At each restart of a Galera enabled node, a lost event is written in the binlog.
As a parameter
If wsrep_write_lost_events = 1, then the lost event is written in the binlog.

Thanks,
Joffrey

wsrep_sst scripts are not mysqld_multi aware

They assume just 'mysqld' when looking through cnf files.

Primary node leaves cluster and causes gcomm fatal exception in other nodes

The environment is a Percona XtraDB Cluster composed of three nodes: db1, db2 and db3). db2 was acting as primary node when it got overloaded a few times, in the last of which the gcomm background thread was stalled for around 20 seconds:

2014-05-26 14:42:49 27548 [Warning] WSREP: last inactive check more than PT1.5S ago (PT21.1142S), skipping check

Because of that the node dropped from the group while it was requesting for SST but soon after it decided to abort, giving SST was not possible.

While aborting it tried to leave the cluster gracefully by closeing the gcomm connection, which caused some message exchange between nodes and triggered a bug on db1 and db3, effectivelly crashing both nodes:

db1:
2014-05-26 14:43:21 2979 [Warning] WSREP: evs::proto(9bdb737e-df4a-11e3-87c9-eab020c42bd0, GATHER, view_id(REG,9bdb737e-df4a-11e3-87c9-eab020c42bd0,174)) install timer expired
2014-05-26 14:43:21 2979 [ERROR] WSREP: exception from gcomm, backend must be restarted: NodeMap::value(i).leave_message() == 0: (FATAL)

db3:
2014-05-26 14:43:21 30104 [Warning] WSREP: evs::proto(c7a2a117-daa4-11e3-8b73-863bb950f40a, GATHER, view_id(REG,9bdb737e-df4a-11e3-87c9-eab020c42bd0,174)) install timer expired
2014-05-26 14:43:21 30104 [ERROR] WSREP: exception from gcomm, backend must be restarted: NodeMap::value(i).leave_message() == 0: (FATAL)

This happened in Percona-XtraDB-Cluster-galera-3-3.5-1.216.rhel6.x86_64, built with Galera 25.3.4

evs exception: failed to form singleton view (leave processing)

Starting 9 nodes concurrently, one of the nodes timed out waiting for prim at the moment prim was being rebootstrapped:

140518  4:28:07 [Note] WSREP: re-bootstrapping prim from partitioned components
140518  4:28:08 [Note] WSREP: evs::proto(bdefe1c5, OPERATIONAL, view_id(REG,ba547f34,7)):  state change: OPERATIONAL -> LEAVING
140518  4:28:08 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
         at gcomm/src/pc.cpp:connect():141
140518  4:28:08 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():202: Failed to open backend connection: -110 (Connection timed out)
140518  4:28:08 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1290: Failed to open channel 'my_wsrep_cluster' at 'gcomm://192.168.17.11:10071?gmcast.listen_addr=tcp://0.0.0.0:10081': -110 (Connection timed out)
140518  4:28:08 [ERROR] WSREP: gcs connect failed: Connection timed out
140518  4:28:08 [ERROR] WSREP: wsrep::connect() failed: 7
140518  4:28:08 [ERROR] Aborting

All the other aborted due to failure of reaching consensus:

140518  4:28:45 [Note] WSREP: going to give up, state dump for diagnosis:
evs::proto(evs::proto(be2b058f, GATHER, view_id(REG,ba547f34,7)), GATHER) {
current_view=view(view_id(REG,ba547f34,7) memb {
    ba547f34,
    bded5f55,
    bdee555e,
    bdefe1c5,
    be0d179f,
    be12d899,
    be183c60,
    be25f051,
    be2b058f,
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=14,safe_seq=13,node_index=
node: {idx=0,range=[15,14],safe_seq=14}
node: {idx=1,range=[15,14],safe_seq=14}
node: {idx=2,range=[15,14],safe_seq=14}
node: {idx=3,range=[15,14],safe_seq=13}
node: {idx=4,range=[15,14],safe_seq=14}
node: {idx=5,range=[15,14],safe_seq=14}
node: {idx=6,range=[15,14],safe_seq=14}
node: {idx=7,range=[15,14],safe_seq=14}
node: {idx=8,range=[15,14],safe_seq=14} ,
msg_index=  (0,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=4,src=ba547f34,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=201,nl=(
)
}
    (1,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=4,src=bded5f55,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=179,nl=(
)
}
    (2,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=4,src=bdee555e,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=183,nl=(
)
}
    (3,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=4,src=bdefe1c5,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=157,nl=(
)
}
    (4,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=4,src=be0d179f,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=199,nl=(
)
}
    (5,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=4,src=be12d899,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=187,nl=(
)
}
    (6,14),{v=0,t=1,ut=5,o=4,s=14,sr=0,as=13,f=4,src=be183c60,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=162,nl=(
)
}
    (7,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=4,src=be25f051,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=170,nl=(
)
}
    (8,14),{v=0,t=1,ut=255,o=0,s=14,sr=0,as=13,f=0,src=be2b058f,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=161,nl=(
)
}
,recovery_index=},
fifo_seq=224,
last_sent=14,
known:
ba547f34 at tcp://192.168.17.11:10011
{o=0,s=1,i=0,fs=231,}
bded5f55 at tcp://192.168.17.11:10071
{o=0,s=1,i=0,fs=209,}
bdee555e at tcp://192.168.17.11:10041
{o=0,s=1,i=0,fs=216,}
bdefe1c5 at tcp://192.168.17.12:10081
{o=0,s=1,i=0,fs=158,lm=
{v=0,t=6,ut=255,o=1,s=14,sr=-1,as=13,f=6,src=bdefe1c5,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=158,nl=(
)
},
}
be0d179f at tcp://192.168.17.13:10061
{o=0,s=1,i=0,fs=229,}
be12d899 at tcp://192.168.17.13:10031
{o=0,s=1,i=0,fs=216,}
be183c60 at tcp://192.168.17.12:10051
{o=0,s=1,i=0,fs=193,}
be25f051 at tcp://192.168.17.12:10021
{o=0,s=1,i=0,fs=198,}
be2b058f at 
{o=1,s=0,i=0,fs=-1,jm=
{v=0,t=4,ut=255,o=1,s=13,sr=-1,as=14,f=0,src=be2b058f,srcvid=view_id(REG,ba547f34,7),ru=00000000,r=[-1,-1],fs=224,nl=(
    ba547f34, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
    bded5f55, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
    bdee555e, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
    bdefe1c5, {o=0,s=1,f=0,ls=14,vid=view_id(REG,ba547f34,7),ss=13,ir=[15,14],}
    be0d179f, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
    be12d899, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
    be183c60, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
    be25f051, {o=0,s=1,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
    be2b058f, {o=1,s=0,f=0,ls=-1,vid=view_id(REG,ba547f34,7),ss=14,ir=[15,14],}
)
},
}
 }
140518  4:28:45 [ERROR] WSREP: exception from gcomm, backend must be restarted:evs::proto(be2b058f, GATHER, view_id(REG,ba547f34,7)) failed to form singleton view after exceeding max_install_timeouts 3, giving up (FATAL)
     at gcomm/src/evs_proto.cpp:handle_install_timer():612

It looks like leaving node failed to acknowledge all messages it had received due to exception and remaining nodes failed to reach consensus because of that.

To fix this, remaining nodes must decide at some point that leaving node won't be sending any more messages and ignore its safe seq in consensus computation. Raised suspected flag could be one such an indication.

Documentation with garbd

On Fedora (and maybe some more distributions) there is in the /etc/sudoers file an line which says:
Defaults requiretty
You must comment out this line before you can start garbd because in the /etc/init.d/garb there is a sudo -u nobody statement which fails on the standard sudoers file of Fedora.

regression test enhancements

Issue tracker for new regression tests, regression test fixes and enhancements.

Improve evs install state processing for big groups

Originally reported in: https://bugs.launchpad.net/galera/+bug/1271918

The problematic part in membership protocol is the time when install message is received. When many nodes will suddenly start to see each other it is probable that they have different set of known nodes than representative at the time of processing install message. Therefore handling install message and install state message processing should be altered somewhat to make group forming more stable.

Related issue is still open because of missing install state improvements: https://bugs.launchpad.net/galera/+bug/1249805

gmcast relay set not updated properly

Relay set if not updated in gmcast_forget() after removing ProtoMap entry. There seem to be other places too where Proto entry is removed from ProtoMap but relay set is not updated accordingly. This may leave invalid pointer references into relay set.

refactoring of UUID

there are at least three types of UUID

wsrep_uuid_t
gu_uuid_t
gcomm::UUID

and three different files involves uuid process

galera/src/uuid.hpp
galerautils/src/gu_uuid.h
gcomm/src/gcomm/uuid.hpp

we'd better to perform refactoring of UUID, and move some common code to galerautils module.

build Deb package failed since mysql-5.5.37 original source code removes Docs/mysql.info file

But scripts/mysql/debian/5.5.list requires Docs/mysql.info this file in MySQL source code tree

f 644 root root $DOCS_DST/mysql.info        $MYSQL_SRC/Docs/mysql.info

Make gcs module C++ compatible

For STL etc.

Add refs to Headings on New Pages

Headings on new pages in documentation require refs to allow for cross-referencing and section links.

cluster crash recovery

copy from t848

typical scenario is data center outage. after power supply is back, cluster nodes will be promoted to primary component automatically.

the designed process is

recovery last known PC from disk and start all nodes in non-prim.
every time node's membership changes, check non-prim view's membership is equal to last known PC's membership. if yes, promoted to primary component.

gmcast use incorrect outgoing ip

if I have two IPs on one node, and start two mysql/galera instances with the two different IPs, they can not work well because they are use wrong IPs.

For example one node has 192.168.0.100/101. Instance A use 192.168.0.100 as listen address, and B use 192.168.0.101 as listen address. When A connects to B, A use 192.168.0.101 to connect. And when B find connection's remote/peer address is self listen address, B will ignore this connection, which cause communication problem between them.

I guess some users also have this problem. https://groups.google.com/forum/#!topic/codership-team/_DCRJIgKY20/discussion

The way to fix it is to bind listen address IP when connect to other nodes.

UPDATE:
related issues:
https://bugs.launchpad.net/galera/+bug/1240964
t801

GCS PC-remerge code can't handle remerge of multiple PCs

Example (happened over WiFi link):

140511 13:38:32 [Warning] WSREP: Quorum: No node with complete state:

    Version      : 2
    Flags        : 3
    Protocols    : 0 / 4 / 2
    State        : NON-PRIMARY
    Prim state   : SYNCED
    Prim UUID    : 2ab816e0-d8f8-11e3-a62c-639606db485d
    Prim seqno   : 37
    Last seqno   : 0
    Prim JOINED  : 3
    State UUID   : 65ddd7ed-d8f8-11e3-8071-1e1fd2a2b454
    Group UUID   : 2108a884-d873-11e3-a155-d6a734a2dc8b
    Name         : 'home0'
    Incoming addr: '10.21.32.1:3305'

    Version      : 2
    Flags        : 2
    Protocols    : 0 / 4 / 2
    State        : NON-PRIMARY
    Prim state   : SYNCED
    Prim UUID    : 2ab816e0-d8f8-11e3-a62c-639606db485d
    Prim seqno   : 37
    Last seqno   : 0
    Prim JOINED  : 3
    State UUID   : 65ddd7ed-d8f8-11e3-8071-1e1fd2a2b454
    Group UUID   : 2108a884-d873-11e3-a155-d6a734a2dc8b
    Name         : 'home2'
    Incoming addr: '10.21.32.1:3303'

    Version      : 2
    Flags        : 2
    Protocols    : 0 / 4 / 2
    State        : NON-PRIMARY
    Prim state   : SYNCED
    Prim UUID    : 2ab816e0-d8f8-11e3-a62c-639606db485d
    Prim seqno   : 37
    Last seqno   : 0
    Prim JOINED  : 3
    State UUID   : 65ddd7ed-d8f8-11e3-8071-1e1fd2a2b454
    Group UUID   : 2108a884-d873-11e3-a155-d6a734a2dc8b
    Name         : 'home1'
    Incoming addr: '10.21.32.1:3304'

    Version      : 2
    Flags        : 2
    Protocols    : 0 / 4 / 2
    State        : NON-PRIMARY
    Prim state   : SYNCED
    Prim UUID    : 351c8568-d8f6-11e3-a7dc-027f4ff287d7
    Prim seqno   : 35
    Last seqno   : 0
    Prim JOINED  : 6
    State UUID   : 65ddd7ed-d8f8-11e3-8071-1e1fd2a2b454
    Group UUID   : 2108a884-d873-11e3-a155-d6a734a2dc8b
    Name         : 'centos6-32'
    Incoming addr: '192.168.122.119:3306'

    Version      : 2
    Flags        : 2
    Protocols    : 0 / 4 / 2
    State        : NON-PRIMARY
    Prim state   : SYNCED
    Prim UUID    : 351c8568-d8f6-11e3-a7dc-027f4ff287d7
    Prim seqno   : 35
    Last seqno   : 0
    Prim JOINED  : 6
    State UUID   : 65ddd7ed-d8f8-11e3-8071-1e1fd2a2b454
    Group UUID   : 2108a884-d873-11e3-a155-d6a734a2dc8b
    Name         : 'centos5-64'
    Incoming addr: '192.168.122.195:3306'

    Version      : 2
    Flags        : 0
    Protocols    : 0 / 4 / 2
    State        : NON-PRIMARY
    Prim state   : PRIMARY
    Prim UUID    : 2ab816e0-d8f8-11e3-a62c-639606db485d
    Prim seqno   : 37
    Last seqno   : 0
    Prim JOINED  : 3
    State UUID   : 65ddd7ed-d8f8-11e3-8071-1e1fd2a2b454
    Group UUID   : 2108a884-d873-11e3-a155-d6a734a2dc8b
    Name         : 'centos6-64'
    Incoming addr: '192.168.122.192:3306'

    Version      : 2
    Flags        : 2
    Protocols    : 0 / 4 / 2
    State        : NON-PRIMARY
    Prim state   : SYNCED
    Prim UUID    : 351c8568-d8f6-11e3-a7dc-027f4ff287d7
    Prim seqno   : 35
    Last seqno   : 0
    Prim JOINED  : 6
    State UUID   : 65ddd7ed-d8f8-11e3-8071-1e1fd2a2b454
    Group UUID   : 2108a884-d873-11e3-a155-d6a734a2dc8b
    Name         : 'centos5-32'
    Incoming addr: '192.168.122.135:3306'

mysqld: gcs/src/gcs_state_msg.c:530: state_quorum_remerge: Assertion `states[i]->prim_joined == candidates[j].prim_joined' failed.
10:38:32 UTC - mysqld got signal 6 ;

There is a merge of two components, 35 and 37, with the same uuid:seqno. The problem is that the component ID is not taken into consideration.
In this particular case it would have merged if debug was disabled, but it is easy to imagine a situation where it won't (e.g. 3 nodes form 37 + 2 nodes from 35 = 5. 35 had 6 so no full remerge)