00pauln00 / niova-core Goto Github PK

License: Other

C 81.16% Makefile 0.72% Batchfile 0.09% Assembly 1.90% Shell 0.79% Starlark 0.01% M4 0.30% HTML 0.08% TypeScript 0.75% CSS 0.02% Dockerfile 0.27% Go 13.14% Python 0.79%

niova-core's People

Contributors

Stargazers

Watchers

Forkers

kitwestneat

niova-core's Issues

PMDB server request history through ctl-interface

Similar to the ctl-interface command history, the pmdb server should present a history of recently completed operations, including information about request origination (ie raft leader, term, etc.).

PumiceDBClientreports returns errno not rc

the writeKV returns errno instead of rv. rv seems to hold the error number.

this is the output from errno when you encounter an error.
ERRO[0000] PMDB put error: PmdbObjPut(): 0

this is the change i applied to return rc
`@@ -229,8 +229,8 @@ func (obj *PmdbClientObj) writeKV(rncui string, key *C.char,
rc := C.PmdbObjPut(obj.pmdb, obj_id, key, c_key_len, &obj_stat)

    if rc != 0 {

```
          var errno syscall.Errno
```

          return fmt.Errorf("PmdbObjPut(): %d", errno)

```
          //var errno syscall.Errno
```

          return fmt.Errorf("PmdbObjPut(): %d", rc)
  }

return nil`

this is the output after the change. which allowed me to see the correct error number.
ERRO[0000] PMDB put error: PmdbObjPut(): -107

Failure in recipe pmdb_client_error_demonstration1

While invalid write seqno the write cmd using a write-seqno == 1 (which is expected to fail since ‘0’ is the expected seqno) The Expected result should be status : "Invalid exchange" and pmdb-seqno: -1

But actual result is:

"pumice_db_test_client" : {
"pmdb-test-apps" : [
{
"app-user-id" : "43516d34-da3e-11eb-b47f-7b78b41234d2:0:0:0:0",
"status" : "Invalid or incomplete multibyte or wide character",
"pmdb-seqno" : 0,
"pmdb-write-pending" : false,
"last-request" : "Thu Jul 01 07:30:54 UTC 2021",
"last-request-duration-ms" : 12,
"last-request-tag" : 1254333918,
"app-sync" : false,
"app-seqno" : 2,
"app-value" : 1211055621,
"app-validated-seqno" : 0
}
],

PumiceDB server startup scan takes too long

A recent test containing ~130 million raft entry indexes is showing variable startup scan speed - sometimes as low at several thousand entries per second. Warming the page cache does show a large improvement and makes the situation more tenable, however, it's not a great solution since:

the system can't prevent huge DB instances from occurring
more valuable page cache data may be flushed

There are a few considerations regarding the scanning of entries during startup.

For raft instances which operate partially or fully "in-core" (RAFT_INSTANCE_STORE_POSIX_FLAT_FILE and RAFT_INSTANCE_STORE_ROCKSDB) the scan is necessary to recreate the raft user's state machine.
Raft server instances of type RAFT_INSTANCE_STORE_ROCKSDB_PERSISTENT_APP guarantee that all application state has been persisted to the backend per raft state machine "apply" operation.
- Therefore, scanning all raft entries is not absolutely necessary.
- Scanning may be considered a useful validation of system correctness
  - RAFT_INSTANCE_STORE_ROCKSDB_PERSISTENT_APP persists the last-applied-cumulative-crc value making possible to verify the entire stream of entries against this CRC.
  - Future versions of RAFT_INSTANCE_STORE_ROCKSDB_PERSISTENT_APP will allow for state machine independent garbage collection of old raft log entries
    - At this point scanning will have decreased usefulness as a correctness check since history will be pruned
    - Strongly checking all of history against a single value will no longer be possible..
      - Partial checks are still possible, however.

What to do in the case of RAFT_INSTANCE_STORE_ROCKSDB_PERSISTENT_APP?
I believe the scan may be removed for now and partially reinstated once the feature for state machine independent garbage collection is implemented. This feature will make it possible to bound the number of raft log entries to some limit which is reasonable for scanning.

Gossip delay causes niovakv_client to send request to invalid addr

The initial gossip delay provides empty tag data to the niovakv_client, which uses tag data to send http request. Since, the tag is empty niovakv_client send HTTP request to invalid address. This has been solved by adding a check before using the addr to send the request.

PMDB server fails with out of bounds go slice operation

<19230.236487815:warn:niovakv_pmdbser:raft_server_leader_calculate_committed_idx@3248> idx=0 sorted-idx=-1, peer-sync-idx=-1
<19230.236495255:warn:niovakv_pmdbser:raft_server_leader_calculate_committed_idx@3248> idx=1 sorted-idx=-1, peer-sync-idx=4524
<19230.236500298:warn:niovakv_pmdbser:raft_server_leader_calculate_committed_idx@3248> idx=2 sorted-idx=-1, peer-sync-idx=4524
<19230.236503862:warn:niovakv_pmdbser:raft_server_leader_calculate_committed_idx@3248> idx=3 sorted-idx=4524, peer-sync-idx=-1
<19230.236507029:warn:niovakv_pmdbser:raft_server_leader_calculate_committed_idx@3248> idx=4 sorted-idx=4524, peer-sync-idx=-1
panic: runtime error: slice bounds out of range [:33] with length 30
goroutine 1 [running]:
niova/go-pumicedb-lib/server.PmdbLookupKey(0xc0001183a4, 0xa, 0xa, 0x52b99e, 0x9, 0xc000126230, 0x2, 0x2, 0xc000126230)
/home/lucusn/niova-core/go/pumiceDB/server/PumiceDBServer.go:221 +0x2b0
niova/go-pumicedb-lib/server.PmdbReadKV(...)
/home/lucusn/niova-core/go/pumiceDB/server/PumiceDBServer.go:282
niova/go-pumicedb-lib/server.(*PmdbServerObject).ReadKV(0xc0001316c0, 0x7f2c25ec9be0, 0xc0001183a4, 0xa, 0xa, 0x52b99e, 0x9, 0xc000159a10, 0x0, 0x64, ...)
/home/lucusn/niova-core/go/pumiceDB/server/PumiceDBServer.go:292 +0x5d
main.(*NiovaKVServer).Read(0xc000131500, 0x7f2c25ec9be0, 0x7f2c25ec9c30, 0x55, 0x29d60b0, 0x3ff9a8, 0x465925)
/home/lucusn/niova-core/go/pumiceDB/examples/niovakv/pmdb/server/niovakv_pmdbserver.go:183 +0x28e
niova/go-pumicedb-lib/server.goRead(0x7f2c25ec9be0, 0x7f2c25ec9c30, 0x55, 0x29d60b0, 0x3ff9a8, 0x295cf40, 0xa)
/home/lucusn/niova-core/go/pumiceDB/server/PumiceDBServer.go:112 +0x88
niova/go-pumicedb-lib/server._Cfunc_PmdbExec(0x29594a0, 0x294d2b0, 0xc0001013d0, 0xc0001220a0, 0x100000001, 0x295cf40, 0x0)
_cgo_gotypes.go:163 +0x48
niova/go-pumicedb-lib/server.PmdbStartServer.func2(0x29594a0, 0x294d2b0, 0xc0001013d0, 0xc0001220a0, 0x295cf40, 0x10)
/home/lucusn/niova-core/go/pumiceDB/server/PumiceDBServer.go:169 +0xda
niova/go-pumicedb-lib/server.PmdbStartServer(0xc0001316c0, 0x0, 0x0)
/home/lucusn/niova-core/go/pumiceDB/server/PumiceDBServer.go:169 +0x325
niova/go-pumicedb-lib/server.(*PmdbServerObject).Run(...)
/home/lucusn/niova-core/go/pumiceDB/server/PumiceDBServer.go:179
main.main()
/home/lucusn/niova-core/go/pumiceDB/examples/niovakv/pmdb/server/niovakv_pmdbserver.go:67 +0x3b8

pmdb server has crashed when making read/write with lkvt. I have been able to reproduce this error by doing put/gets with a total of only 16 puts being done by 5 concurrent methods and by running 100 puts with only one method. key size of 10 bytes and value size of 20 bytes. 5 pmdb servers were running and one niovakv_server as Node1. pmdb_servers will not come back on until niovakv_server is killed. No error was returned to log.

PMDB Client shutdown results in epoll error

If PMDB Client has finished its execution and terminates then we get the following epoll error

 <11042993.414332931:warn:LeaseApp_pmdbSe:epoll_handle_del_complete@330> epoll_ctl(epm_fd=18, eph_fd=35, EPOLL_CTL_DEL): Bad file descriptor
 <11042993.414358470:warn:LeaseApp_pmdbSe:epoll_mgr_reap_destroy_list@436> epoll_handle_del_complete(eph=0x95bcc8): Bad file descriptor

This behavior is reproducible in the leaseApplication branch by starting the PMDB Servers and sending any request from the client side. When the client finishes executing, the above error appears in the PMDB Server logs.

There can be older entries with older term value in cowr RB tree

Leader had some entries in cowr RB tree (Term 10) and leader gets changed and becomes follower (Term 11).
Again the same follower gets elected as new leader (Term 11).

If client retries the same writes which were tried in term 10 and was present in coalesced write RB tree, leader will find
the older entry in cowr RBTree with stale data.

We should remove the stale entries from cowr RB tree whenever leader changes.

pmdb_obj is no longer immediately written to rocksdb when coalescing mode is enabled

In the non-coalesce case, both the raft block as well as the pmdb_obj are immediately written into RocksDB, albeit the write may be asynchronous. This write is done in raft write context, which is serialized. Therefore, a second write arriving from a different client for the same rncui would be detected as invalid due to the conflicting with a pending write to the same seqno. However, if the pmdb_obj is not written to RocksDB, since it's pending in the coalesce buffer, the second write may enter when it would have otherwise been denied.

Without changing the code too much (i.e. modifying filling and storage of the coalesce buffer OR writing the pmdb_objects synchronously and ahead of the raft block), it may be possible to detect the scenario in the Apply phase and return an error to all but one of the conflicting writes (the write whose information is in the pmdb_object would see OK).

PmdbObjGetX should return actual size of the value to client.

PmdbObjGetX asks for buffer to store the value of the key and length of that buffer.
But at client side, application may not know exact size of the value and may allocate smaller buffer
to read the value. In that case E2BIG error is returned to application. But currently there is no way for
application to know the exact size of the value.
On getting the E2BIG, if it knows the exact value size, it can allocate bigger buffer and retry the read request.

We get the size of the value from server in rcrh_reply_size.
This value can be returned to application.

Memory leak for followers.

As Paul investigated, there is missing free for corresponding malloc in raft_net_write_supp_add() and only happening for followers:

It was happening as raft_net_sm_write_supplements_merge() in raft_server_state_machine_apply() was getting called
only when ri_server_sm_request_cb() returns success.

But ri_server_sm_request_cb() could return error for follower as pmdb_object_lookup() would return error for followers
as pmdb object gets created in write phase only for leader as marker.

Solution: raft_net_sm_write_supplements_merge() should get called always.

failed assertion: pmdb_cowr_tree_term == -1 || pmdb_cowr_tree_term == rncr->rncr_current_term

During the testing of niovakv application using lkvt , hit the following assert:
<1198901.930547566:fatal:niovakv_pmdbser:pmdb_sm_handler_pmdb_sm_apply_remove_coalesce_tree_item@867> failed assertion: pmdb_cowr_tree_term == -1 || pmdb_cowr_tree_term == rncr->rncr_current_term

This assert expects any stale entries in cowr subapp RB tree must have already been released and pmdb_cowr_tree_term would be set to -1
OR pmdb_cowr_tree_term value is same as ncr->rncr_current_term i.e all the elements in the cowr sub app tree are from the same term.

There are multiple leader election happened in the cluster before hitting this assert.

Analysis:
When can pmdb_cowr_tree_term not be -1 and not same as rncr->rncr_current_term is:

The Peer1 was leader and term 6.
There were some writes while Peer1 was leader , so there are entries in cows subapp tree on Peer1 and pmdb_cowr_tree_term is 6
Another leader election happens and Peer2 becomes leader (term 7)
Some writes happened while Peer2 is leader.
Before those writes can complete, Peer1 becomes leader again (term 8)
The Apply message send from Peer2 when it was leader (term 7) reaches Peer1 (current leader). The cluster is at term8 but one stale message from term 7 reaches Peer1 (Current leader).
pmdb_sm_handler_pmdb_sm_apply_remove_coalesce_tree_item() releases the stale entries from cowr subapp tree only when peer is follower assuming all entries in the leader should be from same term value.
As pmdb_cowr_tree_term when got updated on Peer1 during its transition from leader -> follower -> leader as Write phase had happened in term 7 and Apply message reached on Peer1 in term8.
And peer hits the assert.

Number_of_values parameter to client request is not working as expected

With new changes to client request to pass number of values in commit:
56eb7de:

It's showing wrong behavior in following 3 cases.

Following are the client commands and its output:

cmd: "input@f9eedfec-7817-11eb-a681-9f2e7bda734d:0:0:0:0.write:0.100.0"
Output:
"/0/app-seqno": 842,
"/0/app-sync": false,
"/0/app-user-id": "f9eedfec-7817-11eb-a681-9f2e7bda734d:0:0:0:0",
"/0/pmdb-seqno": 0,
"/0/pmdb-write-pending": false,
"/0/status": "Success"

Observation:
As number of values is 0, it should use random value for num_vals.
And as there are 100 writes , the total writes /seqno should be in multiple of 100.
But here the app-seqno is 842.

cmd: "[email protected]:ce5ba4be-781c-11eb-a693-b3e3dccb1b44:0:0:0:0.100.1"
Output: "/0/app-seqno": 1,
"/0/app-sync": false,
"/0/app-user-id": "ce5ba4be-781c-11eb-a693-b3e3dccb1b44:0:0:0:0",
"/0/pmdb-seqno": 0,
"/0/pmdb-write-pending": false,
"/0/status": "Success"
Observation:
Here the number of writes is 1 for 100 write requests. But the app-seqno is 1.
Expected result: 99
cmd: "[email protected]:019f37da-781f-11eb-8b51-8ffbba46e3cd:0:0:0:0.100
"/0/app-seqno": 1765,
"/0/app-sync": false,
"/0/app-user-id": "019f37da-781f-11eb-8b51-8ffbba46e3cd:0:0:0:0",
"/0/pmdb-seqno": 0,
"/0/pmdb-write-pending": false,
"/0/status": "Success"
Observation: Here the number of values is not passed at all. In this case as well, we take random number. But as number of write requests is 100, app-seqno should be in multiple of 100's
As @00pauln00 suggested, if num_of_values is not specified, it should take default as 1 to make it compatible with older version.

Raft leader takes too long determining a recently restarted follower's "next index"

The leader requires an untenable amount of time to determine the next-idx of a newly initialized peer

The leader is iterating through 130 million indexes - one at a time at a rate of < 10k/sec
The problem is compounded when there’s a leadership change
- The new leader must start the entire process from scratch
- It’s possible that the discovery of the next-idx may never complete!

To deal with this issue the discovery method should employ some sort of binary reduction. Such a method would reduce the time taken from O(N) to O(logN).

Server asserts when followers current index exceeds from leader.

Multiple servers aborts with following assert from raft_server_append_entry_log_prepare_and_check() :

// At this point, current_idx should not exceed the one from the leader.
NIOVA_ASSERT(reh.reh_index <= raerq->raerqm_prev_log_index);

<17238497.184134622:warn:pumice-dict-ser:raft_server_becomes_follower@1961> r:F et[s:u]=56:56 ei[s:u]=0:0 ht=56 hs=56 c=-1 la=-1:-1 lck=-1 v=7f583666-a963-11eb-8986-bf8a88ea0b58 l=7f583666-a963-11eb-8986-bf8a88ea0b58 sender-uuid=7f58dbd4-a963-11eb-a701-8f7d081fe073 term=58 rsn=stale-leader
<17238497.188614012:warn:pumice-dict-ser:raft_server_append_entry_log_prune_if_needed@2623> r:F et[s:u]=56:56 ei[s:u]=0:0 ht=58 hs=57 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l=7f5a494c-a963-11eb-976c-030a056a075f raerqm_prev_log_term=0 raerqm_prev_log_index=-1
<17238497.189560109:fatal:pumice-dict-ser:raft_server_append_entry_log_prepare_and_check@2762> failed assertion: reh.reh_index <= raerq->raerqm_prev_log_index
SIGABRT: abort

Serf membership precedes gossip data

When niovakv server is started, the serf is started before the gossip is issued! It creates a problem in niovakv_client_api while sending the HTTP request!

Raft recipe, "Basic Leader Election", encounters sporadic failures

The Basic Leader Election recipe (https://github.com/00pauln00/niova-core/wiki/Recipe%3A-Basic-Leader-Election) is failing due to the occasional occurrence of a second election. This error was detected during a CI build in the niova-CI branch. The issue is that the recipe does not cope with a case where peer-X becomes leader but cannot complete its initial commit before peer-Y times out, becomes a candidate, and issues its term value to the network. Eventually peer-X (or another peer which has stored peer-X's initial commit) but this will cause one more elections to occur which pushes the commit-idx past the intended value of 0.

Newly restarted follower in raft async mode does not set commit-idx or advance the last-applied value

                        "raft-uuid" : "11afb04a-f9cc-11ea-b0e5-831d7b183899",
                        "peer-uuid" : "11b0c976-f9cc-11ea-b60b-ff721df40ac4",
                        "voted-for-uuid" : "00000000-0000-0000-0000-000000000000",
                        "leader-uuid" : "11b0e7f8-f9cc-11ea-9d54-7bda2902f51c",
                        "state" : "follower",
                        "follower-reason" : "leader-already-present",
                        "client-requests" : "redirect-to-leader",
                        "term" : 125840,
 ---------->         "commit-idx" : -1,
 ---------->         "last-applied" : 130836835,
                        "last-applied-cumulative-crc" : 943050041,
                        "sync-freq-us" : 4000,
                        "sync-cnt" : 14508,
  ---------->        "sync-entry-idx" : 130978289,
                        "sync-entry-term" : 125805,
                        "sync-entry-data-size" : 128,
                        "sync-entry-crc" : 3468325892,
                        "unsync-entry-idx" : 130978307,
                        "unsync-entry-term" : 125805,
                        "unsync-entry-data-size" : 128,
                        "unsync-entry-crc" : 241182324,

Bulk recovery recipe fails to bootup new peer successfully

In the bulk recovery recipe, after some num_of_writes we are trying to start new peer(say peer4) but it fails to bootup properly everytime.

The error on console:

**```
TASK [common : Start the peer] ******************************************************************************************************************************************
Thursday 25 November 2021 12:02:16 +0000 (0:00:00.098) 0:04:30.929 *****
ok: [localhost]

TASK [common : pause] ***************************************************************************************************************************************************
Thursday 25 November 2021 12:02:17 +0000 (0:00:00.286) 0:04:31.216 *****
Pausing for 3 seconds
(ctrl+C then 'C' = continue early, ctrl+C then 'A' = abort)
ok: [localhost]

TASK [common : Wait for server process to bootup completely] ************************************************************************************************************
Thursday 25 November 2021 12:02:20 +0000 (0:00:03.034) 0:04:34.250 *****
ERROR! A worker was found in a dead state


And in the peer4 log file it shows error:

<16843861.914079839:warn:pumice-referenc:env_parse@201> env-var NIOVA_INOTIFY_BASE_PATH value /home/makshata/bulk_testing/7e4e791c-4df8-11ec-9f8c-b765ffa10fdc/ctl-interface applied from environment
<16843861.917472515:warn:pumice-referenc:env_parse@201> env-var NIOVA_LOCAL_CTL_SVC_DIR value /home/makshata/bulk_testing/7e4e791c-4df8-11ec-9f8c-b765ffa10fdc/configs applied from environment
<16843861.929415922:warn:pumice-referenc:raft_server_set_max_scan_entries@4912> max_scan_entries=100000
<16843861.929461477:warn:pumice-referenc:raft_server_set_log_reap_factor@4924> log_reap_factor=5
<16843861.929470882:warn:pumice-referenc:raft_server_set_num_checkpoints@4937> num_checkpoints=5
<16843861.929598491:warn:pumice-referenc:raft_epoll_setup_net@869> epoll for tcp for server
<16843861.929659826:warn:pumice-referenc:raft_net_epoll_setup@1051> return code of raft_epoll_setup_net is: 0
<16843861.931249832:error:pumice-referenc:rsbr_db_open_internal@2608> rocksdb_open_column_families(/home/makshata/bulk_testing/7e4e791c-4df8-11ec-9f8c-b765ffa10fdc/raftdb/7eac16c6-4df8-11ec-944f-a3d2c7f96389.raftdb/db'): Invalid argument: /home/makshata/bulk_testing/7e4e791c-4df8-11ec-9f8c-b765ffa10fdc/raftdb/7eac16c6-4df8-11ec-944f-a3d2c7f96389.raftdb/db: does not exist (create_if_missing is false) (try-create=no) <16843861.955763581:warn:pumice-referenc:rsbr_db_open_internal@2608> rocksdb_open_column_families(/home/makshata/bulk_testing/7e4e791c-4df8-11ec-9f8c-b765ffa10fdc/raftdb/7eac16c6-4df8-11ec-944f-a3d2c7f96389.raftdb/db'): Success (try-create=yes)
<16843861.958094715:warn:pumice-referenc:rsbr_prep_raft_instance_from_db@2652> entry-idxs: lowest=-1 highest=-1
<16843861.982384089:warn:pumice-referenc:raft_server_process_bulk_recovery_check@3297> r:F et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=0 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= needs bulk recovery (leader-lowest-idx=16243)
<16843861.982420611:warn:pumice-referenc:raft_server_main_loop@5745> epoll_mgr_wait_and_process_events(): Success
<16843861.982455203:warn:pumice-referenc:epoll_mgr_close@170> epm=0x7f664d60ef10 cannot be destroyed num_handles=2 (Device or resource busy)
<16843861.982469429:error:pumice-referenc:raft_net_instance_shutdown@1202> raft_net_epoll_cleanup(): Device or resource busy
<16843861.983068527:error:pumice-referenc:raft_net_evp_remove@999> raft_net_epoll_handle_remove_by_fd(1): No such file or directory
<16843861.983088491:error:pumice-referenc:raft_net_evp_remove@999> raft_net_epoll_handle_remove_by_fd(2): No such file or directory
<16843861.983097167:error:pumice-referenc:raft_net_evp_remove@999> raft_net_epoll_handle_remove_by_fd(3): No such file or directory
<16843861.983124291:error:pumice-referenc:raft_server_instance_shutdown@5668> raft_server_evp_cleanup(): No such file or directory
<16843861.984653600:error:pumice-referenc:raft_net_instance_shutdown@1211> ri_shutdown_cb(): No such file or directory
<16843861.984734368:error:pumice-referenc:raft_server_instance_run@5954> raft_net_instance_shutdown(): Device or resource busy

Request should be retried when tcp_mgr_connection_verify returns EAGAIN

tcp_mgr_connection_verify() may return EAGAIN if the connection is not established between client and leader.
But currently raft_net_send_msg() typecasts all error codes to -ECOMM
and later in raft_client_rpc_sendq_dequeue_head_and_send() on error condition rcrh_cancel gets set and client never retries
that request.

Even if control plane proxy is up and running , service discovery returns error as "Server not available"

I am running the control plane range query recipe shows some write failures as with server not available error, even if proxy is up and running and cluster is in good state.
And proxy doesn't have any write errors as well.
That means key was never sent to the proxy for write operation.
NCPC is wronly declaring proxy as dead.
"nisd.075d6832-7df9-472c-8445-cd2979ec7e4d.e40ec2b0-91d9-4d58-a82a-275a9c824a42": {
"Status": 1,
"Value": "(Service discovery) Server not available"
},

Assert failing in rst_sm_handler_write after connecting client

244    NIOVA_ASSERT(rncr && rncr->rncr_request && rncr->rncr_is_leader &&
245                 rncr->rncr_type == RAFT_NET_CLIENT_REQ_TYPE_WRITE &&
246                 raft_net_client_request_handle_writes_raft_entry(rncr));

The condition that's failing is raft_net_client_request_handle_writes_raft_entry(rncr), which implies that rncr->rncr_write_raft_entry is false.

Reproduction steps:

scripts/raft-launch-servers-in-terminal.sh -w -s 0 -e 2 scripts/raft-mk-configs.sh /tmp;
./raft-client -r $RAFT -u $CLIENT
where:
$RAFT is the uuid portion of $NIOVA_LOCAL_CTL_SVC_DIR, and
$CLIENT is UUID of the first file in $NIOVA_LOCAL_CTL_SVC_DIR/*.raft_client

niovakv_client crashes with "large" -n values

$ ./niovakv_client -c ../lkvt/config -l /tmp/foo -n 200
logfile: /tmp/foo
0  /  200 write request completed
0  /  200 read request completed
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x75ca38]

goroutine 1193 [running]:
niovakv/clientapi.(*ClientAPI).Get(...)
	/home/pauln/Code/niova-core/go/pumiceDB/examples/niovakv/niovakv_client_api/niovakv_client_api.go:72
main.(*niovakv_client).sendReq(0xc000220270, 0xc000618a40, 0x0)
	/home/pauln/Code/niova-core/go/pumiceDB/examples/niovakv/niovakv_client/niovakv_client.go:95 +0x5f8
created by main.(*niovakv_client).doWrite_Read
	/home/pauln/Code/niova-core/go/pumiceDB/examples/niovakv/niovakv_client/niovakv_client.go:144 +0x1ad

regcomp leak introduced in f97565

f97565a introduces a minor leak where a compiled regex is not freed.

<574.109560627:warn:raft-net-test:raft_net_init@2357> enter
<574.110220321:warn:raft-net-test:raft_net_init@2367> exit
<574.110609917:warn:raft-net-test:raft_net_client_user_id_parse@2275> 1a636bd0-d27d-11ea-8cad-90324b2d1e89:2341523123:32452300123:1:0

=================================================================
==21243==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 256 byte(s) in 1 object(s) allocated from:
#0 0x7f338a743bc8 in malloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dbc8)
#1 0x7f338a4054f5 in regcomp (/lib/x86_64-linux-gnu/libc.so.6+0x1024f5)

Direct leak of 224 byte(s) in 1 object(s) allocated from:
#0 0x7f338a743ffe in __interceptor_realloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dffe)
#1 0x7f338a404ab4 (/lib/x86_64-linux-gnu/libc.so.6+0x101ab4)

Indirect leak of 2048 byte(s) in 1 object(s) allocated from:
#0 0x7f338a743dc6 in calloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10ddc6)
#1 0x7f338a403678 (/lib/x86_64-linux-gnu/libc.so.6+0x100678)

Indirect leak of 1888 byte(s) in 1 object(s) allocated from:
#0 0x7f338a743bc8 in malloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dbc8)
#1 0x7f338a403b5b (/lib/x86_64-linux-gnu/libc.so.6+0x100b5b)

Indirect leak of 1888 byte(s) in 1 object(s) allocated from:
#0 0x7f338a743bc8 in malloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dbc8)
#1 0x7f338a403641 (/lib/x86_64-linux-gnu/libc.so.6+0x100641)

Indirect leak of 1888 byte(s) in 1 object(s) allocated from:
#0 0x7f338a743bc8 in malloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dbc8)
#1 0x7f338a403b6b (/lib/x86_64-linux-gnu/libc.so.6+0x100b6b)

Indirect leak of 1296 byte(s) in 1 object(s) allocated from:
#0 0x7f338a743bc8 in malloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dbc8)
#1 0x7f338a404476 (/lib/x86_64-linux-gnu/libc.so.6+0x101476)

Indirect leak of 952 byte(s) in 81 object(s) allocated from:
#0 0x7f338a743ffe in __interceptor_realloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dffe)
#1 0x7f338a404556 (/lib/x86_64-linux-gnu/libc.so.6+0x101556)

Indirect leak of 472 byte(s) in 1 object(s) allocated from:
#0 0x7f338a743bc8 in malloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dbc8)
#1 0x7f338a403b3a (/lib/x86_64-linux-gnu/libc.so.6+0x100b3a)

Indirect leak of 316 byte(s) in 66 object(s) allocated from:
#0 0x7f338a743bc8 in malloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dbc8)
#1 0x7f338a3f7c69 (/lib/x86_64-linux-gnu/libc.so.6+0xf4c69)

Indirect leak of 256 byte(s) in 8 object(s) allocated from:
#0 0x7f338a743dc6 in calloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10ddc6)
#1 0x7f338a4000ee (/lib/x86_64-linux-gnu/libc.so.6+0xfd0ee)

Indirect leak of 88 byte(s) in 1 object(s) allocated from:
#0 0x7f338a743dc6 in calloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10ddc6)
#1 0x7f338a3f9d93 (/lib/x86_64-linux-gnu/libc.so.6+0xf6d93)

Indirect leak of 64 byte(s) in 8 object(s) allocated from:
#0 0x7f338a743bc8 in malloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dbc8)
#1 0x7f338a4042fa (/lib/x86_64-linux-gnu/libc.so.6+0x1012fa)

Indirect leak of 64 byte(s) in 16 object(s) allocated from:
#0 0x7f338a743bc8 in malloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dbc8)
#1 0x7f338a40428a (/lib/x86_64-linux-gnu/libc.so.6+0x10128a)

Indirect leak of 40 byte(s) in 1 object(s) allocated from:
#0 0x7f338a743ffe in __interceptor_realloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dffe)
#1 0x7f338a3f41f4 (/lib/x86_64-linux-gnu/libc.so.6+0xf11f4)
#2 0x6020000053cf ()

Indirect leak of 32 byte(s) in 1 object(s) allocated from:
#0 0x7f338a743ffe in __interceptor_realloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dffe)
#1 0x7f338a3f41f4 (/lib/x86_64-linux-gnu/libc.so.6+0xf11f4)
#2 0x60200000568f ()

Indirect leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x7f338a743ffe in __interceptor_realloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dffe)
#1 0x7f338a3f2bc3 (/lib/x86_64-linux-gnu/libc.so.6+0xefbc3)

Indirect leak of 4 byte(s) in 1 object(s) allocated from:
#0 0x7f338a743bc8 in malloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dbc8)
#1 0x7f338a3f2ad2 (/lib/x86_64-linux-gnu/libc.so.6+0xefad2)

Indirect leak of 4 byte(s) in 1 object(s) allocated from:
#0 0x7f338a743bc8 in malloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dbc8)
#1 0x7f338a3f9dcf (/lib/x86_64-linux-gnu/libc.so.6+0xf6dcf)

SUMMARY: AddressSanitizer: 12348 byte(s) leaked in 206 allocation(s).
FAIL test/raft-net-test (exit status: 1)

niovakv server and clients are too verbose by default

After writing just 7000 raft transactions, the niovakv server and pmdb servers are generating rather large logs full of debug info such as:
`time="21-09-2021 13:11:39" level=info msg="Key passed by client: Key289

This information should be available if requested by the user, however, by default such msgs must be disabled.

pauln@niova:/tmp$ ls -lrt [0-9a-f]*.log
-rw-r--r-- 1 pauln pauln  2425139 Sep 21 13:11 776cab6a-0fea-11ec-87e0-1b1bbe9ed57b.log
-rw-r--r-- 1 pauln pauln 59028765 Sep 21 13:11 776c882e-0fea-11ec-a557-3b6c731ca67e.log
-rw-r--r-- 1 pauln pauln  2300431 Sep 21 13:11 776c638a-0fea-11ec-a731-dbaee6d9855a.log
-rw-r--r-- 1 pauln pauln  2653105 Sep 21 13:11 776c3f68-0fea-11ec-bfa8-87ef3c13525a.log
-rw-r--r-- 1 pauln pauln  2300638 Sep 21 13:11 776c1b28-0fea-11ec-adb3-5b7b4f0914f1.log

Inconsistent TTL after Lookup operation after lease expiry.

Sometimes the LOOKUP operation on an expired lease gives out the following output with the TTL set to some value instead of '0'

  1 {
  2  "Request": {
  3   "Client": "529fc41a-a770-11ed-a40b-3a0bc855fd65",
  4   "Resource": "52638892-a770-11ed-bcf8-3a0bc855fd65",
  5   "Operation": "LOOKUP"
  6  },
  7  "Response": {
  8   "Client": "529fc41a-a770-11ed-a40b-3a0bc855fd65",
  9   "Resource": "52638892-a770-11ed-bcf8-3a0bc855fd65",
 10   "Status": "Success",
 11   "LeaseState": "EXPIRED",
 12   "TTL": 56,
 13   "TimeStamp": {
 14    "LeaderTerm": 1,
 15    "LeaderTime": 1146
 16   }
 17  }
 18 }

Memory leak in raft / pumiceDB code caused by failing to free() RocksDB errptr

The error strings returned by the RocksDB C api are allocated using strndup(). Users of the api must be checked to ensure that they're passing in a NULL ptr address (ie initialized on the stack) and that in the case of error free() is called.

static bool SaveError(char** errptr, const Status& s) {
  assert(errptr != nullptr);
  if (s.ok()) {
    return false;
  } else if (*errptr == nullptr) {
    *errptr = strdup(s.ToString().c_str());
  } else {
    // TODO(sanjay): Merge with existing error?                                 
    // This is a bug if *errptr is not created by malloc()                      
    free(*errptr);
    *errptr = strdup(s.ToString().c_str());
  }
  return true;
}

Add async connection start and close functions

Connection management should only be done from epoll thread, so there should be a mechanism to guarantee that behavior.

lkvt does not appear to work with the niovakv backend

The output shows that lkvt completes without issuing any error, even though the pmdb client sees no successful puts.

$ ./lkvt -d 0 -cp ./config -pp 100.0
INFO[0000] starting the app...                          
INFO[0000] this is the random key size187               
INFO[0000] this is the random value size828720          
0  /  1 request completed
1  /  1 request completed

... (hangs here for a while) ...

INFO[0160] stopping member updater                      
INFO[0160] done                                          
average ms per get=0 
average ms per put=150002 
get per sec=-Inf 
operations completed=1 
put per sec=0.6666576526329919 
seconds to complete=150.002028185 
time for gets=0 
time for puts=150.002028185
INFO[0160] ms latency for puts                          
1.5e+05-1.5e+05  100%  █▏  1
INFO[0160] ms latency for gets

The niovakv_server's ctl-interface reports the following after the above:

{
        "raft_client_root_entry" : [
                {
                        "raft-uuid" : "   ^",
                        "client-uuid" : "",
                        "leader-uuid" : "776cab6a-0fea-11ec-87e0-1b1bbe9ed57b",
                        "state" : "client",
                        "default-request-timeout-sec" : 60,
                        "commit-latency-msec" : {},
                        "read-latency-msec" : {},
                        "leader-viable" : true,
                        "leader-alive-cnt" : 71,
                        "last-request-sent" : "Thu Jan 01 00:00:00 UTC 1970",
                        "last-request-ack" : "Thu Jan 01 00:00:00 UTC 1970",
                        "recent-ops-wr" : [  
                                {
                                        "sub-app-user-id" : "6ec6f9c2-9008-4000-8491-9632f4732774:0:0:0:1",
                                        "rpc-id" : 8906007806612602918,
                                        "rpc-user-tag" : 0,
                                        "blocking" : true,
                                        "status" : "Communication error on send",
                                        "server" : "0.0.0.0:0",                                       <---------------------------------------------
                                        "submitted" : "Tue Sep 21 17:28:23 UTC 2021",
                                        "attempts" : 0,
                                        "completion-time-ms" : 0,
                                        "timeout-ms" : 60000,
                                        "reply-size" : 0,
                                        "op" : "write"
                                }
                        ],
                        "recent-ops-rd" : [],
                        "pending-ops" : []
                }
        ],
        "raft_net_bulk_recovery_info" : {},
        "raft_net_info" : {
                "ignore_timer_events" : false
        }
}

raft-uuid and client-uuid values being reported incorrectly in the ctl-interface

when using lkvt directly connected to pmdb the ctl-interface of the client shows the following

        "raft_client_root_entry" : [
                {
                        "raft-uuid" : "<8E>^^^^B",
                        "client-uuid" : "W^^^]^B",
                        "leader-uuid" : "337fda78-d7bb-11ec-9592-2ff3607520ea",
                        "state" : "client",
                        "default-request-timeout-sec" : 60,
                        "commit-latency-msec" : {},
                        "read-latency-msec" : {},
                        "leader-viable" : true,
                        "leader-alive-cnt" : 21,
                        "last-request-sent" : "Thu Jan 01 00:00:00 UTC 1970",
                        "last-request-ack" : "Thu Jan 01 00:00:00 UTC 1970",
                        "recent-ops-wr" : [],
                        "recent-ops-rd" : [],
                        "pending-ops" : []
                }
        ],

Allowing raft client activity even if there's no defined ctl-svc entry?

I'm wondering if we should create a ctl-svc entry for a client whose info had not been declared at server startup time? This would reinstate some of the functionality which had existed prior to tcp support while still (hopefully) allowing for tcp operation to function properly.

If not we need to address the comment in raft_server_client_recv_ignore_request() which incorrectly states:

 /* Lookup the client in the ctl-svc-node tree - existence is not                                     
   * mandatory.                                                                                        
   */

ctl_svc_node_destruct() should ensure that internal members are fully destroyed

ctl_svc_node_destruct() must ensure that all user have released their dynamically allocated memory. For now we need to watch out for struct tcp_mgr_connection, but later there may be others which also need to be checked.

fatal:niovakv_pmdbser:pmdb_sm_handler_pmdb_sm_apply_remove_coalesce_tree_item@901> unmatching client / cowr

<2174445.790382511:fatal:niovakv_pmdbser:pmdb_sm_handler_pmdb_sm_apply_remove_coalesce_tree_item@901> unmatching client / cowr (776e099c-0fea-11ec-b722-b70849783703, 00000000-0000-0000-0000-000000000000) uuid or rncui (7a94c7d4-f7e8-4c0d-9265-a0c1f35bb726:0:0:0:1218887, 7a94c7d4-f7e8-4c0d-9265-a0c1f35bb726:0:0:0:1218887), or cowr_sa term: 1008/ rncr term: 1008

basic_leader_election2 recipe failing to while electing leader

basic_leader_election2 recipe is failing continuously with niova-core master branch. Its failing while electing a leader.

Error on console:

TASK [basic_leader_election2: Wait until leader election happens.] ****************************************************************************************************************************************
Thursday 13 January 2022  08:41:41 +0000 (0:00:00.064)       0:00:16.018 ****** 
ERROR! A worker was found in a dead state

And the error from pmdb server logs:

`<21057754.661246683:warn:raft-reference-:env_parse@201> env-var NIOVA_INOTIFY_BASE_PATH value /home/makshata/holon_log/98a6d294-744c-11ec-ba45-87528381a0dc/ctl-interface applied from environment
<21057754.661311728:warn:raft-reference-:env_parse@201> env-var NIOVA_LOCAL_CTL_SVC_DIR value /home/makshata/holon_log/98a6d294-744c-11ec-ba45-87528381a0dc/configs applied from environment
<21057754.855329109:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057755.038581618:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057755.335830090:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057755.536117475:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057755.746356631:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057755.905560337:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057756.085834968:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057756.310067763:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057756.543342434:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057756.734550075:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057756.960809726:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057757.248043264:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057757.426285901:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057757.630489983:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057757.888717280:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057758.136951212:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057758.293212469:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057758.585437112:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057758.848726960:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057759.124964410:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057759.324255956:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057759.543424201:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057759.694657303:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057759.924861058:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057760.122084118:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057760.398276729:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057760.623478066:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057760.874726445:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057761.030914781:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057761.296111184:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057761.563381553:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057761.826651712:warn:raft-reference-:raft_server_become_candidate@1962> r:P et[s:u]=0:0 ei[s:u]=-1:-1 ht=0 hs=1 c=-1 la=-1:-1 lck=-1 v=00000000-0000-0000-0000-000000000000 l= 
<21057761.827759377:warn:raft-reference-:raft_server_become_candidate@1962> r:C et[s:u]=0:0 ei[s:u]=-1:-1 ht=1 hs=2 c=-1 la=-1:-1 lck=-1 v=98f1fe2c-744c-11ec-80cb-3351cd5aaf2d l= 
<21057761.828802377:warn:raft-reference-:raft_server_candidate_becomes_leader@2286> r:L et[s:u]=0:1 ei[s:u]=-1:0 ht=1 hs=2 c=-1 la=-1:-1 lck=-1 v=98f1fe2c-744c-11ec-80cb-3351cd5aaf2d l= 
<21057761.859106449:fatal:raft-reference-:rst_sm_handler_commit@180> failed assertion: rtdb->rtdb_op == RAFT_TEST_DATA_OP_WRITE
#0	0x0000556194aec2fd in niova_backtrace_dump () at src/niova_backtrace.c:66
#1	0x0000556194aee8f2 in rst_sm_handler_commit () at test/raft-reference-server.c:180
#2	0x0000556194aee8f2 in raft_server_test_rst_sm_handler () at test/raft-reference-server.c:532
#3	0x00007f664053b750 in raft_server_state_machine_apply () at src/raft_server.c:4705
#4	0x00007f664053e076 in raft_server_sm_apply_evp_cb () at src/raft_server.c:4820
#5	0x00007f66407e1925 in epoll_mgr_wait_and_process_events () at src/epoll_mgr.c:605
#6	0x00007f6640554ac5 in raft_server_main_loop () at src/raft_server.c:5744
#7	0x00007f6640554ac5 in raft_server_instance_run () at src/raft_server.c:5931
#8	0x0000556194aec07a in main () at test/raft-reference-server.c:632`

Please find attached .zip file for logs.
basic_leader_election2.zip