cocalele / pureflash Goto Github PK

A ServerSAN storage system designed for flash device

License: GNU General Public License v3.0

CMake 0.29% C 87.18% C++ 11.98% Shell 0.48% Makefile 0.02% Python 0.01% Dockerfile 0.04%

pureflash's Introduction

For Chinese version, please visist 中文版README

1. What's PureFlash

PureFlash is an open source ServerSAN implementation, that is, through a large number of general-purpose servers, plus PureFlash software system, to construct a set of distributed SAN storage that can meet the various business needs of enterprises.

The idea behind PureFlash comes from the fully hardware-accelerated flash array S5, so while PureFlash itself is a software-only implementation, its storage protocol is highly hardware-friendly. It can be considered that PureFlash's protocol is the NVMe protocol plus cloud storage feature enhancements, including snapshots, replicas, shards, cluster hot upgrades and other capabilities.

2. Why need a new ServerSAN software?

PureFlash is a storage system designed for the all-flash era. At present, the application of SSD disks is becoming more and more extensive, and there is a trend of fully replacing HDDs. The significant difference between SSD and HDD is the performance difference, which is also the most direct difference in user experience, and with the popularity of NVMe interface, the difference between the two is getting bigger and bigger, and this nearly hundredfold difference in quantity is enough to bring about a qualitative change in architecture design. For example, the performance of HDDs is very low, far lower than the performance capabilities of CPUs and networks, so the system design criterion is to maximize the performance of HDDs, and to achieve this goal can be at the expense of CPU and other resources. In the NVMe era, the performance relationship has been completely reversed, and the disk is no longer the bottleneck, but the CPU and network have become the bottleneck of the system. That method of consuming CPU to optimize IO is counterproductive.

Therefore, we need a new storage system architecture to fully exploit the capabilities of SSDs and improve the efficiency of the system. PureFlash is designed to simplify IO stack, separate data path and control path, and prioritize fast path as the basic principles to ensure high performance and high reliability, and provide block storage core capabilities in the cloud computing era.

3. Software design

Almost all current distributed storage systems have a very deep software stack, from the client software to the final server-side SSD disk, the IO path is very long. This deep software stack consumes a lot of system computing resources on the one hand, and on the other hand, the performance advantages of SSDs are wiped out. PureFlash is designed with the following principles in mind:

"Less is more", remove the complex logic on the IO path, use the unique BoB (Block over Bock) structure, and minimize the hierarchy
"Resource-centric", around CPU resources, SSD resource planning software structure, number of threads. Instead of planning according to the usual needs of software code logic
"Control/Data Separation", the control part is developed in Java, and the data path is developed in C++, each taking its own strengths

In addition, PureFlash "uses TCP in RDMA mode" on the network model, instead of the usual "use RDMA as faster TCP", RDMA must configure the one-sided API and the two-sided API correctly according to business needs. This not only makes RDMA used correctly, but also greatly improves the efficiency of TCP use.

Here is the structure diagram of our system:

The whole system include 5 modules (View graph with tabstop=4 and monospaced font)

			   
                                                            +---------------+
                                                            |               |
                                                       +--->+  MetaDB       |
                                                       |    |  (HA DB)      |
                             +------------------+      |    +---------------+
                             |                  +------+
                             | pfconductor      |           +---------------+
                        +---->  (Max 5 nodes)   +----------->               |
                        |    +--------+---------+           | Zookeeper     |
                        |             |                     | (3 nodes)     |
                        |             |                     +------^--------+
+-------------------+   |             |                            |
|                   +---+    +--------v---------+                  |
| pfbd/pfkd/tcmu    |        |                  |                  |
| (User and kernel  +------->+ pfs              +------------------+
| space client)     |        | (Max 1024 nodes) |
+-------------------+        +------------------+

3.1 pfs, PureFlash Store

This module is the storage service daemon that provides all data services, including:

SSD disk space management
Network Interface Services (RDMA and TCP protocols)
IO request processing

A PureFlash cluster can support up to 1024 pfs storage nodes. All PFS provide services to the outside world, so all nodes are working in the active state.

3.2 pfconductor

This module is the cluster control module. A production deployment should have at least 2 pfconductor nodes (up to 5). Key features include:
1) Cluster discovery and state maintenance, including the activity of each node, the activity of each SSD, and the capacity
2) Respond to users' management requests, create volumes, snapshots, tenants, etc
3) Cluster operation control, volume opening/closing, runtime fault handling

This module is programmed in Java, and the code repository URL： https://github.com/cocalele/pfconductor

3.3 Zookeeper

Zookeeper is a module in the cluster that implements the Paxos protocol to solve the network partitioning problem. All pfconductor and pfs instances are registered with zookeeper so that active pfconductor can discover other members in the entire cluster.

3.4 MetaDB

MetaDB is used to hold cluster metadata, and we use MariaDB here. Production deployment requires the Galaera DB plug-in to ensure it's HA.

client application

There are two types of client interfaces: user mode and kernel mode. User mode is accessed by applications in the form of APIs, which are located in libpfbd.

3.5.1 pfdd

pfdd is a dd-like tool, but has access to the PureFlash volume，https://github.com/cocalele/PureFlash/blob/master/common/src/pf_pfdd.cpp

3.5.2 fio

A FIO branch that supports PFBD. Can be used to test PureFlash with direct access to PureFlash volume。repository URL：https://github.com/cocalele/fio.git

3.5.3 qemu

A qemu branch with PFBD enabled, support to access PureFlash volume from VM. repository URL: https://github.com/cocalele/qemu.git

3.5.4 kernel driver

PureFlash provides a free Linux kernel mode driver, which can directly access pfbd volumes as block devices on bare-metal machines, and then format them into arbitrary file systems, which can be accessed by any application without API adaptation.

The kernel driver is ideal for Container PV and database scenarios.

3.5.5 nbd

A nbd implementation to support access PureFlash volume as nbd device， repository URL： https://gitee.com/cocalele/pfs-nbd.git

After compile, you can attach a volume like bellow:

    # pfsnbd  /dev/nbd3 test_v1

3.5.6 iSCSI

A LIO backend implementation to use PureFlash volume as LIO backend device，so it ban be accessed via iSCSI. repository URL：https://gitee.com/cocalele/tcmu-runner.git

networks ports

49162 store node TCP port
49160 store node RDMA port
49180 conductor HTTP port
49181 store node HTTP port

Try PureFlash

the easiest way to try PureFlash is to use docker. Suppose you have a NVMe SSD, e.g. nvme1n1, make sure the data is no more needed.

# dd if=/dev/zero of=/dev/nvme1n1 bs=1M count=100 oflag=direct
# docker pull pureflash/pureflash:latest
# docker run -ti --rm  --env PFS_DISKS=/dev/nvme1n1 --ulimit core=-1 --privileged  -e TZ=Asia/Shanghai  --network host  pureflash/pureflash:latest
# pfcli list_store
+----+---------------+--------+
| Id | Management IP | Status |
+----+---------------+--------+
|  1 |     127.0.0.1 |     OK |
+----+---------------+--------+
 
# pfcli list_disk
+----------+--------------------------------------+--------+
| Store ID |                 uuid                 | Status |
+----------+--------------------------------------+--------+
|        1 | 9ae5b25f-a1b7-4b8d-9fd0-54b578578333 |     OK |
+----------+--------------------------------------+--------+

#let's create a volume
# pfcli create_volume -v test_v1 -s 128G --rep 1

#run fio test
# /opt/pureflash/fio -name=test -ioengine=pfbd -volume=test_v1 -iodepth=16  -rw=randwrite -size=128G -bs=4k -direct=1

pureflash's People

Contributors

Stargazers

Watchers

pureflash's Issues

repo:PureFlash Global LMT vs Private LMT

Global

每个node自己加锁操作的话，数据量最大会有16M，性能不会好
使用LMT owner协助，会好一些
RDMA 直接操作owner的内存是否可行？即使能操作内存，也无法控制redolog的写入

Private

哪个volume属于哪个node，相对要是固定的，灵活性不够，
故障迁移，比如S1故障，上面的两个volume A, B打算分别迁往 S2, S3，需要的操作步骤：
1） S2 对S1 meta加锁，
2） S2读取S1 meta, 从中摘出volume A的LMT，
3）将摘除了volume A的剩余 LMT 写入S1 meta
4) 将添加了volume A的自己的LMT 写入S2 (自己） meta
5) 请求jconductor，将volume A的owner 改成S2
6) S3重复上面的过程，迁移volume B。

PfspdkEngine::submit_io固定往qpair[1]中提交io

当pfs没有prepare volume时，在进行tcp shake过程中出现coredump

rdma 轮询模式处理cq的支持

压力测试过程中，默认的事件驱动方式导致时延很高，通过轮询方式可以有效减少IO时延，通过对比iops大约提升13%

单节点双副本的支持

root@lab101:~# pfcli list_disk
+----------+--------------------------------------+--------+
| Store ID |                 uuid                 | Status |
+----------+--------------------------------------+--------+
|        1 | 87f5753a-d4b4-44dd-93cf-f67430d8ce67 |     OK |
|        1 | d992b5ee-ae23-4a1e-8d85-f0566c056390 |     OK |
+----------+--------------------------------------+--------+

root@lab101:~#  pfcli create_volume -v test_v1 -s 64G --rep 2
[main] ERROR com.netbric.s5.conductor.rpc.SimpleHttpRpc - Failed http GET http://127.0.0.1:49180/s5c/?op=create_volume&volume_name=test_v1&size=68719476736&rep_cnt=2
java.io.IOException: Failed RPC invoke, code:2, reason:only 1 stores available but replica is 2
	at com.netbric.s5.conductor.rpc.SimpleHttpRpc.invokeGET(SimpleHttpRpc.java:42)
	at com.netbric.s5.conductor.rpc.SimpleHttpRpc.invokeConductor(SimpleHttpRpc.java:75)
	at com.netbric.s5.cli.CliMain.cmd_create_volume(CliMain.java:201)
	at com.netbric.s5.cli.CliMain.access$000(CliMain.java:19)
	at com.netbric.s5.cli.CliMain$1.run(CliMain.java:92)
	at com.netbric.s5.cli.CliMain.main(CliMain.java:181)
[main] ERROR com.netbric.s5.cli.CliMain - Failed: Failed RPC invoke, code:2, reason:only 1 stores available but replica is 2

设置卷的副本数的时候，单个节点的时候好像是不允许多副本的，这个地方单节点多副本在某些场景上是有需求的,或者说没必要做这个限制或者能够通过参数来控制的

pfs 在shake sent中申请io内存失败后，会进行conn close操作导致空指针访问

fail to compile on centos7

Zookeeper compile fail,

/home/lele/eclipse-workspace/PureFlash/build/../s5afs/thirdParty/zookeeper/zookeeper-client/zookeeper-client-c/tests/TestLogClientEnv.cc:38: undefined reference to `CppUnit::Message::Message(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'
s5afs/thirdParty/zookeeper/zookeeper-client/zookeeper-client-c/CMakeFiles/zktest.dir/tests/TestLogClientEnv.cc.o: In function `Zookeeper_logClientEnv::testLogClientEnv()':
/home/lele/eclipse-workspace/PureFlash/build/../s5afs/thirdParty/zookeeper/zookeeper-client/zookeeper-client-c/tests/TestLogClientEnv.cc:49: undefined reference to `CppUnit::SourceLine::SourceLine(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int)'

in fact I can find the symbol in file /usr/lib64/libcppunit-1.12.so

 objdump -TC /usr/lib64/libcppunit-1.12.so | grep CppUnit::SourceLine::SourceLine                                                                                            00000000000208c0 g    DF .text  0000000000000025  Base        CppUnit::SourceLine::SourceLine()
0000000000020950 g    DF .text  000000000000005a  Base        CppUnit::SourceLine::SourceLine(std::string const&, int)
00000000000208f0 g    DF .text  000000000000005f  Base        CppUnit::SourceLine::SourceLine(CppUnit::SourceLine const&)
00000000000208c0 g    DF .text  0000000000000025  Base        CppUnit::SourceLine::SourceLine()
0000000000020950 g    DF .text  000000000000005a  Base        CppUnit::SourceLine::SourceLine(std::string const&, int)
00000000000208f0 g    DF .text  000000000000005f  Base        CppUnit::SourceLine::SourceLine(CppUnit::SourceLine const&)

Infinity loop to connection to failing node

如果在recovery期间slave节点crash，那么在primary节点上每次IO的时候就会持续连接slave。

这是因为在recovery开始的时候，primary节点上slave replica被置成了RECOVERYING状态。而在这个状态的副本，仍旧会发送replicate_write

Optimise build and test process

Need a single script to build and run project. It will be convenient.
For now, programs run as kernel model, not convenient for debug purpose. Any planns to run as user space program?

trace spdk io latency

spdk引擎同步写接口死循环

disk layout

磁盘的元数据区大小可以通过meta_size 在配置文件中配置。缺省40GB

#define META_RESERVE_SIZE (40LL<<30) //40GB, can be config in conf
#define MIN_META_RESERVE_SIZE (4LL<<30) //4GB, can be config in conf

#define S5_VERSION 0x00020000

支持PFS2需要进行下面的修改：

版本号NBS5需要修改以示不同，改为0x00030000
需要在 Redolog后面预留一个4K page, 作为ATS锁区域。（检查元数据区最小长度是否够，配置文件中的元数据长度设置作用？）
在原来的元数据区，2.5GB 以后的位置是空的，这里开辟一片锁区域，大小为32M. 每4K为一个锁。其中第一个作为global meta lock，其他的保留作为volume lock

  SSD head layout in LBA(4096 byte):
  0: length 1 LBA, head page
  LBA 1: length 1 LBA, free obj queue meta
  LBA 2: ~ 8193: length 8192 LBA: free obj queue data,
     4 byte per object item, 8192 page can contains
         8192 page x 4096 byte per page / 4 byte per item = 8 Million items
     while one object is 4M or 16M, 4 Million items means we can support disk size 32T or 128T
  offset 64MB: length 1 LBA, trim obj queue meta
  offset 64MB + 1LBA: ~ 64MB + 8192 LBA, length 8192 LBA: trim obj queue data, same size as free obj queue data
  offset 128MB: length 512MB,  lmt map, at worst case, each lmt_key and lmt_enty mapped one to one. 8 Million items need
          8M x ( 32 Byte key + 32 Byte entry) = 512M byte = 64K
  offset 1GByte - 4096, md5 of SSD meta
  offset 1G: length 1GB, duplicate of first 1G area
  offset 2G: length 512MB, redo log
  offset 2G+512M, length 32MB,  ATS lock area
     - 1st LBA(4K)  global meta lock
     - others, reserve as volume lock

当存储节点启动时要检查磁盘是否为新盘，对新盘需要进行创建元数据操作。由于共享盘挂在多个存储节点上，为避免多个节点同时对盘进行初始化，初始化前每个节点要获取global meta lock。流程如下：
1. 判断盘是否为新盘，如果是：
    a. 获取global meta lock
    b. 再次判断盘是否为新盘，如果已经被其他节点初始化了，则结束初始化
    c. 初始化global meta
    d. 释放global meta lock

pfs中存在多线程使用同一个qpair的问题，这是不允许的

pfcli return “NoSuchMethodError”

when i deploy a testing environment on my CentOS-8 VM, and use "pfcli delete_volume -v test" to delete a volume, it return error:

Exception in thread "main" java.lang.NoSuchMethodError: java.net.URLEncoder.encode(Ljava/lang/String;Ljava/nio/charset/Charset;)Ljava/lang/String;
	at com.netbric.s5.conductor.rpc.SimpleHttpRpc.invokeConductor(SimpleHttpRpc.java:73)
	at com.netbric.s5.cli.CliMain.lambda$main$0(CliMain.java:102)
	at com.netbric.s5.cli.CliMain.main(CliMain.java:181)

it may caused by different version of Package java.net. And can be fixed by change the param of URLEncoder.encode:

- sb.append("&").append(args[i]).append("=").append(URLEncoder.encode(args[i+1].toString(), StandardCharsets.UTF_8));
+ sb.append("&").append(args[i]).append("=").append(URLEncoder.encode(args[i+1].toString(), "UTF-8"));

or using the right version.
https://nowjava.com/docs/java-jdk-14/api/java.base/java/net/URLEncoder.html

Project merge

As PureFlash is combined of different projects, but only PureFlash now, need to do some changes.

cmake files:
Cmake project files need to remove unneccesary project settings.
thirdparty:
A common thirdparty is needed. move thirdparty from s5afs to root directory
compiler flags:
Need a unique compiler flags for all project, suggested to put it in root CMakeLists

err_handle 线程coredump

Zookeeper Build Problem

Need a support for libcppunit-1.14 in recent distro like ubuntu 18.04. Because cppunit-1.14 has removed cppunit.m4 from it. We need to update zookeeper to a new version, or just backport their change for cmake building support or just to support pkg-config when there is no cppunit.m4 find. see https://jira.apache.org/jira/browse/ZOOKEEPER-3034 for more details

duplicated IOCB memory pool ?

class PfAppCtx
{
public:
	ObjectMemoryPool<PfIoDesc> iod_pool;
}
struct PfClientAppCtx : public PfAppCtx
{
	ObjectMemoryPool<PfClientIocb> iocb_pool;
}

The two memory pools are duplicated or not ?

pfs-nbd编译报错

git clone  https://gitee.com/cocalele/pfs-nbd.git
./autogen.sh
./configure --enable-pfsnbd
./make

报错信息：

root@lab101:~/pfs-nbd# make
make  all-recursive
make[1]: Entering directory '/root/pfs-nbd'
Making all in .
make[2]: Entering directory '/root/pfs-nbd'
make[2]: Leaving directory '/root/pfs-nbd'
Making all in man
make[2]: Entering directory '/root/pfs-nbd/man'
make[2]: *** No rule to make target 'all'.  Stop.
make[2]: Leaving directory '/root/pfs-nbd/man'
make[1]: *** [Makefile:914: all-recursive] Error 1
make[1]: Leaving directory '/root/pfs-nbd'
make: *** [Makefile:528: all] Error 2

环境为20.04

root@lab101:~/pfs-nbd# cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.5 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.5 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

内核客户端从哪里获取

根据文档内的：
3.5.4 内核态驱动
PureFlash提供了免费的内核态驱动，在物理机上可以直接将pfbd卷呈现成块设备，然后可以格式化成任意的文件系统，任何应用无需API适配就可以访问。
内核驱动非常适合容器PV和数据库场景使用。

这个内核驱动在哪里可以获取到

delete object in RECOVERYING/COPYING state on system up

objects in RECOVERYING/COPYING are stale object cause by last power outage. These should be deleted in load_meta_data

work flow

工作全流程
1. pfs初始化
对于共享盘，在配置文件里面以配置shared=1表示，比如：

[tray.0]
   dev = /dev/sdf
   shared=1

pfs首先检查是否为新盘，如果是，则按照下面步骤对盘进行初始化。初始化也即创建初始磁盘元数据，元数据布局按issue: disk layout 的描述进行。
1) 判断盘是否为新盘，如果是：
a. 获取global meta lock
b. 再次判断盘是否为新盘，如果已经被其他节点初始化了，则结束初始化
c. 初始化global meta
d. 释放global meta lock

mkfs.pfs2命令
盘可以是动态加入的，mkfs.pfs2命令就起到这个作用。这个命令执行时通知本机的store服务，把指定的盘加入到集群。
加入集群的盘在服务下次启动时应该能自动加入。因此需要动态需要对配置文件进行自动修改。
pureflash现在使用的配置文件是ini格式，不提供通过代码进行动态修改的方法。【实现不了就需要运维人员修改配置文件】

启动一个线程作为owner线程？

向zk注册共享盘，注册写入的信息包括2部分，1）自己的store_id，以声称自己和这个盘的联接关系 2） owner 锁，以竞争这个盘的owner

	    /cluster1/shared_disks/
	                    + <disk1_uuid>
	                           + <store1_id>
	                           + <store2_id>
	                           + owner_store
	                                    + <ephemeral_node1 := store_id>
	                                    + <ephemeral_node2 := store_id>

创建volume/file
通过pfcli创建volume，引入一个新的命令conductor op: pfs2_create_file，参数包括file_name, size
这里没有指定disk相关信息，
open file
客户端需要向conductor发请求打开pfs2文件，这个像普通的volume一样。但是具体操作上有区别
1. 使用open_pfs2 命令（是否有必要？ conductor通过元数据可以判断是普通的volume还是pfs2 file）
2. conductor返回的内容不是普通volume open的内容，而是prepare volume时向store发送的volume信息。这是为了在client端实现IO的下发，client需要知道每个shard的具体信息，主要是disk uuid 以及盘符。（client按照盘符打开盘后，要比较uuid是否符合预期）
IO 流程
为了在client将IO直接下发到共享盘，而不是经过store服务转发，需要将Pfs进程中PfFlashStore， PfIoEngine 这两个类以及相关依赖复制到client代码。
然后在PfClientVolume::process_event函数中处理EVT_IO_REQ事件时将IO请求发送给PfFlashStore。这里的挑战比较大，server端用的数据结构（比如PfServerIocb)和client差异很大。
元数据修改流程
元数据修改需要远程调用owner节点完成。网络请求可以有两个发送方法：
1. pfs里的慢速通道，通过HTTP请求完成。慢速通道是同步调用，编码容易。
2. pfs里的快速通道，通过IO请求完成，定义新的IO opcode。快速通道是异步调用
  通常在用户IO 过程中只会发生block allocation元数据请求。

delete block的操作发生在delete snapshot, delete volume的时候。这些请求由pfconductor发送，而不是client。

client watch zk上的owner_store节点，以获知owner的变化。

delete block的操作发生在delete snapshot, delete volume的时候。这些请求由pfconductor发送，而不是client。

create_snapshot
原来的逻辑创建快照时由conductor把meta_ver推送给store。现在没有store参与IO流程。
共享盘（RAC场景）不支持快照，thick provision ，只考虑单节点挂载pfs2文件。create snapshot把请求发给client，conductor。
client如何向client(qemu)发命令？
delete volume/snapshot
执行delete操作的时候，client进程可能已经不存在，必须由owner节点执行 lmt修改， trim。
如果client存在，删掉某个block后，需要client更新自己内存里的LMT。这种情况下，通知client，然后由client通知conductor去删除比较好。
client还是需要一个接受通知的能力！以及cli发现client的能力！
client注册机制
以volume为粒度，哪个client在访问哪个volume。可以使用zk进行注册

Give store device a formal and peculiar name

We have use 'store' as name of a storage device in AFS node. but store is too common a word, and cause chaos in later discussion and documentation.
We need a more formal and a peculiar name for store device, like the word ‘OSD’ used in ceph.

trim_proc线程出现段错误

需要元数据、空间管理、快照原理文档与讲解

缺少s5log.h的头文件

[root@lab3101 PureFlash]# grep "s5log.h" -R *
common/unittest/clt_socket.c:#include "s5log.h"
common/unittest/common_gtest.cpp:#include "s5log.h"
common/unittest/common_gtest2.cpp:#include "s5log.h"
common/unittest/session_clt_socket.c:#include "s5log.h"
common/unittest/srv_socket.c:#include "s5log.h"
common/unittest/test_s5session.cpp:#include "s5log.h"
common/unittest/test_s5sql.c:#include "s5log.h"
common/unittest/test_worker.c:#include "s5log.h"
S5bd/include/internal.h:#include "s5log.h"
S5bd/include/s5_context.h:#include "s5log.h"
S5bd/include/s5imagectx.h:#include "s5log.h"
S5bd/src/idgenerator.c:#include "s5log.h"
S5bd/src/s5session.c:#include "s5log.h"
grep: thirdParty/mongoose/examples/mbed/mongoose: warning: recursive directory loop
[root@lab3101 PureFlash]# find ./ -name "s5log.h"

[root@lab3101 PureFlash]# cat S5bd/README.txt
This is S5bd project. Block device driver for S5 storage.

引用了s5log.h，但是缺少这个文件

cluster meta data

cluster元数据，包括

zookeeper上集群注册规则，这部分影响到节点启动后的注册行为
MetaDB 里的数据，这部分影响到pfconductor 创建volume， prepare volume的操作。

有三种可能的方法：
一、所有的store 使用同一个store_id，这样在zk上仍然是一个逻辑store node, 注册到zk上时节点只是把自己的IP作为一个数据portal增加上去。
这个方法好处是pfconductor不需要做任何的改动，看到的视图和原来完全一样。
坏处在于：
1）掩盖了多个物理节点的实际情况
2）无法支持每个节点私有数据盘

二、保持每个store ID唯一，允许共享盘出现在多个store 下面
- 为表示区别，要标识这个盘是shared。
这个方法和现有的注册机制最相似。但是conductor需要处理重复出现的盘。原来disk uuid在metadb里面是不允许重复的，现在出现了重复。conductor的处理原则：
1) conductor从zk上发现shared disk时，检查这个盘是否已经在t_tray表里被其他store拥有，如果是的话，不在创建新的t_tray记录，而是将新的store做为这个盘的共同拥有人加入到coowner字段。
2) coowner字段是t_tray表新增加的
这个方法的缺点是无法清晰呈现盘和节点的关系，

三、为共享盘建立独立的注册管理体系

在zk上创建这样的结构：

    /cluster1/shared_disks/
                    + <disk1_uuid>
                           + stores
                           |   + <store1_id>
                           |    |   + dev_name := <device name, e.g. /dev/sdd>
                           |   + <store2_id>
                           |       + dev_name  := <device name>
                           + owner_store
                                    + <ephemeral_node1 := store_id>
                                    + <ephemeral_node2 := store_id>

这样共享盘成为独立的一套信息，在metadb里面也为其创建单独的t_shared_disk表，而不是共享原来的t_tray表。
从代码结构上，这样做对原来的代码影响小，不容易引入bug。共享盘的代码独立于原来的代码，包括创建shard时的选盘逻辑，共享盘和原来的逻辑完全分开。
这里顺便也解决了共享盘的Owner问题，只有owner store可以修改这个盘的元数据。

只有单副本的shard可以选择共享盘作为承载盘。

pfs存在不使用的线程池

了解一下咱们这个项目和purestorage公司的产品有关系吗

purestorage公司网站：https://www.purestorage.com/cn/；这是该公司简介：
Pure Storage是一家全球领先的全闪存存储解决方案提供商，成立于2009年10月，总部位于美国加利福尼亚州山景城。该公司专注于研发和销售企业级的全固态存储阵列以及相关数据服务，旨在帮助企业实现数据中心现代化，并通过简化数据存储架构来提升性能、降低运营成本及复杂性。

Pure Storage的核心产品线是Pure FlashArray系列，它以全闪存技术为基础，结合创新的数据压缩、重复数据删除和其他优化功能，为关键业务应用提供极高的I/O性能、低延迟以及出色的可用性和可扩展性。此外，Pure Storage还提供了FlashBlade产品，专为大规模并行非结构化数据处理设计，适用于大数据分析、机器学习和内容库等场景。

除了硬件产品，Pure Storage也发展了软件和服务方面的产品组合，包括数据保护、云集成解决方案（如支持混合云和多云环境下的数据移动与管理）以及用于容器环境的存储解决方案（如Portworx by Pure Storage）。

随着云计算和数字化转型的深入，Pure Storage持续致力于通过其先进的存储技术推动企业的数据战略，帮助客户适应不断变化的市场和技术趋势，实现数据驱动的业务增长和效率提升。

SPDK引擎下，对于网络poller线程避免在epoll_wait进行等待

相比较采用非SCHED_FIFO+等待epoll_wait这种方式，而采用SCHED_FIFO和不等待epoll_wait大概有50%的性能提升

建议/var/log/pfs.log的启动日志里面可以把git commit id加上，再使用git checkou “ commit id”，方便定位log对应代码的行数

编译报错/usr/bin/ld: pf_zk_client.cpp:(.text+0x1e2): undefined reference to `zookeeper_init'

按照编译说明里面：
https://github.com/cocalele/PureFlash/blob/master/build_and_run.txt
步骤一步步操作，x86架构，ubuntu22.04环境

在
4) run ninja to do build
$ ninja
步骤的时候会出错，提示如下：

[53/55] Linking CXX executable bin/pfs
FAILED: bin/pfs
: && /usr/bin/c++ -Wall -Wconversion -Wno-sign-compare  -fms-extensions -Wno-variadic-macros -Wno-format-truncation -I/usr/include -D_XOPEN_SOURCE  -O3 -DNDEBUG -rdynamic pfs/CMakeFiles/pfs.dir/src/pf_cluster.cpp.o pfs/CMakeFiles/pfs.dir/src/pf_flash_store.cpp.o pfs/CMakeFiles/pfs.dir/src/pf_main.cpp.o pfs/CMakeFiles/pfs.dir/src/pf_s5message.cpp.o pfs/CMakeFiles/pfs.dir/src/pf_server.cpp.o pfs/CMakeFiles/pfs.dir/src/pf_dispatcher.cpp.o pfs/CMakeFiles/pfs.dir/src/pf_md5.cpp.o pfs/CMakeFiles/pfs.dir/src/pf_redolog.cpp.o pfs/CMakeFiles/pfs.dir/src/pf_block_tray.cpp.o pfs/CMakeFiles/pfs.dir/src/pf_replica.cpp.o pfs/CMakeFiles/pfs.dir/src/pf_restful_server.cpp.o pfs/CMakeFiles/pfs.dir/src/pf_restful_api.cpp.o pfs/CMakeFiles/pfs.dir/src/pf_volume.cpp.o pfs/CMakeFiles/pfs.dir/src/pf_error_handler.cpp.o pfs/CMakeFiles/pfs.dir/src/pf_replicator.cpp.o pfs/CMakeFiles/pfs.dir/src/pf_bitmap.cpp.o pfs/CMakeFiles/pfs.dir/src/pf_bgtask_manager.cpp.o pfs/CMakeFiles/pfs.dir/src/pf_scrub.cpp.o pfs/CMakeFiles/pfs.dir/src/pf_atslock.cpp.o pfs/CMakeFiles/pfs.dir/src/pf_spdk_engine.cpp.o pfs/CMakeFiles/pfs.dir/src/pf_rdma_server.cpp.o pfs/CMakeFiles/pfs.dir/__/thirdParty/mongoose/mongoose.c.o -o bin/pfs -L/root/PureFlash/pre_build_libs/ubuntu_22.04_x86_64   -L/root/PureFlash/build_deb/bin   -L/root/PureFlash/thirdParty/spdk/build/lib   -L/root/PureFlash/thirdParty/spdk/dpdk/build/lib -lrdmacm  -libverbs  -lpthread  -lzookeeper_mt  -lhashtable  -luuid  -lspdk_nvme  -lspdk_env_dpdk  -lspdk_util  -lspdk_log  -lspdk_sock  -lspdk_trace  -lspdk_json  -lspdk_jsonrpc  -lspdk_rpc  -lrte_eal  -lrte_mempool  -lrte_ring  -lrte_telemetry  -lrte_kvargs  -lrte_pci  -lrte_bus_pci  -lrte_mempool_ring  bin/libs5common.a  -laio  -lcurl  ../thirdParty/isa-l_crypto/.libs/libisal_crypto.a  -Wl,-Bstatic  -lsgutils2  -Wl,-Bdynamic  -ldl  -lrdmacm  -libverbs  -lpthread && cd /root/PureFlash/build_deb/pfs && cp -rpfu /root/PureFlash/pfs/pfs_template.conf /root/PureFlash/build_deb
/usr/bin/ld: bin/libs5common.a(pf_zk_client.cpp.o): in function `PfZkClient::init(char const*, int, char const*)':
pf_zk_client.cpp:(.text+0x1a2): undefined reference to `zoo_set_debug_level'
/usr/bin/ld: pf_zk_client.cpp:(.text+0x1e2): undefined reference to `zookeeper_init'
/usr/bin/ld: pf_zk_client.cpp:(.text+0x1f2): undefined reference to `ZOO_CONNECTED_STATE'
/usr/bin/ld: pf_zk_client.cpp:(.text+0x214): undefined reference to `zoo_state'
/usr/bin/ld: bin/libs5common.a(pf_zk_client.cpp.o): in function `PfZkClient::~PfZkClient()':
pf_zk_client.cpp:(.text+0x2b1): undefined reference to `zookeeper_close'
/usr/bin/ld: bin/libs5common.a(pf_zk_client.cpp.o): in function `PfZkClient::get_data_port[abi:cxx11](int, int)':
pf_zk_client.cpp:(.text+0x394): undefined reference to `zoo_get_children'
/usr/bin/ld: bin/libs5common.a(pf_zk_client.cpp.o): in function `PfZkClient::watch_disk_owner(char const*, std::function<void (char const*)>)':
pf_zk_client.cpp:(.text+0x6e0): undefined reference to `zoo_wget_children'
/usr/bin/ld: pf_zk_client.cpp:(.text+0x7fe): undefined reference to `zoo_get'
/usr/bin/ld: bin/libs5common.a(pf_zk_client.cpp.o): in function `PfZkClient::create_node(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, char const*) [clone .localalias]':
pf_zk_client.cpp:(.text+0xc4b): undefined reference to `zoo_exists'
/usr/bin/ld: pf_zk_client.cpp:(.text+0xdb6): undefined reference to `ZOO_EPHEMERAL'
/usr/bin/ld: pf_zk_client.cpp:(.text+0xde0): undefined reference to `ZOO_OPEN_ACL_UNSAFE'
/usr/bin/ld: pf_zk_client.cpp:(.text+0xdf1): undefined reference to `zoo_create'
/usr/bin/ld: pf_zk_client.cpp:(.text+0xe5b): undefined reference to `zoo_exists'
/usr/bin/ld: bin/libs5common.a(pf_zk_client.cpp.o): in function `PfZkClient::wait_lock(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char const*)':
pf_zk_client.cpp:(.text+0x158f): undefined reference to `ZOO_EPHEMERAL_SEQUENTIAL'
/usr/bin/ld: pf_zk_client.cpp:(.text+0x15cb): undefined reference to `ZOO_OPEN_ACL_UNSAFE'
/usr/bin/ld: pf_zk_client.cpp:(.text+0x15f0): undefined reference to `zoo_create'
/usr/bin/ld: pf_zk_client.cpp:(.text+0x186e): undefined reference to `zoo_wget_children'
/usr/bin/ld: pf_zk_client.cpp:(.text+0x1d95): undefined reference to `zoo_remove_watches'
/usr/bin/ld: bin/libs5common.a(pf_zk_client.cpp.o): in function `std::_Function_handler<void (), PfZkClient::wait_lock(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char const*)::{lambda()#1}>::_M_invoke(std::_Any_data const&)':
pf_zk_client.cpp:(.text+0x68): undefined reference to `deallocate_String_vector'
/usr/bin/ld: bin/libs5common.a(pf_zk_client.cpp.o): in function `std::_Function_handler<void (), PfZkClient::get_data_port[abi:cxx11](int, int)::{lambda()#1}>::_M_invoke(std::_Any_data const&)':
pf_zk_client.cpp:(.text+0x78): undefined reference to `deallocate_String_vector'
/usr/bin/ld: bin/libs5common.a(pf_zk_client.cpp.o): in function `std::_Function_handler<void (), PfZkClient::watch_disk_owner(char const*, std::function<void (char const*)>)::{lambda()#1}>::_M_invoke(std::_Any_data const&)':
pf_zk_client.cpp:(.text+0x88): undefined reference to `deallocate_String_vector'
/usr/bin/ld: bin/libs5common.a(pf_zk_client.cpp.o): in function `PfZkClient::delete_node(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)':
pf_zk_client.cpp:(.text+0x310): undefined reference to `zoo_delete'
/usr/bin/ld: bin/libs5common.a(pf_client_api.cpp.o): in function `get_master_conductor_ip[abi:cxx11](char const*, char const*)':
pf_client_api.cpp:(.text+0x6a5d): undefined reference to `zoo_set_debug_level'
/usr/bin/ld: pf_client_api.cpp:(.text+0x6a74): undefined reference to `zookeeper_init'
/usr/bin/ld: pf_client_api.cpp:(.text+0x6b32): undefined reference to `zoo_state'
/usr/bin/ld: pf_client_api.cpp:(.text+0x6b3b): undefined reference to `ZOO_CONNECTED_STATE'
/usr/bin/ld: pf_client_api.cpp:(.text+0x6b7a): undefined reference to `zoo_get_children'
/usr/bin/ld: pf_client_api.cpp:(.text+0x6c73): undefined reference to `zoo_get'
/usr/bin/ld: bin/libs5common.a(pf_client_api.cpp.o): in function `std::_Function_handler<void (), get_master_conductor_ip[abi:cxx11](char const*, char const*)::{lambda()#1}>::_M_invoke(std::_Any_data const&)':
pf_client_api.cpp:(.text+0x518): undefined reference to `zookeeper_close'
/usr/bin/ld: bin/libs5common.a(pf_client_api.cpp.o): in function `std::_Function_handler<void (), get_master_conductor_ip[abi:cxx11](char const*, char const*)::{lambda()#2}>::_M_invoke(std::_Any_data const&)':
pf_client_api.cpp:(.text+0x528): undefined reference to `deallocate_String_vector'
collect2: error: ld returned 1 exit status
[54/55] Building CXX object common/CMakeFiles/pfdd.dir/src/pf_pfdd.cpp.o
/root/PureFlash/common/src/pf_pfdd.cpp: In function ‘int main(int, char**)’:
/root/PureFlash/common/src/pf_pfdd.cpp:266:17: warning: unused variable ‘offset_in_file’ [-Wunused-variable]
  266 |         int64_t offset_in_file = 0;
      |                 ^~~~~~~~~~~~~~
ninja: build stopped: subcommand failed.

rdma connection hang during release

Thread 44 (LWP 66119 "vol_proc"):
#0  0x0000ffff53779df8 in ?? () from target:/lib/aarch64-linux-gnu/libc.so.6
#1  0x0000ffff5377c8fc in pthread_cond_wait () from target:/lib/aarch64-linux-gnu/libc.so.6
#2  0x0000ffff53cd74b4 in rdma_destroy_id () from target:/lib/aarch64-linux-gnu/librdmacm.so.1
#3  0x0000aaaacf6dc778 in PfRdmaConnection::~PfRdmaConnection (this=0xfff424000c60, __in_chrg=<optimized out>) at /root/v2/PureFlash/common/src/pf_rdma_connection.cpp:326
#4  0x0000aaaacf6dc810 in PfRdmaConnection::~PfRdmaConnection (this=0xfff424000c60, __in_chrg=<optimized out>) at /root/v2/PureFlash/common/src/pf_rdma_connection.cpp:332
#5  0x0000aaaacf6a10b4 in PfConnection::dec_ref (this=0xfff424000c60) at /root/v2/PureFlash/common/include/pf_connection.h:82
#6  0x0000aaaacf68fa50 in PfClientVolume::process_event (this=0xfffe3c000b70, event_type=6, arg_i=0, arg_p=0xfffe3c030e10) at /root/v2/PureFlash/common/src/pf_client_api.cpp:1094
#7  0x0000aaaacf68f1a0 in PfVolumeEventProc::process_event (this=0xfffe3c002e50, event_type=6, arg_i=0, arg_p=0xfffe3c030e10, arg_q=0xfffe3c000b70) at /root/v2/PureFlash/common/src/pf_client_api.cpp:964
#8  0x0000aaaacf6e0334 in thread_proc_eventq (arg=0xfffe3c002e50) at /root/v2/PureFlash/common/src/pf_event_thread.cpp:135
#9  0x0000ffff5377d5c8 in ?? () from target:/lib/aarch64-linux-gnu/libc.so.6
#10 0x0000ffff537e5d9c in ?? () from target:/lib/aarch64-linux-gnu/libc.so.6

stack_hang.txt

abuse of spinlock

Too many spinlocks,

in PfFixedSizeQueue , there's spinlock protected.
in PfEventQueue, there's also a spinlock.

each time post_event is called, the two spinlock will be both acquired.

int PfEventQueue::post_event(int type, int arg_i, void* arg_p, void* arg_q)
{
	//S5LOG_INFO("post_event %s into:%s", EventTypeToStr((S5EventType)type), name);
	{
		AutoSpinLock _l(&lock);
		int rc = current_queue->enqueue(S5Event{ type, arg_i, arg_p , arg_q});
		if(rc)
			return rc;
	}
	write(event_fd, &event_delta, sizeof(event_delta));
	return 0;
}

SPDK engine使用方式

PureFlash集成了SPDK以提升IO性能，如下为SPDK engine的使用方式（以/dev/nvme0n1为例说明）：

如果/dev/nvme0n1是新盘的话，清零磁盘头：
dd if=/dev/zero of=/dev/nvme0n1 bs=4K count=1 oflag=direct
当然如果/dev/nvme0n1不是第一次使用，则可以跳过这一步；
获取/dev/nvme0n1的PCIe Bus ID：

# ls -l /sys/class/block/*
/sys/class/block/nvme0n1 -> ../../devices/pci0000:00/0000:00:16.1/0000:0c:00.0/nvme/nvme0/nvme0n1

可以获取nvme0n1的Bus ID为0000:0c:00.0

使用SPDK接管nvme：

PCI_ALLOWED="0000:0c:00.0" ./PureFlash/thirdParty/spdk/scripts/setup.sh config

修改pfs.conf如下：

[cluster]
name=cluster1
[zookeeper]
ip=127.0.0.1:2181

[afs]
        mngt_ip= xxx
        id=1
        meta_size=10737418240
[engine]
        name=spdk
[tray.0]
   dev = trtype:PCIe traddr:0000:0c:00.0    # path of physical flash device
[port.0]
   ip= xxx
[rep_port.0]
   ip= xxx

与非SPDK模式相比，修改了engine.name为spdk，dev使用PCIe addr

[pfconductor] 同一个shard的副本放到了同一个盘上

上图中shard 0的两个副本放到了同一个盘上

默认pf event thread 和 pf_poller线程优先级都是配置的SCHED_FIFO，在spdk引擎下应避免pf_poller线程进行上下文切换

spdk测试中，pf_poller网络线程使用epoll_wait存在大量上下文切换，造成性能波动

post COW task to disk io thread

void PfFlashStore::do_cow_entry(lmt_key* key, lmt_entry *srcEntry, lmt_entry *dstEntry)
{//this function called in thread pool, not the store's event thread
	CowTask r;
	r.src_offset = srcEntry->offset;
	r.dst_offset = dstEntry->offset;
	r.size = COW_OBJ_SIZE;
	sem_init(&r.sem, 0, 0);

	r.buf = app_context.cow_buf_pool.alloc(COW_OBJ_SIZE);
	event_queue->post_event(EVT_COW_READ, 0, &r);
	sem_wait(&r.sem);
	if(unlikely(r.complete_status != PfMessageStatus::MSG_STATUS_SUCCESS))	{
		S5LOG_ERROR("COW read failed, status:%d", r.complete_status);
		goto cowfail;
	}

	event_queue->post_event(EVT_COW_WRITE, 0, &r);
	sem_wait(&r.sem);
	if(unlikely(r.complete_status != PfMessageStatus::MSG_STATUS_SUCCESS))	{
		S5LOG_ERROR("COW write failed, status:%d", r.complete_status);
		goto cowfail;
	}

I wonder is it necessary to post cow task to disk thread ?
The two blocks(src & dest) were reserved currently, and no other IO will occur until we release them. So, i think cow on current thread is feasible.

fio测试的问题-调整命令4K为4M后崩溃

#run fio test
# /opt/pureflash/fio -name=test -ioengine=pfbd -volume=test_v1 -iodepth=16  -rw=randwrite -size=128G -bs=4k -direct=1

根据readme内的提供的测试命令这个能够运行良好，可以看到磁盘也有流量

只修改上面的bs=4k修改为4M，然后再次跑测试

nvme磁盘没有流量了，docker的宿主盘有流量，运行一会后，就崩溃了

 /opt/pureflash/fio -name=test -ioengine=pfbd -volume=test_v1 -iodepth=16  -rw=randwrite -size=128G -bs=4M -direct=1

PFS2 feature list

1. 共享磁盘，将一个FC/iSCSI LUN挂载到多个计算节点上
2. 本地访问，每个节点上的VM通过本节点的存储服务访问存储，不经过额外的网络（不经过存储节点间网络，FC的话不经过IP网络）
  - client 访问时，以localhost访问，或者以真实IP访问都可以。
3. thin provision, 可以超卖
4. 高可用，可以通过任何节点访问任何volume
5. 支持快照，VM (volume) 粒度

run it as a unittest. use gtest/gmock to do, will add a submodule.
only check IO / volume meta data, without network part.
mock metadata initialize, only a few metadata is needed for testing purpose
write a memstore to store testing data. It's core part is a map<size_t offset, void *data>. It will be
very fast for testing purpose than write to a file.