根据factory_test.cc测试的一部分,我改造到真实的两台机器上测试,感觉有些问题。
环境:两台机器都是Ubuntu系统,地址分别为172.18.0.2, 172.18.0.3,分别取编号(rank)为0,1。
rank是0的机器运行代码如下
//Mytest.cpp
#include <iostream>
#include <string>
#include <vector>
#include <map>
#include <memory>
#include <type_traits>
#include <variant>
#include <unistd.h>
#include <future>
#include <limits>
#include "fmt/format.h"
#include "gtest/gtest.h"
#include "yacl/link/context.h"
#include "yacl/link/link.h"
#include "yacl/link/factory.h"
class FactoryTest{
public:
FactoryTest()
{
static int desc_count = 0;
contexts_.resize(2);
yacl::link::ContextDesc desc;
desc.id = fmt::format("world_{}", desc_count++);
desc.brpc_retry_count = 20;
desc.parties.push_back(yacl::link::ContextDesc::Party("alice", "172.18.0.2:63927"));
desc.parties.push_back(yacl::link::ContextDesc::Party("bob", "172.18.0.3:63921"));
auto create_brpc = [&](int self_rank) {
contexts_[self_rank] = yacl::link::FactoryBrpc().CreateContext(desc, self_rank);
};
std::vector<std::future<void>> creates;
creates.push_back(std::async(create_brpc, 0));
for (auto& f : creates) {
f.get();
}
std::cout << "Connect to Bob successfully\n";
}
void work()
{
auto test = [&](int self_rank)
{
int dst_rank = 1 - self_rank;
this->contexts_[self_rank]->SendAsync(dst_rank, "Hello I am 0", "test");
yacl::Buffer r = this->contexts_[self_rank]->Recv(dst_rank, "test");
std::string r_str(r.data<const char>(), r.size());
std::cout << self_rank << " Receive " << r_str << '\n';
};
std::vector<std::future<void>> tests;
tests.push_back(std::async(test, 0));
for (auto& f : tests) {
f.get();
}
}
~FactoryTest()
{
auto wait = [&](int self_rank) {
contexts_[self_rank]->WaitLinkTaskFinish();
};
std::vector<std::future<void>> waits;
waits.push_back(std::async(wait, 0));
for (auto& f : waits) {
f.get();
}
}
std::vector<std::shared_ptr<yacl::link::Context>> contexts_;
};
int main() {
FactoryTest F;
sleep(2);
F.work();
return 0;
}
编号为1的机器的代码主要改了上面的self_rank的取值。由于是手工启动,测试时两台机器启动程序的时间可能会相差几秒,先启动1号机器的程序,再启动0号机器的。上面代码运行没有问题,0号机器输出
0 Receive Hello I am 1
1号机器输出
1 Receive Hello I am 0
但是代码中如果去掉sleep(2)语句,再测试时就会有以下报错,0号机器报错
I0924 02:51:37.530009 1192314 /repository/brpc-1.6.0/src/brpc/server.cpp:1127] Server[yacl::link::transport::internal::ReceiverServiceImpl] is serving on port=63927.
Connect to Bob successfully
I0924 02:51:56.632742 1192407 /repository/brpc-1.6.0/src/brpc/socket.cpp:2465] Checking Socket{id=0 addr=172.18.0.3:63921} (0x7fbacc067020)
terminate called after throwing an instance of 'yacl::IoError'
what(): [/repository/yacl/yacl/link/transport/channel.cc:351] Get data timeout, key=world_0:P2P-1:1->0
Stacktrace:
#0 yacl::link::transport::Channel::Recv()+0x4d68b8
Aborted (core dumped)
1号机器报错
…
[2023-09-24 02:51:55.515] [info] [default_brpc_retry_policy.cc:29] cntl ErrorCode '112', http status code '200', response header '', error msg '[E111]Fail to connect Socket{id=0 addr=172.18.0.2:63927} (0x0x7f8a34067000): Connection refused [R1][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R2][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R3][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R4][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R5][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R6][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R7][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R8][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R9][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R10][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R11][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R12][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R13][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R1
auto wait = [&](int self_rank) {
4][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R15][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R16][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R17][E112]Not connected
to 172.18.0.2:63927 yet, server_id=0 [R18][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R19][E112]Not connected to 172.18.0.2:63927 yet, server_id=0'
[2023-09-24 02:51:55.515] [info] [default_brpc_retry_policy.cc:75] aggressive retry, sleep=1000000us and retry
I0924 02:51:56.516082 769 /repository/brpc-1.6.0/src/brpc/socket.cpp:2465] Checking Socket{id=0 addr=172.18.0.2:63927} (0x7f8a34067000)
1 Receive Hello I am 0
I0924 02:51:56.516975 695 /repository/brpc-1.6.0/src/brpc/socket.cpp:2525] Revived Socket{id=0 addr=172.18.0.2:63927} (0x7f8a34067000) (Connectable)
[2023-09-24 02:51:56.522] [error] [channel.cc:98] SendImpl error [/repository/yacl/yacl/link/transport/brpc_link.cc:187] send, rpc failed=112, message=[E111]Fail to connect Socket{id=0 addr=172.18.0.2:63927}
(0x0x7f8a34067000): Connection refused [R1][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R2][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R3][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R4][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R5][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R6][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R7][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R8][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R9][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R10][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R11][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R12][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R13][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R14][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R15][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R16][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R17][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R18][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R19][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R20][E112]Not connected to 172.18.0.2:63927 yet, server_id=0
Stacktrace:
#0 yacl::link::transport::BrpcLink::SendRequest()+0x4cb5cf
#1 (unknown)+0x7f8a34002da0
上面省略了一些[info]段落。1号机器确实输出了”1 Receive Hello I am 0”,但0号机器似乎没有收到消息。我确信1号机器程序启动后,0号机器的程序在5秒内启动。