Comments (3)
The INFO log shows that the IP address needed for graph-learn is empty:
I20200405 11:27:22.089519 21023 naming_engine.cc:100] Update endpoint id: 0, address: , filepath: /tmp/graphlearn/endpoints/0
.
This indicates that GetLocalEndpoint returns "". You can check whether this function can get the correct result in your environment
from graph-learn.
@baoleai when I try to set up this in the distributed setting for example, two machines with different IPs, it always says that
I0720 12:13:39.899207 17530 naming_engine.cc:159] Refresh endpoints count: 2
I0720 12:13:40.899502 17530 naming_engine.cc:159] Refresh endpoints count: 2
I0720 12:13:41.899794 17530 naming_engine.cc:159] Refresh endpoints count: 2
I0720 12:13:42.900085 17530 naming_engine.cc:159] Refresh endpoints count: 2
2020-07-20 12:13:43.369141: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
2020-07-20 12:13:43.369198: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
And I have checked my task_idx which are matched with the ps_hosts and worker_hosts, and I also turn off the firewall on both of my computers.
Also, I have try to run two ps and two workers on the same physical machine with different port, it can run but return me message like
Epoch 39, Iteration 0, Time(s) 0.0530, Loss 0.88920
Epoch 39, Iteration 1, Time(s) 0.0529, Loss 0.59737
Epoch 39, Iteration 2, Time(s) 0.0523, Loss 0.77411
Epoch 39, Iteration 3, Time(s) 0.0497, Loss 0.79809
Epoch 39, Iteration 4, Time(s) 0.0540, Loss 0.45329
Epoch 39, Iteration 5, Time(s) 0.0521, Loss 0.98397
Epoch 39, Iteration 6, Time(s) 0.0504, Loss 0.71765
E0720 11:56:27.316502 13822 notification.cc:194] RpcNotification:Failed req_type:GetNodes status:Out of range:No more nodes exist.
E0720 11:56:27.316629 13822 distribute_runner.h:125] Rpc failed:Out of range:No more nodes exist.name:GetNodes
Could you please help me to figure it out?
Thanks
from graph-learn.
@baoleai when I try to set up this in the distributed setting for example, two machines with different IPs, it always says that
I0720 12:13:39.899207 17530 naming_engine.cc:159] Refresh endpoints count: 2 I0720 12:13:40.899502 17530 naming_engine.cc:159] Refresh endpoints count: 2 I0720 12:13:41.899794 17530 naming_engine.cc:159] Refresh endpoints count: 2 I0720 12:13:42.900085 17530 naming_engine.cc:159] Refresh endpoints count: 2 2020-07-20 12:13:43.369141: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2020-07-20 12:13:43.369198: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
And I have checked my task_idx which are matched with the ps_hosts and worker_hosts, and I also turn off the firewall on both of my computers.
Also, I have try to run two ps and two workers on the same physical machine with different port, it can run but return me message like
Epoch 39, Iteration 0, Time(s) 0.0530, Loss 0.88920 Epoch 39, Iteration 1, Time(s) 0.0529, Loss 0.59737 Epoch 39, Iteration 2, Time(s) 0.0523, Loss 0.77411 Epoch 39, Iteration 3, Time(s) 0.0497, Loss 0.79809 Epoch 39, Iteration 4, Time(s) 0.0540, Loss 0.45329 Epoch 39, Iteration 5, Time(s) 0.0521, Loss 0.98397 Epoch 39, Iteration 6, Time(s) 0.0504, Loss 0.71765 E0720 11:56:27.316502 13822 notification.cc:194] RpcNotification:Failed req_type:GetNodes status:Out of range:No more nodes exist. E0720 11:56:27.316629 13822 distribute_runner.h:125] Rpc failed:Out of range:No more nodes exist.name:GetNodes
Could you please help me to figure it out?
Thanks
I have same problem, do you have any solution?
from graph-learn.
Related Issues (20)
- Current Version whether support Caching Neighbors of Important Vertices HOT 5
- 是否支持pyspark数据格式的输入? HOT 1
- About quick start issue HOT 1
- 请问在执行tutorial的过程中helm install dgs-u2i dgs/dgs报错是为什么
- 执行tutorial的时候k8s中有些pod启动失败 HOT 10
- Provide Instructions for macOS installation?
- [BUG] GraphLearn doesn't work with Python 3.10 & Python 3.11
- GraphLearn动态图在线推理仅支持TopK采样吗? HOT 3
- 与其他GNN框架的性能对比 HOT 1
- 目前Graph-learn是用vineyard的哪个结构来存储图拓扑 HOT 1
- dgs部署失败 HOT 1
- 使用当前tutorial中的代码示例无法完成载图操作,在进行string类型特征的时候,导入出现问题。
- 参考tutorial中进行dist.yaml的分布式训练时,worker产生了Unimplemented和Unavailable的报错 HOT 4
- Training process triggered core dump
- 相同参数情况下 分布式和单机训练模型精度出现差异 HOT 1
- Cannot use pip3 to install graph-learn HOT 1
- graph-learn 引入 pywrap_graphlearn包报错, 咨询是因为 Mac M1芯片不兼容, 有没有其他的兼容方案?
- Error occurs when running gl on ps mode
- readthedocs 文档格式紊乱
- 项目更新不太活跃
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from graph-learn.