Coder Social home page Coder Social logo

Comments (3)

baoleai avatar baoleai commented on May 5, 2024

The INFO log shows that the IP address needed for graph-learn is empty:
I20200405 11:27:22.089519 21023 naming_engine.cc:100] Update endpoint id: 0, address: , filepath: /tmp/graphlearn/endpoints/0.
This indicates that GetLocalEndpoint returns "". You can check whether this function can get the correct result in your environment

from graph-learn.

YukeWang96 avatar YukeWang96 commented on May 5, 2024

@baoleai when I try to set up this in the distributed setting for example, two machines with different IPs, it always says that

I0720 12:13:39.899207 17530 naming_engine.cc:159] Refresh endpoints count: 2
I0720 12:13:40.899502 17530 naming_engine.cc:159] Refresh endpoints count: 2
I0720 12:13:41.899794 17530 naming_engine.cc:159] Refresh endpoints count: 2
I0720 12:13:42.900085 17530 naming_engine.cc:159] Refresh endpoints count: 2
2020-07-20 12:13:43.369141: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
2020-07-20 12:13:43.369198: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1

And I have checked my task_idx which are matched with the ps_hosts and worker_hosts, and I also turn off the firewall on both of my computers.

Also, I have try to run two ps and two workers on the same physical machine with different port, it can run but return me message like

Epoch 39, Iteration 0, Time(s) 0.0530, Loss 0.88920
Epoch 39, Iteration 1, Time(s) 0.0529, Loss 0.59737
Epoch 39, Iteration 2, Time(s) 0.0523, Loss 0.77411
Epoch 39, Iteration 3, Time(s) 0.0497, Loss 0.79809
Epoch 39, Iteration 4, Time(s) 0.0540, Loss 0.45329
Epoch 39, Iteration 5, Time(s) 0.0521, Loss 0.98397
Epoch 39, Iteration 6, Time(s) 0.0504, Loss 0.71765
E0720 11:56:27.316502 13822 notification.cc:194] RpcNotification:Failed	req_type:GetNodes	status:Out of range:No more nodes exist.
E0720 11:56:27.316629 13822 distribute_runner.h:125] Rpc failed:Out of range:No more nodes exist.name:GetNodes

Could you please help me to figure it out?
Thanks

from graph-learn.

zhxchnl avatar zhxchnl commented on May 5, 2024

@baoleai when I try to set up this in the distributed setting for example, two machines with different IPs, it always says that

I0720 12:13:39.899207 17530 naming_engine.cc:159] Refresh endpoints count: 2
I0720 12:13:40.899502 17530 naming_engine.cc:159] Refresh endpoints count: 2
I0720 12:13:41.899794 17530 naming_engine.cc:159] Refresh endpoints count: 2
I0720 12:13:42.900085 17530 naming_engine.cc:159] Refresh endpoints count: 2
2020-07-20 12:13:43.369141: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
2020-07-20 12:13:43.369198: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1

And I have checked my task_idx which are matched with the ps_hosts and worker_hosts, and I also turn off the firewall on both of my computers.

Also, I have try to run two ps and two workers on the same physical machine with different port, it can run but return me message like

Epoch 39, Iteration 0, Time(s) 0.0530, Loss 0.88920
Epoch 39, Iteration 1, Time(s) 0.0529, Loss 0.59737
Epoch 39, Iteration 2, Time(s) 0.0523, Loss 0.77411
Epoch 39, Iteration 3, Time(s) 0.0497, Loss 0.79809
Epoch 39, Iteration 4, Time(s) 0.0540, Loss 0.45329
Epoch 39, Iteration 5, Time(s) 0.0521, Loss 0.98397
Epoch 39, Iteration 6, Time(s) 0.0504, Loss 0.71765
E0720 11:56:27.316502 13822 notification.cc:194] RpcNotification:Failed	req_type:GetNodes	status:Out of range:No more nodes exist.
E0720 11:56:27.316629 13822 distribute_runner.h:125] Rpc failed:Out of range:No more nodes exist.name:GetNodes

Could you please help me to figure it out?
Thanks

I have same problem, do you have any solution?

from graph-learn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.