Coder Social home page Coder Social logo

Comments (3)

baoleai avatar baoleai commented on May 5, 2024

This may be caused by the old tracker not being cleaned up, I fixed this in #5 , you can try again. It should also be helpful to check the specific cause through python2.7.log.

from graph-learn.

skyssj avatar skyssj commented on May 5, 2024

I patch the fix but problem still. Looks like you can not remove the --tracker directory directly. A mkdir -p like creation is needed.

WARNING: Logging before InitGoogleLogging() is written to STDERR
E0402 16:29:28.791065  3999 local_file_system.cc:340] Create local directory failed: ./distributed/endpoints/
F0402 16:29:28.791113  3999 naming_engine.cc:58] Connect naming engine failed: ./distributed/endpoints/
*** Check failure stack trace: ***
############# Server init start #############
E0402 16:29:28.791867  3996 local_file_system.cc:340] Create local directory failed: ./distributed/endpoints/
F0402 16:29:28.792384  3996 naming_engine.cc:58] Connect naming engine failed: ./distributed/endpoints/
*** Check failure stack trace: ***
    @     0x7f08d89f619a  google::LogMessage::Fail()
    @     0x7f08d89f60de  google::LogMessage::SendToLog()
    @     0x7f08d89f59fc  google::LogMessage::Flush()
    @     0x7f08d89f9549  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f08d89cfeaf  graphlearn::NamingEngine::NamingEngine()
    @     0x7f08d89d02c4  graphlearn::NamingEngine::GetInstance()
    @     0x7f08d89d3917  graphlearn::DistributeService::DistributeService()
    @     0x7f08d89b4a55  graphlearn::ServerImpl::RegisterDistributeService()
    @     0x7f08d89b4b5e  graphlearn::ServerImpl::Start()
    @     0x7f08d8ed3075  _ZZN8pybind1112cpp_function10initializeIZNS0_C4IvN10graphlearn6ServerEIEINS_4nameENS_9is_methodENS_7siblingEEEEMT0_FT_DpT1_EDpRKT2_EUlPS4_E_vISI_EIS5_S6_S7_EEEvOS9_PFS8_SB_ESH_ENUlRNS_6detail13function_callEE1_4_FUNESP_
    @     0x7f08d8ece039  pybind11::cpp_function::dispatcher()
    @     0x7f08e04e0577  PyEval_EvalFrameEx
    @     0x7f08e04e2a99  PyEval_EvalCodeEx
    @     0x7f08e04dff68  PyEval_EvalFrameEx
    @     0x7f08e04e2a99  PyEval_EvalCodeEx
    @     0x7f08e04dff68  PyEval_EvalFrameEx
    @     0x7f08e04e2a99  PyEval_EvalCodeEx
main
    @     0x7f08e04dff68  PyEval_EvalFrameEx
    @     0x7f08e04e2a99  PyEval_EvalCodeEx
WARNING: Logging before InitGoogleLogging() is written to STDERR
E0402 16:29:28.841421  3997 local_file_system.cc:340] Create local directory failed: ./distributed/endpoints/
F0402 16:29:28.841472  3997 naming_engine.cc:58] Connect naming engine failed: ./distributed/endpoints/
*** Check failure stack trace: ***
    @     0x7f08e04e2cba  PyEval_EvalCode
    @     0x7f08e04fc01d  run_mod
    @     0x7f08e04fd1c8  PyRun_FileExFlags
    @     0x7f08e04fe3e8  PyRun_SimpleFileExFlags
    @     0x7f08e051067c  Py_Main
    @     0x7f08df733c05  __libc_start_main
    @           0x40071e  (unknown)
main
############# Server init start #############
E0402 16:29:28.859108  3998 local_file_system.cc:340] Create local directory failed: ./distributed/endpoints/
F0402 16:29:28.859401  3998 naming_engine.cc:58] Connect naming engine failed: ./distributed/endpoints/
*** Check failure stack trace: ***
    @     0x7f368b96619a  google::LogMessage::Fail()
    @     0x7f368b9660de  google::LogMessage::SendToLog()
./run.sh: line 34:  3996 Aborted                 python dist_train.py --tracker=./distributed --ps_hosts=${PS_HOSTS} --worker_hosts=${WK_HOSTS} --job_name=ps --task_index=0
./run.sh: line 34:  3997 Aborted                 python dist_train.py --tracker=./distributed --ps_hosts=${PS_HOSTS} --worker_hosts=${WK_HOSTS} --job_name=worker --task_index=0
./run.sh: line 34:  3999 Aborted                 python dist_train.py --tracker=./distributed --ps_hosts=${PS_HOSTS} --worker_hosts=${WK_HOSTS} --job_name=worker --task_index=1
    @     0x7f368b9659fc  google::LogMessage::Flush()
    @     0x7f368b969549  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f368b93feaf  graphlearn::NamingEngine::NamingEngine()
    @     0x7f368b9402c4  graphlearn::NamingEngine::GetInstance()
    @     0x7f368b943917  graphlearn::DistributeService::DistributeService()
    @     0x7f368b924a55  graphlearn::ServerImpl::RegisterDistributeService()
    @     0x7f368b924b5e  graphlearn::ServerImpl::Start()
    @     0x7f368be43075  _ZZN8pybind1112cpp_function10initializeIZNS0_C4IvN10graphlearn6ServerEIEINS_4nameENS_9is_methodENS_7siblingEEEEMT0_FT_DpT1_EDpRKT2_EUlPS4_E_vISI_EIS5_S6_S7_EEEvOS9_PFS8_SB_ESH_ENUlRNS_6detail13function_callEE1_4_FUNESP_
    @     0x7f368be3e039  pybind11::cpp_function::dispatcher()
    @     0x7f3693450577  PyEval_EvalFrameEx
    @     0x7f3693452a99  PyEval_EvalCodeEx
    @     0x7f369344ff68  PyEval_EvalFrameEx
    @     0x7f3693452a99  PyEval_EvalCodeEx
    @     0x7f369344ff68  PyEval_EvalFrameEx
    @     0x7f3693452a99  PyEval_EvalCodeEx
    @     0x7f369344ff68  PyEval_EvalFrameEx
    @     0x7f3693452a99  PyEval_EvalCodeEx
    @     0x7f3693452cba  PyEval_EvalCode
    @     0x7f369346c01d  run_mod
    @     0x7f369346d1c8  PyRun_FileExFlags
    @     0x7f369346e3e8  PyRun_SimpleFileExFlags
    @     0x7f369348067c  Py_Main
    @     0x7f36926a3c05  __libc_start_main
    @           0x40071e  (unknown)

BTW, I cleaned up --tracker directory manually, and got some interesting log. Does that cause by
using a local filesystem instead of a NFS?

graphlearn.xxxx.INFO.20200402-163642.5092

...

W0402 16:36:42.347764  5181 coordinator.cc:177] Counting states failed: start/, Internal:./distributed/start/ open failed
W0402 16:36:42.348039  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:42.348053  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
I0402 16:36:42.348098  5182 naming_engine.cc:159] Refresh endpoints count: 0
W0402 16:36:43.348107  5181 coordinator.cc:177] Counting states failed: start/, Internal:./distributed/start/ open failed
W0402 16:36:43.348150  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:43.348160  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
I0402 16:36:43.348214  5182 naming_engine.cc:159] Refresh endpoints count: 0
I0402 16:36:43.348330  5092 naming_engine.cc:100] Update endpoint id: 0, address: , filepath: ./distributed/endpoints/0
I0402 16:36:43.348413  5092 coordinator.cc:190] Coordinator sink start/
I0402 16:36:43.348430  5092 coordinator.cc:216] Sink ./distributed/start/0OK
I0402 16:36:44.348294  5181 coordinator.cc:190] Coordinator sink 
I0402 16:36:44.348353  5181 coordinator.cc:216] Sink ./distributed/startedOK
I0402 16:36:44.348357  5181 coordinator.cc:106] Master sync started.
W0402 16:36:44.348363  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:44.348378  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:44.348367  5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:44.348412  5182 naming_engine.cc:154] Invalid endpoint file: 1
I0402 16:36:44.348418  5182 naming_engine.cc:159] Refresh endpoints count: 0
I0402 16:36:44.363948  5312 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
I0402 16:36:44.363970  5319 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
I0402 16:36:44.364365  5315 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
I0402 16:36:44.364580  5320 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
I0402 16:36:44.364948  5313 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
I0402 16:36:44.364956  5314 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
I0402 16:36:44.366432  5316 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
I0402 16:36:44.366991  5318 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
I0402 16:36:44.367408  5317 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
I0402 16:36:44.369257  5328 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
I0402 16:36:44.369416  5322 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
W0402 16:36:44.369658  5332 channel_manager.cc:100] Waiting for all servers started: 0/2
I0402 16:36:44.369700  5330 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
I0402 16:36:44.371842  5321 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
I0402 16:36:44.371852  5324 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
I0402 16:36:44.371871  5335 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
I0402 16:36:44.371891  5323 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
I0402 16:36:44.371901  5325 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
I0402 16:36:44.371906  5326 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
I0402 16:36:44.371908  5340 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
I0402 16:36:44.371932  5329 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
I0402 16:36:44.371938  5335 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
I0402 16:36:44.371976  5329 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
I0402 16:36:44.372058  5341 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
I0402 16:36:44.372062  5338 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
I0402 16:36:44.372076  5334 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
I0402 16:36:44.372140  5329 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
I0402 16:36:44.372221  5338 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
I0402 16:36:44.372298  5342 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
I0402 16:36:44.372383  5329 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
I0402 16:36:44.372404  5327 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
I0402 16:36:44.372439  5342 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
I0402 16:36:44.372475  5342 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
W0402 16:36:45.348466  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:45.348598  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:45.348598  5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:45.350410  5182 naming_engine.cc:154] Invalid endpoint file: 1
I0402 16:36:45.350420  5182 naming_engine.cc:159] Refresh endpoints count: 0
W0402 16:36:46.350499  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:46.350539  5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:46.350548  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:46.350564  5182 naming_engine.cc:154] Invalid endpoint file: 1
I0402 16:36:46.350572  5182 naming_engine.cc:159] Refresh endpoints count: 0
W0402 16:36:47.350625  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:47.350672  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:47.350684  5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:47.350706  5182 naming_engine.cc:154] Invalid endpoint file: 1
I0402 16:36:47.350713  5182 naming_engine.cc:159] Refresh endpoints count: 0
W0402 16:36:48.350780  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:48.350838  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:48.350852  5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:48.350885  5182 naming_engine.cc:154] Invalid endpoint file: 1
I0402 16:36:48.350893  5182 naming_engine.cc:159] Refresh endpoints count: 0
W0402 16:36:49.351042  5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:49.351099  5182 naming_engine.cc:154] Invalid endpoint file: 1
I0402 16:36:49.351107  5182 naming_engine.cc:159] Refresh endpoints count: 0
W0402 16:36:49.351315  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:49.351348  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:50.351255  5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:50.351312  5182 naming_engine.cc:154] Invalid endpoint file: 1
I0402 16:36:50.352686  5182 naming_engine.cc:159] Refresh endpoints count: 0
W0402 16:36:50.351442  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:50.352722  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:51.352807  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:51.352880  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:51.352929  5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:51.352946  5182 naming_engine.cc:154] Invalid endpoint file: 1
I0402 16:36:51.352957  5182 naming_engine.cc:159] Refresh endpoints count: 0
W0402 16:36:52.352988  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:52.353049  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed

...

graphlearn.xxxx.WARNING.20200402-163642.5092

W0402 16:36:42.347764  5181 coordinator.cc:177] Counting states failed: start/, Internal:./distributed/start/ open failed    
W0402 16:36:42.348039  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:42.348053  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed      
W0402 16:36:43.348107  5181 coordinator.cc:177] Counting states failed: start/, Internal:./distributed/start/ open failed
W0402 16:36:43.348150  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:43.348160  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:44.348363  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:44.348378  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:44.348367  5182 naming_engine.cc:154] Invalid endpoint file: 0                                                   
W0402 16:36:44.348412  5182 naming_engine.cc:154] Invalid endpoint file: 1                                             
W0402 16:36:44.369658  5332 channel_manager.cc:100] Waiting for all servers started: 0/2                                     
W0402 16:36:45.348466  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:45.348598  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed      
W0402 16:36:45.348598  5182 naming_engine.cc:154] Invalid endpoint file: 0                                             
W0402 16:36:45.350410  5182 naming_engine.cc:154] Invalid endpoint file: 1                                                   
W0402 16:36:46.350499  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:46.350539  5182 naming_engine.cc:154] Invalid endpoint file: 0                                                   
W0402 16:36:46.350548  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:46.350564  5182 naming_engine.cc:154] Invalid endpoint file: 1                                                   
W0402 16:36:47.350625  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:47.350672  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed      
W0402 16:36:47.350684  5182 naming_engine.cc:154] Invalid endpoint file: 0                                             
W0402 16:36:47.350706  5182 naming_engine.cc:154] Invalid endpoint file: 1
W0402 16:36:48.350780  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:48.350838  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed      
W0402 16:36:48.350852  5182 naming_engine.cc:154] Invalid endpoint file: 0                                             
W0402 16:36:48.350885  5182 naming_engine.cc:154] Invalid endpoint file: 1                                                   
W0402 16:36:49.351042  5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:49.351099  5182 naming_engine.cc:154] Invalid endpoint file: 1                                                   
W0402 16:36:49.351315  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:49.351348  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed      
W0402 16:36:50.351255  5182 naming_engine.cc:154] Invalid endpoint file: 0                                             
W0402 16:36:50.351312  5182 naming_engine.cc:154] Invalid endpoint file: 1                                                   
W0402 16:36:50.351442  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:50.352722  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed                     
W0402 16:36:51.352807  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:51.352880  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed                     
W0402 16:36:51.352929  5182 naming_engine.cc:154] Invalid endpoint file: 0                                                                  
W0402 16:36:51.352946  5182 naming_engine.cc:154] Invalid endpoint file: 1                                                                  
W0402 16:36:52.352988  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:52.353049  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed      
W0402 16:36:52.353111  5182 naming_engine.cc:154] Invalid endpoint file: 0                                           

from graph-learn.

baoleai avatar baoleai commented on May 5, 2024

The log shows the GL tracker dir still has not been cleaned up, for your case, run rm -rf ./distributed/*, to clean up tracker. You may need add sleep 1 after python dist_train.py --index 0, so that TF&GL can start and exit in the correct order.

from graph-learn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.