Coder Social home page Coder Social logo

Comments (8)

JohnLangford avatar JohnLangford commented on May 28, 2024

This is an interesting issue that I haven't needed to worry about
previously.

What would an ideal fix look like? The overhead of acceptance can be
made much smaller, but this doesn't seem to be a fix, because many
thousands of workers could still overflow the OS's connect queue.
Perhaps a better fix would be making workers retry?

-John

On 04/13/2014 12:46 AM, Maysam Yabandeh wrote:

When running vw at scale we observed cases where vw worker cannot
connect to the spanning_tree server in all_reduce_init function. The
problem seems to be that the spanning_tree server performs accepting
connections as well as connection initializations (the initial
conversation between vw worker and the spanning tree process) all in a
single thread. The sapnning_tree server is therefore not accepting new
connections while it is busy with initialization of other connections.
This becomes an issue when many thousands of workers are trying to
connect to the spanning_tree at the same time.

The proper solution would be to fix the spanning_tree connection
acceptance to be more scalable. Alternatively, the vw worker could
retry if the connection fails.

As a workaround, I put a random delay at the start of all_reduce_init
to let the connection establishment load evenly distribute over time.
I tried with a couple of delays and the below worked for me:

|srand(node);
int range = total / 50 + 1;//e.g 1500 nodes -> 300s
int stime = (rand()%(range))+1;
cerr << "sleep for " << stime << " out of " << range << endl;
sleep(stime);
cerr << "endsleep " << endl;
|

The delay is relatively high but it is unenviable until we have a
proper fix for spanning_tree scalability issue.


Reply to this email directly or view it on GitHub
#279.

from vowpal_wabbit.

maysamyabandeh avatar maysamyabandeh commented on May 28, 2024

Agreed. Retry seems like a more stable solution.

from vowpal_wabbit.

JohnLangford avatar JohnLangford commented on May 28, 2024

I added a retry, but testing the code properly is difficult. Can you
test it?

-John

On 04/14/2014 02:14 PM, Maysam Yabandeh wrote:

Agreed. Retry seems like a more stable solution.


Reply to this email directly or view it on GitHub
#279 (comment).

from vowpal_wabbit.

maysamyabandeh avatar maysamyabandeh commented on May 28, 2024

Sure, but need time to get resources for a large-scale run.

from vowpal_wabbit.

maysamyabandeh avatar maysamyabandeh commented on May 28, 2024

Ran a test with a moderate size of 7000 nodes. The test passes with my random delay patch but still fails with your recent patch for retrying. The logged error is:

read 1 failed!
mapper already connected
terminate called after throwing an instance of 'std::exception'
  what():  std::exception

It seems the error shows at the recv invokation:

  if (recv(master_sock, (char*)&ok, sizeof(ok), 0) < (int)sizeof(ok))                                    
    cerr << "read 1 failed!" << endl;                                                                    
  if (!ok) {
    cerr << "mapper already connected" << endl;                                                          
    throw exception();                                                                                   
  } 

from vowpal_wabbit.

JohnLangford avatar JohnLangford commented on May 28, 2024

I'm very unclear on what's going wrong, because there is apparently a valid connection. Maybe the error message from recv is informative? It's easy to imagine that debugging on your end will be easiest at this point.

I added a 'sleep(1)' delay to the error case in allreduce_init(). Does that help?

from vowpal_wabbit.

maysamyabandeh avatar maysamyabandeh commented on May 28, 2024

Well, there are two debugging direction at this point:

  1. Finding the root cause
  2. Finding a workaround
    I have been struggling with the issue for a while already and the random sleep solution that I pasted was the one that eventually worked for me. So as far as sleep-wise solutions go, I would say that the random delay is already fine. About the root cause analysis however--finding which line exactly has problem with a high volume of connections--the debugging on my side was not much of help.

If it is not possible for you to run it with many instances and reproduce the problem, we can close the issue.

from vowpal_wabbit.

JohnLangford avatar JohnLangford commented on May 28, 2024

I'd like to avoid introducing a large latency on connection by default.
If there is some approach which does not cause the typical case to
suffer, that would be good.

-John

On 04/14/2014 07:31 PM, Maysam Yabandeh wrote:

Well, there are two debugging direction at this point:

  1. Finding the root cause
  2. Finding a workaround
    I have been struggling with the issue for a while already and the
    random sleep solution that I pasted was the one that eventually worked
    for me. So as far as sleep-wise solutions go, I would say that the
    random delay is already fine. About the root cause analysis
    however--finding which line exactly has problem with a high volume of
    connections--the debugging on my side was not much of help.

If it is not possible for you to run it with many instances and
reproduce the problem, we can close the issue.


Reply to this email directly or view it on GitHub
#279 (comment).

from vowpal_wabbit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.