Comments (8)
This is an interesting issue that I haven't needed to worry about
previously.
What would an ideal fix look like? The overhead of acceptance can be
made much smaller, but this doesn't seem to be a fix, because many
thousands of workers could still overflow the OS's connect queue.
Perhaps a better fix would be making workers retry?
-John
On 04/13/2014 12:46 AM, Maysam Yabandeh wrote:
When running vw at scale we observed cases where vw worker cannot
connect to the spanning_tree server in all_reduce_init function. The
problem seems to be that the spanning_tree server performs accepting
connections as well as connection initializations (the initial
conversation between vw worker and the spanning tree process) all in a
single thread. The sapnning_tree server is therefore not accepting new
connections while it is busy with initialization of other connections.
This becomes an issue when many thousands of workers are trying to
connect to the spanning_tree at the same time.The proper solution would be to fix the spanning_tree connection
acceptance to be more scalable. Alternatively, the vw worker could
retry if the connection fails.As a workaround, I put a random delay at the start of all_reduce_init
to let the connection establishment load evenly distribute over time.
I tried with a couple of delays and the below worked for me:|srand(node);
int range = total / 50 + 1;//e.g 1500 nodes -> 300s
int stime = (rand()%(range))+1;
cerr << "sleep for " << stime << " out of " << range << endl;
sleep(stime);
cerr << "endsleep " << endl;
|The delay is relatively high but it is unenviable until we have a
proper fix for spanning_tree scalability issue.—
Reply to this email directly or view it on GitHub
#279.
from vowpal_wabbit.
Agreed. Retry seems like a more stable solution.
from vowpal_wabbit.
I added a retry, but testing the code properly is difficult. Can you
test it?
-John
On 04/14/2014 02:14 PM, Maysam Yabandeh wrote:
Agreed. Retry seems like a more stable solution.
—
Reply to this email directly or view it on GitHub
#279 (comment).
from vowpal_wabbit.
Sure, but need time to get resources for a large-scale run.
from vowpal_wabbit.
Ran a test with a moderate size of 7000 nodes. The test passes with my random delay patch but still fails with your recent patch for retrying. The logged error is:
read 1 failed!
mapper already connected
terminate called after throwing an instance of 'std::exception'
what(): std::exception
It seems the error shows at the recv invokation:
if (recv(master_sock, (char*)&ok, sizeof(ok), 0) < (int)sizeof(ok))
cerr << "read 1 failed!" << endl;
if (!ok) {
cerr << "mapper already connected" << endl;
throw exception();
}
from vowpal_wabbit.
I'm very unclear on what's going wrong, because there is apparently a valid connection. Maybe the error message from recv is informative? It's easy to imagine that debugging on your end will be easiest at this point.
I added a 'sleep(1)' delay to the error case in allreduce_init(). Does that help?
from vowpal_wabbit.
Well, there are two debugging direction at this point:
- Finding the root cause
- Finding a workaround
I have been struggling with the issue for a while already and the random sleep solution that I pasted was the one that eventually worked for me. So as far as sleep-wise solutions go, I would say that the random delay is already fine. About the root cause analysis however--finding which line exactly has problem with a high volume of connections--the debugging on my side was not much of help.
If it is not possible for you to run it with many instances and reproduce the problem, we can close the issue.
from vowpal_wabbit.
I'd like to avoid introducing a large latency on connection by default.
If there is some approach which does not cause the typical case to
suffer, that would be good.
-John
On 04/14/2014 07:31 PM, Maysam Yabandeh wrote:
Well, there are two debugging direction at this point:
- Finding the root cause
- Finding a workaround
I have been struggling with the issue for a while already and the
random sleep solution that I pasted was the one that eventually worked
for me. So as far as sleep-wise solutions go, I would say that the
random delay is already fine. About the root cause analysis
however--finding which line exactly has problem with a high volume of
connections--the debugging on my side was not much of help.If it is not possible for you to run it with many instances and
reproduce the problem, we can close the issue.—
Reply to this email directly or view it on GitHub
#279 (comment).
from vowpal_wabbit.
Related Issues (20)
- Vowpal_wabbit failed to runtest on MSVC on Windows HOT 2
- Coin/FTRL/Pistol are not available via vw.get_config() HOT 1
- Detailed explanation on --explore_eval option for contextual bandits HOT 2
- Classification Multivariate time series : values both categorical and continues HOT 5
- Segmentation fault in CATS HOT 2
- Contextual Bandit vowpal_wabbit training dataset validation HOT 2
- New line is misinterpreted as example
- Binary File Inputs HOT 3
- Your domain is only being misused for illegal gambling promotions in Indonesia HOT 2
- --interact is not working HOT 3
- Slates Json parser error
- Option to enable SSE2 optimization HOT 2
- Segfault on ccb_explore_adf -cb_type dr HOT 7
- Support for multi-line featuers in --audit_regressor HOT 1
- New installation unable to find boost python lib LINK : fatal error LNK1104: cannot open file 'boost_python312-vc143-mt-x64-1_84.lib' HOT 2
- Sporadic failure in read_span_flatbuffer tests
- Request to daemon hands HOT 2
- Multiclass Classifier Consumes Large Memory HOT 1
- Unexpected predictions when training ccb-model HOT 8
- Incremental Training Best Practice HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vowpal_wabbit.