Coder Social home page Coder Social logo

Comments (21)

Eric-Zhang1990 avatar Eric-Zhang1990 commented on May 26, 2024

@Tomcli I run the example pytorch-launch-dist on two servers using 3 learners (but there just has 2 Processes), 1 learner is on one server, and 2 learners are both on another server, but the log just has "node_rank=0" and "node_rank=1", there has no "node_rank=2", it is the same as my issue above.
I wonder that is there no communication between the two servers?
深度截图_选择区域_20190326154123
深度截图_选择区域_20190326154045
深度截图_选择区域_20190326153424

from ffdl.

Tomcli avatar Tomcli commented on May 26, 2024

Hi @Eric-Zhang1990 , It could be something has timed out when you initiating your process group due to low bandwidth. What protocol did you use for initiating your process group? Is it TCP, GLOO, or NCCL?

from ffdl.

Eric-Zhang1990 avatar Eric-Zhang1990 commented on May 26, 2024

@Tomcli I have tried to use NCCL and GLOO, both of them have the same result above.
I use tool "iperf" to test the bandwidth between two servers, it shows its bandwidth is about 94.0 Mbits/sec.
深度截图_选择区域_20190327090442
深度截图_选择区域_20190327090452
Does this bandwidth cause the above problem? Thank you.

from ffdl.

Eric-Zhang1990 avatar Eric-Zhang1990 commented on May 26, 2024

@Tomcli Now we use 1000M bandwidth, it can improve the training speed.
深度截图_选择区域_20190328101741
I wonder that which bandwidth do you use? Thank you.

from ffdl.

Eric-Zhang1990 avatar Eric-Zhang1990 commented on May 26, 2024

@Tomcli Now we use 1000M bandwidth, I do a test for comparing the speed between using FfDL and just using system env (using conda) on the same server, I find that on the same setting, the training speed of just using system env is faster than that of using FfDL. I don't know why, do you know why? Thank you.
Using FfDL:
FfDL---1-server-2-learners-each-1-gpu
Using system env (using conda):
conda---1-server-2-gpus

from ffdl.

animeshsingh avatar animeshsingh commented on May 26, 2024

There is cost to be paid for bandwidth if you are spreading across two servers. What would be interesting is to see if both GPUs are on same server, it it faster than system env or not? Also the training has to be distributed across multiple GPUs I would say to override the payoff for bandwidth and extra communication overhead

from ffdl.

Eric-Zhang1990 avatar Eric-Zhang1990 commented on May 26, 2024

@animeshsingh I have done some tests for comparing the speed between using FfDL and just using system env (using conda) on the same server, followings are the comparation:
2 GPUs comparation:
(a). Using system env, 2 gpus:
conda---1-server-2-gpus
(b). Using FfDL, 2 learners, each learner has 1 gpu:
FfDL---1-server-2-learners-each-1-gpu
(c). Using FfDL, 1 learner, this learner has 2 gpus:
FfDL---1-server-1-learner-2-gpus
RESULT (> means faster): speed of (a) > speed of (b) > speed of (c).

4 GPUs comparation:
(d). Using system env, 4 gpus:
conda---1-server-4-gpus
(e). Using FfDL, 4 learners, each learner has 1 gpu:
FfDL---1-server-4-learners-each-1-gpu
(f). Using FfDL, 1 learner, this learner has 4 gpus:
FfDL---1-server-1-learner-4-gpus
RESULT (> means faster): speed of (d) > speed of (e) > speed of (f).

(All are on the same server) Also, when using system env, speed of 4 gpus is faster that of 2 gpus (it seems normal). However, when using FfDL, speed of 2 learners (each learner has 1 gpu, (c) above) is faster that that of 1 learner (this learner has 2 gpus, (b) above), and for 4 gpus, speed of 4 learners (each learner has 1 gpu, (e) above) is faster that that of 1 learner (this learner has 4 gpus, (f) above).
But speed of 2 learners (each learner has 1 gpu, (c) above) is faster that that of 4 learners (each learner has 1 gpu, (e) above), it seems not normal.

Can you help me to explain above results? Thank you.

from ffdl.

Eric-Zhang1990 avatar Eric-Zhang1990 commented on May 26, 2024

@Tomcli @animeshsingh I am running the project maskrcnn-benchmark (https://github.com/facebookresearch/maskrcnn-benchmark) on 3 nodes, 16 gpus(2 nodes have 4 gpus respectively, 1 node has 8 gpus). What I want to ask is that original code uses "torch.distributed.reduce()", but I am on multi nodes and multi GPUs, what should I use "torch.distributed.reduce_multigpu" or "torch.distributed.all_reduce_multigpu" or someone else? And will they affect the training speed? Because my training speed is very slow when I use 3 nodes and 16 gpus, just like comparation above.
Thank you.

from ffdl.

animeshsingh avatar animeshsingh commented on May 26, 2024

Thanks @Eric-Zhang1990 for testing this thoroughly. While using bare metal directly without the overhead of containers if we have the same number of GPUs, the speed is going to be faster on bare metal.

Going behind the concept of FfDL, the idea is to distribute training over multiple containers and the fact that these containers can be spawned and killed on demand. This allows multiple users to share the same hardware backend environment, and then be able to provide capabilities like batch scheduling, job queuing, moitoring etc which we are working towards adding by integrating with kube-batch. The users dont need to login to individual machines, set things up etc., and are offered this as a service. Also the fact that the user journey remains the same whether they are using PyTorch of Tensorflow etc.

from ffdl.

animeshsingh avatar animeshsingh commented on May 26, 2024

But speed of 2 learners (each learner has 1 gpu, (c) above) is faster that that of 4 learners (each learner has 1 gpu, (e) above), it seems not normal.
Can you help me to explain above results? Thank you.

It definitely doesn`t seem normal, and we would like to simulate at our end and test more. Are these two GPUs in the first case on same machine, and while doing 4 GPUs we spread across two machines?

from ffdl.

animeshsingh avatar animeshsingh commented on May 26, 2024

@Tomcli @animeshsingh I am running the project maskrcnn-benchmark (https://github.com/facebookresearch/maskrcnn-benchmark) on 3 nodes, 16 gpus(2 nodes have 4 gpus respectively, 1 node has 8 gpus). What I want to ask is that original code uses "torch.distributed.reduce()", but I am on multi nodes and multi GPUs, what should I use "torch.distributed.reduce_multigpu" or "torch.distributed.all_reduce_multigpu" or someone else? And will they affect the training speed? Because my training speed is very slow when I use 3 nodes and 16 gpus, just like comparation above.
Thank you.

I would assume reduce should be faster, given that only the process with rank dst is going to receive the final result.
https://pytorch.org/docs/stable/distributed.html

from ffdl.

Eric-Zhang1990 avatar Eric-Zhang1990 commented on May 26, 2024

But speed of 2 learners (each learner has 1 gpu, (c) above) is faster that that of 4 learners (each learner has 1 gpu, (e) above), it seems not normal.
Can you help me to explain above results? Thank you.

It definitely doesn`t seem normal, and we would like to simulate at our end and test more. Are these two GPUs in the first case on same machine, and while doing 4 GPUs we spread across two machines?

Thanks @animeshsingh for kind reply. I test them on same machine (this machine has 4 gpus). I also think it is not normal, but I don't know which reason will cause this phenomenon. Just like I describe above, when I use 3 nodes, 16 gpus (16 learners on 3 machines), it is very slower than 4 learners (4 learners on same machine).
What we think is speed of more gpus will be faster than less gpus, but it is inverse.

from ffdl.

Eric-Zhang1990 avatar Eric-Zhang1990 commented on May 26, 2024

@animeshsingh @Tomcli When I see the doc "FfDL/docs/gpu-guide.md" again, it says that we should use "helm install --set lcm.device_plugin=false ." to deploy FfDL, but I didn't use parameter "lcm.device_plugin=false", does it affect the training speed?
Thank you.
深度截图_选择区域_20190401155730

from ffdl.

animeshsingh avatar animeshsingh commented on May 26, 2024

Thanks @animeshsingh for kind reply. I test them on same machine (this machine has 4 gpus). I also think it is not normal, but I don't know which reason will cause this phenomenon. Just like I describe above, when I use 3 nodes, 16 gpus (16 learners on 3 machines), it is very slower than 4 learners (4 learners on same machine).
What we think is speed of more gpus will be faster than less gpus, but it is inverse.

On the same machine, more GPUs should definitely be faster. When going across machines, it depends on having the right combination for your hardware as described here
https://pytorch.org/docs/stable/distributed.html#which-backend-to-use

from ffdl.

Eric-Zhang1990 avatar Eric-Zhang1990 commented on May 26, 2024

@animeshsingh I use 'nccl' backend and 'reduce' which original maskrcnn-benchmark provides, and when I run maskrcnn-benchmark on 2 machines just using pytorch distributed training (not using FfDL), I can run correctly using backend 'gloo', but error occurs when using 'nccl', I am still trying to find solution.

from ffdl.

Eric-Zhang1990 avatar Eric-Zhang1990 commented on May 26, 2024

@animeshsingh Do you run the original maskrcnn-benchmark on FfDL? How about the speed between multi gpus on one machine and multi gpus on two or more machines? Thank you.

from ffdl.

Tomcli avatar Tomcli commented on May 26, 2024

Hi @Eric-Zhang1990, sorry for the late reply. Can you show us the commands and specs you used for running the maskrcnn-benchmark on FfDL? Is it similar to

command: ./setup.sh; . env_file; python -m torch.distributed.launch --nproc_per_node=$NUM_GPUS --nnodes=$NUM_LEARNERS --node_rank=$node_rank --master_addr=$master_node --master_port=1234 train_dist_launcher.py --batch_size=1024;

And you specified 4 gpus and 3 learners? Thanks.

from ffdl.

Eric-Zhang1990 avatar Eric-Zhang1990 commented on May 26, 2024

@Tomcli My manifest .yml is similar to 'FfDL/etc/examples/pytorch-launch-dist/manifest.yml',
深度截图_选择区域_20190408083324
and my setup.sh is same as 'FfDL/etc/examples/pytorch-launch-dist/setup.sh'
深度截图_选择区域_20190408083433
Can you help me to find where the prblem is?
Thank you.

from ffdl.

Eric-Zhang1990 avatar Eric-Zhang1990 commented on May 26, 2024

@Tomcli I do a test, I run the same code using FfDL and pytorch's distributed training directly, the speed of pytorch's distributed training is almost 2 times faster than FfDL, both of them are using the same machines and GPUs.
FfDL:
深度截图_选择区域_20190411092008
pytorch's distributed training directly:
深度截图_选择区域_20190410174535

Is it because some micro services running on FfDL?
Thank you.

from ffdl.

animeshsingh avatar animeshsingh commented on May 26, 2024

@Eric-Zhang1990 Directly you are running on baremetal?

from ffdl.

Eric-Zhang1990 avatar Eric-Zhang1990 commented on May 26, 2024

@animeshsingh Yeah, I run the maskrcnn-benchmark using conda env, the code and parameters are all the same, the command I use is following (2 machines, each one has 4 gpus):
Master:
NCCL_SOCKET_IFNAME=eno1 python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr="192.168.110.25" --master_port=1234 train_net.py
Node 1:
NCCL_SOCKET_IFNAME=enp129s0f0 python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=1 --master_addr="192.168.110.25" --master_port=1234 train_net.py

from ffdl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.