Coder Social home page Coder Social logo

Comments (14)

epeterson12 avatar epeterson12 commented on August 31, 2024

When a model is trained using DataParallel, classification must also be done with DataParallel. If not, conv1.encoding_block.1.weight is expected but module.conv1.encoding_block.1.weight is what is saved in the model. (see image for error message)
classification_module_bug
The issue was reported here and the aforementioned solution was proposed.

from geo-deep-learning.

epeterson12 avatar epeterson12 commented on August 31, 2024

When using DataParallel with a single GPU, the program runs as expected. When trying to use 2 or more, the program hangs during the training of the first batch in the first epoch. This appears to be a common issue when using Multi-GPU with K80s. The proposed solutions to fix the peer-to-peer hanging all require admin access.

The p2pBandwidthLatencyTest was run in our environment using 2 GPUs and it shows very low bandwidth and high latency. The performance is worse when p2p is enabled, which is a requirement for DataParallel.
p2p_test

from geo-deep-learning.

epeterson12 avatar epeterson12 commented on August 31, 2024

Using DataParallel in Pytorch is straightforward since task distribution and memory management are handled by PyTorch in the background. After setting the net to run on cuda, nn.DataParallel must be used on the model to transform it to use multiple GPUs.

        model = model.cuda()
        model = nn.DataParallel(model)

When we use DataParallel with a single GPU available, the program executes as expected. However, when using 2 or more GPUs, the program hangs during the first epoch of training indefinitely (assumed since tests left running overnight with 121 training samples (256x256 pixels) never completed the first epoch's forward pass). As mentioned in my previous comment, this appears to be a common issue when using K80 GPUs (see link to GitHub issue)

A functional version of DataParallel exists: torch.nn.parallel.data_parallel(module, inputs). It had the same behavior when tested.

For now, I think we should set this task aside since we can't change which GPUs we have access to and we don't have the admin permissions required for the suggested workarounds. Once we deploy on AWS, we should revisit the issue since we will be able to choose what type of GPU is used.

from geo-deep-learning.

mpelchat04 avatar mpelchat04 commented on August 31, 2024

Thanks for the update @epeterson12. I agree with you, we can set this issue aside and revisit it once we deploy on AWS.

from geo-deep-learning.

epeterson12 avatar epeterson12 commented on August 31, 2024

I just tested DataParallel on a p2.4xlarge instance on AWS. The process ran extremely slowly and the CPU use was at 100% for one core and close to 0 for the others. I got the following warning multiple times:
cuda9_error

I set up the environment with pytorch 0.4.0. We will have to update our pytorch version in order to take advantage of DataParallel in v100 GPUs.

With the newest version of pytorch installed, the speed was greatly improved. I will run tests on Monday to see how using multiple GPUs compares to using a single GPU for training models.

from geo-deep-learning.

epeterson12 avatar epeterson12 commented on August 31, 2024

I did some performance tests using DataParallel on an AWS p3.8xlarge instance and compared them to our HPC. I trained over 100 epochs on our unetsmall and kept the same settings for each run other than the batch size. Here are the results:

Instance Type GPUs Used Batch Size Time for Training
p3.8xlarge 1 x V100 30 78m 46s
p3.8xlarge 4 x V100 120 82m 59s
HPC 1 x K80 32 201m 29s

When using DataParallel, the 4 GPUs are used, otherwise one GPU is used, confirmed by running nvidia-smi in the instances while training.
1 GPU:
1xv100_nvidia-smi
4 GPUs:
4xv100_nvidia-smi

I will look into why the training time for training on multiple GPUs is longer than on a single one of the same type.

I also can't explain why the maximum batch size is smaller on the V100s than on the K80s despite them having 16 GB of GPU memory vs. 12 GB on the K80s.

Note: because of how the checkpointed_unet model works, it can't be used in a multi-GPU context since DataParallel collects the forward and gradient (not saved in the checkpointed model, but recalculated at the end) data at the end of each epoch. When we try to run checkpointed training using multiple GPUs, we get the following error: RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

from geo-deep-learning.

ymoisan avatar ymoisan commented on August 31, 2024

@epeterson12 : interesting. Do I understand the reason why you have 120 for batch_size in the 4-GPU case is that you just take the value for 1 GPU and multiply by 4 ? Would it make sense to keep batch_size at 30 for all cases ? Seems to me the idea of having more GPUs is to be able to execute the same task, but using more resources. That would explain why 4 GPUs takes more time than 1 in your case.

At any rate, have you seen the MULTI-GPU EXAMPLES ?

from geo-deep-learning.

epeterson12 avatar epeterson12 commented on August 31, 2024

The way that DataParallel works is that it divides the data over all available GPUs. So with a batch size of 120 over 4 GPUs, each gpu will process 30 samples at a time and with a batch size of 30 over 4 GPUs, two should process 7 samples and two should process 8 samples each. We wouldn't be using all of the available GPU memory in that case and we would have to run more batches. There is a slowdown because of model synchronization between the GPUs when the back-propagation graph is brought together onto a single GPU to be stored. This occurs after each micro-batch is processed .

I had looked at the MULTI-GPU EXAMPLES page that you referenced, but I chose to apply it on the whole model like they do in the 60 Minute Blitz tutorial.

from geo-deep-learning.

epeterson12 avatar epeterson12 commented on August 31, 2024

@ymoisan I may have responded too hastily to your last comment. I had made the assumption that our program was purely memory-bound and I was simply repeating what the pytorch devs answered to questions on GitHub and on PyTorch discussions. When the more complex computations are being done, GPU usage can reach 100% on multiple GPUs with a batch size of 120. I ran a test using DataParallel and the same configurations as my tests above but with a batch size of 30 as you suggested and noted the time required for completion and the GPU utilization.

The available memory per GPU is 16160 MiB.

Batch size Training Time GPU0 Memory Use (MiB) GPU1 Memory Use (MiB) GPU2 Memory Use (MiB) GPU3 Memory Use (MiB) GPU0 Utilization (%) GPU1 Utilization (%) GPU2 Utilization (%) GPU3 Utilization (%)
30 67m 27s 14465 14127 14127 13389 92 78 82 68

It took less time to complete the training with a batch size of 30 than 120. I will do some more tests to find the ideal batch size for our use case as well as try to find other causes for the slowdown.

from geo-deep-learning.

ymoisan avatar ymoisan commented on August 31, 2024

@epeterson12 : so with batch_size=160 GPU utilization stats must have been much lower right ?

from geo-deep-learning.

epeterson12 avatar epeterson12 commented on August 31, 2024

@ymoisan : I get an out of memory error when trying to use a batch size higher than 120. At batch_size=120 GPU Utilization is at 100% accross multiple GPUs and the Gpu memory use is 16049MiB for GPU0 and 15747MiB for the others.

from geo-deep-learning.

ymoisan avatar ymoisan commented on August 31, 2024

@epeterson12 : sorry; I meant 120.

But still, it doesn't make sense to me that the task that drives 4 GPUs at close to 100% utilization and memory (batch_size = 120) takes more time than one where GPU utilization is 92, 78, 82 and 68 % and overall less GPU RAM as shown above ... What are those GPUs doing ?

from geo-deep-learning.

epeterson12 avatar epeterson12 commented on August 31, 2024

@ymoisan : I figured you meant 120, but I just wanted to be sure.

I don't know why the GPU utilisation is so high either. Could it be some delay due to the GPU trying to decide what to do with all of the data that it is receiving? I found this post from 2017 that really bothered me. In the end, the person gave up on using DataParallel in their convolutional network. I am wondering this type of network is ill suited for multi-gpu use...

from geo-deep-learning.

epeterson12 avatar epeterson12 commented on August 31, 2024

I ran bottleneck on single GPU and on 4 GPUs with DataParallel for train_model.py using a batch size of 30. Note: we have to set num_workers=0 in our DataLoaders in order to run bottleneck.
Here are the outputs:
bottleneck_output_4gpu.txt
bottleneck_output_1gpu.txt

Comparing the cProfile output, which shows the function calls that take the most time, for cumtime (total time to run the function and all functions that it called) lets us identify where the relative slowdowns are when using 1 GPU and when using 4 GPUs with DataParallel. Here is a summary of my results:
comparing_bottlenecks

Based on these results, using 4 GPUs seems like it should be faster than using a single GPU. However, when testing total training time, the time to train is practically the same on 1 and on 4 GPUs and the cost to train on one graphics card is drastically lower than on 4 graphics cards.

Summary of test results on AWS with training over 100 epochs

GPUs Used Batch Size Training Time Cost per hour for the EC2 Instance Total Cost for Training
1 x V100 30 78m 46s $3.366 $4.419
4 x V100 30 67m 27s $13.464 $15.136
4 x V100 120 82m 59s $13.464 $ 18.621

The synchronization delay doesn't seem to be the reason that DataParallel is no faster than using a single GPU in our tests, with the P2P connectivity being quite reasonable on the AWS VMs.
p2pbandwidthlatencytest

I don't think that it will be worth it to use multiple GPUs for training on AWS because of the great price increase and the minimal acceleration of training with our model.

from geo-deep-learning.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.