Hi, I use the Places365-Standard dataset to train the model, which has more than 1.8 M

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

bs=10 Do you mean that <code class="notra

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

How to use ddp during training about lama HOT 7 CLOSED

advimman commented on August 21, 2024

How to use ddp during training

from lama.

Comments (7)

nishanthballal-9 commented on August 21, 2024 1

Hi @windj007

I'm training a big-lama model on a custom dataset. I trained both single GPU and multi GPU training and here are some observations I see.
Single GPU: Tesla V100 16GB: data.batch size = 6
Time taken per epoch is ~ 1 hr 12 min
Multi GPU: 4 x Tesla V100 16 GB: data.batch_size = 6
Time taken per epoch is ~ 1 hr 11 min (almost the same time)

Ideally one would expect the time taken per epoch to reduce when using ddp. Am I missing something here? Is there any other parameter that needs to be changed before multi gpu training?

I run python3 bin/train.py -cn big-lama data.batch_size=6 ro start my training

from lama.

windj007 commented on August 21, 2024

Hi!

Our pipeline uses DDP by default, so no extra configuration is needed. With DDP enabled, data.batch_size sets number of samples per GPU - so the total batch size will be data.batch_size * n_gpus. For a more fine-grained tuning, please refer to trainer.kwargs subsection of the configuration.

Does this answer your question?

Btw, what is 180w+?

from lama.

Queenyy commented on August 21, 2024

Hi, thanks for your reply.

There is also a parameter named limit_train_batches in trainer.kwargs, is this the number of samples per GPU? So for one epoch the number of samples is limit_train_batches * n_gpus?

PS: Please ignore this mistake 180w, what i mean is 1.8 Million, i have edit the question.

from lama.

windj007 commented on August 21, 2024

limit_train_batches is the number of training steps within a single epoch. It is independent from batch size. I believe that you do not have to alter it unless your dataset is really small (dataset_size < limit_train_batches * n_gpus * batch_size).

Set it to balance between amount of training and evaluation frequency. We set it so validation is conducted approximately 4 times a day - so each day we have some news, but no excessive time spent on too frequent evaluation. For our hardware it was 25000.

If the training is unstable, then the epoch size should be smaller (not to miss a good local minimum). This might be the case for purely adversarial models, shouldn't be so for LaMa.

Note that there is another parameter val_check_interval, which should be almost always be equal to limit_train_batches (it is so in our configs).

from lama.

Queenyy commented on August 21, 2024

I use the 512 Places2 datasets, and bs=10, n_gpus=8 and ddp accelerator, limit_train_batches=25000, the network is the same as the lama-fourier in your release model, it takes approximately 6h an epoch, is this normal? I am wondering how long it takes during your experiments and the corresponding batchsize? Are there some time-consuming optimization directions? Thanks very much!

from lama.

windj007 commented on August 21, 2024

bs=10

Do you mean that data.batch_size=10 and the training is running on 8 GPUs - so the total batch size is 80?

it takes approximately 6h an epoch, is this normal?

That sounds reasonable. Surely, it depends on the exact model of GPU and performance of HDD/SDD

from lama.

Sanster commented on August 21, 2024

@Queenyy hi, Have you successfully finetuned your own lama model? I would be very grateful if you could share some experiences.

from lama.

How to use ddp during training about lama HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent