Coder Social home page Coder Social logo

How to use ddp during training about lama HOT 7 CLOSED

advimman avatar advimman commented on August 21, 2024
How to use ddp during training

from lama.

Comments (7)

nishanthballal-9 avatar nishanthballal-9 commented on August 21, 2024 1

Hi @windj007

I'm training a big-lama model on a custom dataset. I trained both single GPU and multi GPU training and here are some observations I see.
Single GPU: Tesla V100 16GB: data.batch size = 6
Time taken per epoch is ~ 1 hr 12 min
Multi GPU: 4 x Tesla V100 16 GB: data.batch_size = 6
Time taken per epoch is ~ 1 hr 11 min (almost the same time)

Ideally one would expect the time taken per epoch to reduce when using ddp. Am I missing something here? Is there any other parameter that needs to be changed before multi gpu training?

I run python3 bin/train.py -cn big-lama data.batch_size=6 ro start my training

from lama.

windj007 avatar windj007 commented on August 21, 2024

Hi!

Our pipeline uses DDP by default, so no extra configuration is needed. With DDP enabled, data.batch_size sets number of samples per GPU - so the total batch size will be data.batch_size * n_gpus. For a more fine-grained tuning, please refer to trainer.kwargs subsection of the configuration.

Does this answer your question?

Btw, what is 180w+?

from lama.

Queenyy avatar Queenyy commented on August 21, 2024

Hi, thanks for your reply.

There is also a parameter named limit_train_batches in trainer.kwargs, is this the number of samples per GPU? So for one epoch the number of samples is limit_train_batches * n_gpus?

PS: Please ignore this mistake 180w, what i mean is 1.8 Million, i have edit the question.

from lama.

windj007 avatar windj007 commented on August 21, 2024

limit_train_batches is the number of training steps within a single epoch. It is independent from batch size. I believe that you do not have to alter it unless your dataset is really small (dataset_size < limit_train_batches * n_gpus * batch_size).

Set it to balance between amount of training and evaluation frequency. We set it so validation is conducted approximately 4 times a day - so each day we have some news, but no excessive time spent on too frequent evaluation. For our hardware it was 25000.

If the training is unstable, then the epoch size should be smaller (not to miss a good local minimum). This might be the case for purely adversarial models, shouldn't be so for LaMa.

Note that there is another parameter val_check_interval, which should be almost always be equal to limit_train_batches (it is so in our configs).

from lama.

Queenyy avatar Queenyy commented on August 21, 2024

I use the 512 Places2 datasets, and bs=10, n_gpus=8 and ddp accelerator, limit_train_batches=25000, the network is the same as the lama-fourier in your release model, it takes approximately 6h an epoch, is this normal? I am wondering how long it takes during your experiments and the corresponding batchsize? Are there some time-consuming optimization directions? Thanks very much!

from lama.

windj007 avatar windj007 commented on August 21, 2024

bs=10

Do you mean that data.batch_size=10 and the training is running on 8 GPUs - so the total batch size is 80?

it takes approximately 6h an epoch, is this normal?

That sounds reasonable. Surely, it depends on the exact model of GPU and performance of HDD/SDD

from lama.

Sanster avatar Sanster commented on August 21, 2024

@Queenyy hi, Have you successfully finetuned your own lama model? I would be very grateful if you could share some experiences.

from lama.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.