Comments (7)
Hi @windj007
I'm training a big-lama model on a custom dataset. I trained both single GPU and multi GPU training and here are some observations I see.
Single GPU: Tesla V100 16GB: data.batch size = 6
Time taken per epoch is ~ 1 hr 12 min
Multi GPU: 4 x Tesla V100 16 GB: data.batch_size = 6
Time taken per epoch is ~ 1 hr 11 min (almost the same time)
Ideally one would expect the time taken per epoch to reduce when using ddp. Am I missing something here? Is there any other parameter that needs to be changed before multi gpu training?
I run python3 bin/train.py -cn big-lama data.batch_size=6
ro start my training
from lama.
Hi!
Our pipeline uses DDP by default, so no extra configuration is needed. With DDP enabled, data.batch_size
sets number of samples per GPU - so the total batch size will be data.batch_size * n_gpus
. For a more fine-grained tuning, please refer to trainer.kwargs
subsection of the configuration.
Does this answer your question?
Btw, what is 180w+
?
from lama.
Hi, thanks for your reply.
There is also a parameter named limit_train_batches
in trainer.kwargs
, is this the number of samples per GPU? So for one epoch the number of samples is limit_train_batches * n_gpus
?
PS: Please ignore this mistake 180w
, what i mean is 1.8 Million
, i have edit the question.
from lama.
limit_train_batches
is the number of training steps within a single epoch. It is independent from batch size. I believe that you do not have to alter it unless your dataset is really small (dataset_size < limit_train_batches * n_gpus * batch_size
).
Set it to balance between amount of training and evaluation frequency. We set it so validation is conducted approximately 4 times a day - so each day we have some news, but no excessive time spent on too frequent evaluation. For our hardware it was 25000.
If the training is unstable, then the epoch size should be smaller (not to miss a good local minimum). This might be the case for purely adversarial models, shouldn't be so for LaMa.
Note that there is another parameter val_check_interval
, which should be almost always be equal to limit_train_batches
(it is so in our configs).
from lama.
I use the 512 Places2 datasets, and bs=10, n_gpus=8 and ddp accelerator, limit_train_batches=25000, the network is the same as the lama-fourier in your release model, it takes approximately 6h an epoch, is this normal? I am wondering how long it takes during your experiments and the corresponding batchsize? Are there some time-consuming optimization directions? Thanks very much!
from lama.
bs=10
Do you mean that data.batch_size=10
and the training is running on 8 GPUs - so the total batch size is 80?
it takes approximately 6h an epoch, is this normal?
That sounds reasonable. Surely, it depends on the exact model of GPU and performance of HDD/SDD
from lama.
@Queenyy hi, Have you successfully finetuned your own lama model? I would be very grateful if you could share some experiences.
from lama.
Related Issues (20)
- Hi, I have made a iOS App with your great model!
- Prediction failed due to Missing key visualizer
- Can I fine-tune the model? HOT 2
- why is tensorflow necessary?
- ImportError: cannot import name 'DualIAATransform' from 'albumentations' HOT 1
- About the training command 2 HOT 1
- Created single-file version of LaMa
- Question about generating validation and eval data
- Can I separate the Feature Refinement to Improve the High-Resolution Image Inpainting technique
- A simple ckpt to pt model convertor
- Repeated Refinement?
- Error finetuning the big-lama-with-discr model HOT 7
- Data set training problem HOT 1
- After executing the training command, it has been stuck at this point without any progress in the training. HOT 1
- Inpaint a NEW thing? HOT 3
- Refinement with Multiple Images
- How to draw a loss function curve
- Dataset is empty if configuring img_suffix: .jpg in default.yaml
- ONNX Model done HOT 10
- Output Error: No inpainted in the output_dir HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lama.