Comments (9)
Thanks @vfdev-5 ! Here are my results:
Config | Total time | Training epoch time, validation train/test times |
---|---|---|
PyTorch 1.7.1+cu110, RTX 3070, 1 GPU | 00:02:30 | 00:00:05, 00:00:01.8, 00:00:00.9 |
PyTorch 1.7.1+cu110, RTX 3070, 2 GPU | 00:01:29 | 00:00:03, 00:00:01.3, 00:00:00.5 |
PyTorch 1.9.1+cu111, RTX 3070, 1 GPU | 00:02:19 | 00:00:05, 00:00:01.8, 00:00:01 |
PyTorch 1.9.1+cu111, RTX 3070, 2 GPU | 00:01:26 | 00:00:03, 00:00:01.3, 00:00:01 |
PyTorch 1.7.1+cu110, g4dn.12xlarge, 1 GPU (7Gb of GPU RAM used) | 00:14:55 | 00:00:35, 00:00:02.4, 00:00:03 |
PyTorch 1.7.1+cu110, g4dn.12xlarge, 2 GPU | 00:02:28 | 00:00:05, 00:00:01.7, 00:00:00.7 |
PyTorch 1.7.1+cu110, g4dn.12xlarge, 4 GPU | 00:01:53 | 00:00:04, 00:00:01.7, 00:00:00.5 |
It looks like there is something wrong with 1 GPU on g4dn.12xlarge
Btw, is it fair to compare the speed this way? ie. in multi-gpu context, each GPU get a smaller batch_size
from examples.
Thanks for the results @H4dr1en !
Definitely, there is something unclear with 1 GPU case for g4dn.12xlarge
.
Btw, is it fair to compare the speed this way? ie. in multi-gpu context, each GPU get a smaller batch_size
Well, I'd say we are interested how quickly the task was done. By the task we have to measure the number of processed images. If in multi-gpu context we load each GPU as in the a single GPU case, we have to reduce the number of iterations to run, otherwise they wont accomplish the same task, I think.
EDIT: in the logs for PyTorch 1.7.1+cu110, g4dn.12xlarge, 1 GPU (7Gb of GPU RAM used)
case, do you see something like "Apply torch DataParallel on model" ?
from examples.
Thanks for reporting @H4dr1en ! I'd like to reproduce your results to see what happens once I have some time for that.
Today, we have some benchmarks for Pascal VOC on 1, 2 and 4 GPUs (GeForce RTX 2080 Ti)
- 1 GPU - 3:55h : https://app.clear.ml/projects/1f574b9622104ab0bcdef10c39ff5e2f/experiments/a971430efa76456895724ad7758bf44b/output/execution
- 2 GPUs - 1:53h : https://app.clear.ml/projects/1f574b9622104ab0bcdef10c39ff5e2f/experiments/5899d7044a9b4c3587f3b12ea498ba72/output/execution
- 4 GPUs - 1:02h : https://app.clear.ml/projects/1f574b9622104ab0bcdef10c39ff5e2f/experiments/f96e3ebd97104ab99ff94e582336afaf/output/execution
Btw, thanks for pointing out to clearml-task feature, cool feature !
from examples.
@H4dr1en Thanks for that report. It sounds weird, I had similar experiments when I worked in a research center with GPU cluster and it was fine regarding scalability.
Did you try disabling clearml ? Transferring the results to the server can create disruptions and interruptions during learning.
from examples.
Thank for your answers!
From what you reported, scaling the training speed linearly with the number of GPUs should be achievable, so there could be something wrong that could be fixed.
As a context, I do observe similar bad scalability for my own use case within g4dn.12xlarge instances, so I hope that if we can find the bottleneck with the cifar10 example, it would also unlock my other project.
from examples.
@H4dr1en can please you try original cifar10 example on your infrastructure using 1, 2, 4 GPUs and report back runtime here ?
CUDA_VISIBLE_DEVICES=0 python main.py run
# for older pytorch
python -u -m torch.distributed.launch --nproc_per_node=2 --use_env main.py run --backend="nccl"
# for >=1.9
torchrun --nproc_per_node=2 main.py run --backend="nccl"
# for older pytorch
python -u -m torch.distributed.launch --nproc_per_node=4 --use_env main.py run --backend="nccl"
# for >=1.9
torchrun --nproc_per_node=4 main.py run --backend="nccl"
my times on 1 and 2 GPUs GTX1080Ti to compare:
Config | Total time | Training epoch time, validation train/test times |
---|---|---|
1 GPU | 00:04:22 | 00:00:08, 00:00:05, 00:00:02 |
2 GPUs (DDP) | 00:02:57 | 00:00:06, 00:00:03, 00:00:01 |
from examples.
EDIT: in the logs for PyTorch 1.7.1+cu110, g4dn.12xlarge, 1 GPU (7Gb of GPU RAM used) case, do you see something like "Apply torch DataParallel on model" ?
No, here are the logs for this run:
ec2-user@ip-10-100-0-002:~/ignite/examples/contrib/cifar10# CUDA_VISIBLE_DEVICES=0 python main.py run
2022-02-04 12:22:46,263 ignite.distributed.launcher.Parallel INFO: - Run '<function training at 0x7f717e4a56a8>' in 1 processes
2022-02-04 12:22:49,529 CIFAR10-Training INFO: Train resnet18 on CIFAR10
2022-02-04 12:22:49,529 CIFAR10-Training INFO: - PyTorch version: 1.7.1+cu110
2022-02-04 12:22:49,529 CIFAR10-Training INFO: - Ignite version: 0.4.8
2022-02-04 12:22:49,536 CIFAR10-Training INFO: - GPU Device: Tesla T4
2022-02-04 12:22:49,536 CIFAR10-Training INFO: - CUDA version: 11.0
2022-02-04 12:22:49,536 CIFAR10-Training INFO: - CUDNN version: 8005
2022-02-04 12:22:49,536 CIFAR10-Training INFO:
2022-02-04 12:22:49,536 CIFAR10-Training INFO: Configuration:
2022-02-04 12:22:49,536 CIFAR10-Training INFO: with_amp: False
2022-02-04 12:22:49,536 CIFAR10-Training INFO: with_clearml: False
2022-02-04 12:22:49,536 CIFAR10-Training INFO: stop_iteration: None
2022-02-04 12:22:49,536 CIFAR10-Training INFO: nproc_per_node: None
2022-02-04 12:22:49,536 CIFAR10-Training INFO: log_every_iters: 15
2022-02-04 12:22:49,536 CIFAR10-Training INFO: resume_from: None
2022-02-04 12:22:49,536 CIFAR10-Training INFO: backend: None
2022-02-04 12:22:49,536 CIFAR10-Training INFO: checkpoint_every: 1000
2022-02-04 12:22:49,536 CIFAR10-Training INFO: validate_every: 3
2022-02-04 12:22:49,536 CIFAR10-Training INFO: num_warmup_epochs: 4
2022-02-04 12:22:49,536 CIFAR10-Training INFO: learning_rate: 0.4
2022-02-04 12:22:49,536 CIFAR10-Training INFO: num_epochs: 24
2022-02-04 12:22:49,536 CIFAR10-Training INFO: num_workers: 12
2022-02-04 12:22:49,536 CIFAR10-Training INFO: weight_decay: 0.0001
2022-02-04 12:22:49,536 CIFAR10-Training INFO: momentum: 0.9
2022-02-04 12:22:49,537 CIFAR10-Training INFO: batch_size: 512
2022-02-04 12:22:49,537 CIFAR10-Training INFO: model: resnet18
2022-02-04 12:22:49,537 CIFAR10-Training INFO: output_path: /tmp/output-cifar10/
2022-02-04 12:22:49,537 CIFAR10-Training INFO: data_path: /tmp/cifar10
2022-02-04 12:22:49,537 CIFAR10-Training INFO: seed: 543
2022-02-04 12:22:49,537 CIFAR10-Training INFO:
2022-02-04 12:22:49,537 CIFAR10-Training INFO: Output path: /tmp/output-cifar10/resnet18_backend-None-1_20220204-122249
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to /tmp/cifar10/cifar-10-python.tar.gz
99%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 169295872/170498071 [00:11<00:00, 15776620.49it/s]Extracting /tmp/cifar10/cifar-10-python.tar.gz to /tmp/cifar10
2022-02-04 12:23:04,002 ignite.distributed.auto.auto_dataloader INFO: Use data loader kwargs for dataset 'Dataset CIFAR10':
{'batch_size': 512, 'num_workers': 12, 'shuffle': True, 'drop_last': True, 'pin_memory': True}
2022-02-04 12:23:04,003 ignite.distributed.auto.auto_dataloader INFO: Use data loader kwargs for dataset 'Dataset CIFAR10':
{'batch_size': 1024, 'num_workers': 12, 'shuffle': False, 'pin_memory': True}
2022-02-04 12:23:10,650 CIFAR10-Training INFO: Engine run starting with max_epochs=24.
from examples.
But the behaviour above for 1 GPU on g4dn.12xlarge is probably a separate issue. Sorry I was not very explicit on the issue description, my main concern is the following:
If we define the factor of improvement as f(n_gpu) = training_time(n_gpu) / training_time(2 * n_gpu)
I would expect to have a factor of improvement approaching 2, and it seems to never be achieved. What could be the reason for that?
from examples.
I would expect to have a factor of improvement approaching 2, and it seems to never be achieved. What could be the reason for that?
I think in case of cifar10 a larger model can give something as a linear scaling (up to a certain limit).
EDIT: maybe dataset size and image size also play their role to that.
My times on 1 and 2 GPUs GTX1080Ti to compare for resnet152 model, 10 epochs
Config | Total time | Training epoch time, validation train/test times |
---|---|---|
1 GPU | 00:05:21 | 00:00:25, 00:00:13, 00:00:03 |
2 GPUs (DDP) | 00:03:51 | 00:00:19, 00:00:08, 00:00:02 |
See also results for Pascal VOC: #75 (comment) where factor ~ N GPUs for N = 1, 2 and 4
from examples.
Related Issues (20)
- Convert pure pytorch code to ignite
- Text Classification using Transformers HOT 6
- `idist` tutorial HOT 1
- Add LICENSE? HOT 1
- Cross validation guide HOT 5
- Advanced tutorial: Hyperparameter tuning
- Add weight in frontmatter HOT 2
- Logging in Ignite HOT 1
- Reinforcement learning beginner tutorial
- Custom Metrics Example HOT 10
- could you give an example of how to save checkpoints? HOT 3
- Replace all instances of `save_handler=DiskSaver(...)` with `save_handler=path/to/dir` HOT 5
- Put `cifar10-distributed.py` to intermediate folder HOT 1
- Load Model and Resume Training guide
- Rework cifar10 distributed tutorial HOT 2
- How-to guide to cover multi-output models and their evaluation
- Python environment to run "How To Guides" HOT 3
- How-to-guide to cover LRScheduler HOT 3
- Update `01-collective-communication` notebook regarding NCCL `gather` support HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from examples.