Hi, after updating to your most recent code, I got an error when training in s

FYI <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Error when training in DataParallel model. about open_clip HOT 9 CLOSED

mlfoundations commented on July 22, 2024

Error when training in DataParallel model.

from open_clip.

Comments (9)

rwightman commented on July 22, 2024 1

FYI @mitchellnw in the code from merge, I left some DP code around, but didn't test it and was planning to remove entirely. I don't see the point in maintaining DP as it's all around worse performance than DDP even on a single node...

from open_clip.

wenlinyao commented on July 22, 2024

Thanks for replying!
Is it possible to provide full instruction on how to install all required environments and how to run DDP (including single-node multi-GPUs and multi-nodes multi-GPUs)? A few example commands would also be very helpful! @rwightman

from open_clip.

rwightman commented on July 22, 2024

@wenlinyao on current versions of PyTorch (1.10/1.11) it is really easy w/ torchrun. Older versions you can look up use of the distributed launcher python3 -m torch.distributed.launch --nproc_per_node=$NUM_PROC script.py args

An example command line from some recent CC12M training I was doing (4 V100 32GB GPU), launched from the src/ folder...

torchrun --nproc_per_node 4 -m training.main --train-data '/data/cc12m/cc12m-train-{0000..2175}.tar' --dataset-type webdataset --batch-size 320 --precision amp --workers 4 --imagenet-val /raid/imagenet/validation/

from open_clip.

rwightman commented on July 22, 2024

Too be clear, no extra components are needed aside from torch, webdataset and other dependencies that open_clip has already. torchrun should be in the path for a standard conda install (and likely pip, but haven't tested) of pytorch

from open_clip.

wenlinyao commented on July 22, 2024

@rwightman After adding parser.add_argument("--local_rank", type=int, default=0) to params.py "torch.distributed.launch" is running now. ;)
But I just noticed another problem happened: the running hangs forever at the end of training. (looks like it is due to the synchronization problem?)

I further notice this problem occurs when I pass "data_1/000{00..08}.tar" (9 files) as the training data. The problem won't occur when I pass "data_1/000{00..07}.tar" (8 files) as the training data. I suspect that it is due to dataloader shards distribution problem.
If "{0000..2175}.tar" in your case works, maybe this problem will disappear after I pass enough ".tar" files?
Thanks!

from open_clip.

rwightman commented on July 22, 2024

@wenlinyao using local_rank and the torch.distributed.launch is the 'old' way, but works and I still use it for timm defaults

Yes, you need more shards, shards are distributed across all of the nodes (each shard is randomly assigned 1/N subset of the dataset shards or each epoch), I would say at bare minimum you probably want at least 2 shards per process (GPU) in distributed setup ... but ideally you shard the dataset with more.

from open_clip.

wenlinyao commented on July 22, 2024

@rwightman I see, thanks for your answers!

from open_clip.

rwightman commented on July 22, 2024

@wenlinyao closing this, if you have any additional concerns issue re parallel training let us know. DataParallel support was removed to avoid confusion, and README now includes command line examples for using torchrun, SLURM, etc w/ DistributedDataParallel.

from open_clip.

wenlinyao commented on July 22, 2024

Thanks for the update!

from open_clip.

Error when training in DataParallel model. about open_clip HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent