Coder Social home page Coder Social logo

Comments (9)

rwightman avatar rwightman commented on July 22, 2024 1

FYI @mitchellnw in the code from merge, I left some DP code around, but didn't test it and was planning to remove entirely. I don't see the point in maintaining DP as it's all around worse performance than DDP even on a single node...

from open_clip.

wenlinyao avatar wenlinyao commented on July 22, 2024

Thanks for replying!
Is it possible to provide full instruction on how to install all required environments and how to run DDP (including single-node multi-GPUs and multi-nodes multi-GPUs)? A few example commands would also be very helpful! @rwightman

from open_clip.

rwightman avatar rwightman commented on July 22, 2024

@wenlinyao on current versions of PyTorch (1.10/1.11) it is really easy w/ torchrun. Older versions you can look up use of the distributed launcher python3 -m torch.distributed.launch --nproc_per_node=$NUM_PROC script.py args

An example command line from some recent CC12M training I was doing (4 V100 32GB GPU), launched from the src/ folder...

torchrun --nproc_per_node 4 -m training.main --train-data '/data/cc12m/cc12m-train-{0000..2175}.tar' --dataset-type webdataset --batch-size 320 --precision amp --workers 4 --imagenet-val /raid/imagenet/validation/

from open_clip.

rwightman avatar rwightman commented on July 22, 2024

Too be clear, no extra components are needed aside from torch, webdataset and other dependencies that open_clip has already. torchrun should be in the path for a standard conda install (and likely pip, but haven't tested) of pytorch

from open_clip.

wenlinyao avatar wenlinyao commented on July 22, 2024

@rwightman After adding parser.add_argument("--local_rank", type=int, default=0) to params.py "torch.distributed.launch" is running now. ;)
But I just noticed another problem happened: the running hangs forever at the end of training. (looks like it is due to the synchronization problem?)

image

image

I further notice this problem occurs when I pass "data_1/000{00..08}.tar" (9 files) as the training data. The problem won't occur when I pass "data_1/000{00..07}.tar" (8 files) as the training data. I suspect that it is due to dataloader shards distribution problem.
If "{0000..2175}.tar" in your case works, maybe this problem will disappear after I pass enough ".tar" files?
Thanks!

from open_clip.

rwightman avatar rwightman commented on July 22, 2024

@wenlinyao using local_rank and the torch.distributed.launch is the 'old' way, but works and I still use it for timm defaults

Yes, you need more shards, shards are distributed across all of the nodes (each shard is randomly assigned 1/N subset of the dataset shards or each epoch), I would say at bare minimum you probably want at least 2 shards per process (GPU) in distributed setup ... but ideally you shard the dataset with more.

from open_clip.

wenlinyao avatar wenlinyao commented on July 22, 2024

@rwightman I see, thanks for your answers!

from open_clip.

rwightman avatar rwightman commented on July 22, 2024

@wenlinyao closing this, if you have any additional concerns issue re parallel training let us know. DataParallel support was removed to avoid confusion, and README now includes command line examples for using torchrun, SLURM, etc w/ DistributedDataParallel.

from open_clip.

wenlinyao avatar wenlinyao commented on July 22, 2024

Thanks for the update!

from open_clip.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.