Comments (9)
FYI @mitchellnw in the code from merge, I left some DP code around, but didn't test it and was planning to remove entirely. I don't see the point in maintaining DP as it's all around worse performance than DDP even on a single node...
from open_clip.
Thanks for replying!
Is it possible to provide full instruction on how to install all required environments and how to run DDP (including single-node multi-GPUs and multi-nodes multi-GPUs)? A few example commands would also be very helpful! @rwightman
from open_clip.
@wenlinyao on current versions of PyTorch (1.10/1.11) it is really easy w/ torchrun. Older versions you can look up use of the distributed launcher python3 -m torch.distributed.launch --nproc_per_node=$NUM_PROC script.py args
An example command line from some recent CC12M training I was doing (4 V100 32GB GPU), launched from the src/ folder...
torchrun --nproc_per_node 4 -m training.main --train-data '/data/cc12m/cc12m-train-{0000..2175}.tar' --dataset-type webdataset --batch-size 320 --precision amp --workers 4 --imagenet-val /raid/imagenet/validation/
from open_clip.
Too be clear, no extra components are needed aside from torch, webdataset and other dependencies that open_clip has already. torchrun should be in the path for a standard conda install (and likely pip, but haven't tested) of pytorch
from open_clip.
@rwightman After adding parser.add_argument("--local_rank", type=int, default=0)
to params.py "torch.distributed.launch" is running now. ;)
But I just noticed another problem happened: the running hangs forever at the end of training. (looks like it is due to the synchronization problem?)
I further notice this problem occurs when I pass "data_1/000{00..08}.tar" (9 files) as the training data. The problem won't occur when I pass "data_1/000{00..07}.tar" (8 files) as the training data. I suspect that it is due to dataloader shards distribution problem.
If "{0000..2175}.tar" in your case works, maybe this problem will disappear after I pass enough ".tar" files?
Thanks!
from open_clip.
@wenlinyao using local_rank and the torch.distributed.launch is the 'old' way, but works and I still use it for timm defaults
Yes, you need more shards, shards are distributed across all of the nodes (each shard is randomly assigned 1/N subset of the dataset shards or each epoch), I would say at bare minimum you probably want at least 2 shards per process (GPU) in distributed setup ... but ideally you shard the dataset with more.
from open_clip.
@rwightman I see, thanks for your answers!
from open_clip.
@wenlinyao closing this, if you have any additional concerns issue re parallel training let us know. DataParallel support was removed to avoid confusion, and README now includes command line examples for using torchrun, SLURM, etc w/ DistributedDataParallel.
from open_clip.
Thanks for the update!
from open_clip.
Related Issues (20)
- coca model get dimension mismatch error when dimension is 77 HOT 2
- How to load Clip-VIT-bigG-14 in local which has two .bin files and can not find any information how to load, thanks!!!!!!! HOT 2
- How to extract 1024 width patch embeddings and CLS embedding HOT 1
- S3 Broken Pipe Error HOT 10
- A NotImplementedError occurred when I imported the pre-trained weights of RN50.
- RuntimeError: Boolean value of Tensor with more than one value is ambiguous HOT 8
- CPU memory leak when training with CsvDataset-like dataset HOT 10
- datasets for pretrained models HOT 3
- unlocking layer groups in open_clip/src/open_clip /timm_model.py for fine-tuning HOT 1
- Long context CLIP HOT 3
- How to initialize the encoder to pretrain CLIP? HOT 1
- Training speed slow HOT 1
- The result is random HOT 2
- Load hybrid-clip in open_clip HOT 1
- CoCa RoBERTa Attention Map Size Issue HOT 1
- Issue with inference values with 'hf-hub:laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg' HOT 3
- What is the max value for unlocked_groups in TimmModel.lock()? I am trying to fine-tune the entire vision tower HOT 1
- [1,512] Data conversion
- Handling Negative Pairs in Fine-Tuning of CLIP Models HOT 1
- [1,512] data HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from open_clip.