mlfoundations / open_clip Goto Github PK
View Code? Open in Web Editor NEWAn open source implementation of CLIP.
License: Other
An open source implementation of CLIP.
License: Other
I noticed that the CLIP validation loss curve begins to slope upwards about halfway through training on Conceptual Captions (~ epoch 15) from the figure here, but validation recall continues to increase until the end of training (epoch 30).
Does this mean that when doing contrastive training, the procedure for early stopping be based on the validation recall performance, rather than the validation loss, since they are not necessarily tied to one another like in standard supervised learning?
Google cloud VM
Debian10
16 cores CPU, 60Gb of rams
4 nvidia T4
Traceback (most recent call last):
File "/opt/conda/envs/open_clip/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/jupyter/open_clip/src/training/main.py", line 192, in main_worker
evaluate(model, data, 0, args, writer, 0)
File "/home/jupyter/open_clip/src/training/train.py", line 197, in evaluate
torch.cat(all_image_features), torch.cat(all_text_features)
File "/home/jupyter/open_clip/src/training/train.py", line 228, in get_metrics
logits_per_image = image_features @ text_features.t()
RuntimeError: CUDA out of memory. Tried to allocate 2269.88 GiB (GPU 0; 14.76 GiB total capacity; 7.11 GiB already allocated; 6.67 GiB free; 7.17 GiB reserved in total by PyTorch)
python -u src/training/main.py \
--save-frequency 1 \
--zeroshot-frequency 3 \
--train-data "src/df_openclip_train.csv" \
--val-data "src/df_openclip_val.csv" \
--openai-pretrained \
--csv-separator "," \
--csv-img-key image_path \
--csv-caption-key product_name \
--warmup 10000 \
--batch-size=128 \
--lr=1e-3 \
--wd=0.1 \
--epochs=30 \
--workers=4 \
--model ViT-B/32
2021-09-01,05:15:43 | INFO | Rank 0 | Params:
2021-09-01,05:15:43 | INFO | Rank 0 | C: 3.16
2021-09-01,05:15:43 | INFO | Rank 0 | aggregate: True
2021-09-01,05:15:43 | INFO | Rank 0 | batch_size: 128
2021-09-01,05:15:43 | INFO | Rank 0 | beta1: 0.9
2021-09-01,05:15:43 | INFO | Rank 0 | beta2: 0.98
2021-09-01,05:15:43 | INFO | Rank 0 | checkpoint_path: ./logs/lr=0.001_wd=0.1_agg=True_model=ViT-B/32_batchsize=128_workers=4_date=2021-09-01-05-15-41/checkpoints
2021-09-01,05:15:43 | INFO | Rank 0 | copy_codebase: False
2021-09-01,05:15:43 | INFO | Rank 0 | csv_caption_key: product_name
2021-09-01,05:15:43 | INFO | Rank 0 | csv_img_key: image_path
2021-09-01,05:15:43 | INFO | Rank 0 | csv_separator: ,
2021-09-01,05:15:43 | INFO | Rank 0 | dataset_type: auto
2021-09-01,05:15:43 | INFO | Rank 0 | debug: False
2021-09-01,05:15:43 | INFO | Rank 0 | dist_backend: nccl
2021-09-01,05:15:43 | INFO | Rank 0 | dist_url: tcp://127.0.0.1:6100
2021-09-01,05:15:43 | INFO | Rank 0 | distributed: True
2021-09-01,05:15:43 | INFO | Rank 0 | dp: False
2021-09-01,05:15:43 | INFO | Rank 0 | epochs: 30
2021-09-01,05:15:43 | INFO | Rank 0 | eps: 1e-06
2021-09-01,05:15:43 | INFO | Rank 0 | gpu: 0
2021-09-01,05:15:43 | INFO | Rank 0 | imagenet_v2: None
2021-09-01,05:15:43 | INFO | Rank 0 | imagenet_val: None
2021-09-01,05:15:43 | INFO | Rank 0 | log_level: 20
2021-09-01,05:15:43 | INFO | Rank 0 | log_path: ./logs/lr=0.001_wd=0.1_agg=True_model=ViT-B/32_batchsize=128_workers=4_date=2021-09-01-05-15-41/out.log
2021-09-01,05:15:43 | INFO | Rank 0 | logs: ./logs/
2021-09-01,05:15:43 | INFO | Rank 0 | lr: 0.001
2021-09-01,05:15:43 | INFO | Rank 0 | model: ViT-B/32
2021-09-01,05:15:43 | INFO | Rank 0 | multigpu: None
2021-09-01,05:15:43 | INFO | Rank 0 | name: lr=0.001_wd=0.1_agg=True_model=ViT-B/32_batchsize=128_workers=4_date=2021-09-01-05-15-41
2021-09-01,05:15:43 | INFO | Rank 0 | ngpus_per_node: 4
2021-09-01,05:15:43 | INFO | Rank 0 | openai_pretrained: True
2021-09-01,05:15:43 | INFO | Rank 0 | precision: amp
2021-09-01,05:15:43 | INFO | Rank 0 | rank: 0
2021-09-01,05:15:43 | INFO | Rank 0 | regression_frequency: 2
2021-09-01,05:15:43 | INFO | Rank 0 | report_to:
2021-09-01,05:15:43 | INFO | Rank 0 | resume: None
2021-09-01,05:15:43 | INFO | Rank 0 | save_frequency: 1
2021-09-01,05:15:43 | INFO | Rank 0 | skip_aggregate: False
2021-09-01,05:15:43 | INFO | Rank 0 | skip_scheduler: False
2021-09-01,05:15:43 | INFO | Rank 0 | tensorboard: False
2021-09-01,05:15:43 | INFO | Rank 0 | tensorboard_path:
2021-09-01,05:15:43 | INFO | Rank 0 | train_data: src/df_openclip_train.csv
2021-09-01,05:15:43 | INFO | Rank 0 | use_bn_sync: False
2021-09-01,05:15:43 | INFO | Rank 0 | val_data: src/df_openclip_val.csv
2021-09-01,05:15:43 | INFO | Rank 0 | wandb: False
2021-09-01,05:15:43 | INFO | Rank 0 | wandb_notes:
2021-09-01,05:15:43 | INFO | Rank 0 | warmup: 10000
2021-09-01,05:15:43 | INFO | Rank 0 | wd: 0.1
2021-09-01,05:15:43 | INFO | Rank 0 | workers: 4
2021-09-01,05:15:43 | INFO | Rank 0 | world_size: 4
2021-09-01,05:15:43 | INFO | Rank 0 | zeroshot_frequency: 3
2021-09-01,05:15:47 | INFO | Rank 0 | Use GPU: 0 for training
2021-09-01,05:15:47 | INFO | Rank 1 | Use GPU: 1 for training
2021-09-01,05:15:47 | INFO | Rank 2 | Use GPU: 2 for training
2021-09-01,05:15:47 | INFO | Rank 3 | Use GPU: 3 for training
Training data consist of 2.9 million pairs of text-image
Validation data consist of 780k pairs of text-image
The get_metrics
function is call on whole evaluation data embedding at once, which is massive. In my cases, the matrix multiplication involving 2 matrix with size of 780k x 512 which requires 2000 Gb of GPU memory
hello, I face the problem "TypeError: init() takes 4 positional arguments but 11 were given" when calling the "build_model" function
if args.openai_pretrained:
model, preprocess_train, preprocess_val = load(
args.model,
device=args.device,
jit=False,
is_train=True)
if args.precision == "amp" or args.precision == "fp32":
model = model.float()
def build_model(state_dict: dict):
vit = "visual.proj" in state_dict
if vit:
vision_width = state_dict["visual.conv1.weight"].shape[0]
vision_layers = len(
[k for k in state_dict.keys() if k.startswith("visual.") and k.endswith(".attn.in_proj_weight")])
vision_patch_size = state_dict["visual.conv1.weight"].shape[-1]
grid_size = round((state_dict["visual.positional_embedding"].shape[0] - 1) ** 0.5)
image_size = vision_patch_size * grid_size
else:
counts: list = [
len(set(k.split(".")[2] for k in state_dict if k.startswith(f"visual.layer{b}"))) for b in [1, 2, 3, 4]]
vision_layers = tuple(counts)
vision_width = state_dict["visual.layer1.0.conv1.weight"].shape[0]
output_width = round((state_dict["visual.attnpool.positional_embedding"].shape[0] - 1) ** 0.5)
vision_patch_size = None
assert output_width ** 2 + 1 == state_dict["visual.attnpool.positional_embedding"].shape[0]
image_size = output_width * 32
embed_dim = state_dict["text_projection"].shape[1]
context_length = state_dict["positional_embedding"].shape[0]
vocab_size = state_dict["token_embedding.weight"].shape[0]
transformer_width = state_dict["ln_final.weight"].shape[0]
transformer_heads = transformer_width // 64
transformer_layers = len(set(k.split(".")[2] for k in state_dict if k.startswith(f"transformer.resblocks")))
model = CLIP(
embed_dim,
image_size, vision_layers, vision_width, vision_patch_size,
context_length, vocab_size, transformer_width, transformer_heads, transformer_layers
)
for key in ["input_resolution", "context_length", "vocab_size"]:
if key in state_dict:
del state_dict[key]
convert_weights_to_fp16(model)
model.load_state_dict(state_dict)
return model.eval()
Error caused by the following code。
model = CLIP(
embed_dim,
image_size, vision_layers, vision_width, vision_patch_size,
context_length, vocab_size, transformer_width, transformer_heads, transformer_layers
)
I see that you implement the CLIP model init function with only four arguments。Did I get the wrong version?
class CLIP(nn.Module):
def __init__(
self,
embed_dim: int,
vision_cfg: CLIPVisionCfg,
text_cfg: CLIPTextCfg,
):
Would be nice if this repo supported training on TPUs.
In the current repository, you can evaluate a pretrained model by running
python src/training/main.py \
--val-data="/path/to/validation_data.csv" \
--resume /path/to/checkpoints/epoch_K.pt
However, if you try to do the same thing and just try to get the imagenet-val (or imagenet-v2) accuracy
python src/training/main.py \
--imagenet-val="/path/to/imagenet/val" \
--resume /path/to/checkpoints/epoch_K.pt
then it crashes:
Traceback (most recent call last):
File "src/training/main.py", line 307, in <module>
main()
File "src/training/main.py", line 296, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, log_queue, args))
File "/opt/conda/envs/open_clip/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/envs/open_clip/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/opt/conda/envs/open_clip/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/envs/open_clip/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/ncarlini/open_clip/src/training/main.py", line 189, in main_worker
evaluate(model, data, start_epoch, args, writer, 0)
File "/home/ncarlini/open_clip/src/training/train.py", line 159, in evaluate
dataloader = data['val'].dataloader
KeyError: 'val'
It should be allowed to get imagenet accuracy without getting using a val dataset.
Hello,
In your training of CLIP, did you use only the description
column as text input, or both the title
and description
columns?
The reason I am asking is because in the github folder where OpenAI provide info on their YFCC100M subset, there is a sentence that I find quite ambiguous:
[...] which have been filtered to only keep those with natural languag titles and/or descriptions in English
This seems to imply that it sufficed that only one of title
and description
was considered natural language for an observation (image) to be kept as part of the subset. However, they do not clarify whether they also proceeded to use the results of this natural language filter to choose whether to use only the title
or only the description
in the case that one of them was not deemed to be natural language. Alternatively, they may have concatenated the columns and used both of them in training.
Anyway, what I'm interested in knowing here is what you guys decided to do in your training. Did you use both columns or just the description
?
Also, did you clean the text in any manner (e.g. remove html tags present in the text)?
If you provide bash a command like "foo{0..5} bar{1..6}" it will expand each of the brace expansions separately, give you a list of length 10. Braceexpand will do the cross product here though, and give a list of length 25. This isn't necessarily wrong in general, but in the case of how braceexpand is used in this project I think it's not what's expected.
In particular, if you provide --train-data="/dir1/files{1..10} /dir2/files{1..10}" it ends up trying to include 100 (!!) files. It would probably make more sense to do the bash-like expansion here.
Hi !
Just in cased you missed it there is a new 5.85B dataset from LAION.
Do you have any plan to fit a model on it?
Best.
so - I've been looking into some code for VQGAN
https://github.com/mehdidc/feed_forward_vqgan_clip
https://github.com/nerdyrodent/VQGAN-CLIP
and they let the user to pass a prompt to style / generate an image.
Here's some using code from @nerdyrodent
nerdyrodent/VQGAN-CLIP#13
Must see -
https://twitter.com/e08477/status/1418440857578098691?s=21
Here's theres only 4 images generated with a prompt
eg. Mushroom, spaceship,volcano, old english house on a hill(might be wrong)
But then as you look down - these have predicate prompts that style / shape image differently.
Mushroom + marble sculpture.
What I want is to give an image to CLIP and have it tell me what it thinks the words should be.
Is this feasible / achievable ? Does this repo provide any way into this? Does it need dimensionality reduction? It is like tsne problem (show word2vec in 2 dimensions?) - but under the hood it's 512 dimensions? I'm yet to look at the code - maybe it will become clearer.
Hi all,
thank you very much for providing this repository!
Python reports an unreferenced variable in the following code snippet (from train.py, lines 226-228):
def get_metrics(image_features, text_features):
metrics = {}
logits_per_image = (logit_scale * image_features @ text_features.t()).detach().cpu()
And even my IDE (PyCharm) complains about a missing reference.
Am I missing something?
My training parameters are like as follows:
Loading model from /home/thetaphipsi/MasterAI/src/open_clip/src/training/model_configs/RN50.json
2021-11-14,15:34:02 | INFO | Rank 0 | Params:
2021-11-14,15:34:02 | INFO | Rank 0 | C: 3.16
2021-11-14,15:34:02 | INFO | Rank 0 | aggregate: True
2021-11-14,15:34:02 | INFO | Rank 0 | batch_size: 32
2021-11-14,15:34:02 | INFO | Rank 0 | beta1: 0.9
2021-11-14,15:34:02 | INFO | Rank 0 | beta2: 0.999
2021-11-14,15:34:02 | INFO | Rank 0 | checkpoint_path: ./logs/lr=0.001_wd=0.1_agg=True_model=RN50_batchsize=32_workers=1_date=2021-11-14-14-34-01/checkpoints
2021-11-14,15:34:02 | INFO | Rank 0 | copy_codebase: False
2021-11-14,15:34:02 | INFO | Rank 0 | csv_caption_key: title
2021-11-14,15:34:02 | INFO | Rank 0 | csv_img_key: filepath
2021-11-14,15:34:02 | INFO | Rank 0 | csv_separator:
2021-11-14,15:34:02 | INFO | Rank 0 | dataset_type: auto
2021-11-14,15:34:02 | INFO | Rank 0 | debug: False
2021-11-14,15:34:02 | INFO | Rank 0 | dist_backend: nccl
2021-11-14,15:34:02 | INFO | Rank 0 | dist_url: tcp://127.0.0.1:6100
2021-11-14,15:34:02 | INFO | Rank 0 | distributed: True
2021-11-14,15:34:02 | INFO | Rank 0 | dp: False
2021-11-14,15:34:02 | INFO | Rank 0 | epochs: 30
2021-11-14,15:34:02 | INFO | Rank 0 | eps: 1e-08
2021-11-14,15:34:02 | INFO | Rank 0 | gpu: 0
2021-11-14,15:34:02 | INFO | Rank 0 | imagenet_v2: None
2021-11-14,15:34:02 | INFO | Rank 0 | imagenet_val: None
2021-11-14,15:34:02 | INFO | Rank 0 | log_level: 20
2021-11-14,15:34:02 | INFO | Rank 0 | log_path: ./logs/lr=0.001_wd=0.1_agg=True_model=RN50_batchsize=32_workers=1_date=2021-11-14-14-34-01/out.log
2021-11-14,15:34:02 | INFO | Rank 0 | logs: ./logs/
2021-11-14,15:34:02 | INFO | Rank 0 | lr: 0.001
2021-11-14,15:34:02 | INFO | Rank 0 | model: RN50
2021-11-14,15:34:02 | INFO | Rank 0 | multigpu: None
2021-11-14,15:34:02 | INFO | Rank 0 | name: lr=0.001_wd=0.1_agg=True_model=RN50_batchsize=32_workers=1_date=2021-11-14-14-34-01
2021-11-14,15:34:02 | INFO | Rank 0 | ngpus_per_node: 1
2021-11-14,15:34:02 | INFO | Rank 0 | openai_pretrained: False
2021-11-14,15:34:02 | INFO | Rank 0 | precision: amp
2021-11-14,15:34:02 | INFO | Rank 0 | rank: 0
2021-11-14,15:34:02 | INFO | Rank 0 | regression_frequency: 2
2021-11-14,15:34:02 | INFO | Rank 0 | report_to: tensorboard
2021-11-14,15:34:02 | INFO | Rank 0 | resume: None
2021-11-14,15:34:02 | INFO | Rank 0 | save_frequency: 1
2021-11-14,15:34:02 | INFO | Rank 0 | save_most_recent: False
2021-11-14,15:34:02 | INFO | Rank 0 | skip_aggregate: False
2021-11-14,15:34:02 | INFO | Rank 0 | skip_scheduler: False
2021-11-14,15:34:02 | INFO | Rank 0 | tensorboard: True
2021-11-14,15:34:02 | INFO | Rank 0 | tensorboard_path: ./logs/lr=0.001_wd=0.1_agg=True_model=RN50_batchsize=32_workers=1_date=2021-11-14-14-34-01/tensorboard
2021-11-14,15:34:02 | INFO | Rank 0 | train_data: ./data/Train_GCC-training_output.csv
2021-11-14,15:34:02 | INFO | Rank 0 | use_bn_sync: False
2021-11-14,15:34:02 | INFO | Rank 0 | val_data: ./data/Validation_GCC-1.1.0-Validation_output.csv
2021-11-14,15:34:02 | INFO | Rank 0 | wandb: False
2021-11-14,15:34:02 | INFO | Rank 0 | wandb_notes:
2021-11-14,15:34:02 | INFO | Rank 0 | warmup: 40000
2021-11-14,15:34:02 | INFO | Rank 0 | wd: 0.1
2021-11-14,15:34:02 | INFO | Rank 0 | workers: 1
2021-11-14,15:34:02 | INFO | Rank 0 | world_size: 1
2021-11-14,15:34:02 | INFO | Rank 0 | zeroshot_frequency: 1
2021-11-14,15:34:02 | INFO | Rank 0 | Added key: store_based_barrier_key:1 to store for rank: 0
2021-11-14,15:34:02 | INFO | Rank 0 | Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
2021-11-14,15:34:02 | INFO | Rank 0 | Use GPU: 0 for training
I fixed it temporarily by adding a logit_scale param to get_metrics().
Hi again,
it would make sense to append ROOT
to the filepath in the csv-file? Because after running gather_cc.py
the files are in the folder cc_data
(eg. cc_data/val/00/0123.jpg
), but the path in the csv-file is only val/00/0123.jpg
.
BR Andreas
Thanks for preparing this repo.
I was wondering how is self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))
decided? I mean where is this value np.log(1 / 0.07))
inspired from?
I only changed the line186-193 because we need audio input function get_wds_dataset
. It always stuck when I set num_worker=2. https://github.com/mlfoundations/open_clip/blob/main/src/training/data.py#L150. Could you please check a little on this? @rwightman. Thank you! My modification version are as follows:
def preprocess(
sample,
audio_ext,
samplerate,
mono,
max_len,
dtype,
res_type,
):
for key, value in sample.items():
if key == audio_ext:
audio_data, orig_sr = sf.read(io.BytesIO(value))
if samplerate is not None:
audio_data = librosa.resample(
audio_data, orig_sr=orig_sr, target_sr=samplerate, res_type=res_type
)
if len(audio_data) > max_len: # random clip if too long
overflow = len(audio_data) - max_len
idx = np.random.randint(0, overflow + 1)
if np.random.rand() > 0.5:
audio_data = audio_data[idx : idx + max_len]
else:
audio_data = audio_data[
len(audio_data) + 1 - idx - max_len : len(audio_data) + 1 - idx
]
else: # padding if too short
audio_data = np.pad(
audio_data,
(0, max_len - len(audio_data)),
mode="constant",
constant_values=0,
)
if mono: # convert to mono
audio_data = librosa.to_mono(audio_data)
# sample["data"] = (audio_data, sample[text_ext], sample["__key__"])
sample[audio_ext] = audio_data
return sample
# def get_wds_dataset(args, preprocess_img, is_train):
def get_wds_dataset(
args,
is_train,
file_path_type="local",
audio_ext="flac",
text_ext="json",
samplerate=32000,
mono=True,
max_len=1000000,
dtype="float64",
res_type="kaiser_best",
):
input_shards = args.train_data if is_train else args.val_data
assert input_shards is not None
num_samples, num_shards = get_dataset_size(input_shards)
if not num_samples:
if is_train:
num_samples = args.train_num_samples
if not num_samples:
raise RuntimeError(
'Currently, number of dataset samples must be specified for training dataset. '
'Please specify via `--train-num-samples` if no dataset length info present.')
else:
num_samples = args.val_num_samples or 0 # eval will just exhaust the iterator if not specified
pipeline = [wds.SimpleShardList(input_shards)]
# at this point we have an iterator over all the shards
if is_train:
pipeline.extend([
wds.detshuffle(bufsize=_SHARD_SHUFFLE_SIZE, initial=_SHARD_SHUFFLE_INITIAL, seed=args.seed),
wds.split_by_node,
wds.split_by_worker,
# at this point, we have an iterator over the shards assigned to each worker at each node
wds.tarfile_to_samples(handler=log_and_continue),
wds.shuffle(
bufsize=_SAMPLE_SHUFFLE_SIZE,
initial=_SAMPLE_SHUFFLE_INITIAL,
rng=random.Random(args.seed)),
#wds.repeatedly, # FIXME determine if this is beneficial
])
else:
pipeline.extend([
wds.split_by_worker,
# at this point, we have an iterator over the shards assigned to each worker
wds.tarfile_to_samples(handler=log_and_continue),
])
pipeline.extend([
wds.map(
partial(
preprocess,
audio_ext=audio_ext,
samplerate=samplerate,
mono=mono,
max_len=max_len,
dtype=dtype,
res_type=res_type,
)
),
wds.to_tuple("flac", "json"),
wds.batched(args.batch_size, partial=not is_train),
])
dataset = wds.DataPipeline(*pipeline)
if is_train:
# roll over and repeat a few samples to get same number of full batches on each node
global_batch_size = args.batch_size * args.world_size
num_batches = math.ceil(num_samples / global_batch_size)
num_workers = max(1, args.workers)
num_worker_batches = math.ceil(num_batches / num_workers) # per dataloader worker
num_batches = num_worker_batches * num_workers
num_samples = num_batches * global_batch_size
dataset = dataset.with_epoch(num_worker_batches) # each worker is iterating over this
else:
# last batches are partial, eval is done on single (master) node
num_batches = math.ceil(num_samples / args.batch_size)
dataloader = wds.WebLoader(dataset, batch_size=None, shuffle=False, num_workers=args.workers)
# FIXME not clear which approach is better, with_epoch before vs after dataloader?
# hoping to resolve via https://github.com/webdataset/webdataset/issues/169
# if is_train:
# # roll over and repeat a few samples to get same number of full batches on each node
# global_batch_size = args.batch_size * args.world_size
# num_batches = math.ceil(num_samples / global_batch_size)
# num_workers = max(1, args.workers)
# num_batches = math.ceil(num_batches / num_workers) * num_workers
# num_samples = num_batches * global_batch_size
# dataloader = dataloader.with_epoch(num_batches)
# else:
# # last batches are partial, eval is done on single (master) node
# num_batches = math.ceil(num_samples / args.batch_size)
# add meta-data to dataloader instance for convenience
dataloader.num_batches = num_batches
dataloader.num_samples = num_samples
return DataInfo(dataloader, None)
Hi, after updating to your most recent code, I got an error when training in single machine (8 GPUs) in DataParrallel model. I simply changed the flag args.dp = True and got the following error message:
miniconda3/envs/env37_amp/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:64: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
2022-03-18,06:20:51 | INFO | Start epoch 0
Traceback (most recent call last):
File "CLIP_model/training/main.py", line 304, in
main()
File "CLIP_model/training/main.py", line 243, in main
train_one_epoch(model, data, epoch, optimizer, scaler, scheduler, args, writer)
File "CLIP_model/training/train.py", line 149, in train_one_epoch
total_loss = loss(image_features, text_features, logit_scale)
File "miniconda3/envs/env37_amp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "CLIP_model/training/train.py", line 97, in forward
logits_per_image = logit_scale * image_features @ text_features.T
RuntimeError: The size of tensor a (8) must match the size of tensor b (1024) at non-singleton dimension 1
Code works good when turning args.dp = False and training on a single GPU.
Thanks!
I'm using CLIP to train on my custom dataset with the following params:
Dataset size : 50k image-text pairs
Batch size : 128
Image Size : 224
Gpus : 1
Epochs : 500
It's been running for a while now, I'm on my 15th epoch, and the loss hasn't changed at all. It isn't a constant number, but its constantly at 4.8xxx. Should I be concerned? I'm not sure why this is happening.
Very nice code!
I'm able to reproduce the zero-shot results on imagenet using cc3m (2,862,387 images in total for me) and the provided sample code.
I'd like to ask if you have tried different learning rates other than 1e-3
for batch=128
? Would you be able to give more insights on how you ended up using lr=1e-3
?
Also, I'd like to know if you have tried more training epochs, i.e. larger than 30. I'm curious if training with more epochs would help improve the zero-shot accuracy.
Hi, I am attempting to use open_clip for remote images on xview images. I've finding that in the first 2-3 epochs the loss decreases from 3.5 to 2.7 and stays around 2.7 for lr of 8e-6 (see training below). Would anyone have ideas on how I can motivate the learning?
Some background on xview:
My images are derived from xview which is an object detection dataset with images like this:
To generate captions for xview, for each image, I make a single caption for a single bounding box. Hence the same image may be several different captions for that image. Each caption is valid as there may be multiple objects in the image.
I am getting a ModuleNotFoundError for training when running src/training/main.py. It points to line 19 in main.py, the import function.
Edit: Fixed it. Forgot to add pythonpath
Hi,
Thanks for the tremendous effort!
Is it possible to set up this training code, for fine-tuning CLIP on a custom dataset, on a Windows 10 machine?
How long is a reasonable time for an epoch using 8 workers? I'm seeing about 8 hours/epoch, for the resnet50. Launch command from the README:
--save-frequency 1 \
--zeroshot-frequency 1 \
--report-to tensorboard \
--train-data="/path/to/train_data.csv" \
--val-data="/path/to/validation_data.csv" \
--csv-img-key filepath \
--csv-caption-key title \
--imagenet-val=/path/to/imagenet/root/val/ \
--warmup 10000 \
--batch-size=128 \
--lr=1e-3 \
--wd=0.1 \
--epochs=30 \
--workers=8 \
--model RN50
Thank you!
CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations
https://arxiv.org/abs/2112.07133
Anyone keen to try modifying a training script for above?
Would be nice to provide an example of loading a model and performing inference on a single example.
[/usr/local/lib/python3.7/dist-packages/mmc/loaders/mlfcliploader.py](https://localhost:8080/#) in load(self, device)
43 model, _, preprocess_image = open_clip.create_model_and_transforms(
44 model_name=model_name,
---> 45 pretrained=dataset)
46
47 model.requires_grad_(False)
[/usr/local/lib/python3.7/dist-packages/open_clip/factory.py](https://localhost:8080/#) in create_model_and_transforms(model_name, pretrained, precision, device, jit, force_quick_gelu, pretrained_image)
134 model_name, pretrained, precision, device, jit,
135 force_quick_gelu=force_quick_gelu,
--> 136 pretrained_image=pretrained_image)
137 preprocess_train = image_transform(model.visual.image_size, is_train=True)
138 preprocess_val = image_transform(model.visual.image_size, is_train=False)
[/usr/local/lib/python3.7/dist-packages/open_clip/factory.py](https://localhost:8080/#) in create_model(model_name, pretrained, precision, device, jit, force_quick_gelu, pretrained_image)
106 if checkpoint_path:
107 logging.info(f'Loading pretrained {model_name} weights ({pretrained}).')
--> 108 model.load_state_dict(load_state_dict(checkpoint_path))
109 else:
110 logging.warning(f'Pretrained weights ({pretrained}) not found for model {model_name}.')
[/usr/local/lib/python3.7/dist-packages/open_clip/factory.py](https://localhost:8080/#) in load_state_dict(checkpoint_path, map_location)
48
49 def load_state_dict(checkpoint_path: str, map_location='cpu'):
---> 50 checkpoint = torch.load(checkpoint_path, map_location=map_location)
51 if isinstance(checkpoint, dict) and 'state_dict' in checkpoint:
52 state_dict = checkpoint['state_dict']
[/usr/local/lib/python3.7/dist-packages/torch/serialization.py](https://localhost:8080/#) in load(f, map_location, pickle_module, **pickle_load_args)
711 return torch.jit.load(opened_file)
712 return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
--> 713 return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
714
715
[/usr/local/lib/python3.7/dist-packages/torch/serialization.py](https://localhost:8080/#) in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
938 typed_storage._storage._set_from_file(
939 f, offset, f_should_read_directly,
--> 940 torch._utils._element_size(typed_storage.dtype))
941 if offset is not None:
942 offset = f.tell()
RuntimeError: unexpected EOF, expected 832488 more bytes. The file might be corrupted.
Hi, I encountered this error during training and I'm not sure what it means:
2022-02-09,21:22:00 | INFO | Rank 0 | Train Epoch: 9 [28800/43670 (66%)] Loss: 0.493029 Data (t) 0.000 Batch (t) 0.235 LR: 0.000020 logit_scale 2.821
2022-02-09,21:22:24 | INFO | Rank 0 | Train Epoch: 9 [32000/43670 (73%)] Loss: 0.642597 Data (t) 0.008 Batch (t) 0.274 LR: 0.000012 logit_scale 2.822
2022-02-09,21:22:48 | INFO | Rank 0 | Train Epoch: 9 [35200/43670 (81%)] Loss: 0.442177 Data (t) 0.002 Batch (t) 0.243 LR: 0.000006 logit_scale 2.822
2022-02-09,21:23:13 | INFO | Rank 0 | Train Epoch: 9 [38400/43670 (88%)] Loss: 0.435208 Data (t) 0.000 Batch (t) 0.255 LR: 0.000003 logit_scale 2.823
2022-02-09,21:23:37 | INFO | Rank 0 | Train Epoch: 9 [41600/43670 (95%)] Loss: 0.295687 Data (t) 0.000 Batch (t) 0.240 LR: 0.000000 logit_scale 2.823
2022-02-09,21:24:36 | INFO | Rank 0 | Eval Epoch: 10 image_to_text_mean_rank: 40.2243 image_to_text_median_rank: 22.0000 image_to_text_R@1: 0.0628 image_to_text_R@5: 0.2063 image_to_text_R@10: 0.3273 text_to_image_mean_rank: 44.4849 text_to_image_median_rank: 25.0000 text_to_image_R@1: 0.0477 text_to_image_R@5: 0.1817 text_to_image_R@10: 0.2948 val_loss: 0.3798 epoch: 10.0000 num_elements: 6432.0000
Exception in thread Thread-5:
Traceback (most recent call last):
File "C:\Users\nuzuegbunam\Anaconda3\envs\open_clip_3_9\lib\multiprocessing\connection.py", line 317, in _recv_bytes
Does anyone have any idea what this means?
AAAaaaaaaaaahhhhhhhhhhhhhhhhhh!!!!!!!
open_clip/src/training/data.py
Line 59 in 91f6cce
I'm confused about this project. Isn't OpenAI's CLIP already open source?
The repo has an MIT license: https://github.com/openai/CLIP
We are trying to re-implement CLIP ViT-B/32 pre-trained on YFCC15M provided by OpenAI. But our result is lower than RN50 reported by the paper and your repo (still under training, but almost finished, current ImageNet zero-shot accuracy is around 27% - 28%). So we wonder if you have tried to train a ViT-B/32 on YFCC? Do you have the same finding? Thanks.
Hi!
In the "Usage" part of the readme, we use model.encode_image() and model.encode_text() before computing the dot product of the features.
Hoewever those methods, by contrast with what is done during training,
open_clip/src/open_clip/model.py
Line 429 in 0d1127c
don't normalize the feature vectors.
Therefore it could bias the results. Am I wrong?
Best,
Théo
Is it possible to finetune from the existing Open AI checkpoints rather than train them from scratch with this codebase?
Hello,
I would like to run the different CLIP models on high definition non-square images (e.g. 720p or 1080p).
Is there a simple way to do so without deforming the images into a smaller square resolution (336x336 or 224x224) ?
Thank you for your work on this repository, I found it very helpful,
Simon
I get this strange error when attempting "import open_clip". I have tried reinstalling open clip, as well as various versions of pytorch. In this instance, I am using python 3.7.9 and pytorch 1.9.0.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\firewolf\AppData\Local\Programs\Python\Python37\lib\site-packages\open_clip_torch-1.0.1-py3.7.egg\open_clip\__init__.py", line 2, in <module>
from .loss import ClipLoss
File "C:\Users\firewolf\AppData\Local\Programs\Python\Python37\lib\site-packages\open_clip_torch-1.0.1-py3.7.egg\open_clip\loss.py", line 2, in <module>
import torch.distributed.nn
File "C:\Users\firewolf\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\distributed\nn\__init__.py", line 1, in <module>
from .api.remote_module import RemoteModule
File "C:\Users\firewolf\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\distributed\nn\api\remote_module.py", line 22, in <module>
from torch.distributed.rpc.internal import _internal_rpc_pickler
File "C:\Users\firewolf\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\distributed\rpc\internal.py", line 12, in <module>
from torch._C._distributed_rpc import _get_current_rpc_agent
ModuleNotFoundError: No module named 'torch._C._distributed_rpc'; 'torch._C' is not a package
Is there a straightforward to find all the args supported by the code?
Hi,
thanks for this great work! I want to make a multilingual version CLIP. There is existing works to use English CLIP indirectly (https://github.com/FreddeFrallan/Multilingual-CLIP). But do you have suggestions on making the code a multilingual version?
Thanks you!
Hi,
A sincere request.
Since it is very time taking, could you kindly provide the extracted faster R-CNN features for the conceptual captions dataset via drive or dropbox?
Thanks :)
Thank you in advance for this amazing project :-)
I'm trying to run inference in fp16
mode (like in the original CLIP repo), but I'm failing to achieve it. This is the error message I get:
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
And this is the code I'm trying:
import torch
from PIL import Image
import open_clip
import requests
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32-quickgelu',
pretrained='laion400m_e32',
precision="fp16",
device=torch.device("cuda"))
url = "https://raw.githubusercontent.com/mlfoundations/open_clip/main/docs/CLIP.png"
image = preprocess(Image.open(requests.get(url, stream=True).raw)).unsqueeze(0)
text = open_clip.tokenize(["a diagram", "a dog", "a cat"])
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs) # prints: [[1., 0., 0.]]
Note that I've also tried with model ViT-B-32
and pretrain openai
and it doesn't work either. Am I doing something wrong?
so part of the fun with clip is using it in conjunction with VQGAN.
This allows the prompts to generate images.
There's something lost in this translation. though .
They say a picture is worth a 1000 words - but what if some extra data was injected into the training ?
could be say textures / maybe even geometric descptions / meta data
Question, not Issues.
Why is Weight decay so large? (The default 0.2 or 0.1)
Usually is 1e4
Hello there,
not really an issue but something i am interested in: How did you clean the captions of YFCC?
I did the steps explained in the closed issue, but still there are a lot of captions with URLs, camera names and settings, dates, and so on. Compared to CC (where the captions are really clean) it looks really bad. Still you get a big jump in performance on ImageNet, so before i start training i would like to know if you did clean the data?
If so, i would be very happy if you could provide the code or some snippets :)
Best regards
I've been chatting with some others interested in training CLIP for different domain tasks. They expressed interest in a simple way to use a pre-trained text transformer.
Some basic support for Hugging Face or generic classes of transformers shouldn't be too crazy of an extension to what is already fleshed out.
a. How the fine-tune result is?Could you provide a set of fine-tuned parameters?
b. For fine-tuning, what suggestions do you have in parameter settings or training skills?
Hi, is there any easy-using script for training clip on multiple nodes? I can set up training on one node(8GPUs) now. But I need to test the scaling efficient. Thanks for any insight~
Hi there,
first of all thanks for the code, i appreciate your effort!
I think there is a bug in gather_cc.py
:
In line 86 there is a hardcoded 'val'
, which should probably be split
.
Hi,
Where can we find the details behind your model naming?
Best,
Theo
Hi,
I'm running the src/training/main.py in debug mode, and I'm getting the following message in the terminal:
2021-12-21,18:02:32 | DEBUG | Rank 1 | tag: Software (305) - type: string (2) Tag Location: 22 - Data Location: 26 - value: b'www.meitu.com\x00'
2021-12-21,18:02:32 | DEBUG | Rank 1 | tag: Software (305) - type: string (2) Tag Location: 22 - Data Location: 26 - value: b'www.meitu.com\x00'
2021-12-21,18:02:32 | DEBUG | Rank 1 | tag: Software (305) - type: string (2) Tag Location: 22 - Data Location: 26 - value: b'www.meitu.com\x00'
2021-12-21,18:02:32 | DEBUG | Rank 1 | tag: Software (305) - type: string (2) Tag Location: 22 - Data Location: 26 - value: b'www.meitu.com\x00'
2021-12-21,18:02:32 | DEBUG | Rank 1 | tag: Software (305) - type: string (2) Tag Location: 22 - Data Location: 26 - value: b'www.meitu.com\x00'
2021-12-21,18:02:32 | DEBUG | Rank 1 | tag: Software (305) - type: string (2) Tag Location: 22 - Data Location: 26 - value: b'www.meitu.com\x00'
2021-12-21,18:02:32 | DEBUG | Rank 1 | tag: Software (305) - type: string (2) Tag Location: 22 - Data Location: 26 - value: b'www.meitu.com\x00'
2021-12-21,18:02:32 | DEBUG | Rank 1 | tag: Software (305) - type: string (2) Tag Location: 22 - Data Location: 26 - value: b'www.meitu.com\x00'
2021-12-21,18:02:32 | DEBUG | Rank 1 | tag: Software (305) - type: string (2) Tag Location: 22 - Data Location: 26 - value: b'www.meitu.com\x00'
2021-12-21,18:02:32 | DEBUG | Rank 1 | tag: Software (305) - type: string (2) Tag Location: 22 - Data Location: 26 - value: b'www.meitu.com\x00
What does it mean? I'm using 1 gpu, and running --debug flag.
Second question: how do I delete an experiment so I can reuse its name?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.