mlfoundations / open_clip Goto Github PK

An open source implementation of CLIP.

License: Other

Python 99.24% Makefile 0.11% Shell 0.65%

computer-vision contrastive-loss deep-learning language-model multi-modal-learning pretrained-models pytorch zero-shot-classification

open_clip's People

Contributors

Stargazers

Watchers

Forkers

zlapp miguelbandera ryok pkurainbow dotpyu gdsttian zhanghongyong123456 jawaechan tiamat-tech zyg11 chenchy ameerhamza111 bryant1410 strategist922 ronghanghu kinanz piotr-teterwak afiaka87 shibanisanturkar phecy peace-zy rahimentezari bungerr jplasser alacarter carlini wkvong techthiyanes sidney1994 micpie zasder3 laplacekorea christophschuhmann tc2718 rocke2020 rtvt123 rwightman sour4bh wayne980 tharun-tharun h-leth ssusantachary shinypond snakeztc robertglegg learning-at-home 4m4n5 lzy00codeforfun alexandonian shaoweipng vaishaal makefavor oshrihalimi harvard-visionlab xiaoyukang shuxjweb jzbcoding junjianli106 zerocstaker nyu-dice-lab zyijie xiedake thibaultgroueix rom1504 poissonyzr mzq20180601 tianyu-z olegjakushkin sxjscience fallcat annasblackhat torment123 peternara xinfushe dahoas jorjiang hexiaohao knoriy ophenton cnu1439 mkimhi vb6hobbyst7 trendtan magnumenforcer mbrukman ianderrington progamergov ivanprado mjtaheri11 courao hjeun hitspring2015 zmykevin mbencherif arampacha iejmac wenlinyao jeozhao dashstander jeniajitsev

open_clip's Issues

Overfitting on validation loss okay?

I noticed that the CLIP validation loss curve begins to slope upwards about halfway through training on Conceptual Captions (~ epoch 15) from the figure here, but validation recall continues to increase until the end of training (epoch 30).

Does this mean that when doing contrastive training, the procedure for early stopping be based on the validation recall performance, rather than the validation loss, since they are not necessarily tied to one another like in standard supervised learning?

Massive GPU memory usage during evaluation

Machine setup

Google cloud VM
Debian10
16 cores CPU, 60Gb of rams
4 nvidia T4

Error

Traceback (most recent call last):
  File "/opt/conda/envs/open_clip/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/jupyter/open_clip/src/training/main.py", line 192, in main_worker
    evaluate(model, data, 0, args, writer, 0)
  File "/home/jupyter/open_clip/src/training/train.py", line 197, in evaluate
    torch.cat(all_image_features), torch.cat(all_text_features)
  File "/home/jupyter/open_clip/src/training/train.py", line 228, in get_metrics
    logits_per_image = image_features @ text_features.t()
RuntimeError: CUDA out of memory. Tried to allocate 2269.88 GiB (GPU 0; 14.76 GiB total capacity; 7.11 GiB already allocated; 6.67 GiB free; 7.17 GiB reserved in total by PyTorch)

The script I use :

python -u src/training/main.py \
    --save-frequency 1 \
    --zeroshot-frequency 3 \
    --train-data "src/df_openclip_train.csv"  \
    --val-data "src/df_openclip_val.csv"  \
    --openai-pretrained \
    --csv-separator "," \
    --csv-img-key image_path \
    --csv-caption-key product_name \
    --warmup 10000 \
    --batch-size=128 \
    --lr=1e-3 \
    --wd=0.1 \
    --epochs=30 \
    --workers=4 \
    --model ViT-B/32

Full setting

2021-09-01,05:15:43 | INFO | Rank 0 | Params:
2021-09-01,05:15:43 | INFO | Rank 0 |   C: 3.16
2021-09-01,05:15:43 | INFO | Rank 0 |   aggregate: True
2021-09-01,05:15:43 | INFO | Rank 0 |   batch_size: 128
2021-09-01,05:15:43 | INFO | Rank 0 |   beta1: 0.9
2021-09-01,05:15:43 | INFO | Rank 0 |   beta2: 0.98
2021-09-01,05:15:43 | INFO | Rank 0 |   checkpoint_path: ./logs/lr=0.001_wd=0.1_agg=True_model=ViT-B/32_batchsize=128_workers=4_date=2021-09-01-05-15-41/checkpoints
2021-09-01,05:15:43 | INFO | Rank 0 |   copy_codebase: False
2021-09-01,05:15:43 | INFO | Rank 0 |   csv_caption_key: product_name
2021-09-01,05:15:43 | INFO | Rank 0 |   csv_img_key: image_path
2021-09-01,05:15:43 | INFO | Rank 0 |   csv_separator: ,
2021-09-01,05:15:43 | INFO | Rank 0 |   dataset_type: auto
2021-09-01,05:15:43 | INFO | Rank 0 |   debug: False
2021-09-01,05:15:43 | INFO | Rank 0 |   dist_backend: nccl
2021-09-01,05:15:43 | INFO | Rank 0 |   dist_url: tcp://127.0.0.1:6100
2021-09-01,05:15:43 | INFO | Rank 0 |   distributed: True
2021-09-01,05:15:43 | INFO | Rank 0 |   dp: False
2021-09-01,05:15:43 | INFO | Rank 0 |   epochs: 30
2021-09-01,05:15:43 | INFO | Rank 0 |   eps: 1e-06
2021-09-01,05:15:43 | INFO | Rank 0 |   gpu: 0
2021-09-01,05:15:43 | INFO | Rank 0 |   imagenet_v2: None
2021-09-01,05:15:43 | INFO | Rank 0 |   imagenet_val: None
2021-09-01,05:15:43 | INFO | Rank 0 |   log_level: 20
2021-09-01,05:15:43 | INFO | Rank 0 |   log_path: ./logs/lr=0.001_wd=0.1_agg=True_model=ViT-B/32_batchsize=128_workers=4_date=2021-09-01-05-15-41/out.log
2021-09-01,05:15:43 | INFO | Rank 0 |   logs: ./logs/
2021-09-01,05:15:43 | INFO | Rank 0 |   lr: 0.001
2021-09-01,05:15:43 | INFO | Rank 0 |   model: ViT-B/32
2021-09-01,05:15:43 | INFO | Rank 0 |   multigpu: None
2021-09-01,05:15:43 | INFO | Rank 0 |   name: lr=0.001_wd=0.1_agg=True_model=ViT-B/32_batchsize=128_workers=4_date=2021-09-01-05-15-41
2021-09-01,05:15:43 | INFO | Rank 0 |   ngpus_per_node: 4
2021-09-01,05:15:43 | INFO | Rank 0 |   openai_pretrained: True
2021-09-01,05:15:43 | INFO | Rank 0 |   precision: amp
2021-09-01,05:15:43 | INFO | Rank 0 |   rank: 0
2021-09-01,05:15:43 | INFO | Rank 0 |   regression_frequency: 2
2021-09-01,05:15:43 | INFO | Rank 0 |   report_to: 
2021-09-01,05:15:43 | INFO | Rank 0 |   resume: None
2021-09-01,05:15:43 | INFO | Rank 0 |   save_frequency: 1
2021-09-01,05:15:43 | INFO | Rank 0 |   skip_aggregate: False
2021-09-01,05:15:43 | INFO | Rank 0 |   skip_scheduler: False
2021-09-01,05:15:43 | INFO | Rank 0 |   tensorboard: False
2021-09-01,05:15:43 | INFO | Rank 0 |   tensorboard_path: 
2021-09-01,05:15:43 | INFO | Rank 0 |   train_data: src/df_openclip_train.csv
2021-09-01,05:15:43 | INFO | Rank 0 |   use_bn_sync: False
2021-09-01,05:15:43 | INFO | Rank 0 |   val_data: src/df_openclip_val.csv
2021-09-01,05:15:43 | INFO | Rank 0 |   wandb: False
2021-09-01,05:15:43 | INFO | Rank 0 |   wandb_notes: 
2021-09-01,05:15:43 | INFO | Rank 0 |   warmup: 10000
2021-09-01,05:15:43 | INFO | Rank 0 |   wd: 0.1
2021-09-01,05:15:43 | INFO | Rank 0 |   workers: 4
2021-09-01,05:15:43 | INFO | Rank 0 |   world_size: 4
2021-09-01,05:15:43 | INFO | Rank 0 |   zeroshot_frequency: 3
2021-09-01,05:15:47 | INFO | Rank 0 | Use GPU: 0 for training
2021-09-01,05:15:47 | INFO | Rank 1 | Use GPU: 1 for training
2021-09-01,05:15:47 | INFO | Rank 2 | Use GPU: 2 for training
2021-09-01,05:15:47 | INFO | Rank 3 | Use GPU: 3 for training

Info about the data :

Training data consist of 2.9 million pairs of text-image
Validation data consist of 780k pairs of text-image

Potential cause of the error

The get_metrics function is call on whole evaluation data embedding at once, which is massive. In my cases, the matrix multiplication involving 2 matrix with size of 780k x 512 which requires 2000 Gb of GPU memory

error with --openai_pretrained

hello, I face the problem "TypeError: init() takes 4 positional arguments but 11 were given" when calling the "build_model" function

if args.openai_pretrained:
  model, preprocess_train, preprocess_val = load(
      args.model,
      device=args.device,
      jit=False,
      is_train=True)
  if args.precision == "amp" or args.precision == "fp32":
      model = model.float()

def build_model(state_dict: dict):
    vit = "visual.proj" in state_dict
    if vit:
        vision_width = state_dict["visual.conv1.weight"].shape[0]
        vision_layers = len(
            [k for k in state_dict.keys() if k.startswith("visual.") and k.endswith(".attn.in_proj_weight")])
        vision_patch_size = state_dict["visual.conv1.weight"].shape[-1]
        grid_size = round((state_dict["visual.positional_embedding"].shape[0] - 1) ** 0.5)
        image_size = vision_patch_size * grid_size
    else:
        counts: list = [
            len(set(k.split(".")[2] for k in state_dict if k.startswith(f"visual.layer{b}"))) for b in [1, 2, 3, 4]]
        vision_layers = tuple(counts)
        vision_width = state_dict["visual.layer1.0.conv1.weight"].shape[0]
        output_width = round((state_dict["visual.attnpool.positional_embedding"].shape[0] - 1) ** 0.5)
        vision_patch_size = None
        assert output_width ** 2 + 1 == state_dict["visual.attnpool.positional_embedding"].shape[0]
        image_size = output_width * 32

    embed_dim = state_dict["text_projection"].shape[1]
    context_length = state_dict["positional_embedding"].shape[0]
    vocab_size = state_dict["token_embedding.weight"].shape[0]
    transformer_width = state_dict["ln_final.weight"].shape[0]
    transformer_heads = transformer_width // 64
    transformer_layers = len(set(k.split(".")[2] for k in state_dict if k.startswith(f"transformer.resblocks")))

    model = CLIP(
        embed_dim,
        image_size, vision_layers, vision_width, vision_patch_size,
        context_length, vocab_size, transformer_width, transformer_heads, transformer_layers
    )

    for key in ["input_resolution", "context_length", "vocab_size"]:
        if key in state_dict:
            del state_dict[key]

    convert_weights_to_fp16(model)
    model.load_state_dict(state_dict)
    return model.eval()

Error caused by the following code。

  model = CLIP(
      embed_dim,
      image_size, vision_layers, vision_width, vision_patch_size,
      context_length, vocab_size, transformer_width, transformer_heads, transformer_layers
  )

I see that you implement the CLIP model init function with only four arguments。Did I get the wrong version?

class CLIP(nn.Module):
    def __init__(
            self,
            embed_dim: int,
            vision_cfg: CLIPVisionCfg,
            text_cfg: CLIPTextCfg,
    ):

TPU support.

Would be nice if this repo supported training on TPUs.

Passing --imagenet-val (or --imagenet-v2) without --val crashes unnecessarily

In the current repository, you can evaluate a pretrained model by running

python src/training/main.py \
    --val-data="/path/to/validation_data.csv"  \
    --resume /path/to/checkpoints/epoch_K.pt

However, if you try to do the same thing and just try to get the imagenet-val (or imagenet-v2) accuracy

python src/training/main.py \
    --imagenet-val="/path/to/imagenet/val"  \
    --resume /path/to/checkpoints/epoch_K.pt

then it crashes:

Traceback (most recent call last):
  File "src/training/main.py", line 307, in <module>
    main()
  File "src/training/main.py", line 296, in main
    mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, log_queue, args))
  File "/opt/conda/envs/open_clip/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/envs/open_clip/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/opt/conda/envs/open_clip/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/envs/open_clip/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/ncarlini/open_clip/src/training/main.py", line 189, in main_worker
    evaluate(model, data, start_epoch, args, writer, 0)
  File "/home/ncarlini/open_clip/src/training/train.py", line 159, in evaluate
    dataloader = data['val'].dataloader
KeyError: 'val'

It should be allowed to get imagenet accuracy without getting using a val dataset.

Usage of title and/or description column in YFCC100M

Hello,

In your training of CLIP, did you use only the description column as text input, or both the title and description columns?

The reason I am asking is because in the github folder where OpenAI provide info on their YFCC100M subset, there is a sentence that I find quite ambiguous:

[...] which have been filtered to only keep those with natural languag titles and/or descriptions in English

This seems to imply that it sufficed that only one of title and description was considered natural language for an observation (image) to be kept as part of the subset. However, they do not clarify whether they also proceeded to use the results of this natural language filter to choose whether to use only the title or only the description in the case that one of them was not deemed to be natural language. Alternatively, they may have concatenated the columns and used both of them in training.

Anyway, what I'm interested in knowing here is what you guys decided to do in your training. Did you use both columns or just the description?

Also, did you clean the text in any manner (e.g. remove html tags present in the text)?

braceexpand has unexpected (non-bash-like) behavior with multiple expansions

If you provide bash a command like "foo{0..5} bar{1..6}" it will expand each of the brace expansions separately, give you a list of length 10. Braceexpand will do the cross product here though, and give a list of length 25. This isn't necessarily wrong in general, but in the case of how braceexpand is used in this project I think it's not what's expected.

In particular, if you provide --train-data="/dir1/files{1..10} /dir2/files{1..10}" it ends up trying to include 100 (!!) files. It would probably make more sense to do the bash-like expansion here.

_transform() got an unexpected keyword argument 'is_train'

Hi , I was trying to train this model while I got this issue "_transform() got an unexpected keyword argument 'is_train'"

Any insight of what might be wrong? Thanks a lot!

LAION 5B ?

Hi !

Just in cased you missed it there is a new 5.85B dataset from LAION.
Do you have any plan to fit a model on it?

Best.

Generating prompts from an image

so - I've been looking into some code for VQGAN
https://github.com/mehdidc/feed_forward_vqgan_clip
https://github.com/nerdyrodent/VQGAN-CLIP

and they let the user to pass a prompt to style / generate an image.
Here's some using code from @nerdyrodent
nerdyrodent/VQGAN-CLIP#13

Must see -
https://twitter.com/e08477/status/1418440857578098691?s=21
Here's theres only 4 images generated with a prompt
eg. Mushroom, spaceship,volcano, old english house on a hill(might be wrong)

But then as you look down - these have predicate prompts that style / shape image differently.

Mushroom + marble sculpture.

What I want is to give an image to CLIP and have it tell me what it thinks the words should be.
Is this feasible / achievable ? Does this repo provide any way into this? Does it need dimensionality reduction? It is like tsne problem (show word2vec in 2 dimensions?) - but under the hood it's 512 dimensions? I'm yet to look at the code - maybe it will become clearer.

logit_scale not referenced in get_metrics()/train.py?

Hi all,

thank you very much for providing this repository!

Python reports an unreferenced variable in the following code snippet (from train.py, lines 226-228):

def get_metrics(image_features, text_features):
    metrics = {}
    logits_per_image = (logit_scale * image_features @ text_features.t()).detach().cpu()

And even my IDE (PyCharm) complains about a missing reference.
Am I missing something?

My training parameters are like as follows:

Loading model from /home/thetaphipsi/MasterAI/src/open_clip/src/training/model_configs/RN50.json
2021-11-14,15:34:02 | INFO | Rank 0 | Params:
2021-11-14,15:34:02 | INFO | Rank 0 |   C: 3.16
2021-11-14,15:34:02 | INFO | Rank 0 |   aggregate: True
2021-11-14,15:34:02 | INFO | Rank 0 |   batch_size: 32
2021-11-14,15:34:02 | INFO | Rank 0 |   beta1: 0.9
2021-11-14,15:34:02 | INFO | Rank 0 |   beta2: 0.999
2021-11-14,15:34:02 | INFO | Rank 0 |   checkpoint_path: ./logs/lr=0.001_wd=0.1_agg=True_model=RN50_batchsize=32_workers=1_date=2021-11-14-14-34-01/checkpoints
2021-11-14,15:34:02 | INFO | Rank 0 |   copy_codebase: False
2021-11-14,15:34:02 | INFO | Rank 0 |   csv_caption_key: title
2021-11-14,15:34:02 | INFO | Rank 0 |   csv_img_key: filepath
2021-11-14,15:34:02 | INFO | Rank 0 |   csv_separator: 	
2021-11-14,15:34:02 | INFO | Rank 0 |   dataset_type: auto
2021-11-14,15:34:02 | INFO | Rank 0 |   debug: False
2021-11-14,15:34:02 | INFO | Rank 0 |   dist_backend: nccl
2021-11-14,15:34:02 | INFO | Rank 0 |   dist_url: tcp://127.0.0.1:6100
2021-11-14,15:34:02 | INFO | Rank 0 |   distributed: True
2021-11-14,15:34:02 | INFO | Rank 0 |   dp: False
2021-11-14,15:34:02 | INFO | Rank 0 |   epochs: 30
2021-11-14,15:34:02 | INFO | Rank 0 |   eps: 1e-08
2021-11-14,15:34:02 | INFO | Rank 0 |   gpu: 0
2021-11-14,15:34:02 | INFO | Rank 0 |   imagenet_v2: None
2021-11-14,15:34:02 | INFO | Rank 0 |   imagenet_val: None
2021-11-14,15:34:02 | INFO | Rank 0 |   log_level: 20
2021-11-14,15:34:02 | INFO | Rank 0 |   log_path: ./logs/lr=0.001_wd=0.1_agg=True_model=RN50_batchsize=32_workers=1_date=2021-11-14-14-34-01/out.log
2021-11-14,15:34:02 | INFO | Rank 0 |   logs: ./logs/
2021-11-14,15:34:02 | INFO | Rank 0 |   lr: 0.001
2021-11-14,15:34:02 | INFO | Rank 0 |   model: RN50
2021-11-14,15:34:02 | INFO | Rank 0 |   multigpu: None
2021-11-14,15:34:02 | INFO | Rank 0 |   name: lr=0.001_wd=0.1_agg=True_model=RN50_batchsize=32_workers=1_date=2021-11-14-14-34-01
2021-11-14,15:34:02 | INFO | Rank 0 |   ngpus_per_node: 1
2021-11-14,15:34:02 | INFO | Rank 0 |   openai_pretrained: False
2021-11-14,15:34:02 | INFO | Rank 0 |   precision: amp
2021-11-14,15:34:02 | INFO | Rank 0 |   rank: 0
2021-11-14,15:34:02 | INFO | Rank 0 |   regression_frequency: 2
2021-11-14,15:34:02 | INFO | Rank 0 |   report_to: tensorboard
2021-11-14,15:34:02 | INFO | Rank 0 |   resume: None
2021-11-14,15:34:02 | INFO | Rank 0 |   save_frequency: 1
2021-11-14,15:34:02 | INFO | Rank 0 |   save_most_recent: False
2021-11-14,15:34:02 | INFO | Rank 0 |   skip_aggregate: False
2021-11-14,15:34:02 | INFO | Rank 0 |   skip_scheduler: False
2021-11-14,15:34:02 | INFO | Rank 0 |   tensorboard: True
2021-11-14,15:34:02 | INFO | Rank 0 |   tensorboard_path: ./logs/lr=0.001_wd=0.1_agg=True_model=RN50_batchsize=32_workers=1_date=2021-11-14-14-34-01/tensorboard
2021-11-14,15:34:02 | INFO | Rank 0 |   train_data: ./data/Train_GCC-training_output.csv
2021-11-14,15:34:02 | INFO | Rank 0 |   use_bn_sync: False
2021-11-14,15:34:02 | INFO | Rank 0 |   val_data: ./data/Validation_GCC-1.1.0-Validation_output.csv
2021-11-14,15:34:02 | INFO | Rank 0 |   wandb: False
2021-11-14,15:34:02 | INFO | Rank 0 |   wandb_notes: 
2021-11-14,15:34:02 | INFO | Rank 0 |   warmup: 40000
2021-11-14,15:34:02 | INFO | Rank 0 |   wd: 0.1
2021-11-14,15:34:02 | INFO | Rank 0 |   workers: 1
2021-11-14,15:34:02 | INFO | Rank 0 |   world_size: 1
2021-11-14,15:34:02 | INFO | Rank 0 |   zeroshot_frequency: 1
2021-11-14,15:34:02 | INFO | Rank 0 | Added key: store_based_barrier_key:1 to store for rank: 0
2021-11-14,15:34:02 | INFO | Rank 0 | Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
2021-11-14,15:34:02 | INFO | Rank 0 | Use GPU: 0 for training

I fixed it temporarily by adding a logit_scale param to get_metrics().

Add ROOT to files written in gather_cc

Hi again,

it would make sense to append ROOT to the filepath in the csv-file? Because after running gather_cc.py the files are in the folder cc_data (eg. cc_data/val/00/0123.jpg), but the path in the csv-file is only val/00/0123.jpg.

BR Andreas

`logit_scale` in `CLIP`

Thanks for preparing this repo.
I was wondering how is self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07)) decided? I mean where is this value np.log(1 / 0.07)) inspired from?

[bug] Dataloader does not work when num_worker=2

I only changed the line186-193 because we need audio input function get_wds_dataset. It always stuck when I set num_worker=2. https://github.com/mlfoundations/open_clip/blob/main/src/training/data.py#L150. Could you please check a little on this? @rwightman. Thank you! My modification version are as follows:

def preprocess(
    sample,
    audio_ext,
    samplerate,
    mono,
    max_len,
    dtype,
    res_type,
):
    for key, value in sample.items():
        if key == audio_ext:
            audio_data, orig_sr = sf.read(io.BytesIO(value))
            if samplerate is not None:
                audio_data = librosa.resample(
                    audio_data, orig_sr=orig_sr, target_sr=samplerate, res_type=res_type
                )
            if len(audio_data) > max_len:  # random clip if too long
                overflow = len(audio_data) - max_len
                idx = np.random.randint(0, overflow + 1)
                if np.random.rand() > 0.5:
                    audio_data = audio_data[idx : idx + max_len]
                else:
                    audio_data = audio_data[
                        len(audio_data) + 1 - idx - max_len : len(audio_data) + 1 - idx
                    ]
            else:  # padding if too short
                audio_data = np.pad(
                    audio_data,
                    (0, max_len - len(audio_data)),
                    mode="constant",
                    constant_values=0,
                )
            if mono:  # convert to mono
                audio_data = librosa.to_mono(audio_data)
            # sample["data"] = (audio_data, sample[text_ext], sample["__key__"])
            sample[audio_ext] = audio_data
    return sample


# def get_wds_dataset(args, preprocess_img, is_train):
def get_wds_dataset(
    args,
    is_train,
    file_path_type="local",
    audio_ext="flac",
    text_ext="json",
    samplerate=32000,
    mono=True,
    max_len=1000000,
    dtype="float64",
    res_type="kaiser_best",
):
    input_shards = args.train_data if is_train else args.val_data
    assert input_shards is not None

    num_samples, num_shards = get_dataset_size(input_shards)
    if not num_samples:
        if is_train:
            num_samples = args.train_num_samples
            if not num_samples:
                raise RuntimeError(
                    'Currently, number of dataset samples must be specified for training dataset. '
                    'Please specify via `--train-num-samples` if no dataset length info present.')
        else:
            num_samples = args.val_num_samples or 0  # eval will just exhaust the iterator if not specified

    pipeline = [wds.SimpleShardList(input_shards)]
    # at this point we have an iterator over all the shards
    if is_train:
        pipeline.extend([
            wds.detshuffle(bufsize=_SHARD_SHUFFLE_SIZE, initial=_SHARD_SHUFFLE_INITIAL, seed=args.seed),
            wds.split_by_node,
            wds.split_by_worker,
            # at this point, we have an iterator over the shards assigned to each worker at each node
            wds.tarfile_to_samples(handler=log_and_continue),
            wds.shuffle(
                bufsize=_SAMPLE_SHUFFLE_SIZE,
                initial=_SAMPLE_SHUFFLE_INITIAL,
                rng=random.Random(args.seed)),
            #wds.repeatedly,  # FIXME determine if this is beneficial
        ])
    else:
        pipeline.extend([
            wds.split_by_worker,
            # at this point, we have an iterator over the shards assigned to each worker
            wds.tarfile_to_samples(handler=log_and_continue),
        ])
    pipeline.extend([
        wds.map(
            partial(
                preprocess,
                audio_ext=audio_ext,
                samplerate=samplerate,
                mono=mono,
                max_len=max_len,
                dtype=dtype,
                res_type=res_type,
            )
        ),
        wds.to_tuple("flac", "json"),
        wds.batched(args.batch_size, partial=not is_train),
    ])

    dataset = wds.DataPipeline(*pipeline)
    if is_train:
        # roll over and repeat a few samples to get same number of full batches on each node
        global_batch_size = args.batch_size * args.world_size
        num_batches = math.ceil(num_samples / global_batch_size)
        num_workers = max(1, args.workers)
        num_worker_batches = math.ceil(num_batches / num_workers)  # per dataloader worker
        num_batches = num_worker_batches * num_workers
        num_samples = num_batches * global_batch_size
        dataset = dataset.with_epoch(num_worker_batches)  # each worker is iterating over this
    else:
        # last batches are partial, eval is done on single (master) node
        num_batches = math.ceil(num_samples / args.batch_size)

    dataloader = wds.WebLoader(dataset, batch_size=None, shuffle=False, num_workers=args.workers)

    # FIXME not clear which approach is better, with_epoch before vs after dataloader?
    # hoping to resolve via https://github.com/webdataset/webdataset/issues/169
    # if is_train:
    #     # roll over and repeat a few samples to get same number of full batches on each node
    #     global_batch_size = args.batch_size * args.world_size
    #     num_batches = math.ceil(num_samples / global_batch_size)
    #     num_workers = max(1, args.workers)
    #     num_batches = math.ceil(num_batches / num_workers) * num_workers
    #     num_samples = num_batches * global_batch_size
    #     dataloader = dataloader.with_epoch(num_batches)
    # else:
    #     # last batches are partial, eval is done on single (master) node
    #     num_batches = math.ceil(num_samples / args.batch_size)

    # add meta-data to dataloader instance for convenience
    dataloader.num_batches = num_batches
    dataloader.num_samples = num_samples

    return DataInfo(dataloader, None)

Error when training in DataParallel model.

Hi, after updating to your most recent code, I got an error when training in single machine (8 GPUs) in DataParrallel model. I simply changed the flag args.dp = True and got the following error message:

miniconda3/envs/env37_amp/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:64: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '

2022-03-18,06:20:51 | INFO | Start epoch 0
Traceback (most recent call last):
File "CLIP_model/training/main.py", line 304, in
main()
File "CLIP_model/training/main.py", line 243, in main
train_one_epoch(model, data, epoch, optimizer, scaler, scheduler, args, writer)
File "CLIP_model/training/train.py", line 149, in train_one_epoch
total_loss = loss(image_features, text_features, logit_scale)
File "miniconda3/envs/env37_amp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "CLIP_model/training/train.py", line 97, in forward
logits_per_image = logit_scale * image_features @ text_features.T
RuntimeError: The size of tensor a (8) must match the size of tensor b (1024) at non-singleton dimension 1

Code works good when turning args.dp = False and training on a single GPU.
Thanks!

Loss is constant

I'm using CLIP to train on my custom dataset with the following params:

Dataset size : 50k image-text pairs
Batch size : 128
Image Size : 224
Gpus : 1
Epochs : 500

It's been running for a while now, I'm on my 15th epoch, and the loss hasn't changed at all. It isn't a constant number, but its constantly at 4.8xxx. Should I be concerned? I'm not sure why this is happening.

Results of using different learning rates and more training epochs

Very nice code!

I'm able to reproduce the zero-shot results on imagenet using cc3m (2,862,387 images in total for me) and the provided sample code.

I'd like to ask if you have tried different learning rates other than 1e-3 for batch=128? Would you be able to give more insights on how you ended up using lr=1e-3?

Also, I'd like to know if you have tried more training epochs, i.e. larger than 30. I'm curious if training with more epochs would help improve the zero-shot accuracy.

Loss slowly decreases

Hi, I am attempting to use open_clip for remote images on xview images. I've finding that in the first 2-3 epochs the loss decreases from 3.5 to 2.7 and stays around 2.7 for lr of 8e-6 (see training below). Would anyone have ideas on how I can motivate the learning?

Some background on xview:
My images are derived from xview which is an object detection dataset with images like this:

To generate captions for xview, for each image, I make a single caption for a single bounding box. Hence the same image may be several different captions for that image. Each caption is valid as there may be multiple objects in the image.

error when running training/main.py

I am getting a ModuleNotFoundError for training when running src/training/main.py. It points to line 19 in main.py, the import function.

Edit: Fixed it. Forgot to add pythonpath

training perf for single GPU is not good

Hi, I was training clip using single GPU. After profiling, I noticed that the perf of CLIP training was not good, as we can see from the picture below. GPU idle time is almost twice of GPU active due to the sem_timedwait as blocked in CPU. Any idea we can solve this unnecessary block? Thanks!

Can CLIP be trained on Windows?

Hi,

Thanks for the tremendous effort!

Is it possible to set up this training code, for fine-tuning CLIP on a custom dataset, on a Windows 10 machine?

Expected time/epoch for conceptual captions (R50)

How long is a reasonable time for an epoch using 8 workers? I'm seeing about 8 hours/epoch, for the resnet50. Launch command from the README:

    --save-frequency 1 \
    --zeroshot-frequency 1 \
    --report-to tensorboard \
    --train-data="/path/to/train_data.csv"  \
    --val-data="/path/to/validation_data.csv"  \
    --csv-img-key filepath \
    --csv-caption-key title \
    --imagenet-val=/path/to/imagenet/root/val/ \
    --warmup 10000 \
    --batch-size=128 \
    --lr=1e-3 \
    --wd=0.1 \
    --epochs=30 \
    --workers=8 \
    --model RN50

Thank you!

CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations

CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations
https://arxiv.org/abs/2112.07133

Anyone keen to try modifying a training script for above?

Example of inference with open_clip

Would be nice to provide an example of loading a model and performing inference on a single example.

corrupted weights for `RN50--yfcc15m` and `RN50-quickgelu--yfcc15m`

[/usr/local/lib/python3.7/dist-packages/mmc/loaders/mlfcliploader.py](https://localhost:8080/#) in load(self, device)
     43         model, _, preprocess_image = open_clip.create_model_and_transforms(
     44             model_name=model_name,
---> 45             pretrained=dataset)
     46 
     47         model.requires_grad_(False)

[/usr/local/lib/python3.7/dist-packages/open_clip/factory.py](https://localhost:8080/#) in create_model_and_transforms(model_name, pretrained, precision, device, jit, force_quick_gelu, pretrained_image)
    134         model_name, pretrained, precision, device, jit,
    135         force_quick_gelu=force_quick_gelu,
--> 136         pretrained_image=pretrained_image)
    137     preprocess_train = image_transform(model.visual.image_size, is_train=True)
    138     preprocess_val = image_transform(model.visual.image_size, is_train=False)

[/usr/local/lib/python3.7/dist-packages/open_clip/factory.py](https://localhost:8080/#) in create_model(model_name, pretrained, precision, device, jit, force_quick_gelu, pretrained_image)
    106             if checkpoint_path:
    107                 logging.info(f'Loading pretrained {model_name} weights ({pretrained}).')
--> 108                 model.load_state_dict(load_state_dict(checkpoint_path))
    109             else:
    110                 logging.warning(f'Pretrained weights ({pretrained}) not found for model {model_name}.')

[/usr/local/lib/python3.7/dist-packages/open_clip/factory.py](https://localhost:8080/#) in load_state_dict(checkpoint_path, map_location)
     48 
     49 def load_state_dict(checkpoint_path: str, map_location='cpu'):
---> 50     checkpoint = torch.load(checkpoint_path, map_location=map_location)
     51     if isinstance(checkpoint, dict) and 'state_dict' in checkpoint:
     52         state_dict = checkpoint['state_dict']

[/usr/local/lib/python3.7/dist-packages/torch/serialization.py](https://localhost:8080/#) in load(f, map_location, pickle_module, **pickle_load_args)
    711                     return torch.jit.load(opened_file)
    712                 return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
--> 713         return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
    714 
    715 

[/usr/local/lib/python3.7/dist-packages/torch/serialization.py](https://localhost:8080/#) in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
    938         typed_storage._storage._set_from_file(
    939             f, offset, f_should_read_directly,
--> 940             torch._utils._element_size(typed_storage.dtype))
    941         if offset is not None:
    942             offset = f.tell()

RuntimeError: unexpected EOF, expected 832488 more bytes. The file might be corrupted.

Performance of VIT-B/32 is worse than RN50 on CC3M

Here are my curves. RN50 roughly matchs the one shown in the repo, but the VITB/32 is worse. I am using the hyperparams from README. I am wondering could you also share the performance curves of ViTB/32 on CC?

error in training

Hi, I encountered this error during training and I'm not sure what it means:

2022-02-09,21:22:00 | INFO | Rank 0 | Train Epoch: 9 [28800/43670 (66%)]        Loss: 0.493029  Data (t) 0.000  Batch (t) 0.235 LR: 0.000020    logit_scale 2.821
2022-02-09,21:22:24 | INFO | Rank 0 | Train Epoch: 9 [32000/43670 (73%)]        Loss: 0.642597  Data (t) 0.008  Batch (t) 0.274 LR: 0.000012    logit_scale 2.822
2022-02-09,21:22:48 | INFO | Rank 0 | Train Epoch: 9 [35200/43670 (81%)]        Loss: 0.442177  Data (t) 0.002  Batch (t) 0.243 LR: 0.000006    logit_scale 2.822
2022-02-09,21:23:13 | INFO | Rank 0 | Train Epoch: 9 [38400/43670 (88%)]        Loss: 0.435208  Data (t) 0.000  Batch (t) 0.255 LR: 0.000003    logit_scale 2.823
2022-02-09,21:23:37 | INFO | Rank 0 | Train Epoch: 9 [41600/43670 (95%)]        Loss: 0.295687  Data (t) 0.000  Batch (t) 0.240 LR: 0.000000    logit_scale 2.823
2022-02-09,21:24:36 | INFO | Rank 0 | Eval Epoch: 10 image_to_text_mean_rank: 40.2243   image_to_text_median_rank: 22.0000      image_to_text_R@1: 0.0628       image_to_text_R@5: 0.2063       image_to_text_R@10: 0.3273      text_to_image_mean_rank: 44.4849     text_to_image_median_rank: 25.0000      text_to_image_R@1: 0.0477       text_to_image_R@5: 0.1817       text_to_image_R@10: 0.2948      val_loss: 0.3798        epoch: 10.0000  num_elements: 6432.0000
Exception in thread Thread-5:
Traceback (most recent call last):
  File "C:\Users\nuzuegbunam\Anaconda3\envs\open_clip_3_9\lib\multiprocessing\connection.py", line 317, in _recv_bytes

Does anyone have any idea what this means?

Replace eval() with json.load()

AAAaaaaaaaaahhhhhhhhhhhhhhhhhh!!!!!!!

open_clip/src/training/data.py

Line 59 in 91f6cce

sizes = eval(open(os.path.join(dir_path, 'sizes.json'), 'r').read())

OpenAI's CLIP not open source?

I'm confused about this project. Isn't OpenAI's CLIP already open source?

The repo has an MIT license: https://github.com/openai/CLIP

Performance of VIT-B/32 is worse than RN50 on YFCC15M

We are trying to re-implement CLIP ViT-B/32 pre-trained on YFCC15M provided by OpenAI. But our result is lower than RN50 reported by the paper and your repo (still under training, but almost finished, current ImageNet zero-shot accuracy is around 27% - 28%). So we wonder if you have tried to train a ViT-B/32 on YFCC? Do you have the same finding? Thanks.

Error in demo of README.md?

Hi!

In the "Usage" part of the readme, we use model.encode_image() and model.encode_text() before computing the dot product of the features.

Hoewever those methods, by contrast with what is done during training,

open_clip/src/open_clip/model.py

Line 429 in 0d1127c

image_features = F.normalize(image_features, dim=-1)

don't normalize the feature vectors.

Therefore it could bias the results. Am I wrong?

Best,
Théo

Possible to finetune?

Is it possible to finetune from the existing Open AI checkpoints rather than train them from scratch with this codebase?

Inference for non-square images

Hello,

I would like to run the different CLIP models on high definition non-square images (e.g. 720p or 1080p).
Is there a simple way to do so without deforming the images into a smaller square resolution (336x336 or 224x224) ?

Thank you for your work on this repository, I found it very helpful,
Simon

Update to newest version of webdataset

ModuleNotFoundError: No module named 'torch._C._distributed_rpc'; 'torch._C' is not a package

I get this strange error when attempting "import open_clip". I have tried reinstalling open clip, as well as various versions of pytorch. In this instance, I am using python 3.7.9 and pytorch 1.9.0.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\firewolf\AppData\Local\Programs\Python\Python37\lib\site-packages\open_clip_torch-1.0.1-py3.7.egg\open_clip\__init__.py", line 2, in <module>
    from .loss import ClipLoss
  File "C:\Users\firewolf\AppData\Local\Programs\Python\Python37\lib\site-packages\open_clip_torch-1.0.1-py3.7.egg\open_clip\loss.py", line 2, in <module>
    import torch.distributed.nn
  File "C:\Users\firewolf\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\distributed\nn\__init__.py", line 1, in <module>
    from .api.remote_module import RemoteModule
  File "C:\Users\firewolf\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\distributed\nn\api\remote_module.py", line 22, in <module>
    from torch.distributed.rpc.internal import _internal_rpc_pickler
  File "C:\Users\firewolf\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\distributed\rpc\internal.py", line 12, in <module>
    from torch._C._distributed_rpc import _get_current_rpc_agent
ModuleNotFoundError: No module named 'torch._C._distributed_rpc'; 'torch._C' is not a package

Possible args

Is there a straightforward to find all the args supported by the code?

suggestions for a multilingual version CLIP

Hi,

thanks for this great work! I want to make a multilingual version CLIP. There is existing works to use English CLIP indirectly (https://github.com/FreddeFrallan/Multilingual-CLIP). But do you have suggestions on making the code a multilingual version?

Thanks you!

Conceptual Captions Faster R-CNN features

Hi,
A sincere request.
Since it is very time taking, could you kindly provide the extracted faster R-CNN features for the conceptual captions dataset via drive or dropbox?
Thanks :)

Not able to run inference in fp16 mode

Thank you in advance for this amazing project :-)

I'm trying to run inference in fp16 mode (like in the original CLIP repo), but I'm failing to achieve it. This is the error message I get:

RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor

And this is the code I'm trying:

import torch
from PIL import Image
import open_clip
import requests

model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32-quickgelu',
                                                             pretrained='laion400m_e32',
                                                             precision="fp16",
                                                             device=torch.device("cuda"))

url = "https://raw.githubusercontent.com/mlfoundations/open_clip/main/docs/CLIP.png"
image = preprocess(Image.open(requests.get(url, stream=True).raw)).unsqueeze(0)
text = open_clip.tokenize(["a diagram", "a dog", "a cat"])

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)  # prints: [[1., 0., 0.]]

Note that I've also tried with model ViT-B-32 and pretrain openai and it doesn't work either. Am I doing something wrong?

Avenue for exploration - augmenting training set with colour palettes / texture names /more meta data

so part of the fun with clip is using it in conjunction with VQGAN.
This allows the prompts to generate images.

There's something lost in this translation. though .
They say a picture is worth a 1000 words - but what if some extra data was injected into the training ?

could be say textures / maybe even geometric descptions / meta data

Not Issues. Why is "Weight decay" parameter so large?

Question, not Issues.

Why is Weight decay so large? (The default 0.2 or 0.1)
Usually is 1e4

Captions of YFCC dirty

Hello there,

not really an issue but something i am interested in: How did you clean the captions of YFCC?
I did the steps explained in the closed issue, but still there are a lot of captions with URLs, camera names and settings, dates, and so on. Compared to CC (where the captions are really clean) it looks really bad. Still you get a big jump in performance on ImageNet, so before i start training i would like to know if you did clean the data?

If so, i would be very happy if you could provide the code or some snippets :)

Best regards

Generalizable Text Transformer Usage

I've been chatting with some others interested in training CLIP for different domain tasks. They expressed interest in a simple way to use a pre-trained text transformer.

Some basic support for Hugging Face or generic classes of transformers shouldn't be too crazy of an extension to what is already fleshed out.

Have you tried to fine-tune the clip model (as official Vit-B-32) in your datasets?

a. How the fine-tune result is？Could you provide a set of fine-tuned parameters？
b. For fine-tuning, what suggestions do you have in parameter settings or training skills?

Add option for zero-shot on ImageNetR, Sketch, etc...

scripts of training on multiple nodes

Hi, is there any easy-using script for training clip on multiple nodes? I can set up training on one node(8GPUs) now. But I need to test the scaling efficient. Thanks for any insight~

Bug in gather_cc

Hi there,

first of all thanks for the code, i appreciate your effort!

I think there is a bug in gather_cc.py:
In line 86 there is a hardcoded 'val', which should probably be split.

Model name details

Hi,

Where can we find the details behind your model naming?

Best,
Theo

interpretation of debug output

Hi,

I'm running the src/training/main.py in debug mode, and I'm getting the following message in the terminal:


2021-12-21,18:02:32 | DEBUG | Rank 1 | tag: Software (305) - type: string (2) Tag Location: 22 - Data Location: 26 - value: b'www.meitu.com\x00'

2021-12-21,18:02:32 | DEBUG | Rank 1 | tag: Software (305) - type: string (2) Tag Location: 22 - Data Location: 26 - value: b'www.meitu.com\x00'

2021-12-21,18:02:32 | DEBUG | Rank 1 | tag: Software (305) - type: string (2) Tag Location: 22 - Data Location: 26 - value: b'www.meitu.com\x00'

2021-12-21,18:02:32 | DEBUG | Rank 1 | tag: Software (305) - type: string (2) Tag Location: 22 - Data Location: 26 - value: b'www.meitu.com\x00'

2021-12-21,18:02:32 | DEBUG | Rank 1 | tag: Software (305) - type: string (2) Tag Location: 22 - Data Location: 26 - value: b'www.meitu.com\x00'

2021-12-21,18:02:32 | DEBUG | Rank 1 | tag: Software (305) - type: string (2) Tag Location: 22 - Data Location: 26 - value: b'www.meitu.com\x00'

2021-12-21,18:02:32 | DEBUG | Rank 1 | tag: Software (305) - type: string (2) Tag Location: 22 - Data Location: 26 - value: b'www.meitu.com\x00'

2021-12-21,18:02:32 | DEBUG | Rank 1 | tag: Software (305) - type: string (2) Tag Location: 22 - Data Location: 26 - value: b'www.meitu.com\x00'

2021-12-21,18:02:32 | DEBUG | Rank 1 | tag: Software (305) - type: string (2) Tag Location: 22 - Data Location: 26 - value: b'www.meitu.com\x00'

2021-12-21,18:02:32 | DEBUG | Rank 1 | tag: Software (305) - type: string (2) Tag Location: 22 - Data Location: 26 - value: b'www.meitu.com\x00

What does it mean? I'm using 1 gpu, and running --debug flag.

Second question: how do I delete an experiment so I can reuse its name?

CLIP training in Jax.

Would be nice if we could add a jax_src folder which supported training CLIP models in Jax.

This would also help with #20.