Coder Social home page Coder Social logo

frozen-in-time's Introduction

Frozen️ in Time ❄️️️️️⏳

A Joint Video and Image Encoder for End-to-End Retrieval

project page | paper | dataset | demo alt text Repository containing the code, models, data for end-to-end retrieval. WebVid data can be found here


📝 Preparation

  1. Create conda env conda env create

  2. Create data / experiment folders mkdir data; mkdir exps, note this can just be a symlink to where you want to store big data.

🔧 Finetuning (benchmarks: MSR-VTT)

  1. wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip -P data; unzip data/MSRVTT.zip -d data

  2. Change num_gpus in the config file accordingly.

  3. Train python train.py --config configs/msrvtt_4f_i21k.json

  4. Test python test.py --resume exps/models/{EXP_NAME}/{EXP_TIMESTAMP}/model_best.pth

For finetuning a pretrained model, set "load_checkpoint": "PATH_TO_MODEL" in the config file.

🏋️‍️ Pretraining

  1. Download WebVid-2M (see https://github.com/m-bain/webvid)

  2. Download CC-3M (see https://ai.google.com/research/ConceptualCaptions/download)

  3. Train. python train.py --config CONFIG_PATH. Here are the different options:

    a. Dataset combinations

     i. CC-3M + WebVid2M: configs/cc-webvid2m-pt-i2k.json
     ii. WebVid2M : configs/webvid2m-pt-i2k.json
    

    You can add in an arbitrary number of image/video datasets for pre-training by adding as many dataloaders to the config file dataloader list as your heart desires. Adding more datasets will likely to higher downstream performance.

    b. Number of frames

    For image datasets, this should always be set to video_params": {"num_frames": 1, ...}.

    For video datasets, set this to what you want. N.B. More frames requires = more gpu memory.

    If, like us, you are not a big company and have limited compute, then you will benefit by training via a curriculum on the number of frames. A lot of the knowledge can be learned in the 1-frame setting, as we show in the paper. You can then finetune with more frames. See curriculum learning section

    c. Finetuning

    Set "load_checkpoint": "FULL_MODEL_PATH" in the config file. You can now use different experiment params, such as num_frames, to do curriculum learning for example.

🗄 Pretrained Weights

📚 Curriculum Learning on #frames

Curriculum learning on the number of frames in pretraining achieves similar performance with significant reduction in compute (both memory and training time). This is because model has higher throughput for fewer frames, as well as allowing a bigger batch size for the same gpu memory.

Our best model was trained on 1-frame then finetuned on 4-frames on CC+WebVid2M.

Train on 1-frame until the training loss converges, then finetune on 4-frames with the same config, from the 1-frame checkpoint via setting load_checkpoint in config file. 4-frame finetuning needs much less iterations (~10% of 1-frame setting is sufficient) since most of the knowledge is learned in the 1-frame setting.

📈 Experiment Logging and Visualising

This repository uses a sacred backbone for logging and tracking experiments, with a neptune front end. It makes life a lot easier. If you want to activate this:

  1. Create a neptune.ai account.
  2. Create a project, copy in your credentials in train.py and remove the ValueError
  3. Set neptune: true in your config files.

🔍 Creating a semantic visual search engine

This repository can be used to extract visual features using the frozen in time model in order to create a text to visual semantic search engine, such as our demo. This uses the FAISS library for rapid indexing of millions of features vectors. Follow the instructions in index_search.md.

🎓 Cite

If you use this code in your research, please cite:

@InProceedings{Bain21,
  author       = "Max Bain and Arsha Nagrani and G{\"u}l Varol and Andrew Zisserman",
  title        = "Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval",
  booktitle    = "IEEE International Conference on Computer Vision",
  year         = "2021",
}

LICENSE

This project is licensed under the MIT License. See LICENSE for more details

🙏 Acknowledgements

This code is based off the pytorch-template https://github.com/victoresque/pytorch-template

As well as many good practices adopted from Samuel Albanie's https://github.com/albanie/collaborative-experts

frozen-in-time's People

Contributors

bryant1410 avatar m-bain avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

frozen-in-time's Issues

MSRVTT Data

Hi, just wondering when the zip for the MSRVTT data will be uploaded? Thanks!

Can you share some recordings of your experiments

Can you share some recordings of your experiments like some graphs in neptune.ai or other logs tracking the performance/loss changes in training steps.

I would like to compare the effects of some configurations(e.g. batch size) on training convergence in depth. I think this uses a contrastive loss that depends on a similarity matrix, may be effected by batch size and converges slower in a smaller batch size. In your experiments, it was not using large batch sizes and may not achived the best performance yet. I think I want to try something haha~

Watermarks influence

Hello! The big amount of videos (or even all) in WebVid dataset has a Shutterstock watermark. Did you clean it somehow during training? And if no, in your opinion can these watermarks cause overfitting and hurt the final perfomance on downstream tasks?

The pre-training budget on WebVid-2M

Thanks for your work.
I am interested in the pre-training budget on WebVid-2M.
Could you tell us how many GPUs and how much time you cost for pre-training on WebVid-2M?

Test set of MSR-VTT for downstream evaluation

Hi,

In the paper, it is described that 'Following other works [35], we train on 9Ktrain+val videos and report results on the 1K-A test set'

Howerver, in your provided code for text-to-video retrieval on MSR-VTT, it seems that the validation set and the test set are the same, which is named as 'val_list_jsfusion.txt' with 1K data.

The results of your released model on MSR-VTT test set (val_list_jsfusion.txt) are higher than that reported in the paper.

Is 'val_list_jsfusion.txt' the test set for MSR-VTT evaluation?

Looking forward to your reply.

About the effects of sliding_window_stride

Hi Bain,
I see you mentioned that setting sliding_window_stride=12 when evaluating retrieval on MSR-VTT (finetuned) helps improve the performance in other issues. I tryied this but didn't get the improvement.

After finetuned with msrvtt_4f_i21k.json, the model is tested as the command presented in the README. Results are:

[t2v_metrics] epoch 0, R@1: 28.9, R@5: 55.6, R@10 66.2, R@50 86.8MedR: 4, MeanR: 29.9                                                                                                                                                                  
[v2t_metrics] epoch 0, R@1: 28.4, R@5: 56.5, R@10 66.2, R@50 88.1MedR: 4, MeanR: 25.6  

After setting --sliding_window_stride=12 for test.py, the results are:

[t2v_metrics] epoch 0, R@1: 28.8, R@5: 57.7, R@10 68.5, R@50 88.0MedR: 4, MeanR: 27.3
[v2t_metrics] epoch 0, R@1: 30.0, R@5: 58.8, R@10 68.8, R@50 89.7MedR: 4, MeanR: 22.5

It shows no obvious improvement in my test.

In #41, sliding_window_stride indeed helps improve the evaluation performance. I don't know why it doesn't work here. I keep codes in test.py unchanged and only modify some codes in base_dataset.py to fit my environment (i.e., lower version of PyTorch ans TorchVision due to the limitation on the computing cluster). Besides, the version of ffmpeg on my cluster is low and is hard to update. Is the difference of enviroments the reason leading to the poor results?

I just want to use one trained model with sound performance for some experiments in test phase (e.g., adversarial attacks), so I would like to see the finetuned Frozen-in-Time with R@1 over 30% as results in your paper shows. However, I failed to get such a model. :(

The phenomenon seems weird and I will check further to try to reproduce higher results. Besides, if possible, would you mind share a finetuned model?

What does zero-pad means?

The problem may be missed in the issue42
When increasing the input number of frames from 4 to 8, how does Zero-pad work?
The center 4 frames use the original temporal embedding, and the other frames use zero padding?

why noy evaluate your model on HowTo100 dataset

Hello, Wonderful project! Here is a small question: why noy evaluate your model on HowTo100 dataset? As we all konwn, HowTo100 also contains video-text pairs which can be used for video-text retrieval. I suppose it is a suitable dataset that can be fintuned on. Thanks.

config about Paper Result

run the test script
test.py --resume PATH_TO_FINETUNED_CHECKPOINT --sliding_window_stride 12

the sliding window stride argument adds temporal averaging over multiple frame samples :)

Originally posted by @m-bain in #41 (comment)

Finetuning the pretrained model on MSR-VTT

Hi,

Thanks for your excellent work!

When I finetune the pretrained model that you provide on MSR-VTT, there is a warning shown as below:

"Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias']"

Is it expected?

Thanks!
Yuying

Curriculum Learning and Video-Image Joint Training

Hi,

I have a question about the curriculum learning. For the 1 frame pretraining, both CC3M and WebVid 2M dataset are used. But when finetuning on 4 frames stage, did you use both video and image for joint pretraining (4 frames for WebVid 2M and 1 frame for CC3M)? Since I cannot find any experimental details for "Joint image-video training" in the paper.

Thanks in advance.

Fine tune on a custom dataset

Hi,

Congrats on the amazing work!! I want to fine-tune this model on a custom video dataset. It has a video and text as the inputs but no image is provided in the input. How can I fine-tune without image in the input?

Thank you.

Which results in paper correspond to the finetune command?

I experiment the finetune procedure and run the command python train.py --config configs/msrvtt_4f_i21k.json.

I got:

[v2t_metrics]MSRVTT epoch 27, R@1: 16.1, R@5: 40.5, R@10 55.0, R@50 81.9MedR: 8, MeanR: 40.6
    epoch          : 27
    loss_0         : 0.7913076955540566
    val_loss_0     : 1.5775871678950295
    val_0_t2v_metrics_R1: 17.8
    val_0_t2v_metrics_R5: 40.6
    val_0_t2v_metrics_R10: 55.1
    val_0_t2v_metrics_R50: 81.5
    val_0_t2v_metrics_MedR: 8.0
    val_0_t2v_metrics_MeanR: 39.94
    val_0_t2v_metrics_geometric_mean_R1-R5-R10: 34.14804760940716
    val_0_v2t_metrics_R1: 16.1
    val_0_v2t_metrics_R5: 40.5
    val_0_v2t_metrics_R10: 55.0
    val_0_v2t_metrics_R50: 81.9
    val_0_v2t_metrics_MedR: 8.0
    val_0_v2t_metrics_MeanR: 40.5555
    val_0_v2t_metrics_geometric_mean_R1-R5-R10: 32.9772570568898
Validation performance didn't improve for 10 epochs. Training stops.

There are two R1 resutls. Which results corresponding to the results in paper.
I found the R1 in Table5 is 31.0. It seems far from these implementation.

Result about MSVD

Hi, Bain
I found the experiment setting about MSVD result is not clear, meanwhile MCQ claims a different version. so I want to ask the result and setting about MSVD.

What to run to reproduce the best results?

Given that you do curriculum learning, I was wondering if you could provide the series of config files you use to train the best model in your paper. My understanding is that you first use the CC3M+WebVid config in this repo, which uses 1 frame per video. But then I think you fine-tune on 4 frames, then on 8, right? Is it the same config (iterations, etc.) but changing the frame count and batch size?

Config for the method trained on CC3M

I was wondering if you could provide the config used in the paper to train only on CC3M. Is it exactly like the one CC3M+WebVid but removing the WebVid part?

Cuda Error

When I run the training script you provided, I get the error:

RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same

It's coming from video_embeddings = self.compute_video(video_data) in self._valid_epoch(-1)

Are the videos placed on GPU too?

RuntimeError: There were no tensor arguments to this function (e.g., you passed an empty list of Tensors), but no fallback function is registered for schema aten::_cat.

When I ran python test.py --config configs/video_feat_extract.json --save_feats saved_features --save_type video by following https://github.com/m-bain/frozen-in-time/blob/main/index_search.md, it showed errors as below:

(frozen) ubuntuuser@ubuntugpu:~/frozen-in-time$ python test.py --config configs/video_feat_extract.json  --save_feats saved_features  --save_type video
WARNING - test - No observers have been added to this run
INFO - test - Running command 'run'
INFO - test - Started
TextVideoDataLoader
FrozenInTime
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
######USING ATTENTION STYLE:  frozen-in-time
Using random weights
##### WARNING SAVE_PART STARTING AT 0, MAKE SURE THIS IS THE NEWEST
0
0it [00:00, ?it/s]
0it [00:00, ?it/s]
ERROR - test - Failed after 0:00:15!
Traceback (most recent call last):
  File "test.py", line 284, in <module>
    ex.run()
  File "/mnt/data/ubuntuuser/.conda/envs/frozen/lib/python3.7/site-packages/sacred/experiment.py", line 276, in run
    run()
  File "/mnt/data/ubuntuuser/.conda/envs/frozen/lib/python3.7/site-packages/sacred/run.py", line 238, in __call__
    self.result = self.main_function(*args)
  File "/mnt/data/ubuntuuser/.conda/envs/frozen/lib/python3.7/site-packages/sacred/config/captured_function.py", line 42, in captured_function
    result = wrapped(*args, **kwargs)
  File "test.py", line 135, in run
    vid_embeds = torch.cat(vid_embed_arr)
RuntimeError: There were no tensor arguments to this function (e.g., you passed an empty list of Tensors), but no fallback function is registered for schema aten::_cat.  This usually means that this function requires a non-empty list of Tensors.  Available functions are [CPU, CUDA, QuantizedCPU, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradNestedTensor, UNKNOWN_TENSOR_TYPE_ID, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, Autocast, Batched, VmapMode].

CPU: registered at /opt/conda/conda-bld/pytorch_1616554800319/work/build/aten/src/ATen/RegisterCPU.cpp:5925 [kernel]
CUDA: registered at /opt/conda/conda-bld/pytorch_1616554800319/work/build/aten/src/ATen/RegisterCUDA.cpp:7100 [kernel]
QuantizedCPU: registered at /opt/conda/conda-bld/pytorch_1616554800319/work/build/aten/src/ATen/RegisterQuantizedCPU.cpp:641 [kernel]
BackendSelect: fallthrough registered at /opt/conda/conda-bld/pytorch_1616554800319/work/aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Named: registered at /opt/conda/conda-bld/pytorch_1616554800319/work/aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
AutogradOther: registered at /opt/conda/conda-bld/pytorch_1616554800319/work/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradCPU: registered at /opt/conda/conda-bld/pytorch_1616554800319/work/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradCUDA: registered at /opt/conda/conda-bld/pytorch_1616554800319/work/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradXLA: registered at /opt/conda/conda-bld/pytorch_1616554800319/work/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradNestedTensor: registered at /opt/conda/conda-bld/pytorch_1616554800319/work/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
UNKNOWN_TENSOR_TYPE_ID: registered at /opt/conda/conda-bld/pytorch_1616554800319/work/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradPrivateUse1: registered at /opt/conda/conda-bld/pytorch_1616554800319/work/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradPrivateUse2: registered at /opt/conda/conda-bld/pytorch_1616554800319/work/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
AutogradPrivateUse3: registered at /opt/conda/conda-bld/pytorch_1616554800319/work/torch/csrc/autograd/generated/VariableType_2.cpp:9122 [autograd kernel]
Tracer: registered at /opt/conda/conda-bld/pytorch_1616554800319/work/torch/csrc/autograd/generated/TraceType_2.cpp:10525 [kernel]
Autocast: registered at /opt/conda/conda-bld/pytorch_1616554800319/work/aten/src/ATen/autocast_mode.cpp:254 [kernel]
Batched: registered at /opt/conda/conda-bld/pytorch_1616554800319/work/aten/src/ATen/BatchingRegistrations.cpp:1016 [backend fallback]
VmapMode: fallthrough registered at /opt/conda/conda-bld/pytorch_1616554800319/work/aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]

My configs/video_feat_extract.json file is as below:

{
    "name": "VideoDirectoryFeatureExtraction",
    "n_gpu": 4,
    "arch": {
        "type": "FrozenInTime",
        "args": {
            "video_params": {
                "model": "SpaceTimeTransformer",
                "arch_config": "base_patch16_224",
                "num_frames": 4,
                "pretrained": true,
                "time_init": "zeros"
            },
            "text_params": {
                "model": "distilbert-base-uncased",
                "pretrained": true,
                "input": "text"
            },
            "projection": "minimal",
            "load_checkpoint" : "/mnt/data/ubuntuuser/frozen-in-time/cc-webvid2m-4f_stformer_b_16_224.pth"
        }
    },
    "data_loader": {
        "type": "TextVideoDataLoader",
        "args":{
            "dataset_name": "ImageDirectory",
            "data_dir": "/mnt/data/ubuntuuser/text_video_retrieval/shakespearevideos",
            "shuffle": true,
            "num_workers": 16,
            "batch_size": 32,
            "split": "test",
            "subsample": 1,
            "text_params": {
                "input": "text"
            },
            "video_params": {
                "input_res": 224,
                "num_frames": 4
            }
        }
    },
    "optimizer": {
        "type": "AdamW",
        "args":{
            "lr": 3e-5
        }
    },
    "loss": {
        "type": "NormSoftmaxLoss",
        "args": {
        }
    },
    "metrics": [
        "t2v_metrics",
        "v2t_metrics"
     ],
    "trainer": {
        "epochs": 100,
        "max_samples_per_epoch": 9000,
        "save_dir": "/mnt/data/ubuntuuser/frozen-in-time",
        "save_period": 5,
        "verbosity": 2,
        "monitor": "min val_loss_0",
        "early_stop": 10,
        "neptune": true
    },
    "visualizer": {
        "type": "",
        "args": {
        }
    }
}

evaluation for MSVD .

Dear Sir:

We didn't find the MSVD code on Github, if you treating all the provided caption-video pairs as separate instances for MSVD evaluation?

Thank you!

"img should be PIL Image" when fine-tuning on MSR-VTT

I got the following error when trying to run python train.py --config configs/msrvtt_4f_i21k.json (as in the README):

  File "***/base/base_dataset.py", line 107, in __getitem__
    imgs = self.transforms(imgs)
  File "***/envs/frozen/lib/python3.7/site-packages/torchvision/transforms/transforms.py", line 60, in __call__
    img = t(img)
  File "***/envs/frozen/lib/python3.7/site-packages/torchvision/transforms/transforms.py", line 195, in __call__
    return F.resize(img, self.size, self.interpolation)
  File "***/envs/frozen/lib/python3.7/site-packages/torchvision/transforms/functional.py", line 229, in resize
    raise TypeError('img should be PIL Image. Got {}'.format(type(img)))
TypeError: img should be PIL Image. Got <class 'torch.Tensor'>

(I set up the env as described in the README)

Seems like the frames are obtained as torch tensors but then the transforms need a PIL Image:

frames = torch.stack([frames[idx] for idx in frame_idxs]).float() / 255
frames = frames.permute(0, 3, 1, 2)
return frames, frame_idxs

frames = torch.stack(frames).float() / 255
cap.release()
return frames, success_idxs

If I add a transforms.ToPILImage() before (and a transforms.ToTensor() after) in here

'val': transforms.Compose([
transforms.Resize(center_crop),
transforms.CenterCrop(center_crop),
transforms.Resize(input_res),
normalize,
]),

it still doesn't work because it needs an image, not multiple images. It also makes me think that the transforms actually won't work when having multiple PIL images.

Seems like the transforms are the incorrect ones? Or am I missing something?

About Curriculum Learning

Thanks for your great job! Heer are some questions about Curriculum Learning.

When fine-tuning from 1 frame to 4 frames,

  • should we need to interpolate the temporal position embedding ([1, dim] => [4, dim])? In my opinion, the image is seen as 1-frame video, if the temporal position embedding is interpolated, how can we add it to the image?
  • should we use the same hyperparameters (e.g., learning rate, epoch, warmup)?

Question about pretrained models?

Hello , I have downloaded the .tar file with the pretrained models but I can not understand how they are organised.

From the .tar archive I extracted something like

data
data.pkl
version

It seems that

  • data is a directory with many binary files in it
  • data.pkl - not understood
  • version - text file with "3" written there

Can you please help to understand

  • What should the content of that .tar file be?
  • Which pretrained model's weights are actually provided ? (i.e. pre train / fine tune on dataset X) ?

Thanks for your effort and congrats for the brilliant innovation !

pre-process mp4

Hi I would like to know what is the best way that you recommend to pre-process a mp4 video so it will work best with your project.

Providing dataset

Hello! Thank you for the very interesting work. Could you please provide the WebVid-2M data set?

Code/template for the demo?

Awesome project and great work! I was wondering if the code for the video search demo is available or could be made available? Would be very nice to have even just for debugging the process of fine tuning your model on a different dataset.

Off-by-one issues with the frame sampling

I think there may be 2 off-by-one issues with the frame sampling. I'm not so sure about it and prefer to discuss it, that's why I don't send a patch.

For the first one, this is the part of the code:

intervals = np.linspace(start=0, stop=vlen, num=acc_samples + 1).astype(int)
ranges = []
for idx, interv in enumerate(intervals[:-1]):
ranges.append((interv, intervals[idx + 1] - 1))

I think it should be:

np.linspace(start=0, stop=vlen - 1, ...)

(with a - 1)

and:

ranges.append((interv, intervals[idx + 1]))

(without the - 1).

Otherwise, the right part of each bucket it's gonna be ignored. For the uniform case, instead of doing the interval centroid ((a+b)/2), it takes (a+b-1)/2. This isn't a big deal though.

For the second one, I think the random choice interval end should be + 1. When it does random.choice(range(...)) (which btw could be a random.randrange), the range excludes the stop value, so there's another - 1 hidden there.

For example, in the training video "1013731484", which has only one frame according to Decord, for random it'd be:

intervals == [0, 1]
ranges = [(0, 0)]
random.choice(range(0, 0)) == random.choice([]) <- exception

And it fails silently, assigning all frames to black. Note this one also isn't a big deal as it'd fail with few videos, and with the rest, it'd have all the intervals shifted or something like that.

How about initializing parameters with CLIP model

I think SpaceTimeTransformer can use to extend the CLIP model to process videos.

For example, the 'openai/clip-vit-base-patch32' is based on a text_transformer and a ViT backbone.
I am trying this, the text_model part can be used directly, but the ViT part is different in variables' names and seems hard to align.

Bad file descriptor

More often than not, I'm getting the following exception while training (e.g., training on MSR-VTT):

Exception in thread QueueFeederThread:
Traceback (most recent call last):
  File "***/miniconda3/envs/frozen/lib/python3.7/multiprocessing/queues.py", line 232, in _feed
    close()
  File "***/miniconda3/envs/frozen/lib/python3.7/multiprocessing/connection.py", line 177, in close
    self._close()
  File "***/miniconda3/envs/frozen/lib/python3.7/multiprocessing/connection.py", line 361, in _close
    _close(self._handle)
OSError: [Errno 9] Bad file descriptor

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "***/miniconda3/envs/frozen/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "***/miniconda3/envs/frozen/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "***/miniconda3/envs/frozen/lib/python3.7/multiprocessing/queues.py", line 263, in _feed
    queue_sem.release()
ValueError: semaphore or lock released too many times

Do you have any idea of what's going on?

The training still continues and finishes (reporting repeated errors), not sure if it trains fine though.

CC3M dataset broken

hello, I find that a large portion of CC3M is missing due to broken url, and the dataset size shrinks to 1.9M. I want to ask the size of CC3M you used or do you have full link to download it?

Cannot reproduce results

Hi,

I am trying to match the results from the original paper (for sanity) and have not been able to. When I use the checkpoint provided in the repo, and the exact environment given on the repo (conda create using the environment.yml), on 1K-A MSRVTT I get [t2v_metrics] R@1: 21.7, R@5: 43.4, R@10 53.4, R@50 78.4MedR: 8, MeanR: 52.0. So, I'm off by ~1.5% on the R@1, 1 on the median rank.

Any guidance or clarification would be appreciated. I also downloaded MSRVTT fresh from the link in the repo.

data and code release

it is nice to read your paper and I find this work is very interesting to me

I am wondering when will you release your data and code?

Reproducibility

Hello,

I was wondering how many number of epochs after fine-tuning could the model achieve the best performance in the paper for MSR-VTT.

I added the pre-trained checkpoint name to MSR-VTT config and modified nothing else. The first validation score (zero-shot) is similar to that in the paper, while the validation scores after fine-tuning fluctuated significantly in the first ~20 epochs and stopped earlier before 50 epochs (as reported in the paper) according to early stopping.

Could you shed some lights on this?

Many thanks!

License ?

Hi @m-bain and @bryant1410
Many thanks for this work!

Can you please specify which license it is released under? As it currently is, without any mention, it cannot be legally used by anyone for anything...

The original pytorch-template is under MIT, so I guess this would be an acceptable choice. But of course it's up to you to decide.

Frame sampling in test phase

Hi, I am confused about the description of frame sampling while testing: 'The values for i are determine using a stride S, resulting in an array of video embeddings v = [v_0 , v_S , v_2S , v_M ].'
Could you please take MSRVTT as an example to show us how to sample frames in testing phase? Thanks a lot

CC3M data error

Files download from the given link in CC3M is not match to this code.
It raises error when reading with pandas:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 43 fields in line 23, saw 45

Can you provide the correct version of CC3M data ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.