rowanz / merlot Goto Github PK

View Code? Open in Web Editor NEW

224.0 224.0 25.0 15.42 MB

MERLOT: Multimodal Neural Script Knowledge Models

License: MIT License

Python 97.45% Shell 0.81% HTML 1.74%

merlot's People

Contributors

Stargazers

Watchers

merlot's Issues

How to download pre training model

Thank you for your work.
I have a question about how to download the linked model (gs://merlot/checkpoint_ 4segments/)
This doesn't seem to open through a browser

Fine-tune on TVQA dataset

Thank you very much for your work. May I ask if you can release the code for fine-tune on tvqa dataset

Is finetuned checkpoint on VCR available?

Hi it seems that this repo released the pretrained checkpoints.
Is the finetuned checkpoint on the VCR task also available?
I also wonder approximately how many hours and how much cost it took to finetune for VCR using the current TPU set up.
Thank you!

Question about merlot model

Dear Rowan,
Hi, I have noticed this paper recently, I really think this paper is of great value, I understand nearly all the details of your paper except the model. I know the details are in the code, but I am not familiar with TensorFlow, if you can explain these to me, I will understand the code much easier, so I wonder if you can answer my questions when you have time?

1.What does chunk mean in the code? Does it represent the max number of segments a video has been segmented?
2.In 3.2, you said that MERLOT takes multiple unordered video frames as input, but in Joint Vision-Language Encoder
part, you say that position embeddings are added to the vision components, do you mean that, when fed into the model, the image and the corresponding sentence have the same position embedding?
3.In 3.3, Temporal Reordering part, I understand the core idea, but I am not sure about your methods, is it correct that you randomly choose i frames, and then change the position embedding of these frames to the same embedding [image_unk_0]?

Best regards,
Zihao

Question on the definition of visually "ungrounded" categories

I agree that some categories may not provide enough aligned vision-language information for multi-modal learning. However, in the paper, you mentioned "video game commentaries" as an example.

I wonder why it is not visually grounded. The people's comments are usually related to the games. In my opinion, we could filter this category only for its unreality, which means it may not benefit downstream tasks.

Issue on the model scalablity due to segment-level positional embeddings

I notice that MERLOT adopts segment-level positional embeddings. However, there are only 16 segments during pre-training.
For longer videos, e.g., movies, 16 segments are not enough to encode their information. Specifically, I have two questions:

How to extract features for extremely long videos like movies?
How about using fixed positional embeddings instead of learned ones?

[Question] Est. disk space to hold the pretraining dataset

Hi,

Congrats on the impressive work. I was just wondering do you have a rough estimation about the disk quota required to host the YT-Temporal-180M dataset? Sorry if I missed this information in the manuscript.

Thanks.

Question on fair comparison with Conceptual ∪ COCO

Thanks for the great work. I have a question on fair comparison with Conceptual ∪ COCO.

In the experiments on dataset source, you compared the model trained in Conceptual ∪ COCO datasets. For a fair comparison, you mentioned

for a fair comparison, we train for the same number of steps as 5 epochs on our dataset.

However, 5 epochs means the model has seen all 180M segment-transcripts pairs. As you've mentioned in the paper, there will be lesser overfitting issues.

I think the proper way should be to train your model on 3M segment-transcript pairs / 3M videos.

YT-Temporal-180M video dataset

Hi Rowanz,

   Thanks for your great work and contribution on MERLOT and YT-Temporal-180M dataset !
   Will you release the YT-Temporal-180M video dataset? If possible, can you provide us with the text annotation?

Thanks

Got `UnicodeDecodeError` whening load file `yttemporal180m_050of100.jsonl.gz`.

Files 050, 073, and 098 are corrupted. When I decode them using jsonline, below error occured.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdf in position 6778: invalid continuation byte

Access to Video Dataset

Hi Rowan,

Congrats for your work and contribution. Will you release the YT-Temporal-180M video dataset? I'd like to get access to it.

Thanks,
Hongwei

Fine-tuning on VCR dataset

Thanks for your great work. are you planing to release the code to fine-tune VCR task? I would appreciate it if you could release the code for data processing and data loading.

Access to Video Dataset

Hi Rowanz,

Thanks for your work and contribution.
Will you release the YT-Temporal-180M video dataset? I'd like to get access to it.

I already emailed you. so please check your email!

Thanks,
Shinyeong

Thanks,
Alessandro

Running funetuning on GPU

Thanks for releasing your great work. I was wondering if there is a way to run the finetuning and zero-shot inference code on GPU rather than TPU? What king of adjustment would I need to make?
Thanks

rowanz / merlot Goto Github PK

merlot's People

Contributors

Stargazers

Watchers

Forkers

merlot's Issues

Recommend Projects

Recommend Topics

Recommend Org