Coder Social home page Coder Social logo

merlot's People

Contributors

gloriaximinglu avatar rowanz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

merlot's Issues

How to download pre training model

Thank you for your work.
I have a question about how to download the linked model (gs://merlot/checkpoint_ 4segments/)
This doesn't seem to open through a browser

Fine-tune on TVQA dataset

Thank you very much for your work. May I ask if you can release the code for fine-tune on tvqa dataset

Is finetuned checkpoint on VCR available?

Hi it seems that this repo released the pretrained checkpoints.
Is the finetuned checkpoint on the VCR task also available?
I also wonder approximately how many hours and how much cost it took to finetune for VCR using the current TPU set up.
Thank you!

Question about merlot model

Dear Rowan,
Hi, I have noticed this paper recently, I really think this paper is of great value, I understand nearly all the details of your paper except the model. I know the details are in the code, but I am not familiar with TensorFlow, if you can explain these to me, I will understand the code much easier, so I wonder if you can answer my questions when you have time?

1.What does chunk mean in the code? Does it represent the max number of segments a video has been segmented?
2.In 3.2, you said that MERLOT takes multiple unordered video frames as input, but in Joint Vision-Language Encoder
part, you say that position embeddings are added to the vision components, do you mean that, when fed into the model, the image and the corresponding sentence have the same position embedding?
3.In 3.3, Temporal Reordering part, I understand the core idea, but I am not sure about your methods, is it correct that you randomly choose i frames, and then change the position embedding of these frames to the same embedding [image_unk_0]?

Best regards,
Zihao

Question on the definition of visually "ungrounded" categories

I agree that some categories may not provide enough aligned vision-language information for multi-modal learning. However, in the paper, you mentioned "video game commentaries" as an example.

I wonder why it is not visually grounded. The people's comments are usually related to the games. In my opinion, we could filter this category only for its unreality, which means it may not benefit downstream tasks.

Issue on the model scalablity due to segment-level positional embeddings

I notice that MERLOT adopts segment-level positional embeddings. However, there are only 16 segments during pre-training.
For longer videos, e.g., movies, 16 segments are not enough to encode their information. Specifically, I have two questions:

  1. How to extract features for extremely long videos like movies?
  2. How about using fixed positional embeddings instead of learned ones?

Question on fair comparison with Conceptual ∪ COCO

Thanks for the great work. I have a question on fair comparison with Conceptual ∪ COCO.

In the experiments on dataset source, you compared the model trained in Conceptual ∪ COCO datasets. For a fair comparison, you mentioned

for a fair comparison, we train for the same number of steps as 5 epochs on our dataset.

However, 5 epochs means the model has seen all 180M segment-transcripts pairs. As you've mentioned in the paper, there will be lesser overfitting issues.

I think the proper way should be to train your model on 3M segment-transcript pairs / 3M videos.

YT-Temporal-180M video dataset

Hi Rowanz,

   Thanks for your great work and contribution on MERLOT and YT-Temporal-180M dataset !
   Will you release the YT-Temporal-180M video dataset? If possible, can you provide us with the text annotation?

Thanks

Access to Video Dataset

Hi Rowan,

Congrats for your work and contribution. Will you release the YT-Temporal-180M video dataset? I'd like to get access to it.

Thanks,
Hongwei

Fine-tuning on VCR dataset

Thanks for your great work. are you planing to release the code to fine-tune VCR task? I would appreciate it if you could release the code for data processing and data loading.

Access to Video Dataset

Hi Rowanz,

Thanks for your work and contribution.
Will you release the YT-Temporal-180M video dataset? I'd like to get access to it.

I already emailed you. so please check your email!

Thanks,
Shinyeong

Access to Video Data

Thanks for your work. I was also wondering that how I can access the video data. Could you kindly send me the way to access the video dataset, my email address is [email protected], please?

Code for preprocessing raw video data

Hi,
I can't find code for preprocessing raw videos and the meta data for raw videos. Could you please help me find that?
By the way, it would be really nice if you provide the crawler code for videos and captions.
Thanks!

How to access the video dataset

Thanks for your work. I was also wondering that how I can access the video dataset. Could you kindly send me the way to access the video dataset, please?

Access to video dataset?

Hi Rowan,

Congrats for your work. Indeed very interesting contribution. I was wondering what would be a way to get access to the video dataset that you've used in your experiments?

Thanks,
Alessandro

Running funetuning on GPU

Thanks for releasing your great work. I was wondering if there is a way to run the finetuning and zero-shot inference code on GPU rather than TPU? What king of adjustment would I need to make?
Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.