rowanz / merlot Goto Github PK
View Code? Open in Web Editor NEWMERLOT: Multimodal Neural Script Knowledge Models
License: MIT License
MERLOT: Multimodal Neural Script Knowledge Models
License: MIT License
Thank you for your work.
I have a question about how to download the linked model (gs://merlot/checkpoint_ 4segments/)
This doesn't seem to open through a browser
Thank you very much for your work. May I ask if you can release the code for fine-tune on tvqa dataset
Hi it seems that this repo released the pretrained checkpoints.
Is the finetuned checkpoint on the VCR task also available?
I also wonder approximately how many hours and how much cost it took to finetune for VCR using the current TPU set up.
Thank you!
Dear Rowan,
Hi, I have noticed this paper recently, I really think this paper is of great value, I understand nearly all the details of your paper except the model. I know the details are in the code, but I am not familiar with TensorFlow, if you can explain these to me, I will understand the code much easier, so I wonder if you can answer my questions when you have time?
1.What does chunk mean in the code? Does it represent the max number of segments a video has been segmented?
2.In 3.2, you said that MERLOT takes multiple unordered video frames as input, but in Joint Vision-Language Encoder
part, you say that position embeddings are added to the vision components, do you mean that, when fed into the model, the image and the corresponding sentence have the same position embedding?
3.In 3.3, Temporal Reordering part, I understand the core idea, but I am not sure about your methods, is it correct that you randomly choose i frames, and then change the position embedding of these frames to the same embedding [image_unk_0]?
Best regards,
Zihao
I agree that some categories may not provide enough aligned vision-language information for multi-modal learning. However, in the paper, you mentioned "video game commentaries" as an example.
I wonder why it is not visually grounded. The people's comments are usually related to the games. In my opinion, we could filter this category only for its unreality, which means it may not benefit downstream tasks.
I notice that MERLOT adopts segment-level positional embeddings. However, there are only 16 segments during pre-training.
For longer videos, e.g., movies, 16 segments are not enough to encode their information. Specifically, I have two questions:
Hi,
Congrats on the impressive work. I was just wondering do you have a rough estimation about the disk quota required to host the YT-Temporal-180M dataset? Sorry if I missed this information in the manuscript.
Thanks.
Thanks for the great work. I have a question on fair comparison with Conceptual ∪ COCO.
In the experiments on dataset source
, you compared the model trained in Conceptual ∪ COCO datasets. For a fair comparison, you mentioned
for a fair comparison, we train for the same number of steps as 5 epochs on our dataset.
However, 5 epochs
means the model has seen all 180M segment-transcripts pairs. As you've mentioned in the paper, there will be lesser overfitting issues.
I think the proper way should be to train your model on 3M segment-transcript pairs / 3M videos.
Hi Rowanz,
Thanks for your great work and contribution on MERLOT and YT-Temporal-180M dataset !
Will you release the YT-Temporal-180M video dataset? If possible, can you provide us with the text annotation?
Thanks
Files 050, 073, and 098 are corrupted. When I decode them using jsonline, below error occured.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdf in position 6778: invalid continuation byte
Hi Rowan,
Congrats for your work and contribution. Will you release the YT-Temporal-180M video dataset? I'd like to get access to it.
Thanks,
Hongwei
Thanks for your great work. are you planing to release the code to fine-tune VCR task? I would appreciate it if you could release the code for data processing and data loading.
Hi Rowanz,
Thanks for your work and contribution.
Will you release the YT-Temporal-180M video dataset? I'd like to get access to it.
I already emailed you. so please check your email!
Thanks,
Shinyeong
Thanks for your work. I was also wondering that how I can access the video data. Could you kindly send me the way to access the video dataset, my email address is [email protected], please?
Hi,
I can't find code for preprocessing raw videos and the meta data for raw videos. Could you please help me find that?
By the way, it would be really nice if you provide the crawler code for videos and captions.
Thanks!
Thanks for your work. I was also wondering that how I can access the video dataset. Could you kindly send me the way to access the video dataset, please?
Hi Rowan,
Congrats for your work. Indeed very interesting contribution. I was wondering what would be a way to get access to the video dataset that you've used in your experiments?
Thanks,
Alessandro
Thanks for releasing your great work. I was wondering if there is a way to run the finetuning and zero-shot inference code on GPU rather than TPU? What king of adjustment would I need to make?
Thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.