sail-sg / ptp Goto Github PK
View Code? Open in Web Editor NEW[CVPR2023] The code for 《Position-guided Text Prompt for Vision-Language Pre-training》
Home Page: https://arxiv.org/abs/2212.09737
License: Apache License 2.0
[CVPR2023] The code for 《Position-guided Text Prompt for Vision-Language Pre-training》
Home Page: https://arxiv.org/abs/2212.09737
License: Apache License 2.0
Hello, thanks for your sharing the great work!
As we can see the eq.(1), the object tag is produced by a argmax operation, while the paper shows "we select one O at random for each time" in Sec 3.1.2.
So there is a doubt: when the object tag is firstly determined, how to judge such a situation ? (" For a certain P, we may have various options for O because the block may contain multiple objects.")
Looking forward for your reply!
Thanks😁!
I find BLIP-4M ckpt merely, think that 14M will work better and want to use it. Could you please provide it?
Hello,
I'm currently in the process of re-implementing ptp-blip for my research in the field. However, I encountered an issue during the second step mentioned in ptp/DATASET.md, which is "2. Download/Prepare Corpus (image-text pair)." Following the instructions provided, I was able to download object features for COCO2014 train/val/test, 2015 test, CC3M, and SBU using the provided download_cc3m_predictions.sh script. However, I couldn't find a download link for VG (Visual Genome) and I'm currently searching for it.
If you happen to know the download link for VG or if image features for VG are not required separately, it would be immensely helpful if you could let me know. Once again, I want to express my gratitude for the excellent research work done with ptp.
Thank you!
Hello, I saw the results of your paper and they were truly outstanding.
I have a few questions.
Thank you😁!
Hi @FingerRec, @panzhous
Thank you for your great work. This work can create a significant impact in the VLP fields.
I want to ask these questions regarding this work:
Hello, could you please provide PTP-CLIP checkpoint?
Hello, first of all, thank you for sharing great work!
I try to use your work, but I have some uncertainties in dataset.
In your Dataset.md, you point out that 4M dataset is cleaned followed by BLIP.
Does it mean that your 4M dataset is filtered and synthetically generated as BLIP did? (
Moreover, In Table 2 and 6, it seems that PTP-BLIP scores are different.
What is the difference between these two scores?
Thank you
Hi,
Congratulations on the great success of your wonderful work! I have several questions about ptp in terms of the pretraining/fintuning settings described in the paper. The questions are as follows:
Looking forward to your reply!
Hi, I am interesting in your method and find that the source code in incomplete. Can you upload a new version code ?
I am looking forward your reply. Thank you very much.
Hello,
I'm glad to hear that you've achieved good results and are planning further research based on this work. However, you're currently facing challenges with the pre-training process using the code provided.
Firstly, I would like to know if the "4M_corpus.tsv" file provided on GitHub is the same dataset used in the paper. This file seems to contain a total of 5 million image-text pairs, which differs from the pre-training log you provided.
[Count of image-text pairs in "4M_corpus.tsv"]
On the other hand, the pre-training log for ptp-blip shows the following:
(ptp-blip pre training log: https://huggingface.co/sail/PTP/blob/main/4M_pretrain.txt)
In reality, when I trained using the "4M_corpus.tsv" provided by your team, the total count of image-text pairs exceeded 5 million. We conducted the pre-training with the same experimental setup (as mentioned in the pretrain_concated_pred_4M.yaml file). However, we encountered the phenomenon of gradient explosion, as shown in the image.
Our setup includes four A6000 GPUs, with a batch size of 75 per GPU, resulting in a total batch size of 300 per step (compared to 600 in the paper). However, this configuration led to gradient explosion, hindering the progress of training.
We attempted to address this issue by using gradient accumulation to match the paper's setup, where the batch size remained at 600 per step. However, the gradient still exploded.
The main cause of the explosion seems to be the "ita" loss, as it exhibited instability without a consistent decrease. While the language modeling (LM) loss consistently decreased, the unstable behavior of the "ita" loss indicates potential issues with the image data.
If you have any insights or advice regarding the potential causes of the loss explosion during my pre-training, I would greatly appreciate your guidance.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.