Coder Social home page Coder Social logo

sail-sg / ptp Goto Github PK

View Code? Open in Web Editor NEW
143.0 7.0 4.0 2.42 MB

[CVPR2023] The code for 《Position-guided Text Prompt for Vision-Language Pre-training》

Home Page: https://arxiv.org/abs/2212.09737

License: Apache License 2.0

Python 97.73% Shell 2.27%
cross-modality vision-language-pretraining vlp

ptp's Introduction

PTP

PWC

PWC

PWC

This repository includes implementations of the following method:

Introduction

The goal of Position-guided Text Prompt (PTP) is to bring position information into conventional Vision-Language Pre-training (VLP) models, as current mainstream e2e VLP models ignore this important cues.

We observe Position information is missed in a well-trained ViLT models.

Our method provide a good altentive for existing object feature based methods (BUTD and the following works).

Some examples of one PTP is show below:

Updates

  • 2023.5 Modify the pre-training corpus to prevent confusing.
  • 2023.3 The Pre-training Code is released.
  • 2023.1 We have put the pretrained and fine-tuned weight on huggingface for fast download.
  • 2022.12 The first version of downstream evaluation code based on BLIP and pretrained/down-stream weight is released! The pre-training code is in cleaning up now.

Installation

Please find installation instructions for PyTorch in INSTALL.md.

Dataset Preparation

You may follow the instructions in DATASET.md to prepare the datasets. Considering the dataset prepartion is very time consuming, we provide detail guidence and provided our trained corpus.

Pretrained & Finetune Models

1. Pre-trained Model

Method Vision Encoder #Images Dataset Pretrained Weights Training Logs
PTP-BLIP ViT-B(DeiT) 4M CC3M+COCO+VG+SBU link link

2. Zero-shot & Fine-tuning Downstream Model

2.1 Captioning

Method B@4 CIDEr Config
PTP-BLIP 40.1 135.0 configs/caption_coco.yaml

2.2 Zero-shot Retrieval

2.2.2 Flickr30K
Method I2T@1 T2I@1 Model Weight Training Logs Config
PTP-BLIP 86.4 67.0 link link configs/retrieval_flickr.yaml

2.3 Retrieval (Fine-tune)

Tip: Please use as large batch size as possible, we experimentally find that the larger batch size leads to better result for this task. Due to memory limiation, we use batch size 24 rather than 28 in original implmentation.

2.3.1 COCO
Method I2T@1 T2I@1 Config
PTP-BLIP 77.6 59.4 configs/retrieval_coco.yaml
2.3.2 Flickr30K
Method I2T@1 T2I@1 Model Weight Training Logs Config
PTP-BLIP 96.1 84.2 link link configs/retrieval_flickr.yaml

2.4 VQA V2

Method Test-dev Test-std Model Weight Training Logs Config
PTP-BLIP 76.02 76.18 link link configs/vqa.yaml

2.5 NLVR

Method Dev Test-P Model Weight Training Logs Config
PTP-BLIP 80.45 80.70 link link configs/nlvr.yaml

Quick Start

Follow the example in GETTING_STARTED.md to start playing vlp models with PTP.

Transfer To Other Architectures

The PTP can easily transfer to other architectures without much effort. Specifically, change your base code with following two steps:

  • Download or generate corpus in the same format as ours.
  • Modify the dataset.py

Then train the model with original objectives.

Ackowledgement

This work is mainly based on BLIP and ViLT, thanks for these good baselines. We also refer OSCAR for ablation study and dataset preparation.

License

PTP is released under the Apache 2.0 license.

Contact

Email: awinyimgprocess at gmail dot com

If you have any questions, please email me or open an new issue.

Citation

If you find our work helps, please use the following BibTeX entry for citation.

@article{wang2022ptp,
  title={Position-guided Text Prompt for Vision Language Pre-training},
  author={Wang, Alex Jinpeng and Zhou, Pan and Shou, Mike Zheng and Yan, Shui Cheng},
  journal={https://arxiv.org/abs/2212.09737},
  year={2022}
}

ptp's People

Contributors

fingerrec avatar panzhous avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

ptp's Issues

the source code is incomplete.

Hi, I am interesting in your method and find that the source code in incomplete. Can you upload a new version code ?

I am looking forward your reply. Thank you very much.

Challenges in Pre-training with ptp-blip Code: Seeking Advice on Loss Explosion

Hello,

I'm glad to hear that you've achieved good results and are planning further research based on this work. However, you're currently facing challenges with the pre-training process using the code provided.

Firstly, I would like to know if the "4M_corpus.tsv" file provided on GitHub is the same dataset used in the paper. This file seems to contain a total of 5 million image-text pairs, which differs from the pre-training log you provided.

image
[Count of image-text pairs in "4M_corpus.tsv"]

On the other hand, the pre-training log for ptp-blip shows the following:

image
(ptp-blip pre training log: https://huggingface.co/sail/PTP/blob/main/4M_pretrain.txt)

In reality, when I trained using the "4M_corpus.tsv" provided by your team, the total count of image-text pairs exceeded 5 million. We conducted the pre-training with the same experimental setup (as mentioned in the pretrain_concated_pred_4M.yaml file). However, we encountered the phenomenon of gradient explosion, as shown in the image.
image

Our setup includes four A6000 GPUs, with a batch size of 75 per GPU, resulting in a total batch size of 300 per step (compared to 600 in the paper). However, this configuration led to gradient explosion, hindering the progress of training.

We attempted to address this issue by using gradient accumulation to match the paper's setup, where the batch size remained at 600 per step. However, the gradient still exploded.

The main cause of the explosion seems to be the "ita" loss, as it exhibited instability without a consistent decrease. While the language modeling (LM) loss consistently decreased, the unstable behavior of the "ita" loss indicates potential issues with the image data.

If you have any insights or advice regarding the potential causes of the loss explosion during my pre-training, I would greatly appreciate your guidance.

Pre-exacted Image Feature for Visual Genome(VG)

Hello,

I'm currently in the process of re-implementing ptp-blip for my research in the field. However, I encountered an issue during the second step mentioned in ptp/DATASET.md, which is "2. Download/Prepare Corpus (image-text pair)." Following the instructions provided, I was able to download object features for COCO2014 train/val/test, 2015 test, CC3M, and SBU using the provided download_cc3m_predictions.sh script. However, I couldn't find a download link for VG (Visual Genome) and I'm currently searching for it.

If you happen to know the download link for VG or if image features for VG are not required separately, it would be immensely helpful if you could let me know. Once again, I want to express my gratitude for the excellent research work done with ptp.

Thank you!

  • Here is a screenshot from the oscar website showing that there is no download link available for VG's image features.

image
image

Training time and grid pseudo label extracting time

Hello, I saw the results of your paper and they were truly outstanding.
I have a few questions.

  1. Could you tell me how long it takes to do pretraining and fine-tuning for the coco image-to-text retrieval?
  2. Also, from what I read in your paper, obtaining the grid pseudo label using CLIP takes around 8 hours. Could I understand that the grid pseudo label is a corpus that is extracted to provide positional information through prompts?

Thank you😁!

Utilize the positional information from PTP

Hi @FingerRec, @panzhous

Thank you for your great work. This work can create a significant impact in the VLP fields.
I want to ask these questions regarding this work:

  1. Given this motivation image, and a caption (e.g. There is a dog on the left). From your model, can we localize the dog position, or predict the dog mask?
  2. Given this dog's mask (the top right image), and a caption (e.g. There is a dog on the left). From the model, can we calculate the cosine similarity between that masks and caption (the score will high if the caption referring to that masks and otherwise)?
  3. I try to use the model to calculate the similarity between image and text, however the result are not as good as I expected. I do not know whether I did anything wrong. You can check the code here. Here is the result when I run the model with an image of elephant.

image

About the obj tag and text prompt

Hello, thanks for your sharing the great work!

As we can see the eq.(1), the object tag is produced by a argmax operation, while the paper shows "we select one O at random for each time" in Sec 3.1.2.
So there is a doubt: when the object tag is firstly determined, how to judge such a situation ? (" For a certain P, we may have various options for O because the block may contain multiple objects.")

Looking forward for your reply!
Thanks😁!

PTP-CLIP

Hello, could you please provide PTP-CLIP checkpoint?

Dataset information

Hello, first of all, thank you for sharing great work!

I try to use your work, but I have some uncertainties in dataset.
In your Dataset.md, you point out that 4M dataset is cleaned followed by BLIP.

Does it mean that your 4M dataset is filtered and synthetically generated as BLIP did? (
Moreover, In Table 2 and 6, it seems that PTP-BLIP scores are different.
What is the difference between these two scores?

Thank you

Questions about ptp

Hi,
Congratulations on the great success of your wonderful work! I have several questions about ptp in terms of the pretraining/fintuning settings described in the paper. The questions are as follows:

  1. I noticed that you perform zero-shot retrieval experiments on MS COCO, but in Chap 4.1 of the paper I find COCO is also used in pre-training. Did you exclude COCO from the pre-training dataset before zero-shot testing on COCO?
  2. You mentioned in the paper that text prompt is used only in the pretraining stage. That sounds quite fair because it doesn't change the inference setting. As far as I'm concerned, using ptp will change the distribution of image captions and make a distribution gap between training corpus and testing corpus, which might harm the retrieval results. But it seems to be the opposite, it helps the downstream retrieval rather than harming it. Why?
    For example, in the zero-shot retrieval setting, captions in the training stage are like "...The block x has a x" but the prompts are not used anymore during inference, why doesn't this harm the performance?
    Does the scale of training dataset matters here? I'm curious if it helps if text prompts of ptp is used in the finetuning stage (instead of pre-training)?
    I try to extend ptp to video retreival, and did some experiments on video datasets, trying to add ptp in the finetuning stage when fintuning on MSRVTT, but the performance drops a little bit.

Looking forward to your reply!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.