Coder Social home page Coder Social logo

allenai / x-lxmert Goto Github PK

View Code? Open in Web Editor NEW
50.0 6.0 10.0 4.66 MB

PyTorch code for EMNLP 2020 paper "X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers"

Home Page: https://prior.allenai.org/projects/x-lxmert

Python 99.38% Shell 0.62%
vision-and-language pretrained-models image-generation text-to-image x-lxmert emnlp2020 ai2

x-lxmert's Introduction

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers (EMNLP 2020)

Summary

Recent multi-modal transformers have achieved tate of the art performance on a variety of multimodal discriminative tasks like visual question answering and generative tasks like image captioning. This begs an interesting question: Can these models go the other way and generate images from pieces of text? Our analysis of a popular representative from this model family - LXMERT - finds that it is unable to generate rich and semantically meaningful imagery with its current training setup. We introduce X-LXMERT, an extension to LXMERT with training refinements. X-LXMERT's image generation capabilities rival state of the art generative models while its question answering and captioning abilities remains comparable to LXMERT.

Demo

Try out AI2 Computer Vision Explorer Demo!

Install

  • Python packages
conda create -n xlxmert python=3.7
conda activate xlxmert
cd  x-lxmert
pip install -r ./requirements.txt

Code structure

# Store images, features, and annotations
./datasets
    COCO/
        images/
        featuers/
    VG/
        images/
        features/
    GQA/
        images/
        features/
    nlvr2/
        images/
        features/
    data/               <= Store text annotations (*.json) for each split
        lxmert/
        vqa/
        gqa/
        nlvr2/

# Run feature extraction and k-means clustering
./feature_extraction

# Train image generator
./image_generator
    snap/       <= Store image generator checkpoints
    scripts/    <= Bash scripts for training image generator

# Train X-LXMERT
./x-lxmert
    src/
        lxrt/           <= X-LXMERT model class implementation (inherits huggingface transformers' LXMERT class)
        pretrain/       <= X-LXMERT Pretraining
        tasks/          <= Fine-tuning on downstream tasks (VQA, GQA, NLVR2, Image generation)
    snap/       <= Store X-LXMERT checkpoints
    scripts/    <= Bash scripts for pretraining, fine-tuning, and image generation

Feature extraction

Please checkout ./feature_extraction for download pre-extracted features and more details.

cd ./feature_extraction

# For Pretraining / VQA
python coco_extract_grid_feature.py --split train
python coco_extract_grid_feature.py --split valid
python coco_extract_grid_feature.py --split test

# For Pretraining
python VG_extract_grid_feature.py

# For GQA
python GQA_extract_grid_feature.py

# For NLVR2
python nlvr2_extract_grid_feature.py --split train
python nlvr2_extract_grid_feature.py --split valid
python nlvr2_extract_grid_feature.py --split test

# K-Means clustering
python run_kmeans.py --src mscoco_train --tgt mscoco_train mscoco valid vg

Pretraining

Pretrain on LXMERT Pretraining data

cd ./x-lxmert/
bash scripts/pretrain.bash

or download pretrained checkpoint

wget -O x-lxmert/snap/pretrained/x_lxmert/Epoch20_LXRT.pth https://ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com/x-lxmert/Epoch20_LXRT.pth

Finetuning

VQA

cd ./x-lxmert/
bash scripts/finetune_vqa.bash
bash scripts/test_vqa.bash

GQA

cd ./x-lxmert/
bash scripts/finetune_gqa.bash
bash scripts/test_gqa.bash

NLVR2

cd ./x-lxmert/
bash scripts/finetune_nlvr2.bash
bash scripts/test_nlvr2.bash

Image generation

Train image generator on MS COCO

cd ./image_generator/
bash scripts/train_generator.bash

or download pretrained checkpoints

wget -O image_generator/snap/pretrained/G_60.pth https://ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com/image_generator/G_60.pth

Sample images

cd ./x-lxmert/
bash scripts/sample_image.bash

Reference

@inproceedings{Cho2020XLXMERT,
  title={X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers},
  author={Cho, Jaemin and Lu, Jiasen and Schwenk, Dustin and Hajishirzi, Hannaneh and Kembhavi, Aniruddha},
  booktitle={EMNLP},
  year={2020}
}

x-lxmert's People

Contributors

j-min avatar jiasenlu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

x-lxmert's Issues

Is this the original code for the paper?

Seems like this might not be the original code used for the paper - as I see quite a few bugs here ranging from typos/syntax errors to different file structures than the instructions (specifically for image generation). Would be nice if the authors @j-min actually can verify if this is it and it works for them?

API for generating images from captions

This is a cool tool, and I really enjoy the images I've gotten from the Demo.

I was hoping one of two things were possible, and I'm wondering if I'm just missing something basic. First, is there a web API? It'd be amazing to be able to do something similar to:

result = requests.get('https://vision-explorer.allenai.org/text_to_image_generation_api', data={'caption': 'Diamond rose horse'})
save_image(result.json()['image_str'])

Or, (and it looks like this is to something that I might actually be able to do) if there's no web API:

pip install -r requirements.txt
wget -O image_generator/snap/pretrained/G_60.pth https://ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com/image_generator/G_60.pth
./image_generator/scripts/make_image.py --caption "Diamond rose horse" --outpath my_weird_pic.png

I guess my question is: if all I care about is having a caption and getting an image programmatically, what's the easiest way of doing that?

Scripts missing?

Hi!
In the README there are references to scripts for finetuning and testing the different downstream tasks. They should be located in the x-lxmert/scripts/ folder, but don't seem to be there. Is it possible to add them? It would also be nice if the fine-tuned models could be shared to be able to reproduce the results.

Thanks :)

403 Forbidden when trying to download grid features

Thanks for sharing the code! Could you check the download links for grid features?

$ wget https://ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com/butd_features/NLVR2/maskrcnn_train_grid8.h5
--2021-01-08 14:57:08--  https://ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com/butd_features/NLVR2/maskrcnn_train_grid8.h5
Resolving ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com (ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com)... 52.218.153.17
Connecting to ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com (ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com)|52.218.153.17|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2021-01-08 14:57:08 ERROR 403: Forbidden.

image_generator/src/trainer.py missing

Hi, thanks for this wonderful work! When I try to run the image generation training, it appears that trainer.py is missing from the src. Could you add this missing file to the repo? Thanks a lot for your kind help!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.