MMT-Retrieval: Image Retrieval and more using Multimodal Transformers (OSCAR, UNITER, M3P & Co)

This project provides an easy way to use the recent pre-trained multimodal Transformers like OSCAR, UNITER/ VILLA or M3P (multilingual!) for image search and more.

The code is primarily written for image-text retrieval. Still, many other Vision+Language tasks, beside image-text retrieval, should work out of the box using our code or require just small changes.

There is currently no unified approach for how the visual input is handled and each model uses their own slightly different approach. We provide a common interface for all models and support for multiple feature file formats. This greatly simplifies the process of running the models.

Our project allows you to run a model in a few lines of code and offers easy fine-tuning of your own custom models.

We also provide our fine-tuned image-text-retrieval models for download, so you can get directly started. Check out our example for Image Search on MSCOCO using our fine-tuned models here.

Citing & Authors

If you find this repository helpful, feel free to cite our publication [Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval](ARXIV URL):

@article{geigle:2021:arxiv,
  author    = {Gregor Geigle and 
                Jonas Pfeiffer and 
                Nils Reimers and 
                Ivan Vuli\'{c} and 
                Iryna Gurevych},
  title     = {Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval},
  journal   = {arXiv preprint},
  volume    = {abs/2103.TODO},
  year      = {2021},
  url       = {http://arxiv.org/abs/2103.TODO},
  archivePrefix = {arXiv},
  eprint    = {2103.TODO}
}

Abstract: State-of-the-Art Vision and Language models jointly process images and text pairs to learn a shared representation-space. These Transformer-based models attend over all words and objects in an image, allowing for nuanced reasoning over the respective modalities. Whilst being a powerful mechanism, this comes at cumbersome latency costs for image and text retrieval, making these \textit{cross-encoding} approaches impractical for realistic application scenarios. To mitigate this we propose a cooperative retrieve and rerank approach which utilizes pre-trained multimodal models. Fine-tuned within a twin-network we separately encode all items of a corpus, enabling efficient retrieval of images/text. Our cross-encoder component provides a more nuanced comparison of the input pairs allowing smart reranking of the top retrieved items. We experiment on monolingual and multilingual benchmarks, leveraging recent multimodal models, and demonstrate that our approach achieves state-of-the-art results while being magnitudes faster in retrieval. We further propose a larger and harder benchmark, on which we show that results on smaller test sets are inflated and misleading.

Don't hesitate to send me an e-mail or report an issue, if something is broken or if you have further questions or feedback.

Contact person: Gregor Geigle, [email protected]

https://www.ukp.tu-darmstadt.de/

https://www.tu-darmstadt.de/

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Installation

We recommend Python 3.6 or higher, PyTorch 1.6.0 or higher, transformers v4.1.1 or higher, and sentence-transformer 0.4.1 or higher.

Install with pip (COMING SOON)

Install mmt-retrieval with pip:

pip install mmt-retrieval

Install from sources

Alternatively, you can also clone the latest version from the repository and install it directly from the source code:

pip install -e .

PyTorch with CUDA If you want to use a GPU / CUDA, you must install PyTorch with the matching CUDA Version. Follow PyTorch - Get Started for further details how to install PyTorch.

Getting Started

With our repository, you can get started using the multimodal Transformers in a few lines of code. Check out our example for Image Search on MSCOCO using our fine-tuned models here. Or go along with the following steps to get started with your own project.

Select the Model

We provide our fine-tuned Image-Text Retrieval models for download. We also provide links to where to download the pre-trained models and models that are fine-tuned for other tasks.

Alternatively, you can fine-tune your own model, too. See here for more.

Our Fine-Tuned Image-Text Retrieval Models

We publish our jointly trained fine-tuned models. They can be used both to encode images and text in a multimodal embedding space and to cross-encode pairs for a pairwise similarity.

Model	URL
OSCAR (Flickr30k)	https://public.ukp.informatik.tu-darmstadt.de/reimers/mmt-retrieval/models/v1/oscar_join_flickr30k.zip
OSCAR (MSCOCO)	https://public.ukp.informatik.tu-darmstadt.de/reimers/mmt-retrieval/models/v1/oscar_join_mscoco.zip
M3P (Multi30k - en, de fr, cs)	https://public.ukp.informatik.tu-darmstadt.de/reimers/mmt-retrieval/models/v1/m3p_join_multi30k.zip

Other Pre-Trained or Fine-Tuned Transformer

We currently do not directly support downloading of the different pre-trained Transformer models. Please manually download them using the links in the respective repositories: OSCAR, UNITER/ VILLA, M3P. We present here examples on how to initialize your own models with the pre-trained Transformers.

OSCAR provides many already fine-tuned models for different tasks for download (see their MODEL_ZOO.md). We provide the ability to convert those models to our framework so you can quickly start using them.

from mmt_retrieval.util import convert_finetuned_oscar

downloaded_folder_path = ".../oscar-base-ir-finetune/checkpoint-29-132780"
converted_model = convert_finetuned_oscar(downloaded_folder_path)
converted_model.save("new_save_location_for_converted_model")

Step 0: Image Feature Pre-Processing

All currently supported models require a pre-processing step where we extract the regions of interest (which serve as image input analog to tokens for the language input) from the images using a Faster R-CNN object detection model.

Which detection model is needed, depends on the model that you are using. Check out our guide where we have gathered all needed information to get startet.

If available, we also point to already pre-processed image features that can be downloaded for a quicker start.

Loading Features and Image Input

We load image features in a dictionary-like object (model.image_dict) at the start. We support various different storage formats for the features (see the guide above). Each image is uniquely identified by its image id in this dictionary.

The advantage of the dictionary approach is that we can designate the image input by its id which is then internally resolved to the features.

Loading Features Just-In-Time (RAM Constraints)

The image features require a lot of additional memory. For this reason, we support just-in-time loading of the features from disc. This requires one feature file for each image. Many of the downloadable features are saved in a single file. We provide code to split those big files in separate files, one for each image.

from mmt_retrieval.util import split_oscar_image_feature_file_to_npz, split_tsv_features_to_npz

Step 1: Getting Started

The following is an example showcasing all steps needed to get started encoding multimodal inputs with our code.

from mmt_retrieval import MultimodalTransformer

# Loading a jointly trained model that can both embed and cross-encode multimodal input
model_path = "https://public.ukp.informatik.tu-darmstadt.de/reimers/mmt-retrieval/models/v1/oscar_join_flickr30k.zip"
model = MultimodalTransformer(model_name_or_path=model_path)

# Image ids are the unique identifier number (as string) of each image. If you save the image features separately for each image, this would be the file name
image_ids = ["0", "1", "5"]
# We must load the image features in some way before we can use the model
# Refer to Step 0 on more details for how to generate the features
feature_folder = "path/to/processed/features"
# Directly load the features from disc. Requires more memory. 
# Increase max_workers for more concurrent threads for faster loading with many features
# Remove select to load the entire folder
model.image_dict.load_features_folder(feature_folder, max_workers=1, select=image_ids)
## OR
# Only load the file paths so that features are loaded later just-in-time when there are required.
# Recommended with restricted memory and/ or a lot of images
# Remove select to load the entire folder
model.image_dict.load_file_names(feature_folder, select=image_ids)

sentences = ["The red brown fox jumped over the fence", "A dog being good"]

# Get Embeddings (as a list of numpy arrays)
sentence_embeddings = model.encode(sentences=sentences, convert_to_numpy=True) # convert_to_numpy=True is default
image_embeddings = model.encode(images=image_ids, convert_to_numpy=True)

# Get Pairwise Similarity Matrix (as a tensor)
similarities = model.encode(sentences=sentences, images=image_ids, output_value="logits", convert_to_tensor=True, cross_product_input=True)
similarities = similarities[:,-1].reshape(len(image_ids), len(sentences))

Experiments and Training

See our examples to learn how to fine-tune and evaluate the multimodal Transformers. We provide instructions for fine-tuning your own models with our image-text retrieval setup, show how to replicate our experiments, and give pointers on how to train your own models, potentially beyond image-text retrieval.

Expected Results with our Fine-Tuned Models

We report the JOIN+CO (,i.e., retrieve & re-rank with a jointly trained model) results of our published models Refer to our publications for more detailed results.

Image Retrieval for MSCOCO/ Flickr30k:

Model	Dataset
		R@1	R@5	R@10
oscar-join-mscoco	MSCOCO (5k images)	54.7	81.3	88.9
oscar-join-flickr30k	Flickr30k (1k images)	76.4	93.6	96.2

Multilingual Image Retrieval for Multi30k (in mR):

Model	en	de	fr	cs
m3p-join-multi30k	83.0	79.2	75.9	74

kiminh / mmt-retrieval Goto Github PK

mmt-retrieval's Introduction

MMT-Retrieval: Image Retrieval and more using Multimodal Transformers (OSCAR, UNITER, M3P & Co)

Citing & Authors

Installation

Getting Started

Select the Model

Our Fine-Tuned Image-Text Retrieval Models

Other Pre-Trained or Fine-Tuned Transformer

Step 0: Image Feature Pre-Processing

Loading Features and Image Input

Loading Features Just-In-Time (RAM Constraints)

Step 1: Getting Started

Experiments and Training

Expected Results with our Fine-Tuned Models

mmt-retrieval's People

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent