Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation

PyTorch Code of the ECCV 2022 paper:

Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation,
Chuang Lin, Yi Jiang, Jianfei Cai, Lizhen Qu, Gholamreza Haffari, Zehuan Yuan

Introduction

Vision-and-Language Navigation (VLN) is a task that an agent is required to follow a language instruction to navigate to the goal position, which relies on the ongoing interactions with the environment during moving.
Our work explicitly models the history interactions between observations and the instruction which is critical for figuring out the progress of the navigation trajectory.
Our model MTVM achevies new state-of-the-art on R2R datasets, resulting 65% Success Rate and 59% SPL on unseen test set.

Results

Requirements

Linux or macOS with Python ≥ 3.6
PyTorch ≥ 1.6.

pip install -r requirements.txt
sudo apt-get install libjsoncpp-dev libepoxy-dev libglm-dev libosmesa6 libosmesa6-dev libglew-dev

Installation

Build the Simulator with following instruction. The simulater is version v0.1 of Matterport3D Simulator..

mkdir build && cd build
cmake -DOSMESA_RENDERING=ON ..
make

Prepare datasets

Please follow the data preparation as Recurrent VLN-BERT

R2R Navigation benchmark evaluation and training

The MTVM models are initialized from PREVALENT (indicated by --vlnbert in the train_agent.bash file). Please download the pretrain model and place them under Prevalent/pretrained_model/ before training the MTVM models.

To train a model, run

bash run/train_agent.bash

To evaluate a model with a trained/ pretrained model, run

bash run/test_agent.bash

Download the trained network weights here.

Citation

If you find this project useful for your research, please use the following BibTeX entry.

@article{lin2021multimodal,
  title={Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation},
  author={Lin, Chuang and Jiang, Yi and Cai, Jianfei and Qu, Lizhen and Haffari, Gholamreza and Yuan, Zehuan},
  journal={arXiv preprint arXiv:2111.05759},
  year={2021}
}

Acknowledgments

This repo is based on Recurrent VLN-BERT. Thanks for their wonderful works.

clin1223 / mtvm Goto Github PK

mtvm's Introduction

Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation

Introduction

Results

Requirements

Installation

Prepare datasets

R2R Navigation benchmark evaluation and training

Citation

Acknowledgments

mtvm's People

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent