Coder Social home page Coder Social logo

clin1223 / mtvm Goto Github PK

View Code? Open in Web Editor NEW
19.0 1.0 1.0 2.42 MB

[ECCV 2022] Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation

CMake 2.98% C++ 68.84% Python 27.96% C 0.01% Shell 0.21%
multimodal multiprocessing pytorch vln

mtvm's Introduction

Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation

PyTorch Code of the ECCV 2022 paper:

Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation,
Chuang Lin, Yi Jiang, Jianfei Cai, Lizhen Qu, Gholamreza Haffari, Zehuan Yuan

Introduction

  • Vision-and-Language Navigation (VLN) is a task that an agent is required to follow a language instruction to navigate to the goal position, which relies on the ongoing interactions with the environment during moving.

  • Our work explicitly models the history interactions between observations and the instruction which is critical for figuring out the progress of the navigation trajectory.

  • Our model MTVM achevies new state-of-the-art on R2R datasets, resulting 65% Success Rate and 59% SPL on unseen test set.

Results

Requirements

  • Linux or macOS with Python ≥ 3.6
  • PyTorch ≥ 1.6.
pip install -r requirements.txt
sudo apt-get install libjsoncpp-dev libepoxy-dev libglm-dev libosmesa6 libosmesa6-dev libglew-dev

Installation

Build the Simulator with following instruction. The simulater is version v0.1 of Matterport3D Simulator..

mkdir build && cd build
cmake -DOSMESA_RENDERING=ON ..
make

Prepare datasets

Please follow the data preparation as Recurrent VLN-BERT

R2R Navigation benchmark evaluation and training

The MTVM models are initialized from PREVALENT (indicated by --vlnbert in the train_agent.bash file). Please download the pretrain model and place them under Prevalent/pretrained_model/ before training the MTVM models.

To train a model, run

bash run/train_agent.bash

To evaluate a model with a trained/ pretrained model, run

bash run/test_agent.bash

Download the trained network weights here.

Citation

If you find this project useful for your research, please use the following BibTeX entry.

@article{lin2021multimodal,
  title={Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation},
  author={Lin, Chuang and Jiang, Yi and Cai, Jianfei and Qu, Lizhen and Haffari, Gholamreza and Yuan, Zehuan},
  journal={arXiv preprint arXiv:2111.05759},
  year={2021}
}

Acknowledgments

This repo is based on Recurrent VLN-BERT. Thanks for their wonderful works.

mtvm's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

piratkin

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.