Coder Social home page Coder Social logo

yiconghong / discrete-continuous-vln Goto Github PK

View Code? Open in Web Editor NEW
77.0 4.0 7.0 26.42 MB

Code and Data of the CVPR 2022 paper: Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation

License: MIT License

Python 99.29% Shell 0.71%
cvpr2022 vision-and-language-navigation vision-and-language embodied-ai computer-vision deep-learning visual-navigation

discrete-continuous-vln's Introduction

Discrete-Continuous-VLN

Code and Data of the CVPR 2022 paper:
Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation
Yicong Hong, Zun Wang, Qi Wu, Stephen Gould

[Paper & Appendices] [CVPR2022 Video] [GitHub]

The method presented in this paper is also the base method for winning the:
1st Place in the Room-Across-Room (RxR) Habitat Challenge in CVPR 2022
Dong An, Zun Wang, Yangguang Li, Yi Wang, Yicong Hong, Yan Huang, Liang Wang, Jing Shao

[Habitat-RxR Challenge Report] [Habitat-RxR Challenge Certificate]

"Interlinked. Interlinked. What's it like to hold the hand of someone you love? Interlinked. Interlinked. Did they teach you how to feel finger to finger? Interlinked. Interlinked. Do you long for having your heart interlinked? Interlinked. Do you dream about being interlinked? Interlinked." --- Blade Runner 2049 (2017).

TODOs

Update (18Sep2023): We sincerely apologize that our code has a bug in Policy_ViewSelection_CMA.py-Line#198-199 and Policy_ViewSelection_VLNBERT.py-Line#141-142; two important lines to control the visual encoders in training self.rgb_encoder.cnn.eval() and self.depth_encoder.eval() were accidentally deleted in the published version, which will cause around 3% absolute drop in results compared to the reported numbers in our paper. We have fixed this issue, and again, we want to deeply apologize for the inconvenience brought to all researchers.

Update: Thanks ZunWang for releasing the code for collecting the data and training the Candidate Waypoint Predictor.

Update: Thanks ZunWang for contributing the depth-only Candidate Waypoint Prediction model for FoV 90 (R2R-CE) and FoV 79 (RxR-CE), the architecture remains the same but the input reduces to the DD-PPO depth encoder features. The model produces more accurate waypoint prediction results than the one used in our paper. Weights are uploaded in the section below.

  • VLN-CE Installation Guide
  • Submitted version R2R-CE code of CMA and Recurrent-VLN-BERT with the CWP
  • Running guide
  • Pre-trained weights of the navigator networks and the CWP
  • [ ] RxR-CE code
  • Graph construction code
  • Candidate Waypoint Predictor training code
  • Connectivity graphs in continuous environments
  • [ ] Graph-walk in continuous environments code
  • Test all code for single-node multi-GPU-processing

Prerequisites

Installation

Follow the Habitat Installation Guide to install habitat-lab and habitat-sim. We use version v0.1.7 in our experiments, same as in the VLN-CE, please refer to the VLN-CE page for more details. In brief:

  1. Create a virtual environment. We developed this project with Python 3.6.

    conda create -n dcvln python=3.6
    conda activate dcvln
  2. Install habitat-sim for a machine with multiple GPUs or without an attached display (i.e. a cluster):

    conda install -c aihabitat -c conda-forge habitat-sim=0.1.7 headless
  3. Clone this repository and install all requirements for habitat-lab, VLN-CE and our experiments. Note that we specify gym==0.21.0 because its latest version is not compatible with habitat-lab-v0.1.7.

    git clone [email protected]:YicongHong/Discrete-Continuous-VLN.git
    cd Discrete-Continuous-VLN
    python -m pip install -r requirements.txt
    pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
  4. Clone a stable habitat-lab version from the github repository and install. The command below will install the core of Habitat Lab as well as the habitat_baselines.

    git clone --branch v0.1.7 [email protected]:facebookresearch/habitat-lab.git
    cd habitat-lab
    python setup.py develop --all # install habitat and habitat_baselines

Scenes: Matterport3D

Instructions copied from VLN-CE:

Matterport3D (MP3D) scene reconstructions are used. The official Matterport3D download script (download_mp.py) can be accessed by following the instructions on their project webpage. The scene data can then be downloaded:

# requires running with python 2.7
python download_mp.py --task habitat -o data/scene_datasets/mp3d/

Extract such that it has the form scene_datasets/mp3d/{scene}/{scene}.glb. There should be 90 scenes. Place the scene_datasets folder in data/.

Adapted MP3D Connectivity Graphs in Continuous Environments

We adapt the MP3D connectivity graphs defined for the discrete environments to the continuous Habitat-MP3D environments, such that all nodes are positioned in open space and all edges on the graph are fully traversable by an agent (with VLN-CE configursations). Please refer to Section 4.2 and Appendices A.1 in our paper for more details.

Link to download the adapted connectivity graphs.

Each file for a specific MP3D scene contains the positions of a set of nodes and edges connecting two adjacent nodes. From the node ids, you will find nodes inherited from the original graph, as well as new nodes added by us to complete the graph.

Trained Network Weights

Running

Please refer to Peter Anderson's VLN paper for the R2R Navigation task, and Jacob Krantz's VLN-CE for R2R in continuous environments (R2R-CE).

Training and Evaluation

We apply two popular navigator models, CMA and Recurrent VLN-BERT in our experiments.

Use run_CMA.bash and run_VLNBERT.bash for Training with a single GPU, Training on a single node with multiple GPUs, Evaluation or Inference. Simply uncomment the corresponding lines in the files and do

bash run_CMA.bash

or

bash run_VLNBERT.bash

By running Evaluation, you should obtain very similar results as in logs/eval_results/. Running Inference generates the trajectories for submission to the R2R-CE Test Server.

Hardware

The training of networks are performed on a single NVIDIA RTX 3090 GPU, which takes about 3.5 days to complete.

Related Works

If you are interested in this research direction for VLN, below are some closely related works.

Waypoint Models for Instruction-guided Navigation in Continuous Environments (ICCV2021) by Jacob Krantz, Aaron Gokaslan, Dhruv Batra, Stefan Lee and Oleksandr Maksymets.

Sim-2-Sim Transfer for Vision-and-Language Navigation in Continuous Environments (2022) by Jacob Krantz and Stefan Lee.

Sim-to-Real Transfer for Vision-and-Language Navigation (CoRL2021) by Peter Anderson, Ayush Shrivastava, Joanne Truong, Arjun Majumdar, Devi Parikh, Dhruv Batra and Stefan Lee.

Citation

Please cite our paper:

@InProceedings{Hong_2022_CVPR,
    author    = {Hong, Yicong and Wang, Zun and Wu, Qi and Gould, Stephen},
    title     = {Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2022}
}

discrete-continuous-vln's People

Contributors

yiconghong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

discrete-continuous-vln's Issues

Hi I hava a question about action space in habitat setting

I have been reading your paper lately and looking at the code to try to reproduce the experiment. In ss_trainer_VLNBERT.py, 0 action (stop) or 4 action (HIGHTOLOW) at the end of each step is performed through env.step. Can you explain why? And can you tell me what action HIGHTOLOW is? Is it an act of lowering the perspective from the top down?

Question about the training strategy

Hi Yicong, thanks for releasing the code of the Discrete-to-Continuous work. I have been reading papers in the VLN field, including the recent VLN-CE. There is one detail that confuses me for long after reading papers related to VLN-CE. I found that in discrete VLN, almost all recent works adopted a mixed IL + RL training strategy for better performance. However, most later works in VLN-CE instead turned to a simpler IL training scheme without any RL, including your Discrete-to-Continuous work. I wonder why researchers gave up the effective IL+RL strategy. Is it just a conventional choice following the first VLN-CE work or there are some other reasons? I really appreciate it if you can share your thoughts.

Debug Problem

Thanks for releasing the code of the dcvln, it is an very interesting job. I follow the instruction to run the source code, it works perfectly. But when I want to debug the code by Pycharm 2018 Pro, it gets stuck in
self.envs = construct_envs( self.config, get_env_class(self.config.ENV_NAME), episodes_allowed=episode_ids, auto_reset_done=False ) in ss_trainer_CMA.py line 361, which is very weird.
Specifically, I use the Debug in Pycharm IDE and run the code step by step, it always gets stuck in this line and no response at all.
I guess the problem is due to the multiprocess, maybe certain subprocess is wait for interaction or something. I tried to go inside and found habitat.VectorEnv() is no response. Possibly, is it the problem of our computer or IDE? Does debug mode work well in your side?
I'm looking for your reply!

Question about the imitation learning strategy in the paper

Hi Yicong,

I realized that the imitation learning loss you used in the code base is essentially the cross entropy loss between the predicted action and the oracle action which is obtained by selecting the closest waypoint to the goal. However, this oracle action might not be optimal because sometimes the closest waypoint may not be on the ground truth path (reference path in the dataset). like the following pic,

Screenshot 2024-03-07 at 5 52 59 PM

It is likely to cause the agent to loop around the area.

As the waypoint predictor shows very good results, I wonder if you can comment on how the waypoint predictor manages to avoid the above issue.

Many thanks!
Andy

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.