Coder Social home page Coder Social logo

Comments (5)

YicongHong avatar YicongHong commented on August 28, 2024 2

Hey Jackie,
I am not an expert in RL but I would like to share some of my experience in training VLN agents. Please correct me if I am wrong.

In discrete VLN where an agent has panoramic view and travels with high-level actions, it is easy to define rewards (e.g. progress-based, path fidelity-based) and quantify rewards. I think RL in this case serves the learning of view selection. With IL as a stabilizer, agent can explore and learn from mistakes in the early trainnig stage while maintaining a very stable learning curve.

However, in the continuous setting there are two main concerns. (1) It is hard to define rewards for low-level actions, think about how to shape rewards when agent turns 15 degrees, and think about we may only assign a very small reward when agent move forward a very small distance -- which is a very weak learning signal. (2) Computational cost, running in continuous environments requires significantly more computes than in discrete -- for the simulator to render the scenes and for the agent to learn much larger state space -- sadly, this is the main reason; I can't afford to run RL in my experiment.
My colleague Zun Wang @wz0919 tried RL + IL + low-level actions for continuous VLN (without much compute or tuning) and got very bad results. But for RL + IL + waypoint predictor, comparing to schedule sampling in our paper, RL + IL performs slightly better while the training time is similar.

But, what if we have sufficient resources? I believe with sufficient compute, great ideas can be pushed to another level. Take a look at the Waypoint Models (64GPUs), as well as many of the papers in object-goal navigation such as THDA (256GPUs) that only uses DD-PPO. Saying this, I also want to mention our Recurrent VLN-BERT, increase the batch size and re-attend the instruction in each step can achieve much higher performance ;).
On the other hand, you might be interested in the recent Habitat-Web -- with sufficient amount of IL data, an agent can learn exploration well via imitation.

Happy to learn more about your thoughts. Cheers.

from discrete-continuous-vln.

YESAndy avatar YESAndy commented on August 28, 2024 1

Hi Andy,

Thanks! The "single Nvidia 3090 and batch size of 64" is for training the waypoint predictor. For the navigation model, please see the Appendix for more details. For VLNBERT, the training takes about 3.5 days to complete 50 epochs using batch size 16 on a single 3090 GPU. I guess the speed difference is primarily due to the hardware, can you try to fit more samples (larger batch size)?

Cheers, Yicong

Hi Yicong,

Thank you for your reply! Yes, I tried larger batch sizes (8, 16, 32,...) but it seemed that it ate up a large amount of CPU memory (my computer has 24Gb RAM) so a larger batch size will cause the training to crash. I also checked the GPU utilization which is low (~40%) so I tried re-creating a new conda env and re-installing habitat-sim using conda (I built it from source before because I didn't realize my python version was not compatible with hatbitat-sim=0.1.7 ;))). I also revise some of the code to reduce CPU memory consumption. And now the training time is finally reduced to 1.5 hr/epoch 😫.

from discrete-continuous-vln.

Jackie-Chou avatar Jackie-Chou commented on August 28, 2024

Thanks Yicong! It's so generous of you to share with me all these ideas and materials. I really appreciate it, sincerely! It is sad to know that RL in the continuous environment is unaffordable for now, but I agree with you that great ideas like your VLN-BERT could be pushed to another level given sufficient resources. Thanks again, and looking forward to seeing more great ideas from you!

from discrete-continuous-vln.

YESAndy avatar YESAndy commented on August 28, 2024

Hey Jackie, I am not an expert in RL but I would like to share some of my experience in training VLN agents. Please correct me if I am wrong.

In discrete VLN where an agent has panoramic view and travels with high-level actions, it is easy to define rewards (e.g. progress-based, path fidelity-based) and quantify rewards. I think RL in this case serves the learning of view selection. With IL as a stabilizer, agent can explore and learn from mistakes in the early trainnig stage while maintaining a very stable learning curve.

However, in the continuous setting there are two main concerns. (1) It is hard to define rewards for low-level actions, think about how to shape rewards when agent turns 15 degrees, and think about we may only assign a very small reward when agent move forward a very small distance -- which is a very weak learning signal. (2) Computational cost, running in continuous environments requires significantly more computes than in discrete -- for the simulator to render the scenes and for the agent to learn much larger state space -- sadly, this is the main reason; I can't afford to run RL in my experiment. My colleague Zun Wang @wz0919 tried RL + IL + low-level actions for continuous VLN (without much compute or tuning) and got very bad results. But for RL + IL + waypoint predictor, comparing to schedule sampling in our paper, RL + IL performs slightly better while the training time is similar.

But, what if we have sufficient resources? I believe with sufficient compute, great ideas can be pushed to another level. Take a look at the Waypoint Models (64GPUs), as well as many of the papers in object-goal navigation such as THDA (256GPUs) that only uses DD-PPO. Saying this, I also want to mention our Recurrent VLN-BERT, increase the batch size and re-attend the instruction in each step can achieve much higher performance ;). On the other hand, you might be interested in the recent Habitat-Web -- with sufficient amount of IL data, an agent can learn exploration well via imitation.

Happy to learn more about your thoughts. Cheers.

Hi Yicong,

Very impressive discussion. Just a quick follow-up question. As you mentioned in the paper you used Nvidia 3090 and a batch size of 64, I wonder how long your training time for VLNBERT is?

We have 4090 gpu and a batch size of 4 but the training time is extremely long (3 hr/epoch) TAT. I already checked the version of the Habitat-sim is as the same as yours and is installed with cuda. Do you have any suggestion regarding this issue?

Many thanks!!!!!!!

from discrete-continuous-vln.

YicongHong avatar YicongHong commented on August 28, 2024

Hi Andy,

Thanks! The "single Nvidia 3090 and batch size of 64" is for training the waypoint predictor. For the navigation model, please see the Appendix for more details. For VLNBERT, the training takes about 3.5 days to complete 50 epochs using batch size 16 on a single 3090 GPU.
I guess the speed difference is primarily due to the hardware, can you try to fit more samples (larger batch size)?

Cheers,
Yicong

from discrete-continuous-vln.

Related Issues (8)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.