Coder Social home page Coder Social logo

pg_travel's Introduction

Policy Gradient (PG) Algorithms

image

This repository contains PyTorch (v0.4.0) implementations of typical policy gradient (PG) algorithms.

  • Vanilla Policy Gradient [1]
  • Truncated Natural Policy Gradient [4]
  • Trust Region Policy Optimization [5]
  • Proximal Policy Optimization [7].

We have implemented and trained the agents with the PG algorithms using the following benchmarks. Trained agents and Unity ml-agent environment source files will soon be available in our repo!

For reference, solid reviews of below papers related to PG (in Korean) are located in https://reinforcement-learning-kr.github.io/2018/06/29/0_pg-travel-guide/. Enjoy!

  • [1] R. Sutton, et al., "Policy Gradient Methods for Reinforcement Learning with Function Approximation", NIPS 2000.
  • [2] D. Silver, et al., "Deterministic Policy Gradient Algorithms", ICML 2014.
  • [3] T. Lillicrap, et al., "Continuous Control with Deep Reinforcement Learning", ICLR 2016.
  • [4] S. Kakade, "A Natural Policy Gradient", NIPS 2002.
  • [5] J. Schulman, et al., "Trust Region Policy Optimization", ICML 2015.
  • [6] J. Schulman, et al., "High-Dimensional Continuous Control using Generalized Advantage Estimation", ICLR 2016.
  • [7] J. Schulman, et al., "Proximal Policy Optimization Algorithms", arXiv, https://arxiv.org/pdf/1707.06347.pdf.

Table of Contents

Mujoco-py

1. Installation

2. Train

Navigate to pg_travel/mujoco folder

Basic Usage

Train the agent with PPO using Hopper-v2 without rendering.

python main.py
  • Note that models are saved in save_model folder automatically for every 100th iteration.

Train the agent with TRPO using HalfCheetah with rendering

python main.py --algorithm TRPO --env HalfCheetah-v2 --render
  • algorithm: PG, TNPG, TRPO, PPO(default)
  • env: Ant-v2, HalfCheetah-v2, Hopper-v2(default), Humanoid-v2, HumanoidStandup-v2, InvertedPendulum-v2, Reacher-v2, Swimmer-v2, Walker2d-v2

Continue training from the saved checkpoint

python main.py --load_model ckpt_736.pth.tar
  • Note that ckpt_736.pth.tar file should be in the pg_travel/mujoco/save_model folder.
  • Pass the arguments algorithm and/or env if not PPO and/or Hopper-v2.

Test the pretrained model

Play 5 episodes with the saved model ckpt_738.pth.tar

python test_algo.py --load_model ckpt_736.pth.tar --iter 5
  • Note that ckpt_736.pth.tar file should be in the pg_travel/mujoco/save_model folder.
  • Pass the arguments env if not Hopper-v2.

Modify the hyperparameters

Hyperparameters are listed in hparams.py. Change the hyperparameters according to your preference.

3. Tensorboard

We have integrated TensorboardX to observe training progresses.

  • Note that the results of trainings are automatically saved in logs folder.
  • TensorboardX is the Tensorboard-like visualization tool for Pytorch.

Navigate to the pg_travel/mujoco folder

tensorboard --logdir logs

4. Trained Agent

We have trained the agents with four different PG algortihms using Hopper-v2 env.

Algorithm Score GIF
Vanilla PG trpo
NPG trpo
TRPO trpo
PPO ppo

Unity ml-agents

1. Installation

2. Environments

We have modified Walker environment provided by Unity ml-agents.

Overview image
Walker walker
Plane Env plane
Curved Env curved

Description

  • 212 continuous observation spaces
  • 39 continuous action spaces
  • 16 walker agents in both Plane and Curved envs
  • Reward
    • +0.03 times body velocity in the goal direction.
    • +0.01 times head y position.
    • +0.01 times body direction alignment with goal direction.
    • -0.01 times head velocity difference from body velocity.
    • +1000 for reaching the target
  • Done
    • When the body parts other than the right and left foots of the walker agent touch the ground or walls
    • When the walker agent reaches the target

Prebuilt Unity envrionements

  • Contains Plane and Curved walker environments for Linux / Mac / Windows!
  • Linux headless envs are also provided for faster training and server-side training.
  • Download the corresponding environments, unzip, and put them in the pg_travel/unity/env folder.

3. Train

Navigate to the pg_travel/unity folder

Basic Usage

Train walker agent with PPO using Plane environment without rendering.

python main.py --train
  • The PPO implementation is for multi-agent training. Collecting experiences from multiple agents and using them for training the global policy and value networks (brain) are included. Refer to pg_travel/mujoco/agent/ppo_gae.py for just single-agent training.
  • See arguments in main.py. You can change hyper parameters for the ppo algorithm, network architecture, etc.
  • Note that models are saved in save_model folder automatically for every 100th iteration.

Continue training from the saved checkpoint

python main.py --load_model ckpt_736.pth.tar --train
  • Note that ckpt_736.pth.tar file should be in the pg_travel/unity/save_model folder.

Test the pretrained model

python main.py --render --load_model ckpt_736.pth.tar
  • Note that ckpt_736.pth.tar file should be in the pg_travel/unity/save_model folder.

Modify the hyperparameters

See main.py for default hyperparameter settings. Pass the hyperparameter arguments according to your preference.

4. Tensorboard

We have integrated TensorboardX to observe training progresses.

Navigate to the pg_travel/unity folder

tensorboard --logdir logs

5. Trained Agent

We have trained the agents with PPO using plane and curved envs.

Env GIF
Plane plane
Curved curved

Reference

We referenced the codes from below repositories.

pg_travel's People

Contributors

dnddnjs avatar dongminlee94 avatar hyeokreal avatar pz1004 avatar rrbb014 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pg_travel's Issues

학습 속도와 성능 개선을 위해 A2C 스타일의 PPO 에이전트 만들기

1 개의 액터러너를 가지고 샘플을 모아서 학습시키는 것은 학습 속도가 느린 것 같습니다. 또한 여러개의 액터러너로 학습시킨 에이전트보다 policy의 quality가 상당히 낮기 때문에 여러 개의 액터러너를 가지고 학습해야할 것 같습니다. 다음과 같은 순서로 진행하면 될 것 같습니다.

  1. 여러개의 액터러너가 있는 환경 만들기
  2. 각 액터러너로 각각의 메모리에 샘플 저장하기
  3. 각 메모리를 통해 GAE를 따로 따로 구하기
  4. 각 메모리를 통해 gradient를 구한 다음에 평균을 취해서 actor와 critic을 업데이트

일단 이게 되어야 뒤의 다른 작업들을 진행할 수 있기 때문에 가능한 한 빠르게 구성해주시면 좋을 것 같습니다.

Quick question about environment normalization

Hello,

I'm planning to use your PPO implementations, which seem well-written, clear and easy to understand. But first, I'd like to have the answer to the following question:

In OpenAI baselines, environments are passed to various classes, such as VecNormalize or Observation/Reward Wrappers or even Monitor. In these cases, observations and rewards are transformed in order to ease learning. However, there is a lot of encapsulation and it makes it kinda difficult to follow the chain. After a quick glance at your implementations, I'm under the impression that you do transform the observations in unity/utils/running_state.py. Is that so ? Are there other transformations ? Or were you just careful while designing the environment, designing it to make sure rewards were appropriately scaled ?

Thanks a lot for your answers.

PPO Model RuntimeError

when i run main.py i met this problem:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [64, 1]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

How can i solve this? Thanks

Pyramid 환경에서 에이전트 PPO로 학습시켜보기

아마 다음과 같은 순서로 진행하면 되지 않을까 싶습니다.
도움 필요하면 언제나 요청해주세요!

  • pyramid 환경 컴파일해서 환경 테스트 해보기 (상태, 보상 뽑아보기 등)
  • 기존 PPO 알고리즘으로 학습시켜보기
  • curiosity 추가해서 학습시켜보기

When i install mujoco and import mujoco_py, I got a problem ....

I successfully installed mujoco, but when i import it, I got this problem...

--
PermissionError Traceback (most recent call last)
/usr/local/lib/python3.5/dist-packages/lockfile/linklockfile.py in acquire(self, timeout)
18 try:
---> 19 open(self.unique_name, "wb").close()
20 except IOError:

PermissionError: [Errno 13] Permission denied: '/usr/local/lib/python3.5/dist-packages/mujoco_py-1.50.1.59-py3.5.egg/mujoco_py/generated/wonchul-60572700.8099-2917094554558463988'

I followed all you mentioned...

Could you help me?

경사가 있는 환경에서 에이전트 학습시키기

학습은 기존 평평한 곳에서 학습시킨 PPO 에이전트를 베이스라인으로해서 학습
환경은 가능하다면 민규식님의 도움을 받아볼 것.
아래는 대충 나눈 거니까 두 분이서 의논하시면서 진행하시면 어떨까 싶습니다.
중간중간 이 이슈에 과정 남겨주세요!

README와 코드 주석 추가

README에는 다음 내용이 들어가야합니다.

  1. 프로젝트 목표
  2. 각 환경에 대한 간단 설치 가이드 ( Linux를 기준으로 설명하는게 좋을 것 같습니다)
  3. 각 알고리즘 설명
  4. 각 환경에 대한 학습 결과
  5. 학습 시키거나 테스트 하기 위한 가이드
  6. 참고한 repository

코드 주석은 알고리즘에 대해 주석이 없으면 이해하기 어려운 부분에 추가하도록 합니다.

코드를 서버에서 돌리기 위해 여러가지 설정 추가

현재 unity ppo 코드는 로컬 랩탑(cpu only)에서 돌리는데 mujoco와 달리 state와 action space가 커서 gpu가 있는 서버에서 돌려야합니다. 게다가 ppo는 gpu를 trpo보다 잘 활용할 수 있는 알고리즘입니다.

따라서 다음을 수행해야합니다.

  1. hyper-params를 argparser로 설정할 수 있도록 변경
  2. 학습과정을 확인할 수 있도록 tensorboardX를 사용
  3. 학습 결과를 중간 중간 video로 저장

why log standard deviation is fixed to 0

I see that in the actor critic model(model.py) it outputs the mu and logstd as an output. In the code, logstd is fixed to 0 by defining it "logstd = torch.zeros_like(mu)" making the standard deviation fixed to 1. But as far as I know it should return the logstd which is also learned by the network(in this case logstd would be the output of some layer). Is there any reason for this behavior?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.