reinforcement-learning-kr / pg_travel Goto Github PK

View Code? Open in Web Editor NEW

366.0 11.0 76.0 275.32 MB

Policy Gradient algorithms (REINFORCE, NPG, TRPO, PPO)

License: MIT License

Python 100.00%

pg_travel's Introduction

Policy Gradient (PG) Algorithms

This repository contains PyTorch (v0.4.0) implementations of typical policy gradient (PG) algorithms.

Vanilla Policy Gradient [1]
Truncated Natural Policy Gradient [4]
Trust Region Policy Optimization [5]
Proximal Policy Optimization [7].

We have implemented and trained the agents with the PG algorithms using the following benchmarks. Trained agents and Unity ml-agent environment source files will soon be available in our repo!

mujoco-py: https://github.com/openai/mujoco-py
Unity ml-agent: https://github.com/Unity-Technologies/ml-agents

For reference, solid reviews of below papers related to PG (in Korean) are located in https://reinforcement-learning-kr.github.io/2018/06/29/0_pg-travel-guide/. Enjoy!

[1] R. Sutton, et al., "Policy Gradient Methods for Reinforcement Learning with Function Approximation", NIPS 2000.
[2] D. Silver, et al., "Deterministic Policy Gradient Algorithms", ICML 2014.
[3] T. Lillicrap, et al., "Continuous Control with Deep Reinforcement Learning", ICLR 2016.
[4] S. Kakade, "A Natural Policy Gradient", NIPS 2002.
[5] J. Schulman, et al., "Trust Region Policy Optimization", ICML 2015.
[6] J. Schulman, et al., "High-Dimensional Continuous Control using Generalized Advantage Estimation", ICLR 2016.
[7] J. Schulman, et al., "Proximal Policy Optimization Algorithms", arXiv, https://arxiv.org/pdf/1707.06347.pdf.

Table of Contents

Policy Gradient (PG) Algorithms

Mujoco-py

1. Installation

Ubuntu

2. Train

Navigate to pg_travel/mujoco folder

Basic Usage

Train the agent with PPO using Hopper-v2 without rendering.

python main.py

Note that models are saved in save_model folder automatically for every 100th iteration.

Train the agent with TRPO using HalfCheetah with rendering

python main.py --algorithm TRPO --env HalfCheetah-v2 --render

algorithm: PG, TNPG, TRPO, PPO(default)
env: Ant-v2, HalfCheetah-v2, Hopper-v2(default), Humanoid-v2, HumanoidStandup-v2, InvertedPendulum-v2, Reacher-v2, Swimmer-v2, Walker2d-v2

Continue training from the saved checkpoint

python main.py --load_model ckpt_736.pth.tar

Note that ckpt_736.pth.tar file should be in the pg_travel/mujoco/save_model folder.
Pass the arguments algorithm and/or env if not PPO and/or Hopper-v2.

Test the pretrained model

Play 5 episodes with the saved model ckpt_738.pth.tar

python test_algo.py --load_model ckpt_736.pth.tar --iter 5

Note that ckpt_736.pth.tar file should be in the pg_travel/mujoco/save_model folder.
Pass the arguments env if not Hopper-v2.

Modify the hyperparameters

Hyperparameters are listed in hparams.py. Change the hyperparameters according to your preference.

3. Tensorboard

We have integrated TensorboardX to observe training progresses.

Note that the results of trainings are automatically saved in logs folder.
TensorboardX is the Tensorboard-like visualization tool for Pytorch.

Navigate to the pg_travel/mujoco folder

tensorboard --logdir logs

4. Trained Agent

We have trained the agents with four different PG algortihms using Hopper-v2 env.

Algorithm	Score	GIF
Vanilla PG
NPG
TRPO
PPO

Unity ml-agents

1. Installation

2. Environments

We have modified Walker environment provided by Unity ml-agents.

Overview	image
Walker
Plane Env
Curved Env

Description

212 continuous observation spaces
39 continuous action spaces
16 walker agents in both Plane and Curved envs
Reward
- +0.03 times body velocity in the goal direction.
- +0.01 times head y position.
- +0.01 times body direction alignment with goal direction.
- -0.01 times head velocity difference from body velocity.
- +1000 for reaching the target
Done
- When the body parts other than the right and left foots of the walker agent touch the ground or walls
- When the walker agent reaches the target

Prebuilt Unity envrionements

Contains Plane and Curved walker environments for Linux / Mac / Windows!
Linux headless envs are also provided for faster training and server-side training.
Download the corresponding environments, unzip, and put them in the pg_travel/unity/env folder.

3. Train

Navigate to the pg_travel/unity folder

Basic Usage

Train walker agent with PPO using Plane environment without rendering.

python main.py --train

The PPO implementation is for multi-agent training. Collecting experiences from multiple agents and using them for training the global policy and value networks (brain) are included. Refer to pg_travel/mujoco/agent/ppo_gae.py for just single-agent training.
See arguments in main.py. You can change hyper parameters for the ppo algorithm, network architecture, etc.
Note that models are saved in save_model folder automatically for every 100th iteration.

Continue training from the saved checkpoint

python main.py --load_model ckpt_736.pth.tar --train

Note that ckpt_736.pth.tar file should be in the pg_travel/unity/save_model folder.

Test the pretrained model

python main.py --render --load_model ckpt_736.pth.tar

Note that ckpt_736.pth.tar file should be in the pg_travel/unity/save_model folder.

Modify the hyperparameters

See main.py for default hyperparameter settings. Pass the hyperparameter arguments according to your preference.

4. Tensorboard

We have integrated TensorboardX to observe training progresses.

Navigate to the pg_travel/unity folder

tensorboard --logdir logs

5. Trained Agent

We have trained the agents with PPO using plane and curved envs.

Env	GIF
Plane
Curved

Reference

We referenced the codes from below repositories.

pg_travel's People

Contributors

Stargazers

Watchers

Forkers

jdc08161063 howardchenhd wwxfromtju seunadeks falconzyx viveksck 170928 landoufulxf luckysneed anlianglu ceseale zwfightzw neocsnd panlichen blackboy5004 taeurn fanjinyin huiniu dykim1222 niknbr hehhehee yanxg inverse-reinforement-learning wyn1996 hustacds wangyu1997 raeyo colin-zgf afcarl haochihlin wddebwt emmanuel982 balsampeary ar-zadeh izumikonata nuaa-codemonkey wolfworld6 li-ming-fan sutlzy edgarzou creatorcen evtunit likangxidian trexces young-forever jiang123-cmd wormpartner cerphilly killyseason huq1231 zhuhuaiyu bankunion shaokang-agent qiu1234567 qq107347827 dlrudco hebaoxianga flint-xf-fan eunsungkim-kr messorem7 ncut-ai onesoulkang neo884 souradip-chakraborty superdiode khiem2105 j-c-carr dodoseung stjordanis mathcom soyail lyt1887997 wyq199321 shenjiede

pg_travel's Issues

학습 속도와 성능 개선을 위해 A2C 스타일의 PPO 에이전트 만들기

1 개의 액터러너를 가지고 샘플을 모아서 학습시키는 것은 학습 속도가 느린 것 같습니다. 또한 여러개의 액터러너로 학습시킨 에이전트보다 policy의 quality가 상당히 낮기 때문에 여러 개의 액터러너를 가지고 학습해야할 것 같습니다. 다음과 같은 순서로 진행하면 될 것 같습니다.

여러개의 액터러너가 있는 환경 만들기
각 액터러너로 각각의 메모리에 샘플 저장하기
각 메모리를 통해 GAE를 따로 따로 구하기
각 메모리를 통해 gradient를 구한 다음에 평균을 취해서 actor와 critic을 업데이트

일단 이게 되어야 뒤의 다른 작업들을 진행할 수 있기 때문에 가능한 한 빠르게 구성해주시면 좋을 것 같습니다.

Quick question about environment normalization

Hello,

I'm planning to use your PPO implementations, which seem well-written, clear and easy to understand. But first, I'd like to have the answer to the following question:

In OpenAI baselines, environments are passed to various classes, such as VecNormalize or Observation/Reward Wrappers or even Monitor. In these cases, observations and rewards are transformed in order to ease learning. However, there is a lot of encapsulation and it makes it kinda difficult to follow the chain. After a quick glance at your implementations, I'm under the impression that you do transform the observations in unity/utils/running_state.py. Is that so ? Are there other transformations ? Or were you just careful while designing the environment, designing it to make sure rewards were appropriately scaled ?

Thanks a lot for your answers.

PPO Model RuntimeError

when i run main.py i met this problem:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [64, 1]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

How can i solve this? Thanks

Pyramid 환경에서 에이전트 PPO로 학습시켜보기

아마 다음과 같은 순서로 진행하면 되지 않을까 싶습니다.
도움 필요하면 언제나 요청해주세요!

pyramid 환경 컴파일해서 환경 테스트 해보기 (상태, 보상 뽑아보기 등)
기존 PPO 알고리즘으로 학습시켜보기
curiosity 추가해서 학습시켜보기

When i install mujoco and import mujoco_py, I got a problem ....

I successfully installed mujoco, but when i import it, I got this problem...

--
PermissionError Traceback (most recent call last)
/usr/local/lib/python3.5/dist-packages/lockfile/linklockfile.py in acquire(self, timeout)
18 try:
---> 19 open(self.unique_name, "wb").close()
20 except IOError:

PermissionError: [Errno 13] Permission denied: '/usr/local/lib/python3.5/dist-packages/mujoco_py-1.50.1.59-py3.5.egg/mujoco_py/generated/wonchul-60572700.8099-2917094554558463988'

I followed all you mentioned...

Could you help me?

경사가 있는 환경에서 에이전트 학습시키기

학습은 기존 평평한 곳에서 학습시킨 PPO 에이전트를 베이스라인으로해서 학습
환경은 가능하다면 민규식님의 도움을 받아볼 것.
아래는 대충 나눈 거니까 두 분이서 의논하시면서 진행하시면 어떨까 싶습니다.
중간중간 이 이슈에 과정 남겨주세요!

환경 구성: @pz1004 장수영님
환경에서 학습: @Hyeokreal 양혁렬님

error of running ppo

Hi, if I run the ppo, I get,

What is it?

README와 코드 주석 추가

README에는 다음 내용이 들어가야합니다.

프로젝트 목표
각 환경에 대한 간단 설치 가이드 ( Linux를 기준으로 설명하는게 좋을 것 같습니다)
각 알고리즘 설명
각 환경에 대한 학습 결과
학습 시키거나 테스트 하기 위한 가이드
참고한 repository

코드 주석은 알고리즘에 대해 주석이 없으면 이해하기 어려운 부분에 추가하도록 합니다.

action = get_action(mu, std)[0]?

In main.py line 93,
action = get_action(mu, std)[0]
then action is just a scalar.
Is that a problem?

코드를 서버에서 돌리기 위해 여러가지 설정 추가

현재 unity ppo 코드는 로컬 랩탑(cpu only)에서 돌리는데 mujoco와 달리 state와 action space가 커서 gpu가 있는 서버에서 돌려야합니다. 게다가 ppo는 gpu를 trpo보다 잘 활용할 수 있는 알고리즘입니다.

따라서 다음을 수행해야합니다.

hyper-params를 argparser로 설정할 수 있도록 변경
학습과정을 확인할 수 있도록 tensorboardX를 사용
학습 결과를 중간 중간 video로 저장

why log standard deviation is fixed to 0

I see that in the actor critic model(model.py) it outputs the mu and logstd as an output. In the code, logstd is fixed to 0 by defining it "logstd = torch.zeros_like(mu)" making the standard deviation fixed to 1. But as far as I know it should return the logstd which is also learned by the network(in this case logstd would be the output of some layer). Is there any reason for this behavior?

Frequency of saving stats

I think this(https://github.com/reinforcement-learning-kr/pg_travel/blob/master/mujoco/main.py#L117) should be like below??

if iter % 100 == 0:

이 repo의 코드들은 기본적으로 cpu에서 돌게 되어 있나요?

코드를 돌려보니, gpu 사용이 없어서 질문 드립니다.

Does this code run on cpu by default?
When i run this code, there seems no gpu usage during the execution.

reinforcement-learning-kr / pg_travel Goto Github PK

pg_travel's Introduction

Policy Gradient (PG) Algorithms

Mujoco-py

1. Installation

2. Train

Basic Usage

Continue training from the saved checkpoint

Test the pretrained model

Modify the hyperparameters

3. Tensorboard

4. Trained Agent

Unity ml-agents

1. Installation

2. Environments

3. Train

Basic Usage

Continue training from the saved checkpoint

Test the pretrained model

Modify the hyperparameters

4. Tensorboard

5. Trained Agent

Reference

pg_travel's People

Contributors

Stargazers

Watchers

Forkers

pg_travel's Issues

Recommend Projects

Recommend Topics

Recommend Org