Introduction

This is my personal implementation for several algorithms, some of which are cutting edge, including

Deep Q-Network (DQN)
Deep Deterministic Policy Gradient (DDPG)
Asynchronous Advantage Actor-Critic (A3C)
REINFORCE
Truncated Natural Policy Gradient (TNPG) (maybe I cited wrong paper since it doesn’t use Conjugate Gradient to solve equations?)
Trust Region Policy Gradient with Generalized Advantage Estimation

Some optimizations are used. Double DQN is implemented instead of traditional DQN. Furthermore, prioritized sampling is currently being developing.

The library is inspired by a paper Benchmarking Deep Reinforcement Learning for Continuous Control, whose home page is here. If you find duplicated code, it’s my bad. I, however, promise to write every line of code myself.

Lots of codes are ad-hoc and needs refactored. Issues and discussions are always appreciated.

Tests

I developed the library in an ancient MacBook Air (Mid 2013, i5 with 4G RAM) without using GPU, so you should have no problems running all of these toy experiments.

Few examples are available now, due to lots of bugs. However, DDPG may succeed now. All codes depend on OpenAI/gym and TensorFlow, so if you want to run any experiments, install them please.

Examples commands:

python main.py --mode train --agent DDPG      --env MountainCarContinuous-v0
python main.py --mode train --agent REINFORCE --env CartPole-v0               --batch_size 10 --iterations 8000 --learning_rate 0.1
python main.py --mode train --agent A2C       --env CartPole-v0               --replay_buffer_size 200 --batch_size 200
python main.py --mode train --agent A2C       --env Copy-v0                   --replay_buffer_size 200 --batch_size 200 --iterations 6000
python main.py --mode train --agent TNPG      --env Copy-v0                   --batch_size 10 --iterations 8000
python main.py --mode train --agent TRPO      --env Copy-v0                   --batch_size 10 --iterations 8000

Notes

Experimental A2C (synchronous advantage actor-critic) running on CartPole-v0. Note that A2C uses LSTM by default.

A2C on `Copy-v0` succeeds with probability about 0.7 after 4k-6k steps, or gets stuck at a local minimum where for some specific characters the agent would always go left. I find using small learning rate for actor helps find global minimum.

TNPG sometiomes solves `Copy-v0` in ~1k steps. More experiments?

Question: can we combine TNPG and A3C with LSTM? The actor network and critic network shares many weights and how to apply suitable gradient on them?

5 independent runs of TNPG (batch size 10, delta_KL = 0.001):

afcarl / needle Goto Github PK

needle's Introduction

Introduction

Tests

Notes

needle's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent