Coder Social home page Coder Social logo

wisnunugroho21 / reinforcement_learning_ppo_rnd Goto Github PK

View Code? Open in Web Editor NEW
47.0 2.0 5.0 33.66 MB

Deep Reinforcement Learning by using Proximal Policy Optimization and Random Network Distillation in Tensorflow 2 and Pytorch with some explanation

License: GNU General Public License v3.0

Python 100.00%
reinforcement-learning gym pytorch ppo-rnd proximal-policy-optimization random-network-distillation cartpole-v0 frozenlake-v0 frozenlake-not-slippery deep-reinforcement-learning

reinforcement_learning_ppo_rnd's Introduction

PPO-RND

Simple code to demonstrate Deep Reinforcement Learning by using Proximal Policy Optimization and Random Network Distillation in Tensorflow 2 and Pytorch

Version 2 and Other Progress

Version 2 will bring improvement in code quality and peformance. I refactor the code so it will follow algorithm in PPO's implementation on OpenAI's baseline. I also using newer version of PPO called Truly PPO, which has more sample efficiency and performance than OpenAI's PPO. Currently, I am focused on how to implement this project in more difficult environment (Atari games, MuJoCo, etc).

  • Use Pytorch and Tensorflow 2
  • Clean up the code
  • Use Truly PPO
  • Add more complex environment
  • Add more explanation

Getting Started

This project is using Pytorch and Tensorflow 2 for Deep Learning Framework and using Gym for Reinforcement Learning Environment.
Although it's not required, but i recommend run this project on a PC with GPU and 8 GB Ram

Prerequisites

Make sure you have installed Pytorch and Gym.

  • Click here to install gym

You can use either Pytorch or Tensorflow 2

  • Click here to install pytorch
  • Click here to install tensorflow 2

Installing

Just clone this project into your work folder

git clone https://github.com/wisnunugroho21/reinforcement_learning_ppo_rnd.git

Running the project

After you clone the project, run following script in cmd/terminal :

Pytorch version

cd reinforcement_learning_ppo_rnd/PPO_RND/pytorch
python ppo_rnd_frozen_notslippery_pytorch.py

Tensorflow 2 version

cd reinforcement_learning_ppo_rnd/PPO_RND/'tensorflow 2'
python ppo_frozenlake_notslippery_tensorflow.py

Proximal Policy Optimization

PPO is motivated by the same question as TRPO: how can we take the biggest possible improvement step on a policy using the data we currently have, without stepping so far that we accidentally cause performance collapse? Where TRPO tries to solve this problem with a complex second-order method, PPO is a family of first-order methods that use a few other tricks to keep new policies close to old. PPO methods are significantly simpler to implement, and empirically seem to perform at least as well as TRPO.

There are two primary variants of PPO: PPO-Penalty and PPO-Clip.

  • PPO-Penalty approximately solves a KL-constrained update like TRPO, but penalizes the KL-divergence in the objective function instead of making it a hard constraint, and automatically adjusts the penalty coefficient over the course of training so that it’s scaled appropriately.

  • PPO-Clip doesn’t have a KL-divergence term in the objective and doesn’t have a constraint at all. Instead relies on specialized clipping in the objective function to remove incentives for the new policy to get far from the old policy.

OpenAI use PPO-Clip
You can read full detail of PPO in here

Random Network Distillation

Random Network Distillation (RND), a prediction-based method for encouraging reinforcement learning agents to explore their environments through curiosity, which for the first time exceeds average human performance on Montezuma’s Revenge. RND achieves state-of-the-art performance, periodically finds all 24 rooms and solves the first level without using demonstrations or having access to the underlying state of the game.

RND incentivizes visiting unfamiliar states by measuring how hard it is to predict the output of a fixed random neural network on visited states. In unfamiliar states it’s hard to guess the output, and hence the reward is high. It can be applied to any reinforcement learning algorithm, is simple to implement and efficient to scale.

You can read full detail of RND in here

Truly Proximal Policy Optimization

Proximal policy optimization (PPO) is one of the most successful deep reinforcement-learning methods, achieving state-of-the-art performance across a wide range of challenging tasks. However, its optimization behavior is still far from being fully understood. In this paper, we show that PPO could neither strictly restrict the likelihood ratio as it attempts to do nor enforce a well-defined trust region constraint, which means that it may still suffer from the risk of performance instability. To address this issue, we present an enhanced PPO method, named Truly PPO. Two critical improvements are made in our method: 1) it adopts a new clipping function to support a rollback behavior to restrict the difference between the new policy and the old one; 2) the triggering condition for clipping is replaced with a trust region-based one, such that optimizing the resulted surrogate objective function provides guaranteed monotonic improvement of the ultimate policy performance. It seems, by adhering more truly to making the algorithm proximal - confining the policy within the trust region, the new algorithm improves the original PPO on both sample efficiency and performance.

You can read full detail of Truly PPO in here

Result

LunarLander using PPO (Non RND)

Result Gif Award Progress Graph
Result Gif Award Progress Graph

Bipedal using PPO (Non RND)

Result Gif
Result Gif

Pendulum using PPO (Non RND)

Result Gif Award Progress Graph
Result Gif Award Progress Graph

Pong using PPO (Non RND)

Result Gif
Result Gif

Contributing

This project is far from finish and will be improved anytime . Any fix, contribute, or idea would be very appreciated

reinforcement_learning_ppo_rnd's People

Contributors

wisnunugroho21 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

reinforcement_learning_ppo_rnd's Issues

When will RND_predictor_Model().lets_init_weights be called?

I notice that you have write the initialization code for RND_predictor_Model. But I failed to find where and when it has been called.

class RND_predictor_Model(nn.Module):
    def init_state_predict_weights(self, m):
        for name, param in m.named_parameters():
            if 'bias' in name:
                nn.init.constant_(param, 0.01)
            elif 'weight' in name:
                nn.init.constant_(param, 1)

Entropy calculation not useful

Describe the bug
In ppo_continous_tensorflow.py, when you calculate entropy with:
dist_entropy = tf.math.reduce_mean(self.distributions.entropy(action_mean, self.std))
since entropy only depends on std and std is a static parameter, dist_entropy has always the same value all the time.
Thus, entropy loss has no effect on learning.

To Reproduce
Launch any env and stop your debugger on dist_entropy. Check that it has the same value for every batch at any given point during learning.

Expected behavior
Std shall not be static but somehow represent real prediction confidence of the network.

Testing error

Hi, thx for repo!
I have an error while run test episodes

Traceback (most recent call last):
  File "ppo_frozenlake_notslippery_tensorflow.py", line 640, in <module>
    main()
  File "ppo_frozenlake_notslippery_tensorflow.py", line 588, in main
    agent.update_ppo()          
  File "ppo_frozenlake_notslippery_tensorflow.py", line 407, in update_ppo
    for states, actions, rewards, dones, next_states in self.memory.get_all_tensor().batch(batch_size):
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/data/ops/dataset_ops.py", line 1642, in batch
    return BatchDataset(self, batch_size, drop_remainder)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/data/ops/dataset_ops.py", line 4115, in __init__
    **self._flat_structure)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/gen_dataset_ops.py", line 596, in batch_dataset_v2
    _ops.raise_from_not_ok_status(e, name)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 6897, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: Batch size must be greater than zero. [Op:BatchDatasetV2]

How can I use the "ppo_rnd_tensorflow.py " to train BipedalWalker and LunarLander

**Is your feature request related to a problem? Please describe.**
I really appreciate your coding because it helped me a lot.

I am just a beginner of RL and I wonder if I can use ppo_rnd_tensorflow.py to train BipedalWalker and LunderLander by filling some gaps about the environment.

But I am in China now, and it is really slow to download your codes because of some bad effects of the COVID-19. So I wonder if you have tried it before? I have noticed that your Results file where you have noted that they are the results(NON-RND)

Describe the solution you'd like
I am really looking forward to your reply or whether you think it is reasonable or not? If so, I will try then after my community network recovers from the COVID-19.

Additional context
Thank you a lot!

RND_epochs

Hi
can I use for training more then 5 rnd_epoch? or it's only const?
thx

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.