Coder Social home page Coder Social logo

keiohta / tf2rl Goto Github PK

View Code? Open in Web Editor NEW
457.0 18.0 103.0 8.81 MB

TensorFlow2 Reinforcement Learning

License: MIT License

Python 99.26% Shell 0.60% Dockerfile 0.13%
reinforcement-learning tensorflow2 inverse-reinforcement-learning tensorflow imitation-learning deep-reinforcement-learning

tf2rl's Introduction

Test Coverage Status MIT License GitHub issues open PyPI version

TF2RL

TF2RL is a deep reinforcement learning library that implements various deep reinforcement learning algorithms using TensorFlow 2.x.

1. Algorithms

Following algorithms are supported:

Algorithm Dicrete action Continuous action Support Category
VPG, PPO GAE Model-free On-policy RL
DQN (including DDQN, Prior. DQN, Duel. DQN, Distrib. DQN, Noisy DQN) - ApeX Model-free Off-policy RL
DDPG (including TD3, BiResDDPG) - ApeX Model-free Off-policy RL
SAC ApeX Model-free Off-policy RL
CURL, SAC-AE - - Model-free Off-policy RL
MPC, ME-TRPO - Model-base RL
GAIL, GAIfO, VAIL (including Spectral Normalization) - Imitation Learning

Following papers have been implemented in tf2rl:

Also, some useful techniques are implemented:

2. Installation

There are several ways to install tf2rl. The recommended way is "2.1 Install from PyPI".

If TensorFlow is already installed, we try to identify the best version of TensorFlow Probability.

2.1 Install from PyPI

You can install tf2rl from PyPI:

$ pip install tf2rl

2.2 Install from Source Code

You can also install from source:

$ git clone https://github.com/keiohta/tf2rl.git tf2rl
$ cd tf2rl
$ pip install .

2.3 Preinstalled Docker Container

Instead of installing tf2rl on your (virtual) system, you can use preinstalled Docker containers.

Only the first execution requires time to download the container image.

At the following commands, you need to replace <version> with the version tag which you want to use.

2.3.1 CPU Only

The following simple command starts preinstalled container.

$ docker run -it ghcr.io/keiohta/tf2rl/cpu:<version> bash

If you also want to mount your local directory /local/dir/path at container /mount/point

$ docker run -it -v /local/dir/path:/mount/point ghcr.io/keiohta/tf2rl/cpu:<version> bash

2.3.2 GPU Support (Linux Only, Experimental)

WARNING: We encountered unsolved errors when running ApeX multiprocess learning.

Requirements

  • Linux
  • NVIDIA GPU
    • TF2.2 compatible driver
  • Docker 19.03 or later

The following simple command starts preinstalled container.

$ docker run --gpus all -it ghcr.io/keiohta/tf2rl/nvidia:<version> bash

If you also want to mount your local directory /local/dir/path at container /mount/point

$ docker run --gpus all -it -v /local/dir/path:/mount/point ghcr.io/keiohta/tf2rl/nvidia:<version> bash

If your container can see GPU correctly, you can check inside container by the following comand;

$ nvidia-smi

3. Getting started

Here is a quick example of how to train DDPG agent on a Pendulum environment:

import gym
from tf2rl.algos.ddpg import DDPG
from tf2rl.experiments.trainer import Trainer


parser = Trainer.get_argument()
parser = DDPG.get_argument(parser)
args = parser.parse_args()

env = gym.make("Pendulum-v1")
test_env = gym.make("Pendulum-v1")
policy = DDPG(
    state_shape=env.observation_space.shape,
    action_dim=env.action_space.high.size,
    gpu=-1,  # Run on CPU. If you want to run on GPU, specify GPU number
    memory_capacity=10000,
    max_action=env.action_space.high[0],
    batch_size=32,
    n_warmup=500)
trainer = Trainer(policy, env, args, test_env=test_env)
trainer()

You can check implemented algorithms in examples. For example if you want to train DDPG agent:

# You must change directory to avoid importing local files
$ cd examples
# For options, please specify --help or read code for options
$ python run_ddpg.py [options]

You can see the training progress/results from TensorBoard as follows:

# When executing `run_**.py`, its logs are automatically generated under `./results`
$ tensorboard --logdir results

4. Usage

In basic usage, what you need is initializing one of the policy classes and Trainer class.

As a option, tf2rl supports command line program style, so that you can also pass configuration parameters from command line arguments.

4.1 Command Line Program Style

Trainer class and policy classes have class method get_argument, which creates or updates ArgParser object.

You can parse the command line arguments with the ArgParser.parse_args method, which returns Namespace object.

Policy's constructor option can be extracted from the Namespace object explicitly. Trainer constructor accepts the Namespace object.

from tf2rl.algos.dqn import DQN
from tf2rl.experiments.trainer import Trainer

env = ... # Create gym.env like environment.

parser = DQN.get_argument(Trainer.get_argument())
args = parser.parse_args()

policy = DQN(enable_double_dqn = args.enable_double_dqn,
             enable_dueling_dqn = args.enable_dueling_dqn,
			 enable_noisy_dqn = args.enable_noisy_dqn)
trainer = Trainer(policy, env, args)
trainer()

4.2 Non Command Line Program Style (e.g. on Jupyter Notebook)

ArgParser doesn't fit the usage on Jupyter Notebook like envrionment. Trainer constructor can accept dict as args argument instead of Namespace object.

from tf2rl.algos.dqn import DQN
from tf2rl.experiments.trainer import Trainer

env = ... # Create gym.env like environment.

policy = DQN( ... )
trainer = Trainer(policy, env, {"max_steps": int(1e+6), ... })
trainer()

4.3 Results

The Trainer class saves logs and models under <logdir>/%Y%m%dT%H%M%S.%f. The default logdir is "results", and it can be changed by --logdir command argument or "logdir" key in constructor args.

5. Citation

@misc{ota2020tf2rl,
  author = {Kei Ota},
  title = {TF2RL},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/keiohta/tf2rl/}}
}

tf2rl's People

Contributors

estshorter avatar keiohta avatar sff1019 avatar ymd-h avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tf2rl's Issues

which version is this Discrete SAC based upon

hi, may I know which paper(s)/method the discrete SAC is based upon?

from my understanding there are 3 main implementations of discrete SAC:

  • Gumbel Softmax
  • KL Divergence
  • Petros Christodoulou's

but they all include the auto entropy tuning/temperature term. I may have missed it but I dont see it in your version of the code. Thanks for your time!

ApeX: Maximize GPU usage by parallelizing environments

Generally, RL collects experiences by interacting with environment, and it needs to generate an action using current policy networks.
However, making an action for each transition is not computationally efficient because the input to the neural network uses only one batch. So, there could be room to improve efficiency by changing batch size.
In this issue, try to make batch size bigger for one actions by preparing multiple environments and forward all environments with one actions.
Note that this method is only valid for no-GIL environments, because python only works on one process.

Adding noise to action in DDPG implementation

Hi,

I noticed another thing. In DDPG implementation, there is a method get_action() that I think by accident doesn't add noise to action in training phase, but adds it during testing. Here's the exact line that I think is problematic:

tf.constant(state), self.sigma * test, tf.constant(self.actor.max_action, dtype=tf.float32))

As per original pseudocode in original paper DDPG, page 5, it is explicitly stated that noise is added during training. I'm assuming actor's action is then directly used during testing.

cpprb will break its api in version 8

A depending library cpprb is scheduled to break its api in version 8.
(ReplayBuffers in cpprb namespaces are replaced with those in cpprb.experimental, filnally.)

To prepare migration, the current version of tf2rl is recommended to ensure to use version 7 by specifying in setup.py

Replace all float64 operations with float32 operations

So far, all operations are done with float64 (np.float64 and tf.float64) for compatibility with cpprb, but cpprb now experimentally supports arbitrary data type, so replace all float64 operations with float32 operations to accelerate computation.

Visualize game learning process on TensorBoard

Visualize final game capture of an episode so that user can know the game progress.
Call following function to visualize the capture to TensorBoard.

tf.contrib.summary.image('train/input_img', tf.cast(image * 255.0, tf.uint8))

Support gin-config

Currently users need to specify hyper parameters by passing command line arguments or set_defaults in examples/run_*.py, but it's a bother to do for each algorithm/environment. So, use gin to specify initial values especially for reproducing the results of papers.
https://github.com/google/gin-config

[Feature] Support new feature of cpprb

tf2rl does not keep up to date with recent cpprb development.

cpprb >= 7.14 obtained N-step feature in the new experimental package.
cpprb >= 8.0 replace stable code with experimental one, finally.

Write detail agent type to README

Agent type can be classified into discrete or continuous.
Also, some detail information should be written, such as recurrent output.

Using TensorFlow global time step makes main loop slower

  • Measure time spent in main loop using line_profiler
  • Spend 16% for computing tf.train.create_global_step() related operations
    • 4.0 while total_steps < self._max_steps:
    • 4.8 if total_steps >= self._policy.n_warmup:
    • 4.8 if total_steps >= self._policy.n_warmup:
    • 2.6 if int(total_steps) % self._model_save_interval == 0:
$ git checkout cff2d42ae73b7ddaa050853b3359a78ada06929a
$ python examples/run_dqn_line_profiler.py --max-steps=10000
...
File: /Users/keiohta/workspace/rl/tf2rl/tf2rl/experiments/trainer.py
Function: call at line 50

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    50                                               def call(self):
    51         1       3666.0   3666.0      0.0          total_steps = tf.train.create_global_step()
    52         1          2.0      2.0      0.0          episode_steps = 0
    53         1          0.0      0.0      0.0          episode_return = 0
    54         1          2.0      2.0      0.0          episode_start_time = time.time()
    55         1          1.0      1.0      0.0          n_episode = 0
    56                                           
    57         1          1.0      1.0      0.0          replay_buffer = get_replay_buffer(
    58         1          1.0      1.0      0.0              self._policy, self._env, self._use_prioritized_rb,
    59         1        762.0    762.0      0.0              self._use_nstep_rb, self._n_step)
    60                                           
    61         1         72.0     72.0      0.0          obs = self._env.reset()
    62                                           
    63         1       1049.0   1049.0      0.0          with tf.contrib.summary.record_summaries_every_n_global_steps(1000):
    64     10001     670392.0     67.0      4.0              while total_steps < self._max_steps:
    65     10000     629217.0     62.9      3.8                  if total_steps < self._policy.n_warmup:
    66       500       3332.0      6.7      0.0                      action = self._env.action_space.sample()
    67                                                           else:
    68      9500    2394183.0    252.0     14.4                      action = self._policy.get_action(obs)
    69                                           
    70     10000     285089.0     28.5      1.7                  next_obs, reward, done, _ = self._env.step(action)
    71     10000       9598.0      1.0      0.1                  if self._show_progress:
    72                                                               self._env.render()
    73     10000       7443.0      0.7      0.0                  episode_steps += 1
    74     10000       7514.0      0.8      0.0                  episode_return += reward
    75     10000     768571.0     76.9      4.6                  total_steps.assign_add(1)
    76                                           
    77     10000      10090.0      1.0      0.1                  done_flag = done
    78     10000      12436.0      1.2      0.1                  if hasattr(self._env, "_max_episode_steps") and \
    79     10000       8542.0      0.9      0.1                          episode_steps == self._env._max_episode_steps:
    80         6          5.0      0.8      0.0                      done_flag = False
    81     10000     310069.0     31.0      1.9                  replay_buffer.add(obs=obs, act=action, next_obs=next_obs, rew=reward, done=done_flag)
    82     10000       9355.0      0.9      0.1                  obs = next_obs
    83                                           
    84     10000       9105.0      0.9      0.1                  if done or episode_steps == self._episode_max_steps:
    85       179       2326.0     13.0      0.0                      obs = self._env.reset()
    86                                           
    87       179        199.0      1.1      0.0                      n_episode += 1
    88       179        268.0      1.5      0.0                      fps = episode_steps / (time.time() - episode_start_time)
    89       179        266.0      1.5      0.0                      self.logger.info("Total Epi: {0: 5} Steps: {1: 7} Episode Steps: {2: 5} Return: {3: 5.4f} FPS: {4:5.2f}".format(
    90       179      30602.0    171.0      0.2                          n_episode, int(total_steps), episode_steps, episode_return, fps))
    91                                           
    92       179        211.0      1.2      0.0                      episode_steps = 0
    93       179        122.0      0.7      0.0                      episode_return = 0
    94       179        173.0      1.0      0.0                      episode_start_time = time.time()
    95                                           
    96     10000     804974.0     80.5      4.8                  if total_steps >= self._policy.n_warmup:
    97      9501     371540.0     39.1      2.2                      samples = replay_buffer.sample(self._policy.batch_size)
    98      9501      11071.0      1.2      0.1                      td_error = self._policy.train(
    99      9501       8588.0      0.9      0.1                          samples["obs"], samples["act"], samples["next_obs"],
   100      9501      29581.0      3.1      0.2                          samples["rew"], np.array(samples["done"], dtype=np.float64),
   101      9501    8322550.0    876.0     49.9                          None if not self._use_prioritized_rb else samples["weights"])
   102      9501      14172.0      1.5      0.1                      if self._use_prioritized_rb:
   103                                                                   replay_buffer.update_priorities(samples["indexes"], np.abs(td_error) + 1e-6)
   104      9501     487314.0     51.3      2.9                      if int(total_steps) % self._test_interval == 0:
   105         5         86.0     17.2      0.0                          with tf.contrib.summary.always_record_summaries():
   106         5     989351.0 197870.2      5.9                              avg_test_return = self.evaluate_policy(int(total_steps))
   107         5         20.0      4.0      0.0                              self.logger.info("Evaluation Total Steps: {0: 7} Average Reward {1: 5.4f} over {2: 2} episodes".format(
   108         5       1098.0    219.6      0.0                                  int(total_steps), avg_test_return, self._test_episodes))
   109         5       1452.0    290.4      0.0                              tf.contrib.summary.scalar(name="AverageTestReturn", tensor=avg_test_return, family="loss")
   110         5       1288.0    257.6      0.0                              tf.contrib.summary.scalar(name="FPS", tensor=fps, family="loss")
   111                                           
   112         5        214.0     42.8      0.0                          self.writer.flush()
   113                                           
   114     10000     433524.0     43.4      2.6                  if int(total_steps) % self._model_save_interval == 0:
   115         1      16610.0  16610.0      0.1                      self.checkpoint_manager.save()
   116                                           
   117         1         44.0     44.0      0.0              tf.contrib.summary.flush()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.