Light

junhyukoh / self-imitation-learning Goto Github PK

View Code? Open in Web Editor NEW

273.0 16.0 40.0 7.4 MB

ICML 2018 Self-Imitation Learning

License: MIT License

Python 99.91% Dockerfile 0.09%

self-imitation-learning's Introduction

Introduction

This repository is an implementation of ICML 2018 Self-Imitation Learning in Tensorflow.

@inproceedings{Oh2018SIL,
  title={Self-Imitation Learning},
  author={Junhyuk Oh and Yijie Guo and Satinder Singh and Honglak Lee},
  booktitle={ICML},
  year={2018}
}

Our code is based on OpenAI Baselines.

Training

The following command runs A2C+SIL on Atari games:

python baselines/a2c/run_atari_sil.py --env FreewayNoFrameskip-v4

The following command runs PPO+SIL on MuJoCo tasks:

python baselines/ppo2/run_mujoco_sil.py --env Ant-v2 --num-timesteps 10000000 --lr 5e-05

self-imitation-learning's People

Contributors

Stargazers

Watchers

Forkers

ml-lab hyzcn collector-m cclauss amoliu bluecontra jbdatascience djs2018 guoyijie wongcheukwai wwxfromtju bhairavmehta95 rayzh2012 shubhampachori12110095 strategist922 klqulei ashigirl96 robotsdiy rajusaladi komonoli qazxsw1234 saminyeasar christinaliang sweetice yooceii therisingstar kangyongxin simayuhe sky4star haochihlin xrosliang ballinhuang marisssssa hengyanliu chaoyuanjam axia75 brezezee seungeonbaek yvonneyvonneyvonne tauruslegend

self-imitation-learning's Issues

SIL Value update

In the paper, sil value loss is defined as 0.5 * max(0, (R-V))^2. Howerver in the code, the value loss is defined as below
self.vf_loss = tf.reduce_sum(self.W * v_estimate * tf.stop_gradient(delta)) / self.num_samples
which means that the value loss is 0.5 * V * clip((V-R), -5, 0).
What's the advantage of this implementation. Thanks

Policy 'lstm' doesn't work

Hello.

I firstly change the policy in <run_atari_sil.py> by:

parser.add_argument('--policy', help='Policy architecture', choices=['cnn', 'lstm', 'lnlstm'], default='lstm')

Then I run A2C+SIL on Atari games :

python baselines/a2c/run_atari_sil.py --env BreakoutNoFrameskip-v4

I got error:

Logging to /tmp/a2c
2018-12-25 14:46:34.107377: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
WARNING:tensorflow:From e:\output\python_output\hardrlwithyoutube\self-imitation-learning-master\baselines\common\distributions.py:148: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be re
moved in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.

Traceback (most recent call last):
  File "E:\Output\Python_output\HardRLWithYoutube\venv_self-imitation-learning-master\lib\site-packages\tensorflow\python\framework\ops.py", line 1628, in _create_c_op
    c_op = c_api.TF_FinishOperation(op_desc)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimension size must be evenly divisible by 15 but is 8192 for 'model_2/Reshape_1' (op: 'Reshape') with input shapes: [16,512], [3] and with input tensors computed as partia
l shapes: input[1] = [3,5,?].

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "baselines/a2c/run_atari_sil.py", line 38, in <module>
    main()
  File "baselines/a2c/run_atari_sil.py", line 35, in main
    num_env=16)
  File "baselines/a2c/run_atari_sil.py", line 20, in train
    sil_update=sil_update, sil_beta=sil_beta)
  File "e:\output\python_output\hardrlwithyoutube\self-imitation-learning-master\baselines\a2c\a2c_sil.py", line 161, in learn
    max_grad_norm=max_grad_norm, lr=lr, alpha=alpha, epsilon=epsilon, total_timesteps=total_timesteps, lrschedule=lrschedule, sil_update=sil_update, sil_beta=sil_beta)
  File "e:\output\python_output\hardrlwithyoutube\self-imitation-learning-master\baselines\a2c\a2c_sil.py", line 35, in __init__
    sil_model = policy(sess, ob_space, ac_space, nenvs, nsteps, reuse=True)
  File "e:\output\python_output\hardrlwithyoutube\self-imitation-learning-master\baselines\a2c\policies.py", line 66, in __init__
    xs = batch_to_seq(h, nenv, nsteps)
  File "e:\output\python_output\hardrlwithyoutube\self-imitation-learning-master\baselines\a2c\utils.py", line 74, in batch_to_seq
    h = tf.reshape(h, [nbatch, nsteps, -1])
  File "E:\Output\Python_output\HardRLWithYoutube\venv_self-imitation-learning-master\lib\site-packages\tensorflow\python\ops\gen_array_ops.py", line 7759, in reshape
    "Reshape", tensor=tensor, shape=shape, name=name)
  File "E:\Output\Python_output\HardRLWithYoutube\venv_self-imitation-learning-master\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "E:\Output\Python_output\HardRLWithYoutube\venv_self-imitation-learning-master\lib\site-packages\tensorflow\python\util\deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "E:\Output\Python_output\HardRLWithYoutube\venv_self-imitation-learning-master\lib\site-packages\tensorflow\python\framework\ops.py", line 3274, in create_op
    op_def=op_def)
  File "E:\Output\Python_output\HardRLWithYoutube\venv_self-imitation-learning-master\lib\site-packages\tensorflow\python\framework\ops.py", line 1792, in __init__
    control_input_ops)
  File "E:\Output\Python_output\HardRLWithYoutube\venv_self-imitation-learning-master\lib\site-packages\tensorflow\python\framework\ops.py", line 1631, in _create_c_op
    raise ValueError(str(e))
ValueError: Dimension size must be evenly divisible by 15 but is 8192 for 'model_2/Reshape_1' (op: 'Reshape') with input shapes: [16,512], [3] and with input tensors computed as partial shapes: input[1] = [3,5,?].

What can I do to fix this? Thank you very much!

np.sign(rewards)

Is there a reason that SIL requires using the np.sign(reward) to do all of the training, rather than the raw rewards themselves?

How the policy and the value function use the same parameters $\theta$ ?

Thanks for this paper.

In the third part (the last line on the right of the second page), you say that

$\pi_{\theta}, V_{\theta}(s)$ are the policy (i.e. actor) and the value function parameterized by $\theta$ .

I want to know how the policy and the value function use the same parameters $\theta$.

Looking forward to your answers. Thanks in advance.

Calculating returns with signed rewards

self-imitation-learning/baselines/common/self_imitation.py

Line 262 in 13eb8a7

returns = discount_with_dones(rewards, dones, self.gamma)

entropy in SIL policy loss

In the equation in the paper, there is no entropy term in the SIL policy loss, how come in the code there is one?

self.loss = self.pg_loss - entropy * self.w_entropy

Key-Door-Treasure

I do not see a way to replicate grid world experiment from the paper using code that is available in the repository. Is there a way and if not, could you please publish the code?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.