Difference Critic

Motivation

The goal is to train a RL system that learns a difference of value functions in order to perform effectively under simulation and approximation errors. In other words, there is mis-match between simulated and target domains. This is a OpenAI Request for Research problem "Difference of Value Functions".

What's New

The main idea idea comes from a 1997 paper Differential Training of Rollout Policies, by Bertsekas. The paper introduces a technique called differential training and argues that under simulation and approximation error, learning a difference of value functions can do better than learning vanilla value functions.

Instead of learning a difference of value function as suggested by Bertsekas, in this work I introduce a variant of DDPG (Deep Deterministic Policy Gradients), which instead of learning a Q(state, action) function, learns a difference of Q function Q(state1, action1, state2, action2) which approximates the difference of expected Q-values between two state, action pairs under the current policy. We use the gradient from this function to train the policy network in DDPG.

Implementation Details

The mismatch between simulated and target domains is modeled using Mujoco agents with varying torso masses, similar to EPOpt. As in EPOpt, we train on a ensemble of robot models.

We use the Mujoco physics simulator for training on the HalfCheetah-v1 environment.

We use a Tensorflow Eager adaptation of OpenAI Baselines for Deep Deterministic Policy Gradients (DDPG) as the baseline.

This model has been ported to Tensorflow Eager, which gives us a better Pythonic expression of the model (define-by-run as opposed to define-and-run) and makes it easier to debug in many cases.

Instructions to for installation

Install OpenAI Gym and Mujoco (needs a software license).
Install Tensorflow from the nightly build (we need nightly builds for TF Eager unless you have >=1.5)
Install pybullet
Install numpy

Future work

Apply the concept of differential training to other Deep RL methods and see if this gives us benefits in the presence of simulation error.

nikhil-dev / differential-training Goto Github PK

differential-training's Introduction

Difference Critic

Motivation

What's New

Implementation Details

Instructions to for installation

Future work

differential-training's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent