neural_exploration's Introduction

neural_exploration

Contextual bandits are single-state decision processes with the assumption that the rewards for each arm at each step are generated from a (possibly noisy) function of observable features. Similarly, contextual MDPs offer a setting for reinforcement learning where rewards and transition probabilities can be inferred from vector features.

The literature has focused on optimistic exploration bounds under assumptions of linear dependency with the features, resulting in celebrated algorithms such as LinUCB (bandit) and LinUCBVI (fixed horizon RL with value iteration).

Recently, https://arxiv.org/pdf/1911.04462.pdf introduced NeuralUCB, an optimistic exploration bound algorithm that leverages the power of deep neural networks as universal function approximators to alleviate the constraint of linearity.

The goal of this repo is to implement these methods, design synthetic bandits and MDPs to test the various algorithms and introduce NeuralUCBVI, a value iteration algorithm based on neural approximators of rewards and transition kernel for efficient exploration of fixed-horizon MDP.

Experiments

All methods are tested on 3 types of contextual rewards : linear, quadratic, and highly nonlinear (cosine).

For episodic MDPs, transition matrix are assumed to be linear in the features in all cases. While LinUCB and LinUCB-VI perform well in the linear case (sublinear or even no regret growth), they are slightly sub-optimal in the quadratic case and completely fail in presence of stronger nonlinearity. This is consistent with results from https://arxiv.org/abs/1907.05388 on approximate linear MDP and https://arxiv.org/pdf/1911.00567.pdf on low-rank MDPs, which give control on the performance of linear exploration as a function of the magnitude of the nonlinearity.

Neural exploration on the other hand rely on more sophisticated approximators, which are expressive enough to predict rewards or Q-functions generated by more complicated functions of the features (given wide or deep enough architecture, neural networks are universal approximators). NeuralUCB and NeuralUCB-VI efficiently explore and quickly reach optimality (no or very slow regret growth).

neural_exploration's People

Contributors

Stargazers

Watchers

neural_exploration's Issues

There is a mistake in your NeuralUCB.ipynb.

You declare the cosine function as reward_func but use h, which is a quadratic function you declared before in the Bandit initialization. In my experiment, NeuralUCB does not work in the cosine function.

In LinUCBVI.ipynb , running it on colab or locally giving extremely different results

I have try running your exact LinUCBVI.ipynb locally and also on colab but i can't reproduce same results.

To be concrete, In linear transition prabability and linear reward function (case 1), graph between no of episode vs cumulative regret is comming out to be linear and policy differ extremely from optimal policy and also regret explodes.

I have tried to run many times, results remain different from yours every time.

you can view results at
colab Link

NeuralUCB Confidence

At first, thank you a lot for your contributions. They are very valuable to improve my understanding of the original paper.

I have a fundamental question regarding the implementation of the NeuralUCB - confidence multiplier.

How is it exactly concluded from the original paper?

Recommend Projects

sauxpa / neural_exploration Goto Github PK

neural_exploration's Introduction

neural_exploration

Experiments

neural_exploration's People

Contributors

Stargazers

Watchers

Forkers

neural_exploration's Issues

There is a mistake in your NeuralUCB.ipynb.

In LinUCBVI.ipynb , running it on colab or locally giving extremely different results

NeuralUCB Confidence

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent