Coder Social home page Coder Social logo

neural_exploration's Introduction

neural_exploration

Contextual bandits are single-state decision processes with the assumption that the rewards for each arm at each step are generated from a (possibly noisy) function of observable features. Similarly, contextual MDPs offer a setting for reinforcement learning where rewards and transition probabilities can be inferred from vector features.

The literature has focused on optimistic exploration bounds under assumptions of linear dependency with the features, resulting in celebrated algorithms such as LinUCB (bandit) and LinUCBVI (fixed horizon RL with value iteration).

Recently, https://arxiv.org/pdf/1911.04462.pdf introduced NeuralUCB, an optimistic exploration bound algorithm that leverages the power of deep neural networks as universal function approximators to alleviate the constraint of linearity.

The goal of this repo is to implement these methods, design synthetic bandits and MDPs to test the various algorithms and introduce NeuralUCBVI, a value iteration algorithm based on neural approximators of rewards and transition kernel for efficient exploration of fixed-horizon MDP.

Experiments

All methods are tested on 3 types of contextual rewards : linear, quadratic, and highly nonlinear (cosine).

For episodic MDPs, transition matrix are assumed to be linear in the features in all cases. While LinUCB and LinUCB-VI perform well in the linear case (sublinear or even no regret growth), they are slightly sub-optimal in the quadratic case and completely fail in presence of stronger nonlinearity. This is consistent with results from https://arxiv.org/abs/1907.05388 on approximate linear MDP and https://arxiv.org/pdf/1911.00567.pdf on low-rank MDPs, which give control on the performance of linear exploration as a function of the magnitude of the nonlinearity.

Neural exploration on the other hand rely on more sophisticated approximators, which are expressive enough to predict rewards or Q-functions generated by more complicated functions of the features (given wide or deep enough architecture, neural networks are universal approximators). NeuralUCB and NeuralUCB-VI efficiently explore and quickly reach optimality (no or very slow regret growth).

neural_exploration's People

Contributors

sauxpa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

neural_exploration's Issues

There is a mistake in your NeuralUCB.ipynb.

You declare the cosine function as reward_func but use h, which is a quadratic function you declared before in the Bandit initialization. In my experiment, NeuralUCB does not work in the cosine function.

In LinUCBVI.ipynb , running it on colab or locally giving extremely different results

I have try running your exact LinUCBVI.ipynb locally and also on colab but i can't reproduce same results.

To be concrete, In linear transition prabability and linear reward function (case 1), graph between no of episode vs cumulative regret is comming out to be linear and policy differ extremely from optimal policy and also regret explodes.

I have tried to run many times, results remain different from yours every time.

you can view results at
colab Link

NeuralUCB Confidence

At first, thank you a lot for your contributions. They are very valuable to improve my understanding of the original paper.

I have a fundamental question regarding the implementation of the NeuralUCB - confidence multiplier.

How is it exactly concluded from the original paper?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.