Coder Social home page Coder Social logo

rl2booksolutions's Introduction

Introduction

This repo is a note for the book Reinforcement Learning: An Introduction 2nd Edition by Sutton & Barto. It serves mainly as a guide to have a thinking in depth on exercise problems proposed in this book.

It aims to provide answers that are intuitively reasonable, experimentally proved and mathematically proved. Most of the time the first two are guaranteed. Hopefully the verbosity and elaboration would get you inspired :D . If you ever get confused reading this note, raise an issue, pull a merge request or if you prefer, feel free to contact me at [email protected].

This note is now being rapidly updated because I'm still intensively getting familiar with the RL research area. The codes (will) include:

  • Experimentally proved solutions to exercise problems with reasonable explanation
  • The code generating each figure in the book

Solutions

PDF Release will soon be available once I finished the note. ๐Ÿšง
Web hosted docs are available and I believe these should serve the purpose well enough :

  • Chapter 1 exercise solutions โ˜‘๏ธ
  • Chapter 2 exercise solutions โ˜‘๏ธ
  • Chapter 3 exercise solutions โ˜‘๏ธ
  • Chapter 4 exercise solutions ๐Ÿƒ (in progress ...)
  • Chapter 5 exercise solutions ๐Ÿƒ (in progress ...)
  • ...

Dependencies

  • I use Python 3.6 installed via Anaconda environment on OSX 10.15.6. Other Python versions or system platforms are not tested yet, but it would be on theory working.
  • numpy == 1.19
  • matplotlib == 3.3.1
  • tqdm == 4.49

Specifications

Usually, random seed = 0 (as specified in code). This allows everyone to reproduce the work exactly the same way as I did in this note.
I believe this, reproducibility, is of great importance when your own code is behaving strangely but you are not sure if it's a bug.

Examples

  1. Exercise 2.3 In the comparison shown in Figure 2.2, which method will perform best in the long run in terms of cumulative reward and probability of selecting the best action? How much better will it be? Express your answer quantitatively.
    fig 2.2

    Ans:
    The experiment is conducted with 10,000 iterations averaged by 2,000 runs and the epsilon=0.01 player performed best (see code for fig 2.2).
    Reward Performance: ep=0.01 > ep=0.1 > ep=0
    Select Performance: ep=0.01 > ep=0.1 > ep=0 exercise 2.2

  2. Exercise 2.5 (programming) Design and conduct an experiment to demonstrate the difficulties that sample-average methods have for non-stationary problems. Use a modified version of the 10-armed testbed in which all the q*(a) start out equal and then take independent random walks (say by adding a normally distributed increment with mean zero and standard deviation 0.01 to all the q*(a) on each step). Prepare plots like Figure 2.2 for an action-value method using sample averages, incrementally computed, and another action-value method using a constant step-size parameter, a = 0.1. Use epsilon = 0.1 and longer runs, say of 10,000 steps.

    Ans:
    Experiments are conducted in exercise_2_5.py
    The lines inserted to Bandit.step for the non-stationary bandit implementation:

    # Nonstationary Bandit    
    self.q_true += np.random.normal(loc=0, scale=0.01, size=(self.k,))
    self.best_action = np.argmax(self.q_true)

    and in Bandit.reset:

    # As stated in the prob, q starts at 0.
    self.q_true = np.zeros(shape=(self.k,)) + self.true_reward

    The constant step-size method outperformed the sample average method in terms of both average reward and best action hit rate.

    exercise 2.5

    Unsatisfied with the simulation speed, I wrote a new version exercise_2_5_SIMR.py for this exercise prob. SIMR stands for Single Iteration Multi Runs (You know it's from SIMD in chips). Instead of going all the way through a complete run one after another, this version simultaneously operates multi-runs at each iteration, as if those runs are in parallel.
    This allowed us to utilize the power of the optimized vector computation tools in numpy, and it actually gets around 8x faster than the first implementation.

References

The code implementations references are:

For figures, usage and examples can be accessed at Matplotlib Gallery

rl2booksolutions's People

Contributors

brycechen1849 avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.