openai-gym-taxi-v2's Introduction

OpenAI-Gym-Taxi-v2

This small repo represents a re-inforcement solution to the Taxi problem in OpenAI Gym: https://github.com/openai/gym/wiki/Leaderboard#taxi-v2

Steps to Run

Clone the repo: git clone https://github.com/mostafaelhoushi/OpenAI-Gym-Taxi-v2
cd to the workspace directory: cd OpenAI-Gym-Taxi-v2/workspace
Run the main script: python main.py You may add any of the following arguments when calling the above command to specify the update method: SARSA, SARSA_MAX, EXPECTED_SARSA.

Source Code:

The repo contains three files in its workspace folder:

agent.py: The code I develop the reinforcement learning agent is written here here. This is the only file that I have modified.
monitor.py: The interact function tests how well the agent learns from interaction with the environment. This file has been provided by the creators of the Udacity Reinforcement Learning Nanodegree.
main.py: The main file to run in the terminal to check the performance of the agent. This file has been provided by the creators of the Udacity Reinforcement Learning Nanodegree.

Results:

The average of running 100 episodes for Sarsa Max (a.k.a. Q-Learning) is 9.2926, Expected Sara is 9.2754.

openai-gym-taxi-v2's People

Contributors

Stargazers

Watchers

openai-gym-taxi-v2's Issues

Agent.py cliffwalker

In your agent, your implementation of step doesn't use epsilon as when done == false, epsilon = 1.0 /(1+epsilon). The following implementation uses epsilon and it replaces self.i_episode = 1. You can check it with the prints.
I also hardcoded the values of epsilon and alpha in the __init__ but I like to live dangerously.

I took the idea of rewriting the code from here.

import numpy as np
from collections import defaultdict

class Agent:

    def __init__(self, nA=6, epsilon=0.05, alpha=0.1, gamma=1):
        """ Initialize agent.

        Params
        ======
        - nA: number of actions available to the agent
        """
        self.nA = nA
        self.Q = defaultdict(lambda: np.zeros(self.nA))
        self.epsilon = 1.0#epsilon
        self.alpha = 0.2 #alpha
        self.gamma = gamma
        #self.update_method = update_method
        self.i_episode = 1#0        

    def get_policy_probs(self, state):
        """ Given the state, return the probability of each action.
        Params
        ======
        - state: the current state of the environment
        Returns
        =======
        - probs: an array, each element corresponds to probability of corresponding action selected
        """
        probs = np.ones(self.nA) * self.epsilon /self.nA
        probs[np.argmax(self.Q[state])] += 1 - self.epsilon
        return probs        
        
    def select_action(self, state):
        """ Given the state, select an action.

        Params
        ======
        - state: the current state of the environment

        Returns
        =======
        - action: an integer, compatible with the task's action space
        """
        #return np.random.choice(self.nA)
    
        probs = self.get_policy_probs(state)
        return np.random.choice(np.arange(self.nA), p=probs)

    def step(self, state, action, reward, next_state, done):
        """ Update the agent's knowledge, using the most recently sampled tuple.

        Params
        ======
        - state: the previous state of the environment
        - action: the agent's previous choice of action
        - reward: last reward received
        - next_state: the current state of the environment
        - done: whether the episode is complete (True or False)
        """
        #self.Q[state][action] += 1
        """
        if (done == False):
            print("\n",self.i_episode, done, self.epsilon)
            self.epsilon = 1.0 / (1.0 + self.i_episode)
            probs = self.get_policy_probs(state)
            next_action = np.random.choice(np.arange(self.nA), p=probs)
            self.Q[state][action] += self.alpha * (reward + self.gamma * np.sum(self.Q[next_state] * probs)  - self.Q[state][action])        
        else: # done == True
            print("\n",self.i_episode, done, self.epsilon)
            self.Q[state][action] += self.alpha * (reward - self.Q[state][action])
            self.i_episode +=  1             
        
        """
        if done:
            #print("\n",self.i_episode, done, self.epsilon)
            self.Q[state][action] += self.alpha * (reward - self.Q[state][action])
            self.i_episode +=  1
            self.epsilon = self.epsilon / (self.i_episode)  # old 1.0 / (1.0 + self.i_episode) with self.i_episode = 0
        else:
            #print("\n",self.i_episode, done, self.epsilon)
            probs = self.get_policy_probs(state)
            next_action = np.random.choice(np.arange(self.nA), p=probs)
            self.Q[state][action] += self.alpha * (reward + self.gamma * np.sum(self.Q[next_state] * probs)  - self.Q[state][action])