Coder Social home page Coder Social logo

reinforcement-learning-snake-game's People

Contributors

dependabot[bot] avatar mcfearless999 avatar mkduer avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

reinforcement-learning-snake-game's Issues

Bug: human controls

low priority
up and down controls for human are reversed. Controls are also sluggish.

Extra Time aka Future Work

These can be added to the paper per more work with extra time:

  • escape key for ai play
  • Interesting Graph: Collisions (use ratio over a percentage of episodes, show the percent of wall collisions vs. body collisions -- stacked barplot)
  • Average number of steps to get a mouse per episode e.g. total steps / total mouse
  • Variation of game with added reward(s): different types of foods e.g. one type is worth more than another, can the snake be trained to for the higher reward?
  • MENTIONED IN PAPER: Try to improve learning by somehow avoiding body collisions more effectively
  • Experiment: start with Epsilon = 0.1 and gradually increase over the training to see how the exploration over to exploitation affects the Snake's performance

Discarded:

  • mouse, reduced time in position and regenerates to new random position (in which case, we can track missed mouses) -- random seeding affected other parts of the game, and would need more time to figure out how to constrain this.
  • Randomized snake initial movements (RIC) and why this was not done

Epsilon Tests

Epsilon

  • 0.1 -- pretty good
  • 0.3
  • 0.6 -- already shows reduced performance
  • 0.05 -- similar
  • 0.01 -- not so good
  • 0.07 -- good, better than 0.05
  • 0.12 -- better than 0.07, but 0.1 is still better

0.1 seems to be the best Epsilon value

===
Discarded

  • 0 (which was the default for choosing the learning rate, performs slightly worse than 0.001 and similar to 0.1)
  • 0.0005
  • 0.9
  • 0.001
  • 0.005

Learning Rate 0.005
Discount Factor 0.9

Train Q Learning Algorithm

Using a quick solution to reset the game until #8 is completed in order to train the AI and test Q Learning algorithm's state updates.

Extra Tests

  • given extra time run 1,000,000+ episodes to see just how well the snake can learn (e.g. in terms of peak score, in terms of game converging)

  • adjusting board size (larger, smaller, regular)

  • reward tests #73

  • test vs training: run test on constant hyperparameters multiple times to check for the consistency and overfitting (note append to file for this), possibly add with a changed board size


Discarded Tests:

  • randomized mouse approaches were more complicated than expected, so decided to forgo this for now
  • increased penalty for going back on itself -20 should be factored in with modified state representation, and seemed more fun to focus on other tests given limited time
  • start with Epsilon = 0.1 and gradually increase over the training to see how the exploration over to exploitation affects the Snake's performance

Test Run

Save final training data and run test

Track data and graph

Single Episode

  • timing (start, finish, total length)

  • steps/frames

  • total successful eats aka score

  • body collisions

  • wall collisions

Imlplement Q-learning algorithm

hyperparameters
for producing proof of concept (in other words, keeping things simple to start)

  • η, learning rate = 1
  • γ, discount factor = 0 (no learning occurs, this should eventually be (0,1), 0.9 may be a good one to eventually use)
  • r, reward = 2 for eating mouse (later on, penalties could be added like -1 for collision, 1 for surviving)
  • ε, epsilon randomness is not used to start

Simple q-learning algorithm (with η = 1):
Q(s, a) = (Q(s, a)) + (Q(s, a) + γ [maxa'(Q(s', a')))] - (Q(s, a)))

Epsilon-Greedy q-learning algorithm:
Q(s, a) = (1 - ε)(Q(s, a)) + η (Q(s, a) + γ [maxa'(Q(s', a')))] - (Q(s, a)))

Final Paper

  • Editing:
    • learning rate #64 - Chris -- editing
    • discount factor #74 - Chris -- editing
    • Overall - Michelle
  • Graphs: add last (this can be annoying to add early with shared Google dox)

DONE

  • Intro - Michelle #42
  • Terminology - Chris
  • Implementation details - Michelle
  • Default hyperparameters for tests - Michelle
  • Training vs testing - Michelle
  • Experimentation and Evaluation
    • rewards #73 - Michelle
    • epsilon #34 - Michelle
  • Extra tests #76
    • given extra time run 1,000,000+ episodes to see just how well the snake can learn (e.g. in terms of peak score, in terms of game converging) - Michelle
    • test vs training: run test on constant hyperparameters multiple times to check for the consistency and overfitting (note append to file for this), possibly add with a changed board size - Michelle
    • adjusting board size (larger, smaller, regular) - Michelle
  • Conclusion -- what more would we experiment #33, what would we change, la di da - Michelle

soft reset

We need to come up with a way to have the game not completely close after each episode. my idea is would be to have the game boot with a prompt (press start) and then after each collision go back to the prompt while feeding the data to the q-table class.

Discount Factor Tests

  • 0.1
  • 0.5
  • 0.75
  • 0.6

Testing closer to 0.6 and 0.75

  • 0.85
  • 0.9
  • 0.95
  • 0.8

The optimal value seems to be 0.85-0.95

Fix write_data()

Fixes needed (related to #29):

  • Close file after writing
  • An extra episode is being written to CSV (test run or extra values in list?)
  • Need to write test run to a separate CSV for comparison between test and training results (to view overfitting, effects of learning, etc). Best if able to use the same function for both training and test data.
  • Add function comment detailing its purpose (view other functions for reference)
  • Need append option if the constant RESUME is set to True (since it would be added to the "current" training session

Interesting Graph: Cumulative Rewards

This would likely have a broader spread and look more interesting visually. Depending on how the other graphs look, this may go from being an extra time item to an implement item.

Experiments: Test hyperparameters

  • learning rate #64 -- optimal rate 0.005
  • epsilon factor #75 -- optimal rate 0.1
  • discount factor #74 -- optimal rates 0.85-0.95
  • rewards #73

DEFAULT VALUES for all tests (excepting the one being tested):
Learning Rate: 0.005
Epsilon: 0.1
Discount Factor: 0.9
Rewards: Wall -100, Body -100, Empty -10, Mouse 100

Learning Rate Tests

As learning rate had a noticeable impact on the snake, we are increasing the number of rates that are tested.

  • Michelle: 0.0001, 0.01, and 0.1
  • Chris: 0.001, 0.5 and 1
  • Michelle: 0.005

For 5000 20,000 total episodes, prior to setting epsilon. Changed to 20,000 because of interesting results from 0.01 learning rate around/post 10,000 runs where the learning actually seems to devolve a bit.

Track collisions and graph

  • Calculate total wall collisions and body collisions per episode
  • Add as a new measurement
  • Plot the type of collisions over time (wall collisions seem to go down while body collisions go up), which demonstrates that the snake learns the board dimensions.

interface

There modifications need to be made with the snake program to store the output of the Q-table as a digit to be interpreted as a movement in the game.I can do this over the break in order to start with a basic proof of concept. it wouldn't have any learning but it could be set up to make a series of random movements

Rewards Tests

  • change values of rewards to see how that affects performances
    experiment
    • eliminate penalty for empty square
      experiment
    • use more extreme values (larger)
      experiment
    • use relatively closer values (e.g. Wall/Body -1, reward 1)
      experiment
    • change collision values so wall is lesser penalty to colliding with itself and inverse

Results

  • remove empty tile penalty - okay performance but a lot of 0 scores, so more collisions early on, but also a noticeable increase in steps
    • higher step penalty -
  • use more extreme values (larger 1000) - worse performance (note: empty tile penalty was also increased)
  • use relatively closer values (e.g. Wall/Body -2, reward 2) - worse performance
  • balance rewards (e.g. Wall/Body -20, Mouse 50, empty tile -1) so-so
  • change collision values so wall is lesser penalty to colliding with itself - worse than previous
  • inverse of previous test (note, previous, reduced wall penalty seemed to result in better performance in the long run)
  • wall: -80, body: -100, mouse: 120, empty tile: -40 -- did poorly

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.