The reinforcement-learning-snake-game from mkduer

Graph Collisions

Implement appropriate graph to represent collisions

Bug: human controls

low priority
up and down controls for human are reversed. Controls are also sluggish.

Extra Time aka Future Work

These can be added to the paper per more work with extra time:

escape key for ai play
Interesting Graph: Collisions (use ratio over a percentage of episodes, show the percent of wall collisions vs. body collisions -- stacked barplot)
Average number of steps to get a mouse per episode e.g. total steps / total mouse
Variation of game with added reward(s): different types of foods e.g. one type is worth more than another, can the snake be trained to for the higher reward?
MENTIONED IN PAPER: Try to improve learning by somehow avoiding body collisions more effectively
Experiment: start with Epsilon = 0.1 and gradually increase over the training to see how the exploration over to exploitation affects the Snake's performance

Discarded:

mouse, reduced time in position and regenerates to new random position (in which case, we can track missed mouses) -- random seeding affected other parts of the game, and would need more time to figure out how to constrain this.
Randomized snake initial movements (RIC) and why this was not done

Epsilon Tests

Epsilon

0.1 seems to be the best Epsilon value

===
Discarded

Learning Rate 0.005
Discount Factor 0.9

export_table(episode: int) function

every modulo X episodes, save Q table to an external file
for the final paper, create heatmap confusion matrices for this

Observations (pre-experiments)

Successful Learning

initial collisions are walls, later collisions are the snake's body

Train Q Learning Algorithm

Using a quick solution to reset the game until #8 is completed in order to train the AI and test Q Learning algorithm's state updates.

Extra Tests

given extra time run 1,000,000+ episodes to see just how well the snake can learn (e.g. in terms of peak score, in terms of game converging)
adjusting board size (larger, smaller, regular)
reward tests #73
test vs training: run test on constant hyperparameters multiple times to check for the consistency and overfitting (note append to file for this), possibly add with a changed board size

Discarded Tests:

~~randomized mouse~~ approaches were more complicated than expected, so decided to forgo this for now
~~increased penalty for going back on itself -20~~ should be factored in with modified state representation, and seemed more fun to focus on other tests given limited time
~~start with Epsilon = 0.1 and gradually increase over the training~~ to see how the exploration over to exploitation affects the Snake's performance

Optional: Add functionality for -ai flag from CLI args

Export a separate CSV file with training sessions details

Imlplement Q-learning algorithm

hyperparameters
for producing proof of concept (in other words, keeping things simple to start)

η, learning rate = 1
γ, discount factor = 0 (no learning occurs, this should eventually be (0,1), 0.9 may be a good one to eventually use)
r, reward = 2 for eating mouse (later on, penalties could be added like -1 for collision, 1 for surviving)
ε, epsilon randomness is not used to start

Simple q-learning algorithm (with η = 1):
Q(s, a) = (Q(s, a)) + (Q(s, a) + γ [maxa'(Q(s', a')))] - (Q(s, a)))

Epsilon-Greedy q-learning algorithm:
Q(s, a) = (1 - ε)(Q(s, a)) + η (Q(s, a) + γ [maxa'(Q(s', a')))] - (Q(s, a)))

Final Paper

Editing:
- learning rate #64 - Chris -- editing
- discount factor #74 - Chris -- editing
- Overall - Michelle
Graphs: add last (this can be annoying to add early with shared Google dox)

DONE

We need to come up with a way to have the game not completely close after each episode. my idea is would be to have the game boot with a prompt (press start) and then after each collision go back to the prompt while feeding the data to the q-table class.

Discount Factor Tests

0.1
0.5
0.75
0.6

Testing closer to 0.6 and 0.75

0.85
0.9
0.95
0.8

The optimal value seems to be 0.85-0.95

Fix write_data()

Fixes needed (related to #29):

Close file after writing
An extra episode is being written to CSV (test run or extra values in list?)
Need to write test run to a separate CSV for comparison between test and training results (to view overfitting, effects of learning, etc). Best if able to use the same function for both training and test data.
Add function comment detailing its purpose (view other functions for reference)
Need append option if the constant RESUME is set to True (since it would be added to the "current" training session

Interesting Graph: Cumulative Rewards

This would likely have a broader spread and look more interesting visually. Depending on how the other graphs look, this may go from being an extra time item to an implement item.

Experiments: Test hyperparameters

learning rate #64 -- optimal rate 0.005
epsilon factor #75 -- optimal rate 0.1
discount factor #74 -- optimal rates 0.85-0.95
rewards #73

DEFAULT VALUES for all tests (excepting the one being tested):
Learning Rate: 0.005
Epsilon: 0.1
Discount Factor: 0.9
Rewards: Wall -100, Body -100, Empty -10, Mouse 100

Load external table to resume training at that point

update the "current" episode value to reflect the next episode after this imported table -- need int AND boolean if starting this as episode 1

Generate Line Graph

Generate lines for both steps and scores.

Learning Rate Tests

As learning rate had a noticeable impact on the snake, we are increasing the number of rates that are tested.

Michelle: 0.0001, 0.01, and 0.1
Chris: 0.001, 0.5 and 1
Michelle: 0.005

For ~~5000~~ 20,000 total episodes, prior to setting epsilon. Changed to 20,000 because of interesting results from 0.01 learning rate around/post 10,000 runs where the learning actually seems to devolve a bit.

Track collisions and graph

Calculate total wall collisions and body collisions per episode
Add as a new measurement
Plot the type of collisions over time (wall collisions seem to go down while body collisions go up), which demonstrates that the snake learns the board dimensions.

Add epsilon

epsilon

interface

There modifications need to be made with the snake program to store the output of the Q-table as a digit to be interpreted as a movement in the game.I can do this over the break in order to start with a basic proof of concept. it wouldn't have any learning but it could be set up to make a series of random movements

Fix rewards in Q algorithm

Rewards Tests

change values of rewards to see how that affects performances
experiment
- eliminate penalty for empty square
  experiment
- use more extreme values (larger)
  experiment
- use relatively closer values (e.g. Wall/Body -1, reward 1)
  experiment
- change collision values so wall is lesser penalty to colliding with itself and inverse

Results

remove empty tile penalty - okay performance but a lot of 0 scores, so more collisions early on, but also a noticeable increase in steps
- higher step penalty -
use more extreme values (larger 1000) - worse performance (note: empty tile penalty was also increased)
use relatively closer values (e.g. Wall/Body -2, reward 2) - worse performance
balance rewards (e.g. Wall/Body -20, Mouse 50, empty tile -1) so-so
change collision values so wall is lesser penalty to colliding with itself - worse than previous
inverse of previous test (note, previous, reduced wall penalty seemed to result in better performance in the long run)
wall: -80, body: -100, mouse: 120, empty tile: -40 -- did poorly

mkduer / reinforcement-learning-snake-game Goto Github PK

reinforcement-learning-snake-game's People

Contributors

Stargazers

Watchers

reinforcement-learning-snake-game's Issues

Recommend Projects

Recommend Topics

Recommend Org