This repository was mainly for solving a task for JetBrains Research Lab summer 2021 internship. The task's description is: "Implement DQN, policy gradient or actor-critic RL algorith to solve Mountain-Car gym environment"
I have implemented (Simple/Vanilla) Deep Q-Network (DQN) algorithm with experience replay buffer and frequent change for the target network inside "DQN.py".
A gif for a trained agent by this implementation of DQN
After testing with the original reward of the environment, nothing was improved in the training. So, I have changed the reward function, to test different behavior and see some improvements.
Multiple reward functions have been tested to conform with the desired behaviors:
-
Move with fast right and left -> Correlated with velocity [2nd observation]
-
Move closer to the goal -> Correlated with the position [1st observation]
I have noticed some observations:
-
When only the position is in the reward (or the position dominated) it makes it only try to go up not by going right and left but just go right
-
When only the velocity is in the reward (or the velocity dominated) it makes it only to move fast right and left and don't care about the real goal (position)
And for that I have made a new reward function:
reward= r + abs(velocity)*10 - abs(position-0.5))
such that r is the original reward (0 or -1) from the environment, 0.5 in the equation is the desired position for the car. And the weight scalar factor (10) was tuned with experiments in such a way if it is too much the car will only be interested in gaining velocity and not reach the desired position and it is too small, the car (agent) will be interested in be closer to the goal but not in gaining veocity first to swing in order to reach the top
-
start training script:
python3 learn.py
-
start trained agent:
python3 run_agent.py
Reward training plot for 500 epochs:
Reward testing plot for 100 epochs:
- DQN paper
- Good online tutorial for theory
- Good course slides (CS285)
- Gym environment for MountainCar
-
Implement Vanilla DQN for value-based RL algorithm
-
Write good README.md with cool gifs
-
Implement REINFORCE for Policy Gradient (Maybe in different repo)
-
Implement simple Actor-Critic algorithm (Maybe in different repo)
-
Add more plots with more experiments with different seeds
-
Perform different experiments
-
Add trained agents and videos directories
-
Add plots for different results with different algorithms
-
Use RLlib to show the difference between the implementd and the library's implementation and provide plots
-
Create a report with references and papers