Reinforcement Learning with PyTorch and OpenAI-Gym. This repository has implementation for Deep-Q-Learning Algorithm and Dueling Double Deep-Q-Learning Algorithm.
Breakout has multiple levels. The bricks are reset and the agent continues to get points in the next level with its remaining lives. Such case results in sudden increase in the reward.
pip install -r requirements.txt
Create folder for checkpoints
mkdir -p Checkpoints/Pong
mkdir -p Checkpoints/Breakout
Set the environment
parameter before running.
Run DQN
python deep_q_learning.py
Run DDDQN
python duel_double_deep_q_learning.py
Set MODEL_PATH
and environment
parameters in play_model.py
python play_model.py
- DQN is tricky and it takes numerous attempts and patience to get things working from scratch. For quicker results one can start with OpenAI baselines. However, trying from scratch provides deeper insights about parameters and their effect on the algorithm's performance.
- Pytorch specific: The target needs to be detached from the computation graph or the target Q values need to be calculated with
no_grad
scope during model optimization. This ensures that the optimizer does not update the weights of the target network during backpropagation. - Be aware of the data types of tensors. Replay memory should have
uint8
type but that needs to be converted tofloat32
while training. Reward and terminal tensors should also befloat32
type. - Gradient clipping and reward clipping both are extremely important for stability. Figure shown below illustrates a failed training example where the reward falls as soon as the agent gets a very high reward (Breakout). Sudden large reward leads to unstable gradient updates during backpropagation.
- Batch normalization is not a good choice for DQN and leads to increased training time.
- Huber loss is an essential alternative for MSE loss. Simpler problems such as Cartpole and Pong where the planning horizon is short and the focus is more on immediate rewards, MSE loss performs fairly well.
- Parameters such as GAMMA, BATCH_SIZE, LEARNING_RATE, EXPLORATION_FRAMES and MEMORY_BUFFER determine the training time. Small deviations in these parameters should not affect the stability drastically.
- EPSILON scheduling is important during initial training phase. In the later phase of training most of the learning happens with EPSILON_END.
- Gym Deterministic environments are the ones Deepmind folks used for their 2013 paper
- Model optimization happens once in POLICY_UPDATE_INTERVAL steps and target update happens once in TARGET_UPDATE_INTERVAL steps. For Deterministic environments each step skips 4 frames. So if POLICY_UPDATE_INTERVAL=4 and TARGET_UPDATE_INTERVAL=100 then backpropagation happens once in 16 game frames and target update happens once in 400 game frames.
- Understanding the pre-processing is confusing. I would recommend this article to get a clear idea about how the Atari environments work.
- There are two terminal flags. One is true every time the agent loses a life and the other is true only at the end of an episode. Replay memory should be filled with the one indicating lives lost.
- Random FIRE actions at the start of an episode avoids learning a suboptimal policy.
- Start training when memory buffer has some good number of samples. This avoids learning a suboptimal policy.
- Don't give up. Deep RL is frustrating to train but it is also an engineering marvel.