Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method
This repository is an implementation of the paper Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method (Riedmiller, 2005).
Please โญ this repository if you found it useful!
For implementations of other deep learning papers, check the implementations repository!
Neural Fitted Q-Iteration used a deep neural network for a Q-network, with its input being observation (s) and action (a) and its output being its action value (Q(s, a)). Instead of online Q-learning, the paper proposes batch offline updates by collecting experience throughout the episode and updating with that batch. The paper also suggests hint-to-goal method, where the neural network is trained explicitly in goal regions so that it can correctly estimate the value of the goal region.
First, clone this repository from GitHub. Since this repository contains submodules, you should use the --recursive
flag.
git clone --recursive https://github.com/seungjaeryanlee/implementations-nfq.git
If you already cloned the repository without the flag, you can download the submodules separately with the git submodules
command:
git clone https://github.com/seungjaeryanlee/implementations-nfq.git
git submodule update --init --recursive
After cloing the repository, use the requirements.txt for simple installation of PyPI packages.
pip install -r requirements.txt
You can read more about each package in the comments of the requirements.txt file!
You can train the NFQ agent on Cartpole Regulator using the given configuration file with the below command:
python train_eval.py -c cartpole.conf
For a reproducible run, use the --RANDOM_SEED
flag.
python train_eval.py -c cartpole.conf --RANDOM_SEED=1
To save a trained agent, use the --SAVE_PATH
flag.
python train_eval.py -c cartpole.conf --SAVE_PATH=saves/cartpole.pth
To load a trained agent, use the --LOAD_PATH
flag.
python train_eval.py -c cartpole.conf --LOAD_PATH=saves/cartpole.pth
To enable logging to TensorBoard or W&B, use appropriate flags.
python train_eval.py -c cartpole.conf --USE_TENSORBOARD --USE_WANDB
This repository uses TensorBoard for offline logging and Weights & Biases for online logging. You can see the all the metrics in my summary report at Weights & Biases!
- From the 3 environments (Pole Balancing, Mountain Car, Cartpole Regulator), only the Cartpole Regulator environment was implemented and tested. It is the most difficult environment.
- For the Cartpole Regulator, the success state is relaxed so that the state is successful whenever the pole angle is at most 24 degrees away from upright position. In the original paper, the cart must also be in the center with 0.05 tolerance.
- Evaluation of the trained policy is only done in 1 evaluation environment, instead of 1000.
Despite having no open-source code, the paper had sufficient details to implement NFQ. However, the results were not fully reproducible: we had to relax the definition of goal states and simplify evaluation. Still, the agent was able to learn to balance a CartPole for 3000 steps while only training from 100-step environment.
Few nits:
- There is no specification of pole angle for goal and forbidden states. We set 0~24 degrees from upright position as a requirement for goal state and any state with 90+ degrees forbidden.
- The paper randomly initializes network weights within [โ0.5, 0.5], but does not mention bias initialization.
- The goal velocity of the success states is not mentioned. We use a normal distribution to randomly generate velocities for the hint-to-goal variant.
- It is unclear whether to add experience after or before training the agent for each epoch. We assume adding experience before training.
- The learning rate for the Rprop optimizer is not specified.