uoa-cares / cares_reinforcement_learning Goto Github PK

CARES Reinforcement Learning Package

Python 100.00%

reinforcement-learning deep-learning machine-learning

cares_reinforcement_learning's Introduction

The CARES reinforcement learning bed used as the foundation for RL related projects.

Motivation

Reinforcement Learning Algorithms (that is to say, how the Neural Networks are updated) stay the same no matter the application. This package is designed so that these algorithms are only programmed once and can be "plugged & played" into different environments.

Usage

Consult the repository wiki for a guide on how to use the package

Installation Instructions

If you want to utilise the GPU with Pytorch install CUDA first - https://developer.nvidia.com/cuda-toolkit

Install Pytorch following the instructions here - https://pytorch.org/get-started/locally/

git clone the repository into your desired directory on your local machine

Run pip3 install -r requirements.txt in the root directory of the package

To make the module globally accessible in your working environment run pip3 install --editable . in the project root

Running an Example

This package serves as a library of specific RL algorithms and utility functions being used by the CARES RL team. For an example of how to use this package in your own environments see the example gym packages below that use these algorithms for training agents on a variety of simulated and real-world tasks.

Gym Environments

We have created a standardised general purpose gym that wraps the most common simulated environments used in reinforcement learning into a single easy to use place: https://github.com/UoA-CARES/gymnasium_envrionments

This package contains wrappers for the following gym environments:

Deep Mind Control Suite

The standard Deep Mind Control suite: https://github.com/google-deepmind/dm_control

OpenAI Gymnasium

The standard OpenAI Gymnasium: https://github.com/Farama-Foundation/Gymnasium

Game Boy Emulator

Environment running Gameboy games utilising the pyboy wrapper: https://github.com/UoA-CARES/pyboy_environment

Gripper Gym

The gripper gym contains all the code for training our dexterous robotic manipulators: https://github.com/UoA-CARES/gripper_gym

F1Tenth Autonomous Racing

The Autonomous F1Tenth package contains all the code for training our F1Tenth platforms to autonomously race: https://github.com/UoA-CARES/autonomous_f1tenth

Package Structure

cares_reinforcement_learning/
├─ algorithm/
├─ encoders/
│  ├─ autoencoder.py
│  ├─ ...
├─ policy/
│  │  ├─ TD3.py
│  │  ├─ ...
│  ├─ value/
│  │  ├─ DQN.py
│  │  ├─ ...
├─ memory/
│  ├─ prioritised_replay_buffer.py
├─ networks/
│  ├─ DQN/
│  │  ├─ network.py
│  ├─ TD3.py/
│  │  ├─ actor.py
│  │  ├─ critic.py
│  ├─ ...
├─ util/
│  ├─ network_factory.py
│  ├─ ...

algorithm: contains update mechanisms for neural networks as defined by the algorithm.

encoders: contains the implementations for various autoencoders and variational autoencoders

memory: contains the implementation of various memory buffers - e.g. Prioritised Experience Replay

networks: contains standard neural networks that can be used with each algorithm

util: contains common utility classes

Encoders

An autoencoder consists of an encoder that compresses input data into a latent representation and a decoder that reconstructs the original data from this compressed form. Variants of autoencoders, such as Variational Autoencoders (VAEs) and Beta-VAEs, introduce probabilistic elements and regularization techniques to enhance the quality and interpretability of the latent space. While standard autoencoders focus on reconstruction accuracy, advanced variants like Beta-VAE and Squared VAE (SqVAE) aim to improve latent space disentanglement and sparsity, making them valuable for generating more meaningful and structured representations.

We have re-implemented a range of autoencoder/variational-autoencoder methodologies for use with the RL algorithms implemented within this library. For more information on the encoders available in this package, please refer to the README in the encoders folder. These algorithms can be used stand-alone beyond their use here for RL.

Utilities

CARES RL provides a number of useful utility functions and classes for generating consistent results across the team. These utilities should be utilised in the new environments we build to test our approaches.

Record.py

The Record class allows data to be saved into a consistent format during training. This allows all data to be consistently formatted for plotting against each other for fair and consistent evaluation.

All data from a training run is saved into the directory specified in the CARES_LOG_BASE_DIR environment variable. If not specified, this will default to '~/cares_rl_logs'.

You may specify a custom log directory format using the CARES_LOG_PATH_TEMPLATE environment variable. This path supports variable interpolation such as the algorithm used, seed, date etc. This defaults to "{algorithm}/{algorithm}-{domain_task}-{date}/{seed}" so that each run is saved as a new seed under the algorithm and domain-task pair for that algorithm.

The following variables are supported for log path template variable interpolation:

algorithm
domain
task
domain_task: The domain and task or just task if domain does not exist
gym
seed
date: The current date in the YY_MM_DD-HH-MM-SS format
run_name: The run name if it is provided, otherwise "unnamed"
run_name_else_date: The run name if it is provided, otherwise the date

This folder will contain the following directories and information saved during the training session:

├─ <log_path>
|  ├─ env_config.py
|  ├─ alg_config.py
|  ├─ train_config.py
|  ├─ data
|  |  ├─ train.csv
|  |  ├─ eval.csv
|  ├─ figures
|  |  ├─ eval.png
|  |  ├─ train.png
|  ├─ models
|  |  ├─ model.pht
|  |  ├─ CHECKPOINT_N.pht
|  |  ├─ ...
|  ├─ videos
|  |  ├─ STEP.mp4
├─ ...

plotting.py

The plotting utility will plot the data contained in the training data based on the format created by the Record class. An example of how to plot the data from one or multiple training sessions together is shown below.

Plot the results of a single training instance

python3 plotter.py -s ~/cares_rl_logs -d ~/cares_rl_logs/ALGORITHM/ALGORITHM-TASK-YY_MM_DD:HH:MM:SS

Plot and compare the results of two or more training instances

python3 plotter.py -s ~/cares_rl_logs -d ~/cares_rl_logs/ALGORITHM_A/ALGORITHM_A-TASK-YY_MM_DD:HH:MM:SS ~/cares_rl_logs/ALGORITHM_B/ALGORITHM_B-TASK-YY_MM_DD:HH:MM:SS

Running 'python3 plotter.py -h' will provide details on the plotting parameters and control arguments.

python3 plotter.py -h

configurations.py

Provides baseline data classes for environment, training, and algorithm configurations to allow for consistent recording of training parameters.

RLParser.py

Provides a means of loading environment, training, and algorithm configurations through command line or configuration files. Enables consistent tracking of parameters when running training on various algorithms.

NetworkFactory.py

A factory class for creating a baseline RL algorithm that has been implemented into the CARES RL package.

MemoryFactory.py

A factory class for creating a memory buffer that has been implemented into the CARES RL package.

Supported Algorithms

Algorithm	Observation Space	Action Space	Paper Reference
DQN	Vector	Discrete	DQN Paper
DoubleDQN	Vector	Discrete	DoubleDQN Paper
DuelingDQN	Vector	Discrete	DuelingDQN Paper
SACD	Vector	Discrete	SAC-Discrete Paper
-----------	--------------------------	------------	---------------
PPO	Vector	Continuous	PPO Paper
DDPG	Vector	Continuous	DDPG Paper
TD3	Vector	Continuous	TD3 Paper
SAC	Vector	Continuous	SAC Paper
PERTD3	Vector	Continuous	PERTD3 Paper
PERSAC	Vector	Continuous	PERSAC Paper
PALTD3	Vector	Continuous	PALTD3 Paper
LAPTD3	Vector	Continuous	LAPTD3 Paper
LAPSAC	Vector	Continuous	LAPSAC Paper
LA3PTD3	Vector	Continuous	LA3PTD3 Paper
LA3PSAC	Vector	Continuous	LA3PSAC Paper
MAPERTD3	Vector	Continuous	MAPERTD3 Paper
MAPERSAC	Vector	Continuous	MAPERSAC Paper
RDTD3	Vector	Continuous	WIP
RDSAC	Vector	Continuous	WIP
REDQ	Vector	Continuous	REDQ Paper
TQC	Vector	Continuous	TQC Paper
CTD4	Vector	Continuous	CTD4 Paper
-----------	--------------------------	------------	---------------
NaSATD3	Image	Continuous	In Submission
TD3AE	Image	Continuous	TD3AE Paper
SACAE	Image	Continuous	SACAE Paper

cares_reinforcement_learning's People

Contributors

Stargazers

Watchers

Forkers

ikui753 yzuo500

cares_reinforcement_learning's Issues

Refactor – Move saving/loading models into helper functions

Current Behaviour

Each algorithm/agent has save/load models function. Duplicate code. Also, the loading paths are hard coded snippet below:

def load_models(self, filename):
        self.actor_net.load_state_dict(torch.load(f'models/{filename}_actor.pht'))
        self.critic_net.load_state_dict(torch.load(f'models/{filename}_critic.pht'))
        logging.info("models has been loaded...")

Potential Behaviour

Could reduce the duplication by extracting the saving and loading to a helper functions in util.
This way, you deal with the saving and loading in your training loop.

from cares_reinforcement_learning.util import helpers as hlp

# Loading a model
actor = Actor(...)
hlp.load_model(actor, saved_actor_path)
critic = Critic(...)
hlp.load_model(critic, saved_critic_path)

# After training
hlp.save_model(actor, file_path)
hlp.save_model(critic, file_path)

Notes

In saying this, saving and loading models is so little code that it's possibly not worth making such an optimisation lmao.

OpenCV creates QT issue

The new installation version of OpenCV can create the following error message:

QObject::moveToThread: Current thread (...) is not the object's thread (...).
Cannot move to target thread (...)
qt.qpa.plugin: Could not load the Qt platform plugin "xcb" in "/(...)site-packages/cv2/qt/plugins" even though it was found.
This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem.

This issue can be resolved manually by installing opencv headless:

pip uninstall opencv-python
pip install opencv-python-headless

Record Memory Leak

The plotting function has a memory leak.

Separate Info from Algorithm

Update the record example

Unused Dependencies - Dependency Management

Users of the package, when integrating the agents into their own environments, download dependencies they don't need. For example Open AI gym, or Deep mind control suite. These aren't necessary for applications in other environments.

Really just a nitpick, but thought I should note it

Add Dockerfile to Essential GPU

TQC CPU Crash

TQC crashes without gpu access:

Shall we change the time format when save the data

I would like to suggest a minor change that use "HH_MM_SS" rather than "HH:MM:SS" when saving the data. The current format is not very convient when using a docker image on a Windows machine.

Automate the Logging Process

Features

Dynamic Logging (Printing):

logger.log(
   S=10000,
   R=100.0,
   AL=...,
   CL=...,
   ...
)
# Prints
"| S: 10000  |   R: 100   |   AL: ...   |   CL: ...   | ... | "

Should automatically write to file (checkpoint frequency).

Potential to pass in actor and critic on class creation (if the implementation is a class), and have automated model saving too.

Look for inspiration at the gripper code stuff

Investigate the difference between Critics

Example loop for intrinsic reward methods (NaSA-TD3)

Develop an example training loop for intrinsic reward methods - or incorporate into existing if sensible and clean

Add PPO to training config

PPO doesn't use the same training config as the off policy algorithms.

Better Avoid using keywords as a variable's name. Type Error when calling create_memory().

TypeError: MemoryFactory.create_memory() got multiple values for argument 'args'.

Args is a keyword for arguments, so it is potentially overwriting a positional argument.

One solution is changing the name, e.g., changing the name 'args' to 'arg'.

Should avoid the common keywords that potentially aliases: min, max, next, break, args, kwargs et al.

Adding network/actor&critic path to the RLParser

Adding the network or actor/critic paths to RLParser so that models can be loaded and evaluated after their training

In the original MAPERTD3 algorithm, the parameter alpha is sent to memory, we do not utilize it, too.

The code for MAPER isn't provided in the GitHub repository. Instead, it's available as supplementary material in their paper, which I believe you downloaded previously from the following link:
https://openreview.net/forum?id=WuEiafqdy9H

The SegmentTree baseline they used can be found here: https://github.com/openai/baselines/blob/master/baselines/common/segment_tree.py

PPO is tightly coupled to RolloutBuffer

Current Behaviour

You pass the memory buffer directly into the PPO train_policy. The only thing PPO.py does with the memory buffer is read from it, and then clear it.

Expected Behaviour

You can read from the buffer and clear it from the outside. This means you can pass the experience to the train_policy which keeps it consistent with all the other algorithm implementations.

Implement - REDQ

Paper: https://openreview.net/forum?id=AY8zfZm0tDd
Example Code https://github.com/BY571/Randomized-Ensembled-Double-Q-learning-REDQ-

Update - checkpoint_frequency

Currently it means per number of episodes. But it would make sense to change it to how many times we want to save the model?
suggest: round((max_steps_training/episode_horizon)/checkpoint_frequency), then replace the current self.checkpoint_frequency with this.

e.g.
self.num_ep_for_save = round((max_steps_training/episode_horizon)/checkpoint_frequency)
if self.network is not None and self.log_count % self.num_ep_for_save == 0:
self.network.save_models(
f"{self.algorithm}-checkpoint-{self.log_count}", self.directory
)

Create a Release for CARES RL

Implement - Dueling DQN

Implement Dueling DQN

https://arxiv.org/abs/1511.06581

Implement - D4PG

Implement D4PG

https://arxiv.org/abs/1804.08617

Add Algorithm Tests

Devise a testing strategy for algorithms.
Potential Approaches:
Seed –> ensure that values are identical/at least of value
Last X episodes averaged –> check last X episodes and average, to make sure that reward matches that for solving the task

Add default values for Algorithm parameters

During evaluation, we don't care about GAMMA and TAU and others because they are only useful for training. But if we want to create an agent for evaluation, we have to set both those fields. It's better to have default values so that they can be omitted when not needed

unit tests + documentation - plotter and Record

Storing experience/memory to GPU

When memory.add, process the experience to tensors and send to GPU, and then adding them to the memory array will improve the time performance, but use more GPU memory.

At the moment we are storing memory in RAM, but each experience is used multiple times to train the network on GPU, so multiple transfers from RAM to GPU occurs for a single experience.

We can potentially skip the multiple transfers by storing the experience to GPU when they are added. The drawback is we are using the limited GPU memory. This method could achieve speedups in some cases.

Implement Live plotting

During the training process it is useful to visualise the data being collected so far in real time. This includes the reward over time, the loss over time, and other metrics.

We usually want to visualise more than one thing at a time, so subplots may be useful. Ideally we can set the number of things we want to visualise at the instantiation of the class (if it is a class), and then live update those sub plots. This means we only have one window that contains the plots we're interested in.

In addition, when we previously attempted to live plot (using the Plot class) every update brought the figure window to the forefront of the screen. This can become annoying, so ideally we want to update the plots without bringing the entire window to the forefront.

Implement - Action-Selection Strategies for Exploration (DQN)

Add alternative exploration strategies for DQN based approaches

https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-7-action-selection-strategies-for-exploration-d3a97b7cceaf

Implement - Hindsight experience replay

Implement hindsight experience replay into the package

https://arxiv.org/abs/1707.01495

Update Wiki

Bug - MAPERTD3/SAC not learning

There is an issue preventing the MAPER variations from learning as expected from sample code

Implement - save training state for pause and restart training

Functions for enabling pausing and restarting a training which involved saving all the neccesary information needed to restart the training exactly where it left off.

Implement - TQC

Implement: https://arxiv.org/pdf/2005.04269.pdf

Refactor- Save/Load functions are not consistent across all models

Need to make the save/load function to be consistent across the algorithms.

Algorithm classes return value

DDPG returns a dictionary – called info. While other algorithms return td_error (to support the PER).

Much prefer the info – as that means we can return the actor and critic loss for logging purposes.

What do we think?

Extract memory into own config type

Currently, every algorithm config defines a memory type. Meaning changing anything memory related must be changed in every algorithm config.
Additionally, current implementation does not

Set up a good foundation for the future when prioritised replay buffers will introduced
Allow for customising buffer size

Desired Behaviour

The current Plot class/functions doesn't meet the needs of users:

Plotting Uses:

In-training plotting – meaning it shouldn't block the thread
Single Use Plotting – passing x and y values and just plot
Multiple Plots – multiple plots should be able to be generated simultaneously
- If during training you want to plot both reward and loss

Plotting Types:

plain x and y
average

Shift common.py into networks folder?

Common.py contains commonly used network setups for various algorithms - it might make more sense inside of the networks folder structure

Will we consider switch to Gymnasium instead?

As suggested in https://github.com/openai/gym, they mainly maintain the Gymnasium project (https://gymnasium.farama.org/).

To switch to the Gymnasium:

changing "import gym" to "import gymnasium as gym".
state, _ = env.reset(seed=seed)
next_state, reward, terminated, truncated, info = env.step(action)

Use Record within the Training Examples

Use the new Record class within the training examples to save the training results

uoa-cares / cares_reinforcement_learning Goto Github PK

cares_reinforcement_learning's Introduction

Motivation

Usage

Installation Instructions

Running an Example

Gym Environments

Deep Mind Control Suite

OpenAI Gymnasium

Game Boy Emulator

Gripper Gym

F1Tenth Autonomous Racing

Package Structure

Encoders

Utilities

Record.py

plotting.py

configurations.py

RLParser.py

NetworkFactory.py

MemoryFactory.py

Supported Algorithms

cares_reinforcement_learning's People

Contributors

Stargazers

Watchers

Forkers

cares_reinforcement_learning's Issues

Current Behaviour

Potential Behaviour

Notes

Features

Current Behaviour

Expected Behaviour

Desired Behaviour

Plotting Uses:

Plotting Types:

Recommend Projects

Recommend Topics

Recommend Org