Coder Social home page Coder Social logo

agilerl / agilerl Goto Github PK

View Code? Open in Web Editor NEW
492.0 8.0 37.0 54.61 MB

Streamlining reinforcement learning with RLOps. State-of-the-art RL algorithms and tools.

Home Page: https://agilerl.com

License: Apache License 2.0

Python 100.00%
reinforcement-learning deep-reinforcement-learning deep-learning rlops evolutionary-algorithms machine-learning pytorch gym hpo python

agilerl's Introduction

AgileRL

Reinforcement learning streamlined.
Easier and faster reinforcement learning with RLOps. Visit our website. View documentation.
Join the Discord Server to collaborate.

License Documentation Status Downloads Discord

NEW: AgileRL now introduces evolvable Contextual Multi-armed Bandit Algorithms!

This is a Deep Reinforcement Learning library focused on improving development by introducing RLOps - MLOps for reinforcement learning.

This library is initially focused on reducing the time taken for training models and hyperparameter optimization (HPO) by pioneering evolutionary HPO techniques for reinforcement learning.
Evolutionary HPO has been shown to drastically reduce overall training times by automatically converging on optimal hyperparameters, without requiring numerous training runs.
We are constantly adding more algorithms and features. AgileRL already includes state-of-the-art evolvable on-policy, off-policy, offline, multi-agent and contextual multi-armed bandit reinforcement learning algorithms with distributed training.

AgileRL offers 10x faster hyperparameter optimization than SOTA.
Global steps is the sum of every step taken by any agent in the environment, including across an entire population, during the entire hyperparameter optimization process.

Table of Contents

Benchmarks

Reinforcement learning algorithms and libraries are usually benchmarked once the optimal hyperparameters for training are known, but it often takes hundreds or thousands of experiments to discover these. This is unrealistic and does not reflect the true, total time taken for training. What if we could remove the need to conduct all these prior experiments?

In the charts below, a single AgileRL run, which automatically tunes hyperparameters, is benchmarked against Optuna's multiple training runs traditionally required for hyperparameter optimization, demonstrating the real time savings possible. Global steps is the sum of every step taken by any agent in the environment, including across an entire population.

AgileRL offers an order of magnitude speed up in hyperparameter optimization vs popular reinforcement learning training frameworks combined with Optuna. Remove the need for multiple training runs and save yourself hours.

AgileRL also supports multi-agent reinforcement learning using the Petting Zoo-style (parallel API). The charts below highlight the performance of our MADDPG and MATD3 algorithms with evolutionary hyper-parameter optimisation (HPO), benchmarked against epymarl's MADDPG algorithm with grid-search HPO for the simple speaker listener and simple spread environments.

Get Started

Install as a package with pip:

pip install agilerl

Or install in development mode:

git clone https://github.com/AgileRL/AgileRL.git && cd AgileRL
pip install -e .

Demo:

cd demos
python demo_online.py

or to demo distributed training:

cd demos
accelerate launch --config_file configs/accelerate/accelerate.yaml demos/demo_online_distributed.py

Tutorials

We are in the process of creating tutorials on how to use AgileRL and train agents on a variety of tasks.

Currently, we have tutorials for single-agent tasks that will guide you through the process of training both on and off-policy agents to beat a variety of Gymnasium environments. Additionally, we have multi-agent tutorials that make use of PettingZoo environments such as training DQN to play Connect Four with curriculum learning and self-play, and also for multi-agent tasks in MPE environments. We also have a tutorial on using hierarchical curriculum learning to teach agents Skills. We also have files for a tutorial on training a language model with reinforcement learning using ILQL on Wordle in tutorials/Language. If using ILQL on Wordle, download and unzip data.zip here.

Our demo files in demos also provide examples on how to train agents using AgileRL, and more information can be found in our documentation.

Evolvable algorithms implemented (more coming soon!)

  • DQN
  • Rainbow DQN
  • DDPG
  • TD3
  • PPO
  • CQL
  • ILQL
  • MADDPG
  • MATD3
  • NeuralUCB
  • NeuralTS

Train an agent to beat a Gym environment

Before starting training, there are some meta-hyperparameters and settings that must be set. These are defined in INIT_HP, for general parameters, and MUTATION_PARAMS, which define the evolutionary probabilities, and NET_CONFIG, which defines the network architecture. For example:

INIT_HP = {
    'ENV_NAME': 'LunarLander-v2',   # Gym environment name
    'ALGO': 'DQN',                  # Algorithm
    'DOUBLE': True,                 # Use double Q-learning
    'CHANNELS_LAST': False,         # Swap image channels dimension from last to first [H, W, C] -> [C, H, W]
    'BATCH_SIZE': 256,              # Batch size
    'LR': 1e-3,                     # Learning rate
    'EPISODES': 2000,               # Max no. episodes
    'TARGET_SCORE': 200.,           # Early training stop at avg score of last 100 episodes
    'GAMMA': 0.99,                  # Discount factor
    'MEMORY_SIZE': 10000,           # Max memory buffer size
    'LEARN_STEP': 1,                # Learning frequency
    'TAU': 1e-3,                    # For soft update of target parameters
    'TOURN_SIZE': 2,                # Tournament size
    'ELITISM': True,                # Elitism in tournament selection
    'POP_SIZE': 6,                  # Population size
    'EVO_EPOCHS': 20,               # Evolution frequency
    'POLICY_FREQ': 2,               # Policy network update frequency
    'WANDB': True                   # Log with Weights and Biases
}
MUTATION_PARAMS = {
    # Relative probabilities
    'NO_MUT': 0.4,                              # No mutation
    'ARCH_MUT': 0.2,                            # Architecture mutation
    'NEW_LAYER': 0.2,                           # New layer mutation
    'PARAMS_MUT': 0.2,                          # Network parameters mutation
    'ACT_MUT': 0,                               # Activation layer mutation
    'RL_HP_MUT': 0.2,                           # Learning HP mutation
    'RL_HP_SELECTION': ['lr', 'batch_size'],    # Learning HPs to choose from
    'MUT_SD': 0.1,                              # Mutation strength
    'RAND_SEED': 1,                             # Random seed
}
NET_CONFIG = {
    'arch': 'mlp',      # Network architecture
    'hidden_size': [32, 32], # Actor hidden size
}

First, use utils.utils.initialPopulation to create a list of agents - our population that will evolve and mutate to the optimal hyperparameters.

from agilerl.utils.utils import makeVectEnvs, initialPopulation
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

env = makeVectEnvs(env_name=INIT_HP['ENV_NAME'], num_envs=16)
try:
    state_dim = env.single_observation_space.n          # Discrete observation space
    one_hot = True                                      # Requires one-hot encoding
except Exception:
    state_dim = env.single_observation_space.shape      # Continuous observation space
    one_hot = False                                     # Does not require one-hot encoding
try:
    action_dim = env.single_action_space.n             # Discrete action space
except Exception:
    action_dim = env.single_action_space.shape[0]      # Continuous action space

if INIT_HP['CHANNELS_LAST']:
    state_dim = (state_dim[2], state_dim[0], state_dim[1])

agent_pop = initialPopulation(algo=INIT_HP['ALGO'],                 # Algorithm
                              state_dim=state_dim,                  # State dimension
                              action_dim=action_dim,                # Action dimension
                              one_hot=one_hot,                      # One-hot encoding
                              net_config=NET_CONFIG,                # Network configuration
                              INIT_HP=INIT_HP,                      # Initial hyperparameters
                              population_size=INIT_HP['POP_SIZE'],  # Population size
                              device=device)

Next, create the tournament, mutations and experience replay buffer objects that allow agents to share memory and efficiently perform evolutionary HPO.

from agilerl.components.replay_buffer import ReplayBuffer
from agilerl.hpo.tournament import TournamentSelection
from agilerl.hpo.mutation import Mutations

field_names = ["state", "action", "reward", "next_state", "done"]
memory = ReplayBuffer(action_dim=action_dim,                # Number of agent actions
                      memory_size=INIT_HP['MEMORY_SIZE'],   # Max replay buffer size
                      field_names=field_names,              # Field names to store in memory
                      device=device)

tournament = TournamentSelection(tournament_size=INIT_HP['TOURN_SIZE'], # Tournament selection size
                                 elitism=INIT_HP['ELITISM'],            # Elitism in tournament selection
                                 population_size=INIT_HP['POP_SIZE'],   # Population size
                                 evo_step=INIT_HP['EVO_EPOCHS'])        # Evaluate using last N fitness scores

mutations = Mutations(algo=INIT_HP['ALGO'],                                 # Algorithm
                      no_mutation=MUTATION_PARAMS['NO_MUT'],                # No mutation
                      architecture=MUTATION_PARAMS['ARCH_MUT'],             # Architecture mutation
                      new_layer_prob=MUTATION_PARAMS['NEW_LAYER'],          # New layer mutation
                      parameters=MUTATION_PARAMS['PARAMS_MUT'],             # Network parameters mutation
                      activation=MUTATION_PARAMS['ACT_MUT'],                # Activation layer mutation
                      rl_hp=MUTATION_PARAMS['RL_HP_MUT'],                   # Learning HP mutation
                      rl_hp_selection=MUTATION_PARAMS['RL_HP_SELECTION'],   # Learning HPs to choose from
                      mutation_sd=MUTATION_PARAMS['MUT_SD'],                # Mutation strength
                      arch=NET_CONFIG['arch'],                              # Network architecture
                      rand_seed=MUTATION_PARAMS['RAND_SEED'],               # Random seed
                      device=device)

The easiest training loop implementation is to use our train_off_policy() function. It requires the agent have functions getAction() and learn().

from agilerl.training.train_off_policy import train_off_policy

trained_pop, pop_fitnesses = train_off_policy(env=env,                                 # Gym-style environment
                                   env_name=INIT_HP['ENV_NAME'],            # Environment name
                                   algo=INIT_HP['ALGO'],                    # Algorithm
                                   pop=agent_pop,                           # Population of agents
                                   memory=memory,                           # Replay buffer
                                   swap_channels=INIT_HP['CHANNELS_LAST'],  # Swap image channel from last to first
                                   n_episodes=INIT_HP['EPISODES'],          # Max number of training episodes
                                   evo_epochs=INIT_HP['EVO_EPOCHS'],        # Evolution frequency
                                   evo_loop=1,                              # Number of evaluation episodes per agent
                                   target=INIT_HP['TARGET_SCORE'],          # Target score for early stopping
                                   tournament=tournament,                   # Tournament selection object
                                   mutation=mutations,                      # Mutations object
                                   wb=INIT_HP['WANDB'])                     # Weights and Biases tracking

Citing AgileRL

If you use AgileRL in your work, please cite the repository:

@software{Ustaran-Anderegg_AgileRL,
author = {Ustaran-Anderegg, Nicholas and Pratt, Michael},
license = {Apache-2.0},
title = {{AgileRL}},
url = {https://github.com/AgileRL/AgileRL}
}

agilerl's People

Contributors

balisujohn avatar dependabot[bot] avatar erickrf avatar ewouth avatar gonultasbu avatar mikepratt1 avatar mp4217 avatar nargizsentience avatar nicku-a avatar pre-commit-ci[bot] avatar seandasheep avatar shreyansjainn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

agilerl's Issues

DQN: Action mask is not compatible in vectorized environments

What version of AgileRL are you using?
v0.1.19

What operating system and processor architecture are you using?
Windows, 64-bit operating system, x64-based processor

What did you do?
I attempted to add vectorization to the self-play script to train DQN agent in PettingZoo AEC env. However, it seems like DQN's getAction assumes the usage of single action mask for all environments. It results in the mismatch between the shapes of mask and data fed into np.ma.array

Steps to reproduce the behaviour:

  1. Run the next reproduction script:
import numpy as np
from agilerl.algorithms.dqn import DQN

state_dim = [4]
action_dim = 2
one_hot = True

dqn = DQN(state_dim, action_dim, one_hot)
state = np.array([[1], [1]])
action_mask = np.array([[0, 1], [1, 0]])
epsilon = 1
action = dqn.getAction(state, epsilon, action_mask)
print(action)
  1. See error:
  File "C:\....\AgileRL\.venv\lib\site-packages\numpy\ma\core.py", line 2900, in __new__
    raise MaskError(msg % (nd, nm))
numpy.ma.core.MaskError: Mask and data not compatible: data size is 2, mask size is 4.

What did you expect to see?
A list of actions [1, 0]. Each action corresponds to a respective action mask and state.

What did you see instead? Describe the bug.
numpy.ma.core.MaskError: Mask and data not compatible: data size is 2, mask size is 4.

Additional context
The current getAction() seems to assume that action_mask is an 1D array, the size of which corresponds to the action_dim. It then samples n actions, where n is the number of observations (state.size()[0]). However, when the 'action_mask' is not an 1D array, the mask shape does not have the same shape as np.arange(0, self.action_dim).
I fixed this issue locally by modifying the getAction().

  1. I expand dimension if the action_mask.ndim == 1.
  2. Then randomly sample one action for each action mask.

Missing critic for PPO in mutation

Please kindly forgive my immature bug report if I missed some detailed implementation :D

What version of AgileRL are you using?
Since there seems to be no __version__ attribute included, I used the pip cmd-line tool instead

$ pip show agile
xxx
Version: 0.1.18
xxx

What operating system and processor architecture are you using?

$ uname -a
Darwin xxx 21.2.0 Darwin Kernel Version 21.2.0: Sun Nov 28 20:29:10 PST 2021; root:xnu-8019.61.5~1/RELEASE_ARM64_T8101 arm64

What did you do?
Basically I am adopting training procedures of the PPO tutorial as shown here (https://docs.agilerl.com/en/latest/online_training/index.html).

What did you see instead? Describe the bug.
I noticed that when I printed out the optimizer of the eventual best agent after HPO, I saw something like below

Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: None
    lr: 0.0012
    maximize: False
    weight_decay: 0
)

As you can see, there is only one group of parameters left. However, when a new PPO agent is initialized via loadCheckpoint() (or just load() in the latest version), the .optimizer is actually associated with two groups of parameters, and consequently the loading process will not be successfully done.

After I checked your codebase, I felt like it might be caused by the following. A PPO agent indeed initializes two groups of parameters to optimize while being instantiated, i.e. __init__(), one for the actor and the other one for the critic. However, when mutation happens, as implemented in AgileRL/agilerl/hpo/mutation.py:reinit_opt() in line 390, only PPO is excluded from critic re-initialization. Consequently, offspring PPO agents only inherit the parameters of actor networks from parent PPO agents.

Then the tricky story will go like:

  1. If one has trained a PPO agent WITHOUT HPO, everything will be alright. And the pre-trained agent can be reused by the loading method.
  2. If one has trained a PPO agent WITH HPO, then the resulted agent will lost her parameters for the critic network.

(For the second case, which I encountered, looks like you haven't included such cases in the unittest.)

Add support for more complicated observation/action spaces

Is your feature request related to a problem? Please describe.
My RL simulator has an action space that is Tuple(Discrete, Discrete, Box) that isn't nicely supported by the framework.

In particular multiple discrete choices or discrete + box are hard to implement with out of framework flattening.

Describe the solution you'd like
Ray's RLLIb has a preprocessor that flattens more complicated spaces inside the framework in some way.

Describe alternatives you've considered
I considered flattening the action space before providing it, but there's not really a good discrete + discrete or discrete + box flattening I can do in my own code.

Typo on documents resulted in ModuleNotFound error

What version of AgileRL are you using? stable version on pip
What operating system and processor architecture are you using? linux x86_64

What did you do?
Steps to reproduce the behaviour:

  1. Go to MADDPG on Readme.md
  2. Copy and run code
  3. ModuleNotFoundError: No module named 'agilerl.comp'

What did you see instead? Describe the bug.
should be agilerl.components. it says agilerl.comp instead.

random action generation does not respect action limits

It looks like the line I am referencing generates random actions in the [0,1) interval, however action limits may be custom defined by algorithm parameters min_action and max_action. I have not been able to locate a part where the range is normalized to [min_action, max_action) range and the function has been consistently returning values between [0,1) for epsilon=1.0 (purely uniformly sampled at random).

np.random.rand(state.size()[0], self.action_dims[idx])

I am only referencing to MADDPG but I believe MATD3 also has the same characteristics.

EDIT: The generated values are clipped for tighter constraints fine but a larger action range such as [-1.0,1.0] does not work as expected.

Various demo issues

What version of AgileRL are you using?
Github HEAD
What operating system and processor architecture are you using?
Ubuntu x64

I attempted to follow the instructions to clone the repo and run the demo, and ran into various stumbling blocks:

a) Dependency on gymnasium does not specify the [box2d] extra deps in requirements.txt causing failures
b) gymnasium box2d deps requires swig installed to build
c) demo code doesn't run on CPU since it hardcodes cuda

After the demo completes, it prints some exception:

Exception ignored in: <function AsyncVectorEnv.__del__ at 0x7ffb1bc3edd0>
Traceback (most recent call last):
  File "/home/alex/.local/share/virtualenvs/AgileRL-KKHwKg5J/lib/python3.10/site-packages/gymnasium/vector/async_vector_env.py", line 548, in __del__
  File "/home/alex/.local/share/virtualenvs/AgileRL-KKHwKg5J/lib/python3.10/site-packages/gymnasium/vector/vector_env.py", line 271, in close
  File "/home/alex/.local/share/virtualenvs/AgileRL-KKHwKg5J/lib/python3.10/site-packages/gymnasium/vector/async_vector_env.py", line 464, in close_extras
AttributeError: 'NoneType' object has no attribute 'TimeoutError'

I also found it odd that the demo doesn't match the instructions in the README for the standard agilerl.training.train loop.

Minari Support for loading datasets for AgileRL

I'm creating this issue to gauge interest in adding support for the Offline RL dataset Library Minari to AgileRL. Minari aims to solve the issues D4RL had with dataset standardization, and provide a stable and maintained offline RL dataset library, with curated first-party environments. From looking though your codebase, it seems like you use hdf5-formatted datsets. Since Minari is a successor to D4RL and uses hdf5-formatted datasets with a very similar format, we think it won't be too hard for us to integrate Minari support into AgileRL.

If this sounds interesting, please let me know, and someone from the Minari team can create a pull request adding experimental Minari support to AgileRL.

SAC Implementation

This should implement a working version of SAC that complies with the format required for the repository. Preferably this will incorporate various unit tests of the algorithm

Load a trained model without instantiating an agent

Is your feature request related to a problem? Please describe.
I notice that currently if we need to load a trained model, we need to first instantiate an agent. Take PPO as an example, we need to first precisely specify the parameters like state_dim, action_dim, clip_coef, etc, even if we already have a trained model at hand. In fact, for someone to reuse a trained model, I personally feel like a good convention is to make the process hyper-parameter-agnostic.

Describe the solution you'd like
A possibly good way is to follow the implementation of stable-baselines, simply as {some_rl_algo}.load({some_trained_model}). Probably an easy way to do that is to implement the loadCheckpoint() as wrapped by @classmethod. And also, it might be more efficient to code a base class for all the algorithms and then the implementation of loadCheckpoint() in the base class would suffice.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Create CONTRIBUTING.md

To better support open source contributions it would be beneficial to have contributing guidelines

Tests

It would be good to offer tests, to enable users to verify the correctness and/or soundness of the implementations. These tests could be combined with other unit/style tests in GitHub actions to better support community PRs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.