Coder Social home page Coder Social logo

facebookresearch / pearl Goto Github PK

View Code? Open in Web Editor NEW
2.4K 32.0 142.0 54.45 MB

A Production-ready Reinforcement Learning AI Agent Library brought by the Applied Reinforcement Learning team at Meta.

License: MIT License

Python 35.04% Jupyter Notebook 64.96%

pearl's Introduction

alt

Pearl - A Production-ready Reinforcement Learning AI Agent Library

Proudly brought by Applied Reinforcement Learning @ Meta

License: MIT Support Ukraine codecov

More details of the library at our official website.

The Pearl paper is available at Arxiv.

Our NeurIPS 2023 Presentation Slides is released here.

Overview

Pearl is a new production-ready Reinforcement Learning AI agent library open-sourced by the Applied Reinforcement Learning team at Meta. Furthering our efforts on open AI innovation, Pearl enables researchers and practitioners to develop Reinforcement Learning AI agents. These AI agents prioritize cumulative long-term feedback over immediate feedback and can adapt to environments with limited observability, sparse feedback, and high stochasticity. We hope that Pearl offers the community a means to build state-of-the-art Reinforcement Learning AI agents that can adapt to a wide range of complex production environments.

Getting Started

Installation

To install Pearl, you can simply clone this repository and run pip install -e . (you need pip version ≥ 21.3 and setuptools version ≥ 64):

git clone https://github.com/facebookresearch/Pearl.git
cd Pearl
pip install -e .

Quick Start

To kick off a Pearl agent with a classic reinforcement learning environment, here's a quick example.

from pearl.pearl_agent import PearlAgent
from pearl.action_representation_modules.one_hot_action_representation_module import (
    OneHotActionTensorRepresentationModule,
)
from pearl.policy_learners.sequential_decision_making.deep_q_learning import (
    DeepQLearning,
)
from pearl.replay_buffers.sequential_decision_making.fifo_off_policy_replay_buffer import (
    FIFOOffPolicyReplayBuffer,
)
from pearl.utils.instantiations.environments.gym_environment import GymEnvironment

env = GymEnvironment("CartPole-v1")

num_actions = env.action_space.n
agent = PearlAgent(
    policy_learner=DeepQLearning(
        state_dim=env.observation_space.shape[0],
        action_space=env.action_space,
        hidden_dims=[64, 64],
        training_rounds=20,
        action_representation_module=OneHotActionTensorRepresentationModule(
            max_number_actions=num_actions
        ),
    ),
    replay_buffer=FIFOOffPolicyReplayBuffer(10_000),
)

observation, action_space = env.reset()
agent.reset(observation, action_space)
done = False
while not done:
    action = agent.act(exploit=False)
    action_result = env.step(action)
    agent.observe(action_result)
    agent.learn()
    done = action_result.done

Users can replace the environment with any real-world problems.

Tutorials

We provide a few tutorial Jupyter notebooks (and are currently working on more!):

  1. A single item recommender system. We derived a small contrived recommender system environment using the MIND dataset (Wu et al. 2020).

  2. Contextual bandits. Demonstrates contextual bandit algorithms and their implementation using Pearl using a contextual bandit environment for providing data from UCI datasets, and tested the performance of neural implementations of SquareCB, LinUCB, and LinTS.

  3. Frozen Lake. A simple example showing how to use a one-hot observation wrapper to learn the classic problem with DQN.

  4. Deep Q-Learning (DQN) and Double DQN. Demonstrates how to run DQN and Double DQN on the Cart-Pole environment.

  5. Actor-critic algorithms with safety constraints. Demonstrates how to run Actor Critic methods, including a version with safe constraints.

Design and Features

alt Pearl was built with a modular design so that industry practitioners or academic researchers can select any subset and flexibly combine features below to construct a Pearl agent customized for their specific use cases. Pearl offers a diverse set of unique features for production environments, including dynamic action spaces, offline learning, intelligent neural exploration, safe decision making, history summarization, and data augmentation.

You can find many Pearl agent candidates with mix-and-match set of reinforcement learning features in utils/scripts/benchmark_config.py

Adoption in Real-world Applications

Pearl is in progress supporting real-world applications, including recommender systems, auction bidding systems and creative selection. Each of them requires a subset of features offered by Pearl. To visualize the subset of features used by each of the applications above, see the table below.

Pearl Features Recommender Systems Auction Bidding Creative Selection
Policy Learning
Intelligent Exploration
Safety
History Summarization
Replay Buffer
Contextual Bandit
Offline RL
Dynamic Action Space
Large-scale Neural Network

Comparison to Other Libraries

Pearl Features Pearl ReAgent (Superseded by Pearl) RLLib SB3 Tianshou Dopamine
Agent Modularity
Dynamic Action Space
Offline RL
Intelligent Exploration ⚪ (limited support)
Contextual Bandit ⚪ (only linear support)
Safe Decision Making
History Summarization ⚪ (requires modifying environment state)
Data Augmented Replay Buffer

Cite Us

@article{pearl2023paper,
    title = {Pearl: A Production-ready Reinforcement Learning Agent},
    author = {Zheqing Zhu, Rodrigo de Salvo Braz, Jalaj Bhandari, Daniel Jiang, Yi Wan, Yonathan Efroni, Ruiyang Xu, Liyuan Wang, Hongbo Guo, Alex Nikulkov, Dmytro Korenkevych, Urun Dogan, Frank Cheng, Zheng Wu, Wanqiao Xu},
    eprint = {arXiv preprint arXiv:2312.03814},
    year = {2023}
}

License

Pearl is MIT licensed, as found in the LICENSE file.

pearl's People

Contributors

amyreese avatar billmatrix avatar connernilsen avatar cryptexis avatar eltociear avatar facebook-github-bot avatar frankchengmeta avatar jb3618 avatar jbaron avatar machina-source avatar matthiasthomas avatar rodrigodesalvobraz avatar yiwan-rl avatar zpao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pearl's Issues

Pip availability and clear versioning in branches/tags/releases

Hi Pearl team,

thank you for all the work you have done.

The problem is I am having is the following:
while the Pearl is formulated as something to be used in production - simple procedures like versioning are not taken care of. We have developed a routine that every day an agent will be trained and deployed (and it is done in an environment such as AWS Sagemaker). What this entails is that when SM instance is spinning up, it must install the Pearl and all other important dependencies.
The way you have described in the documentation is not sufficient for the production work, when spontaneously changes are committed to the main branch. Stable versions are needed - where API, functionality and behavior are fixed for a given version.

Hope that makes sense also for you.

PPO + LSTM issues

First of all, thank you for sharing this project in an open source form!

While testing PPO + LSTM, I've identified 2 potential improvements:

  • LSTM historization module requires the next state of the trajectory to be available. OnPolicyEpisodicReplayBuffer, which is the one used in many examples with PPO, doesn't compute it by default. This leads to the following exception:
    Error
    Traceback (most recent call last):
      File "/home/antoine/git/Pearl/pearl/test/integration/integration_tests.py", line 389, in test_ppo_lstm
        target_return_is_reached(
      File "/home/antoine/git/Pearl/pearl/utils/functional_utils/train_and_eval/online_learning.py", line 192, in target_return_is_reached
        episode_info, episode_total_steps = run_episode(
      File "/home/antoine/git/Pearl/pearl/utils/functional_utils/train_and_eval/online_learning.py", line 275, in run_episode
        agent.learn()
      File "/home/antoine/git/Pearl/pearl/pearl_agent.py", line 205, in learn
        report = self.policy_learner.learn(self.replay_buffer)
      File "/home/antoine/git/Pearl/pearl/policy_learners/sequential_decision_making/ppo.py", line 154, in learn
        result = super().learn(replay_buffer)
      File "/home/antoine/git/Pearl/pearl/policy_learners/policy_learner.py", line 170, in learn
        batch = self.preprocess_batch(batch)
      File "/home/antoine/git/Pearl/pearl/policy_learners/policy_learner.py", line 188, in preprocess_batch
        batch.next_state = self._history_summarization_module(batch.next_state)
      File "/home/antoine/git/Pearl/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
        return self._call_impl(*args, **kwargs)
      File "/home/antoine/git/Pearl/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
        return forward_call(*args, **kwargs)
      File "/home/antoine/git/Pearl/pearl/history_summarization_modules/lstm_history_summarization_module.py", line 100, in forward
        batch_size = x.shape[0]
    AttributeError: 'NoneType' object has no attribute 'shape'
  • Input batch.state is being modified by historization module in
    batch.state = self._history_summarization_module(batch.state)
    and then it is reused in critic and actor losses computation, leading to the following exception:
    Error
    Traceback (most recent call last):
      File "/home/antoine/git/Pearl/pearl/test/integration/integration_tests.py", line 389, in test_ppo_lstm
        target_return_is_reached(
      File "/home/antoine/git/Pearl/pearl/utils/functional_utils/train_and_eval/online_learning.py", line 192, in target_return_is_reached
        episode_info, episode_total_steps = run_episode(
      File "/home/antoine/git/Pearl/pearl/utils/functional_utils/train_and_eval/online_learning.py", line 275, in run_episode
        agent.learn()
      File "/home/antoine/git/Pearl/pearl/pearl_agent.py", line 205, in learn
        report = self.policy_learner.learn(self.replay_buffer)
      File "/home/antoine/git/Pearl/pearl/policy_learners/sequential_decision_making/ppo.py", line 154, in learn
        result = super().learn(replay_buffer)
      File "/home/antoine/git/Pearl/pearl/policy_learners/policy_learner.py", line 171, in learn
        single_report = self.learn_batch(batch)
      File "/home/antoine/git/Pearl/pearl/policy_learners/sequential_decision_making/actor_critic_base.py", line 232, in learn_batch
        self._actor_learn_batch(batch)  # update actor
      File "/home/antoine/git/Pearl/pearl/policy_learners/sequential_decision_making/ppo.py", line 139, in _actor_learn_batch
        loss.backward()
      File "/home/antoine/git/Pearl/venv/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
        torch.autograd.backward(
      File "/home/antoine/git/Pearl/venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
        Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

How to infer result of the tutorials?

How the infer the results from mind dataset in the tutorial provided, once the agent is trained? Is there a test set available? Once the agent is trained, agent.act() returns either 0/1 . What does it signify?

Frozen lake

hi!
i want to solve Frozen Lake problem with Pearl.But how can i do that?i do not know how to use Pearl for solving problems.You said Pearl can be used for real-world problems but you did not provide any example for that.
Please help me!

cuda out of meomory

image
During the use of Pearl, the consumption of VRAM keeps increasing continuously. Is there any way to delete tensors that are no longer needed, or is running out of memory inevitable due to my large action space?

Error when running the Single Item Recommender System Notebook

Hi, I'm trying to run the notebook on how to use Pearl for recommender systems, but when I run the online_learning() function I keep getting the same error, which I copy below:

    [161](Pearl/pearl/pearl_agent.py:161) if isinstance(safe_action_space, DiscreteActionSpace):
--> [162](Pearl/pearl/pearl_agent.py:162)     self._latest_action = safe_action_space.actions_batch[int(action.item())]
    [163](Pearl/pearl/pearl_agent.py:163) else:
    [164](Pearl/pearl/pearl_agent.py:164)     self._latest_action = action

RuntimeError: a Tensor with 100 elements cannot be converted to Scalar

On the other hand, I'm having a hard time understanding how the environment is being built. Could someone please explain further how they are creating the RecEnv object?

Use pearl with custom pytorch model

Used to use pytorch + gym for RL.
Is there a way to make a custom neural net in pytorch and insert it into the pearl agent without using the pre-made settings for neural nets?

online_learning function meet custom gym env without register

info = online_learning( agent = agent, env = env, number_of_steps = number_of_steps, **print_every_x_steps=1000,** record_period = record_period, learn_after_episode= True )

cause:

File "/workspace/Pearl/pearl/utils/instantiations/environments/gym_environment.py", line 169, in str
return self.env.spec.id
AttributeError: 'NoneType' object has no attribute 'id'

reason: gym env spec is none before register.

Gymnasium?

Seeing as how Gym is no longer maintained, and has some shortcomings - like the lack of terminated/truncated distinction, will there be support for Gymnasium?

[Question] Offline evals

Given that offline evaluations and model-based RL methods are planned for the next version.

Could you share here some of the challenges that made the those features being deferred since safe learning is part of the unique way?

Would be keen to know more.

FAlcon

Hi,

Tflow Agent implementa Falcon exploration….
Any way to get Falcon in Pearl ?

thanks !

Frozen-lake tutorial

Hi there!
My question is about the plot you shown in frozen-lake example.
What does it show to us?
How can we understand that the problem is solved?
image

Will n-D observation states be supported?

From the util function (and testing with ALE/Bowling-v5), it appears that only 1-D (vector) observation shapes are supported. Allowing for n-D shapes can be interesting as it can open up experiments things like text-embeddings, while maintaining the given structure.

Is there a plan to add this in?

FIFO On-Policy replay buffers lead to error with PPO

I tried an example with pole cart and the starter code with the provided FIFO on-policy replay buffer with PPO. There is an error regarding the cumulative reward.

With FIFOOnPolicyReplayBuffer and PPO we get:

Traceback (most recent call last):
  File "C:\Users\ericz\Documents\Github\RLTrading Model\RLTrading\Modelling\Tests\Pearl\Basic.py", line 25, in <module>
    trainer.train(num_iterations=1000)
  File "C:\Users\ericz\Documents\Github\RLTrading Model\RLTrading\Modelling\Training\Pearl\Trainer.py", line 46, in train
    info = online_learning(
  File "C:\Users\ericz\Documents\Github\RLTrading Model\RLTrading-Data\RLTrading\Pearl\pearl\utils\functional_utils\train_and_eval\online_learning.py", line 107, in online_learning
    episode_info, episode_total_steps = run_episode(
  File "C:\Users\ericz\Documents\Github\RLTrading Model\RLTrading-Data\RLTrading\Pearl\pearl\utils\functional_utils\train_and_eval\online_learning.py", line 275, in run_episode
    agent.learn()
  File "C:\Users\ericz\Documents\Github\RLTrading Model\RLTrading-Data\RLTrading\Pearl\pearl\pearl_agent.py", line 206, in learn
    report = self.policy_learner.learn(self.replay_buffer)
  File "C:\Users\ericz\Documents\Github\RLTrading Model\RLTrading-Data\RLTrading\Pearl\pearl\policy_learners\sequential_decision_making\ppo.py", line 154, in learn
    result = super().learn(replay_buffer)
  File "C:\Users\ericz\Documents\Github\RLTrading Model\RLTrading-Data\RLTrading\Pearl\pearl\policy_learners\policy_learner.py", line 171, in learn
    single_report = self.learn_batch(batch)
  File "C:\Users\ericz\Documents\Github\RLTrading Model\RLTrading-Data\RLTrading\Pearl\pearl\policy_learners\sequential_decision_making\actor_critic_base.py", line 231, in learn_batch
    self._critic_learn_batch(batch)  # update critic
  File "C:\Users\ericz\Documents\Github\RLTrading Model\RLTrading-Data\RLTrading\Pearl\pearl\policy_learners\sequential_decision_making\ppo.py", line 145, in _critic_learn_batch
    assert batch.cum_reward is not None
AssertionError

Here is the full example code:

env = GymEnvironment(gym.make("CartPole-v1"))

num_actions = env.action_space.n
agent = PearlAgent(
    policy_learner=ProximalPolicyOptimization(
        state_dim=env.observation_space.shape[0],
        action_space=env.action_space,
        actor_hidden_dims=[64, 64],
        critic_hidden_dims=[64, 64],
        training_rounds=100,
        batch_size=8,
        action_representation_module=OneHotActionTensorRepresentationModule(
            max_number_actions=num_actions,
        ),
    ),
    replay_buffer=FIFOOnPolicyReplayBuffer(10_000),
)

observation, action_space = env.reset()
agent.reset(observation, action_space)
done = False
while not done:
    action = agent.act(exploit=False)
    action_result = env.step(action)
    agent.observe(action_result)
    agent.learn()
    done = action_result.done

Unable to export loss values from TD3

First, thank you very much for this wonderful package.

Visualization of loss trends gives us an indication of where the training process is bound for.

Where TD3 is concerned, i think that there is a potential bug in line 113 'return {}'.

With this, it is very difficult to monitor the loss values on tensorboard.

Creating a project roadmap

Hi there,

First and foremost, thank you very much for your excellent contribution to the RL community! I was wondering if it would be possible to create a simple roadmap or a to-do list as part of this repository? This list could enumerate bugs and features that are planned to be tackled or are already being tackled by your team. While I've noticed comments scattered throughout the GitHub discussions as well as in the paper, it would be immensely helpful to have a single place providing a general overview.

I believe such a roadmap could be invaluable for individuals considering using or relying on this project in their production use cases. Additionally, it would also be beneficial to those interested in contributing to the project itself (i.e., calling dibs on solving a particular thing). I would be happy to come up with an initial template and gather already open issues, in case that would help.

Cheers!

Suggestion: Add linters, formatters, and type checkers to ensure code quality and coding style consistency

5. Make sure your code lints.

Pearl/CONTRIBUTING.md

Lines 35 to 37 in 7cb515a

## Coding Style
* Please follow code style presented in our repo. We will strictly enforcing
code style standards for contributions.

With the contributing suggestions here, it's will be better if we can add some linters&formatter(ruff, black, isort, flake8) and type checkers(mypy) in the ci to check them automatically.

What is the policy of Pull Requests ?

I am a bit confused - yesterday I made a #61 which was small enough to be merged not to disrupt any other functionality. @rodrigodesalvobraz has recognised the issue, however PR was closed. And to point out that the issue still persists on the main branch. Meaning, the tutorial on CBs are going to fail.

I am not taking it personally, just want to understand if this is a truly open source project where people are free to contribute
OR
it is governed by Metas internal policies and we just have to report issues and wait until the development team will address them?

Link to project site in README broken?

Yay first bug!

I'm interested to find out more about pearl. I landed on the README from LinkedIn and clicked the first link to the "official website", which gets me to a page for which I don't have permissions to view (I think).

Congrats on the launch 🎉

Q-transformer

How can we integrate Q-transformer to Pearl?
Thank you

Recommender System Example

Hi there.
My Question is about Recommender system example you provided.
1.what does it actually do? Really i can not understand .
2. what is exactly these two: i mean env_model_state_dict.pt and news_embedding_small.pt
model.load_state_dict(torch.load("Pearl/pearl/tutorials/single_item_recommender_system_example/**env_model_state_dict.pt**")) actions = torch.load("Pearl/pearl/tutorials/single_item_recommender_system_example/**news_embedding_small.pt**")
finally i need some more example to know how to use the Pearl!

PolicyLearner batch_size versus episode_steps clarification

Hi all,

This library looks great, and strikes a good balance between software abstractions and the underlying math. I do however have a clarification regarding batch_size and episode_steps for the policy learner.

I decided to step through the PPO integration test, and noticed that the batch_size used in _actor_learn_batch is not actually the batch_size set in the test, i.e., 64. Instead it is 14. You can see this in the screenshot:

image

That said, this mismatch in batch_size can actually be traced to the run_episode function where episode_steps == 14. I'm wondering if this is intentional, and if so how should we interpret that? Here is the screenshot for run_episode:

image

Thanks again for the library!

Examples

Hi,
Will you be providing any examples for real-world implementations of Pearl?
Thank you for creating this amazing project!

Issues when trying to use torch.jit.script()

Hi,

I was trying to save the model by script = torch.jit.script(agent.policy_learner), however I got error stating:

RuntimeError:
Module 'IdentityActionRepresentationModule' has no attribute '_max_number_actions' (This attribute exists on the Python module, but we failed to convert Python type: 'numpy.int64' to a TorchScript type. Only tensors and (possibly nested) tuples of tensors, lists, or dictsare supported as inputs or outputs of traced functions, but instead got value of type int64.. Its type was inferred; try adding a type annotation for the attribute.):
File "Pearl\pearl\action_representation_modules\identity_action_representation_module.py", line 31
@Property
def max_number_actions(self) -> int:
return self._max_number_actions
~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE

Could you look into this please? Thanks!

LSTM History Summarization Poor Performance

I've been trying to play around with LSTM history summarization and various algorithms on toy environments. I have found that in particular with fully observable environments, the LSTM history summarization module performs poorly. For example, on pendulum, it's gotten to episode 240 with DDPG and the reward is still super negative:

episode 10 return: -1397.8437638282776
episode 20 return: -1456.5716090202332
episode 30 return: -1346.7031090259552
episode 40 return: -1062.9707473516464
episode 50 return: -1553.775797367096
episode 60 return: -1183.1240811608732
episode 70 return: -1204.8374280929565
episode 80 return: -1765.7173137664795
episode 90 return: -1457.595247745514
episode 100 return: -1073.8434460163116
episode 110 return: -1259.576441526413
episode 120 return: -1023.4422654509544
episode 130 return: -1172.4052398204803
episode 140 return: -706.3417955114273
episode 150 return: -1440.3148374557495
episode 160 return: -1305.943922996521
episode 170 return: -932.2708021588624
episode 180 return: -1180.3497375249863
episode 190 return: -1183.8436343669891
episode 200 return: -1431.2645144462585
episode 210 return: -920.1698980480433
episode 220 return: -1273.0213116556406
episode 230 return: -1060.2317422628403
episode 240 return: -1481.3649444580078

For reference, without the LSTM history summarization module and DDPG, we usually get to a moving average of under -250 around episode 100. I extended the DDPG test as follows:

    def test_ddpg_lstm_summarization(self) -> None:
        """
        This test is checking if DDPG will eventually learn on Pendulum-v1
        If learns well, return will converge above -250
        Due to randomness in games, we check on moving avarage of episode returns
        """
        env = GymEnvironment("Pendulum-v1")
        agent = PearlAgent(
            policy_learner=DeepDeterministicPolicyGradient(
                state_dim=512,
                action_space=env.action_space,
                actor_hidden_dims=[400, 300],
                critic_hidden_dims=[400, 300],
                critic_learning_rate=1e-2,
                actor_learning_rate=1e-3,
                training_rounds=5,
                actor_soft_update_tau=0.05,
                critic_soft_update_tau=0.05,
                exploration_module=NormalDistributionExploration(
                    mean=0,
                    std_dev=0.2,
                ),
            ),
            history_summarization_module=LSTMHistorySummarizationModule(
                observation_dim=env.observation_space.shape[0],
                action_dim=env.action_space.shape[0],
                hidden_dim=512,
                num_layers=5,
                history_length=200
            ),
            replay_buffer=FIFOOffPolicyReplayBuffer(50000),
        )
        self.assertTrue(
            target_return_is_reached(
                agent=agent,
                env=env,
                target_return=-250,
                max_episodes=1000,
                learn=True,
                learn_after_episode=True,
                exploit=False,
                check_moving_average=True,
            )
        )

Is this expected or did I miss something obvious?

I did modify the policy learner preprocess method to detach the tensor being assigned to batch.state (refer to line 186)

NotImplementedError: The Gym space 'Tuple' is not yet supported in Pearl.

Hi there, I am trying to use DeepQLearning to solve the discrete Blackjack Gymnasium environment (Blackjack-v1), where the Observation Space is a tuple, Tuple(Discrete(32), Discrete(11), Discrete(2)), but when I launch Pearl I obtain the following error:

raise NotImplementedError(NotImplementedError: The Gym space 'Tuple' is not yet supported in Pearl.

Do we know when are we going to solve this issue? Lots of environments have tuples, or if there is a quick workaround to make Pearl support this environment.

Thanks,

Is there some limitation with the dimensions of actions and observations?

Dear Developers,
I'm getting the following error when running the code below

pearl/neural_networks/common/value_networks.py", line 262, in get_q_values
x = torch.cat([state_batch, action_batch], dim=-1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Tensors must have same number of dimensions: got 4 and 2

Am I doing something stupid, or is there some limitation (for instance, so that dimension of the action and observation space must be the same?)
Terveisin, Markus

""" 
copy pasted from 
https://github.com/facebookresearch/Pearl?tab=readme-ov-file#quick-start

with small modifications for training,


"""


from pearl.pearl_agent import PearlAgent
from pearl.action_representation_modules.one_hot_action_representation_module import (
    OneHotActionTensorRepresentationModule,
)
from pearl.policy_learners.sequential_decision_making.deep_q_learning import (
    DeepQLearning,
)
from pearl.replay_buffers.sequential_decision_making.fifo_off_policy_replay_buffer import (
    FIFOOffPolicyReplayBuffer,
)
from pearl.utils.instantiations.environments.gym_environment import GymEnvironment
from pearl.action_representation_modules.identity_action_representation_module import (
    IdentityActionRepresentationModule,
)
from pearl.utils.functional_utils.train_and_eval.online_learning import online_learning

from time import sleep
import gym
from tqdm import tqdm
import torch
import matplotlib.pyplot as plt
import numpy as np

# env = GymEnvironment("highway-v0", render_mode="human")

# env = GymEnvironment("CartPole-v1", render_mode="human")
env = GymEnvironment("CarRacing-v2", render_mode="human", continuous=False)
observation, action_space = env.reset()
print(f"observation")
print(observation)
print(f"action_space")
attributes = dir(action_space)
print(attributes)
print(f"action dim: {action_space.action_dim}")
# print(f"actions: {action_space.actions}")

# sys.exit()

agent = PearlAgent(
    policy_learner=DeepQLearning(
        state_dim=9216,
        action_space=action_space,
        hidden_dims=[64, 64],
        training_rounds=20,
        action_representation_module=OneHotActionTensorRepresentationModule(
            max_number_actions=5
        ),
    ),
    replay_buffer=FIFOOffPolicyReplayBuffer(10_000),
)

# experiment code
number_of_steps = 10000
record_period = 1000

info = online_learning(
    agent=agent,
    env=env,
    number_of_steps=number_of_steps,
    print_every_x_steps=1000,
    record_period=record_period,
    learn_after_episode=True,
)
torch.save(info["return"], "CarRacing-DQN-return.pt")
plt.plot(record_period * np.arange(len(info["return"])), info["return"], label="DQN")
plt.legend()
plt.show()

save the trained agent for reuse

I would like to know whether there is a way to save the trained agent, as I did not see this part in the tutorial, nor did I find any related methods in the agent class.

TorchRL Vs Pearl: What's the difference & when to use one?

I mostly use pytorch for my deep learning models, and I recently started learning reinforcement learning.

I came across TorchRL & Pearl. I couldn't find a brief answer on what's the difference between the two and when to use which one.

Simplify actor-critic with shared layers

The current implementation of ActorCriticBase makes it a bit trick to have custom actor and critic networks that have shared layers. This is because the instantiation of the networks happen in the ActorCriticBase class. You can probably get around it but it's very tricky. I'd recommend you pass in the actor/critic objects rather than the class types and let the user initialize them.

MultiDiscrete Action space not support

Hi

I am trying to run Pearl for a DQN algorithm with my custom environment. and I get the following error. Is it not supported at the moment and is there any workaround? I will paste my code that does the training and interact with environment below. Please let me know if I am doing anything wrong.

NotImplementedError: The Gym space 'MultiDiscrete' is not yet supported in Pearl.

def train(self, alpha):
        for episode in tqdm(range(self.max_episodes)):
            # print(f"+-------- Episode: {episode} -----------+")
            observation, action_space = self.env.reset()
            self.agent.reset(observation, action_space)
       

            while not terminated:
                action = self.agent.act(exploit=False)
                action_alpha_list = [*action, alpha]
                print(action_alpha_list)
                action_result = self.env.step(action_alpha_list)
                self.agent.observe(action_result)
                self.agent.learn()
                terminated = action_result.done

I can also post other parts of my code. (how I initiated the agent and model) if needed. Thanks!

Hidden dim of LSTM history summarization module must be equal to observation dim

For now, it's impossible to configure LSTMHistorySummarizationModule with hidden_dim other than observation space dimension. If you try to, this leads to following exception (PPO example):

Error
Traceback (most recent call last):
  File "/home/antoine/git/Pearl/pearl/test/integration/integration_tests.py", line 390, in test_ppo_lstm
    target_return_is_reached(
  File "/home/antoine/git/Pearl/pearl/utils/functional_utils/train_and_eval/online_learning.py", line 192, in target_return_is_reached
    episode_info, episode_total_steps = run_episode(
  File "/home/antoine/git/Pearl/pearl/utils/functional_utils/train_and_eval/online_learning.py", line 250, in run_episode
    action = agent.act(exploit=exploit)
  File "/home/antoine/git/Pearl/pearl/pearl_agent.py", line 154, in act
    action = self.policy_learner.act(
  File "/home/antoine/git/Pearl/pearl/policy_learners/sequential_decision_making/actor_critic_base.py", line 208, in act
    action_probabilities = self._actor.get_policy_distribution(
  File "/home/antoine/git/Pearl/pearl/neural_networks/sequential_decision_making/actor_networks.py", line 135, in get_policy_distribution
    policy_distribution = self.forward(
  File "/home/antoine/git/Pearl/pearl/neural_networks/sequential_decision_making/actor_networks.py", line 119, in forward
    return self._model(x)
  File "/home/antoine/git/Pearl/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/antoine/git/Pearl/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/antoine/git/Pearl/venv/lib/python3.10/site-packages/torch/nn/modules/container.py", line 215, in forward
    input = module(input)
  File "/home/antoine/git/Pearl/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/antoine/git/Pearl/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/antoine/git/Pearl/venv/lib/python3.10/site-packages/torch/nn/modules/container.py", line 215, in forward
    input = module(input)
  File "/home/antoine/git/Pearl/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/antoine/git/Pearl/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/antoine/git/Pearl/venv/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x16 and 4x64)

They could be decoupled by using a linear layer with output dim = observation space dim after the LSTM module.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.