Coder Social home page Coder Social logo

chenglongchen / pytorch-drl Goto Github PK

View Code? Open in Web Editor NEW
586.0 11.0 106.0 260 KB

PyTorch implementations of various Deep Reinforcement Learning (DRL) algorithms for both single agent and multi-agent.

License: MIT License

Python 100.00%
pytorch deep-reinforcement-learning multi-agent deep-q-network actor-critic advantage-actor-critic a2c proximal-policy-optimization ppo deep-deterministic-policy-gradient ddpg acktr rl drl madrl dqn reinforcement-learning

pytorch-drl's Introduction

pytorch-madrl

This project includes PyTorch implementations of various Deep Reinforcement Learning algorithms for both single agent and multi-agent.

  • A2C
  • ACKTR
  • DQN
  • DDPG
  • PPO

It is written in a modular way to allow for sharing code between different algorithms. In specific, each algorithm is represented as a learning agent with a unified interface including the following components:

  • interact: interact with the environment to collect experience. Taking one step forward and n steps forward are both supported (see _take_one_step_ and _take_n_steps, respectively)
  • train: train on a sample batch
  • exploration_action: choose an action based on state with random noise added for exploration in training
  • action: choose an action based on state for execution
  • value: evaluate value for a state-action pair
  • evaluation: evaluation the learned agent

Requirements

  • gym
  • python 3.6
  • pytorch

Usage

To train a model:

$ python run_a2c.py

Results

It's extremely difficult to reproduce results for Reinforcement Learning algorithms. Due to different settings, e.g., random seed and hyper parameters etc, you might get different results compared with the followings.

A2C

CartPole-v0

ACKTR

CartPole-v0

DDPG

Pendulum-v0

DQN

CartPole-v0

PPO

CartPole-v0

TODO

  • TRPO
  • LOLA
  • Parameter noise

Acknowledgments

This project gets inspirations from the following projects:

License

MIT

pytorch-drl's People

Stargazers

Dinh Quoc Dat avatar  avatar  avatar  avatar Ai. Shao avatar  avatar Zhang Feiyue avatar Haizhi Huang avatar  avatar  avatar Carrick Cheah avatar  avatar  avatar  avatar Marsink avatar  avatar Ye Qingsong avatar  avatar  avatar Ziye Qin avatar  avatar Fortuna avatar Mariusz W avatar Kang-Li (Stephen) Cheng avatar Nayan Das avatar 孟于伶 avatar ZAIY XIME  avatar KIng Kevin avatar Qingyu Zhu avatar Duong Dang avatar Judicaël Clair avatar  avatar João Pedro F. Silva avatar  avatar xuqingyu avatar Li Peifeng avatar  avatar  avatar  avatar  avatar Zarah avatar Sayed Mohammed Tasmimul Huda avatar  avatar  avatar inspectusername avatar Iman Rahmati avatar  avatar  avatar 송승훈 avatar Feiyu Yang 杨飞宇 avatar zhpengg avatar MohammadSaleh Rouhi avatar Giacomo Verticale avatar Trung Viet avatar  avatar Steppenwolf avatar Shawon Ashraf avatar Pu Xu avatar  avatar  avatar  avatar xmc avatar Lado avatar  avatar Evgeny Ponomarev avatar  avatar  avatar Shibo Wang avatar  avatar  avatar  avatar  avatar  avatar shenjie avatar Ke Liu avatar Jin Gao avatar  avatar Wang.W avatar Josh Fourie avatar  avatar  avatar  avatar  avatar wangdongzhuo avatar  avatar  avatar  avatar Nadia Abdolkhani avatar  avatar  avatar  avatar  avatar  avatar DIJU Liu avatar Morokot Sakal avatar  avatar HayatoSuō avatar zachary avatar Nicolas Alan  avatar  avatar

Watchers

Jorain avatar Chenhui Wang avatar Ethan Caballero avatar Chenglong Chen avatar Shauharda Khadka avatar Shuangxi Nie avatar D avatar  avatar  avatar  avatar Chase Coleman avatar

pytorch-drl's Issues

License?

Hi,

What is the license?

Hugh

The Actor Critic Structure in MAA2C

A little confused about your implementation of MAA2C. I don't think the input of the actor network is simply the ``joint state" of the agents. According to [1] the critic's input should be state of the environment (where agents' joint state is not necessarily defined) + the joint action of the agents, i.e., the critic here should be a Q-function for joint actions. And for the actor it should be something like a policy, where I am not quite understand why the actor network is implemented in this way. Appreciate if explained.

About the computation of Advantage and State Value in PPO

In your implementation of Critic, you feed the network of the observation and action and output 1-dim value. Can I make the inference that It is Q(s,a) ?
But the advantage you given is
values = self.critic_target(states_var, actions_var).detach() advantages = rewards_var - values
It is the estimation of q_t minus Q(s_t,a)
I think it should be Advantage = q_t - V(s_t)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.