Coder Social home page Coder Social logo

trpo-in-marl's Introduction

Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning

Described in the paper "Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning", this repository develops Heterogeneous Agent Trust Region Policy Optimisation (HATRPO) and Heterogeneous-Agent Proximal Policy Optimisation (HAPPO) algorithms on the bechmarks of SMAC and Multi-agent MUJOCO. HATRPO and HAPPO are the first trust region methods for multi-agent reinforcement learning with theoretically-justified monotonic improvement guarantee. Performance wise, it is the new state-of-the-art algorithm against its rivals such as IPPO, MAPPO and MADDPG. HAPPO and HATRPO have been integrated into HARL framework, please check the latest changes at here.

Installation

Create environment

conda create -n env_name python=3.9
conda activate env_name
pip install -r requirements.txt
conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia

Multi-agent MuJoCo

Following the instructios in https://github.com/openai/mujoco-py and https://github.com/schroederdewitt/multiagent_mujoco to setup a mujoco environment. In the end, remember to set the following environment variables:

LD_LIBRARY_PATH=${HOME}/.mujoco/mujoco200/bin;
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libGLEW.so

StarCraft II & SMAC

Run the script

bash install_sc2.sh

Or you could install them manually to other path you like, just follow here: https://github.com/oxwhirl/smac.

How to run

When your environment is ready, you could run shell scripts provided. For example:

cd scripts
./train_mujoco.sh  # run with HAPPO/HATRPO on Multi-agent MuJoCo
./train_smac.sh  # run with HAPPO/HATRPO on StarCraft II

If you would like to change the configs of experiments, you could modify sh files or look for config files for more details. And you can change algorithm by modify algo=happo as algo=hatrpo.

Some experiment results

SMAC

Multi-agent MuJoCo on MAPPO

Additional Experiment Setting

For SMAC

2022/4/24 update important ERROR for SMAC

Fix the parameter of gamma, the right configuration of gamma show as following:
gamma for 3s5z and 2c_vs_64zg is 0.95
gamma for corridor is 0.99

trpo-in-marl's People

Contributors

cyanrain7 avatar znowu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

trpo-in-marl's Issues

How do you use global information and local information in multi-agent mujoco?

I notice that in your multi-agent mujoco environment codes,

def get_obs(self):
    """ Returns all agent observat3ions in a list """
    state = self.env._get_obs()
    obs_n = []
    for a in range(self.n_agents):
        agent_id_feats = np.zeros(self.n_agents, dtype=np.float32)
        agent_id_feats[a] = 1.0
        # obs_n.append(self.get_obs_agent(a))
        # obs_n.append(np.concatenate([state, self.get_obs_agent(a), agent_id_feats]))
        # obs_n.append(np.concatenate([self.get_obs_agent(a), agent_id_feats]))
        obs_i = np.concatenate([state, agent_id_feats])
        obs_i = (obs_i - np.mean(obs_i)) / np.std(obs_i)
        obs_n.append(obs_i)
    return obs_n

def get_state(self, team=None):
    # TODO: May want global states for different teams (so cannot see what the other team is communicating e.g.)
    state = self.env._get_obs()
    share_obs = []
    for a in range(self.n_agents):
        agent_id_feats = np.zeros(self.n_agents, dtype=np.float32)
        agent_id_feats[a] = 1.0
        # share_obs.append(np.concatenate([state, self.get_obs_agent(a), agent_id_feats]))
        state_i = np.concatenate([state, agent_id_feats])
        state_i = (state_i - np.mean(state_i)) / np.std(state_i)
        share_obs.append(state_i)
    return share_obs

They all use self.env._get_obs() and will return the same obs information, so in your codes, what the differences between get_obs() and get_state(), and how do you use global information and local information in your algorithm?

Questions about visualization

Hi,
I wonder if you had tried to visualize the starCradt game with your trained model. I tried to set the parameter '--user-render', it didn't work. How should I visualize it?

look forward to your reply.

About the number of Critic Networks

This is a very helpful work, but I have a question about the code: in the code, HAPPO_Policy seems to build a Critic network for each agent, but in the paper there seems to be only one total Critic network. Does this affect the experimental results?

self.actor = Actor(args, self.obs_space, self.act_space, self.device) self.critic = Critic(args, self.share_obs_space, self.device)

Looking forward to your reply, thank you.

Question about HAPPO performance in StarCraftII

When I ran HAPPO in the StarCraftII, I found that HAPPO the performance is poor at 3s5z_ vs_3s6z map, far from MAPPO. The parameters I use are:

--n_training_threads 32 --n_rollout_threads 8 --num_mini_batch 1 --episode_length 400
--num_env_steps 10000000 --ppo_epoch 5 --use_value_active_masks --use_eval --eval_episodes 32 --use_recurrent_policy

and the default parameters are used for the rest.

The result is an average of 5 times, and the shadow represents the 95% confidence interval
3s5z_3s6z

We look forward to your reply.

Confused about the results of IPPO and MAPPO.

I notice that in your code the multiagent mujoco environment is an MDP setting. Thus, the inputs of critics of IPPO and MAPPO are the same. I expect the performances to be similar but the results in the figure are not. Are there other factors I'm ignoring? I am looking forward to your reply. Thank you!

I have some questions about the adjustment of experiment parameters.

I ran the default experiment "ant-v2, 2x4" and used the default parameters to get the results in the first picture. Later, I modified the parameters (n_rollout_threads: 24, num_mini_batch: 4, ppo_epoch: 40) and got the results in the second picture.
image
I have also made other modifications to the experimental parameters, but I have not achieved the performance shown in the article in the experiment of "ant-v2, 2x4" provided by the code.
So I want to ask if there are some rules or skills in parameter adjustment of HAPPO / HATRPO algorithm.

dependency issue

Hi,
I tried the repo's recommendation method to install the dependencies as written in requirements.txt, but the python 3.9 is not compatible with tensorflow 2.0.0, as well as other related packages. I am not sure if the environment setting is correct as described in the requirements.txt, will the codes run under newer version of tensorflow and other packages? could you please update the package dependencies?

what to do with a dead agent

Hello, I would like to ask, when an agent dies, but the environment does not end, the data of the dead agent will continue to be collected at this time, but how to process the data during training, and the decision of a dead agent will affect the subsequent How to deal with the impact of the update of the agent, will it affect the Multi-Agent Advantage Decomposition Lemma

muti_env_error

when i run the train_mujoco.sh, the error generate:
NotImplementedError
Traceback (most recent call last):
File "/home/spaci/RL/TRPO-in-MARL-master/scripts/train/train_mujoco.py", line 163, in
main(sys.argv[1:])
File "/home/spaci/RL/TRPO-in-MARL-master/scripts/train/train_mujoco.py", line 136, in main
envs = make_train_env(all_args)
File "/home/spaci/RL/TRPO-in-MARL-master/scripts/train/train_mujoco.py", line 37, in make_train_env
return ShareSubprocVecEnv([get_env_fn(i) for i in range(all_args.n_rollout_threads)])
File "/home/spaci/RL/TRPO-in-MARL-master/scripts/../envs/env_wrappers.py", line 360, in init
self.n_agents = self.remotes[0].recv()
File "/home/spaci/anaconda3/envs/env_name/lib/python3.9/multiprocessing/connection.py", line 255, in recv
buf = self._recv_bytes()
File "/home/spaci/anaconda3/envs/env_name/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
buf = self._recv(4)
File "/home/spaci/anaconda3/envs/env_name/lib/python3.9/multiprocessing/connection.py", line 384, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer

can anyone help me?
thanks

The question about critic loss

The hatrpo works well in my environment, but why critic loss will increase? Addtionally, I think the "kl_threshold" is a important parameter . Could you please tell me how to tune the parameter? My parameter settings and experiment results are as follows.
Looking forward to your reply. Thank you.
critic_lr: 5e-3
opti_eps: 1e-5
kl_threshold: 0.0001
gamma: 0.99
use_linear_lr_decay: True
image
image

The Script code runs wrong when applying the HATRPO algorithm with 【rnn】 network.

Hello, I try to run your code with the hatrpo algorithm and 【rnn】 network. Specially, I add "--use_recurrent_policy" in both scripts: train_smac.sh and train_mujoco, modify the algo='hatrpo'. However, both the scripts code go wrong and return errors as below:

RuntimeError: the derivative for '_cudnn_rnn_backward' is not implemented. Double backwards is not supported for CuDNN RNNs due to limitations in the CuDNN API. To run double backwards, please disable the CuDNN backend temporarily while running the forward pass of your RNN. For example:
with torch.backends.cudnn.flags(enabled=False):
output = model(inputs)

Some questions about HAPPO implementation

45e703d34e3718481a732fbc49b7412
The above is the equation(10) in the paper, but I can't find it in the current implementation.

In the file <happo_trainer.py>
737eb5d5d035c88f44ec05ad85a9b1d

And <separated_buffer.py>
baa6d02351ad0e8c4ebeecfcf7962eb

I can not find any preprocessing to advantages like equation (10) in your paper.

I would appreciate knowing how iterative updates in the algorithm1 are represented in the code.

gym error

when using DummyVecEnv, the env class has no '_get_obs' properties
Traceback (most recent call last):
File "train/train_mujoco.py", line 163, in
main(sys.argv[1:])
File "train/train_mujoco.py", line 136, in main
envs = make_train_env(all_args)
File "train/train_mujoco.py", line 35, in make_train_env
return ShareDummyVecEnv([get_env_fn(0)])
File "../envs/env_wrappers.py", line 712, in init
self.envs = [fn() for fn in env_fns]
File "../envs/env_wrappers.py", line 712, in
self.envs = [fn() for fn in env_fns]
File "train/train_mujoco.py", line 25, in init_env
env = MujocoMulti(env_args=env_args)
File "../envs/ma_mujoco/multiagent_mujoco/mujoco_multi.py", line 104, in init
self.share_obs_size = self.get_state_size()
File "../envs/ma_mujoco/multiagent_mujoco/mujoco_multi.py", line 204, in get_state_size
return len(self.get_state()[0])
File "../envs/ma_mujoco/multiagent_mujoco/mujoco_multi.py", line 191, in get_state
state = self.env._get_obs()
File "/home/spaci/anaconda3/envs/test/lib/python3.7/site-packages/gym/core.py", line 228, in getattr
raise AttributeError(f"attempted to get missing private attribute '{name}'")
AttributeError: attempted to get missing private attribute '_get_obs'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.