cyanrain7 / trpo-in-marl Goto Github PK

License: MIT License

Python 99.30% Shell 0.70%

trpo-in-marl's Introduction

Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning

Described in the paper "Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning", this repository develops Heterogeneous Agent Trust Region Policy Optimisation (HATRPO) and Heterogeneous-Agent Proximal Policy Optimisation (HAPPO) algorithms on the bechmarks of SMAC and Multi-agent MUJOCO. HATRPO and HAPPO are the first trust region methods for multi-agent reinforcement learning with theoretically-justified monotonic improvement guarantee. Performance wise, it is the new state-of-the-art algorithm against its rivals such as IPPO, MAPPO and MADDPG. HAPPO and HATRPO have been integrated into HARL framework, please check the latest changes at here.

Installation

Create environment

conda create -n env_name python=3.9
conda activate env_name
pip install -r requirements.txt
conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia

Multi-agent MuJoCo

Following the instructios in https://github.com/openai/mujoco-py and https://github.com/schroederdewitt/multiagent_mujoco to setup a mujoco environment. In the end, remember to set the following environment variables:

LD_LIBRARY_PATH=${HOME}/.mujoco/mujoco200/bin;
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libGLEW.so

StarCraft II & SMAC

Run the script

bash install_sc2.sh

Or you could install them manually to other path you like, just follow here: https://github.com/oxwhirl/smac.

How to run

When your environment is ready, you could run shell scripts provided. For example:

cd scripts
./train_mujoco.sh  # run with HAPPO/HATRPO on Multi-agent MuJoCo
./train_smac.sh  # run with HAPPO/HATRPO on StarCraft II

If you would like to change the configs of experiments, you could modify sh files or look for config files for more details. And you can change algorithm by modify algo=happo as algo=hatrpo.

Some experiment results

SMAC

Multi-agent MuJoCo on MAPPO

Additional Experiment Setting

For SMAC

2022/4/24 update important ERROR for SMAC

Fix the parameter of gamma, the right configuration of gamma show as following:

gamma for 3s5z and 2c_vs_64zg is 0.95

gamma for corridor is 0.99

trpo-in-marl's People

Contributors

Stargazers

Watchers

trpo-in-marl's Issues

How do you use global information and local information in multi-agent mujoco?

I notice that in your multi-agent mujoco environment codes,

def get_obs(self):
    """ Returns all agent observat3ions in a list """
    state = self.env._get_obs()
    obs_n = []
    for a in range(self.n_agents):
        agent_id_feats = np.zeros(self.n_agents, dtype=np.float32)
        agent_id_feats[a] = 1.0
        # obs_n.append(self.get_obs_agent(a))
        # obs_n.append(np.concatenate([state, self.get_obs_agent(a), agent_id_feats]))
        # obs_n.append(np.concatenate([self.get_obs_agent(a), agent_id_feats]))
        obs_i = np.concatenate([state, agent_id_feats])
        obs_i = (obs_i - np.mean(obs_i)) / np.std(obs_i)
        obs_n.append(obs_i)
    return obs_n

def get_state(self, team=None):
    # TODO: May want global states for different teams (so cannot see what the other team is communicating e.g.)
    state = self.env._get_obs()
    share_obs = []
    for a in range(self.n_agents):
        agent_id_feats = np.zeros(self.n_agents, dtype=np.float32)
        agent_id_feats[a] = 1.0
        # share_obs.append(np.concatenate([state, self.get_obs_agent(a), agent_id_feats]))
        state_i = np.concatenate([state, agent_id_feats])
        state_i = (state_i - np.mean(state_i)) / np.std(state_i)
        share_obs.append(state_i)
    return share_obs

They all use self.env._get_obs() and will return the same obs information, so in your codes, what the differences between get_obs() and get_state(), and how do you use global information and local information in your algorithm?

Questions about visualization

Hi,
I wonder if you had tried to visualize the starCradt game with your trained model. I tried to set the parameter '--user-render', it didn't work. How should I visualize it?

look forward to your reply.

About the number of Critic Networks

This is a very helpful work, but I have a question about the code: in the code, HAPPO_Policy seems to build a Critic network for each agent, but in the paper there seems to be only one total Critic network. Does this affect the experimental results?

self.actor = Actor(args, self.obs_space, self.act_space, self.device) self.critic = Critic(args, self.share_obs_space, self.device)

Looking forward to your reply, thank you.

Question about HAPPO performance in StarCraftII

When I ran HAPPO in the StarCraftII, I found that HAPPO the performance is poor at 3s5z_ vs_3s6z map, far from MAPPO. The parameters I use are:

--n_training_threads 32 --n_rollout_threads 8 --num_mini_batch 1 --episode_length 400
--num_env_steps 10000000 --ppo_epoch 5 --use_value_active_masks --use_eval --eval_episodes 32 --use_recurrent_policy

and the default parameters are used for the rest.

The result is an average of 5 times, and the shadow represents the 95% confidence interval

We look forward to your reply.

Do you have PyMARL implementation?

Hi, I found many MARL methods were implemented with PyMARL. Do you have PyMARL implementation of this repo?

Confused about the results of IPPO and MAPPO.

I notice that in your code the multiagent mujoco environment is an MDP setting. Thus, the inputs of critics of IPPO and MAPPO are the same. I expect the performances to be similar but the results in the figure are not. Are there other factors I'm ignoring? I am looking forward to your reply. Thank you!

I have some questions about the adjustment of experiment parameters.

I ran the default experiment "ant-v2, 2x4" and used the default parameters to get the results in the first picture. Later, I modified the parameters (n_rollout_threads: 24, num_mini_batch: 4, ppo_epoch: 40) and got the results in the second picture.

I have also made other modifications to the experimental parameters, but I have not achieved the performance shown in the article in the experiment of "ant-v2, 2x4" provided by the code.
So I want to ask if there are some rules or skills in parameter adjustment of HAPPO / HATRPO algorithm.

The

dependency issue

Hi,
I tried the repo's recommendation method to install the dependencies as written in requirements.txt, but the python 3.9 is not compatible with tensorflow 2.0.0, as well as other related packages. I am not sure if the environment setting is correct as described in the requirements.txt, will the codes run under newer version of tensorflow and other packages? could you please update the package dependencies?

I found a bug in file 'utils/util.py'. If we use discrete action space in 'runners\separated\mujoco_runner.py' and store it's transition in buffer, we will get a bug. Because the act_shape is a constant value.

I test it in MPE

what to do with a dead agent

Hello, I would like to ask, when an agent dies, but the environment does not end, the data of the dead agent will continue to be collected at this time, but how to process the data during training, and the decision of a dead agent will affect the subsequent How to deal with the impact of the update of the agent, will it affect the Multi-Agent Advantage Decomposition Lemma

muti_env_error

when i run the train_mujoco.sh, the error generate:
NotImplementedError
Traceback (most recent call last):
File "/home/spaci/RL/TRPO-in-MARL-master/scripts/train/train_mujoco.py", line 163, in
main(sys.argv[1:])
File "/home/spaci/RL/TRPO-in-MARL-master/scripts/train/train_mujoco.py", line 136, in main
envs = make_train_env(all_args)
File "/home/spaci/RL/TRPO-in-MARL-master/scripts/train/train_mujoco.py", line 37, in make_train_env
return ShareSubprocVecEnv([get_env_fn(i) for i in range(all_args.n_rollout_threads)])
File "/home/spaci/RL/TRPO-in-MARL-master/scripts/../envs/env_wrappers.py", line 360, in init
self.n_agents = self.remotes[0].recv()
File "/home/spaci/anaconda3/envs/env_name/lib/python3.9/multiprocessing/connection.py", line 255, in recv
buf = self._recv_bytes()
File "/home/spaci/anaconda3/envs/env_name/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
buf = self._recv(4)
File "/home/spaci/anaconda3/envs/env_name/lib/python3.9/multiprocessing/connection.py", line 384, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer

can anyone help me?
thanks

conflicting dependicies and distribution of some packages not found

Hi,
I am trying to install the packages in the requirements file but I got errors, like matplotlib and tensorflow versions are not found and conflicting dependicies; Any clue how to fix this?
Thank you

The question about critic loss

The hatrpo works well in my environment, but why critic loss will increase? Addtionally, I think the "kl_threshold" is a important parameter . Could you please tell me how to tune the parameter? My parameter settings and experiment results are as follows.
Looking forward to your reply. Thank you.
critic_lr: 5e-3
opti_eps: 1e-5
kl_threshold: 0.0001
gamma: 0.99
use_linear_lr_decay: True

I found that the action value exceeds the limit

Hello, I found that the action value is not between [-1,+1].

Question about observation and state in multi-agent mujoco tasks

I have a question on multi-agent mujoco tasks.
For an agent, the global state and its observation seem to be the same in mujoco_multi.py.
Have I misunderstood the code？

The Script code runs wrong when applying the HATRPO algorithm with 【rnn】 network.

Hello, I try to run your code with the hatrpo algorithm and 【rnn】 network. Specially, I add "--use_recurrent_policy" in both scripts: train_smac.sh and train_mujoco, modify the algo='hatrpo'. However, both the scripts code go wrong and return errors as below:

RuntimeError: the derivative for '_cudnn_rnn_backward' is not implemented. Double backwards is not supported for CuDNN RNNs due to limitations in the CuDNN API. To run double backwards, please disable the CuDNN backend temporarily while running the forward pass of your RNN. For example:
with torch.backends.cudnn.flags(enabled=False):
output = model(inputs)

Some questions about HAPPO implementation

The above is the equation(10) in the paper, but I can't find it in the current implementation.

In the file <happo_trainer.py>

And <separated_buffer.py>

I can not find any preprocessing to advantages like equation (10) in your paper.

I would appreciate knowing how iterative updates in the algorithm1 are represented in the code.

gym error

when using DummyVecEnv, the env class has no '_get_obs' properties
Traceback (most recent call last):
File "train/train_mujoco.py", line 163, in
main(sys.argv[1:])
File "train/train_mujoco.py", line 136, in main
envs = make_train_env(all_args)
File "train/train_mujoco.py", line 35, in make_train_env
return ShareDummyVecEnv([get_env_fn(0)])
File "../envs/env_wrappers.py", line 712, in init
self.envs = [fn() for fn in env_fns]
File "../envs/env_wrappers.py", line 712, in
self.envs = [fn() for fn in env_fns]
File "train/train_mujoco.py", line 25, in init_env
env = MujocoMulti(env_args=env_args)
File "../envs/ma_mujoco/multiagent_mujoco/mujoco_multi.py", line 104, in init
self.share_obs_size = self.get_state_size()
File "../envs/ma_mujoco/multiagent_mujoco/mujoco_multi.py", line 204, in get_state_size
return len(self.get_state()[0])
File "../envs/ma_mujoco/multiagent_mujoco/mujoco_multi.py", line 191, in get_state
state = self.env._get_obs()
File "/home/spaci/anaconda3/envs/test/lib/python3.7/site-packages/gym/core.py", line 228, in getattr
raise AttributeError(f"attempted to get missing private attribute '{name}'")
AttributeError: attempted to get missing private attribute '_get_obs'