Coder Social home page Coder Social logo

openai / maddpg Goto Github PK

View Code? Open in Web Editor NEW
1.5K 151.0 483.0 55 KB

Code for the MADDPG algorithm from the paper "Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments"

Home Page: https://arxiv.org/pdf/1706.02275.pdf

License: MIT License

Python 100.00%
paper

maddpg's Introduction

Status: Archive (code is provided as-is, no updates expected)

Multi-Agent Deep Deterministic Policy Gradient (MADDPG)

This is the code for implementing the MADDPG algorithm presented in the paper: Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. It is configured to be run in conjunction with environments from the Multi-Agent Particle Environments (MPE). Note: this codebase has been restructured since the original paper, and the results may vary from those reported in the paper.

Update: the original implementation for policy ensemble and policy estimation can be found here. The code is provided as-is.

Installation

  • To install, cd into the root directory and type pip install -e .

  • Known dependencies: Python (3.5.4), OpenAI gym (0.10.5), tensorflow (1.8.0), numpy (1.14.5)

Case study: Multi-Agent Particle Environments

We demonstrate here how the code can be used in conjunction with the Multi-Agent Particle Environments (MPE).

  • Download and install the MPE code here by following the README.

  • Ensure that multiagent-particle-envs has been added to your PYTHONPATH (e.g. in ~/.bashrc or ~/.bash_profile).

  • To run the code, cd into the experiments directory and run train.py:

python train.py --scenario simple

  • You can replace simple with any environment in the MPE you'd like to run.

Command-line options

Environment options

  • --scenario: defines which environment in the MPE is to be used (default: "simple")

  • --max-episode-len maximum length of each episode for the environment (default: 25)

  • --num-episodes total number of training episodes (default: 60000)

  • --num-adversaries: number of adversaries in the environment (default: 0)

  • --good-policy: algorithm used for the 'good' (non adversary) policies in the environment (default: "maddpg"; options: {"maddpg", "ddpg"})

  • --adv-policy: algorithm used for the adversary policies in the environment (default: "maddpg"; options: {"maddpg", "ddpg"})

Core training parameters

  • --lr: learning rate (default: 1e-2)

  • --gamma: discount factor (default: 0.95)

  • --batch-size: batch size (default: 1024)

  • --num-units: number of units in the MLP (default: 64)

Checkpointing

  • --exp-name: name of the experiment, used as the file name to save all results (default: None)

  • --save-dir: directory where intermediate training results and model will be saved (default: "/tmp/policy/")

  • --save-rate: model is saved every time this number of episodes has been completed (default: 1000)

  • --load-dir: directory where training state and model are loaded from (default: "")

Evaluation

  • --restore: restores previous training state stored in load-dir (or in save-dir if no load-dir has been provided), and continues training (default: False)

  • --display: displays to the screen the trained policy stored in load-dir (or in save-dir if no load-dir has been provided), but does not continue training (default: False)

  • --benchmark: runs benchmarking evaluations on saved policy, saves results to benchmark-dir folder (default: False)

  • --benchmark-iters: number of iterations to run benchmarking for (default: 100000)

  • --benchmark-dir: directory where benchmarking data is saved (default: "./benchmark_files/")

  • --plots-dir: directory where training curves are saved (default: "./learning_curves/")

Code structure

  • ./experiments/train.py: contains code for training MADDPG on the MPE

  • ./maddpg/trainer/maddpg.py: core code for the MADDPG algorithm

  • ./maddpg/trainer/replay_buffer.py: replay buffer code for MADDPG

  • ./maddpg/common/distributions.py: useful distributions used in maddpg.py

  • ./maddpg/common/tf_util.py: useful tensorflow functions used in maddpg.py

Paper citation

If you used this code for your experiments or found it helpful, consider citing the following paper:

@article{lowe2017multi,
  title={Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments},
  author={Lowe, Ryan and Wu, Yi and Tamar, Aviv and Harb, Jean and Abbeel, Pieter and Mordatch, Igor},
  journal={Neural Information Processing Systems (NIPS)},
  year={2017}
}

maddpg's People

Contributors

cberner avatar christopherhesse avatar himanshub16 avatar jxwuyi avatar ryan-lowe avatar wwxfromtju avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

maddpg's Issues

question about p_reg in p_train

I went through the code and found a problem I didn't understand.
image
I think of p_reg as a regular term, and the regular term as a constraint on the learning parameters.
But I found that the act_Pd. flatparam() in the code p_reg = TF.reduce_mean (TF.square (act_pd.flatparam())) gets the network output, that is to say, the return of the flatparam function is not the learning parameters,Instead , It's network output How to explain this regularization.This confuses me and I look forward to your advice.
for example of act_Pd. flatparam() :
image

Episode in cooperative navigation env

Hi,
Thank you for releasing the code. I have some questions about the 'done' situation in the cooperative navigation environment. I don't see any done function for the env. I just see the maximum time step for one episode for the terminal condition.
1- Is it the only situation that the env will be done and we need to reset the world?
2- How about when agents cover the landmarks? do they try to continue to cover the landmarks until the max time step is reached?
3- what is the max steps for the results you reported in table 2 in the paper for cooperative navigation env? Do you calculate the number of touches and the mean distance to landmarks in these number of time steps?

Thank you in advance

The code does not converged

I run the environment simple_spread_listener with the code and it does not converged. I haven't changed any code.

Hello! I encountered some problems while running the train.py file under the MADDPG file and would like to seek your help.

Hello! When I run the train.py file in the experiments directory according to your instructions, I execute the following command,
python train.py --scenario simple
After the training, I got the following error, please ask, what is the reason for this, how should I modify it? Looking forward to your reply
Traceback (most recent call last):
File "train.py", line 193, in
train(arglist)
File "train.py", line 182, in train
rew_file_name = arglist.plots_dir + arglist.exp_name + '_rewards.pkl'
TypeError: must be str, not NoneType

The reward and action is nan ?

Hello, when I run your code, everything seems to be fine. But when I display the result after 60000 episodes, the agent flashes quickly and disappears quickly. I printed the reward and action, at a certain time, probably after 1500 episodes, they become nan, I didn't change anything and what to know why, maybe you can give me some advice, thank you !
image

Trying to set the random seeds, any idea how?

I'm trying to set the random seeds so I can get a unique result several times. tried setting np seed in several scripts but none of them give the what I want. Any idea how to do this?

Error when setting display to true

When I set display = True I get the error below. These are the versions I have for the known dependencies:
Python (3.6)
OpenAI gym (0.10.5)
tensorflow (1.8.0)
numpy (1.14.5)

  File "C:\Users\AAJ\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1322, in _do_call
    return fn(*args)
  File "C:\Users\AAJ\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1307, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "C:\Users\AAJ\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1409, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [4,64] rhs shape= [16,64]
         [[Node: save/Assign_39 = Assign[T=DT_FLOAT, _class=["loc:@agent_0/target_p_func/fully_connected/weights"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](agent_0/target_p_func/fully_connected/weights, save/RestoreV2:39)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 193, in <module>
    train(arglist)
  File "train.py", line 96, in train
    U.load_state(arglist.load_dir)
  File "d:\red mirror\dr so\multi agent examples\maddpg-master\maddpg\common\tf_util.py", line 230, in load_state
    saver.restore(get_session(), fname)
  File "C:\Users\AAJ\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\saver.py", line 1802, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "C:\Users\AAJ\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\client\session.py", line 900, in run
    run_metadata_ptr)
  File "C:\Users\AAJ\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "C:\Users\AAJ\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1316, in _do_run
    run_metadata)
  File "C:\Users\AAJ\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\client\session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [4,64] rhs shape= [16,64]
         [[Node: save/Assign_39 = Assign[T=DT_FLOAT, _class=["loc:@agent_0/target_p_func/fully_connected/weights"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](agent_0/target_p_func/fully_connected/weights, save/RestoreV2:39)]]

Caused by op 'save/Assign_39', defined at:
  File "train.py", line 193, in <module>
    train(arglist)
  File "train.py", line 96, in train
    U.load_state(arglist.load_dir)
  File "d:\red mirror\dr so\multi agent examples\maddpg-master\maddpg\common\tf_util.py", line 229, in load_state
    saver = tf.train.Saver()
  File "C:\Users\AAJ\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\saver.py", line 1338, in __init__
    self.build()
  File "C:\Users\AAJ\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\saver.py", line 1347, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "C:\Users\AAJ\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\saver.py", line 1384, in _build
    build_save=build_save, build_restore=build_restore)
  File "C:\Users\AAJ\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\saver.py", line 835, in _build_internal
    restore_sequentially, reshape)
  File "C:\Users\AAJ\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\saver.py", line 494, in _AddRestoreOps
    assign_ops.append(saveable.restore(saveable_tensors, shapes))
  File "C:\Users\AAJ\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\saver.py", line 185, in restore
    self.op.get_shape().is_fully_defined())
  File "C:\Users\AAJ\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\ops\state_ops.py", line 283, in assign
    validate_shape=validate_shape)
  File "C:\Users\AAJ\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\ops\gen_state_ops.py", line 63, in assign
    use_locking=use_locking, name=name)
  File "C:\Users\AAJ\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "C:\Users\AAJ\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\ops.py", line 3392, in create_op
    op_def=op_def)
  File "C:\Users\AAJ\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\framework\ops.py", line 1718, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [4,64] rhs shape= [16,64]
         [[Node: save/Assign_39 = Assign[T=DT_FLOAT, _class=["loc:@agent_0/target_p_func/fully_connected/weights"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](agent_0/target_p_func/fully_connected/weights, save/RestoreV2:39)]]

Having trouble with import maddpg

I am having trouble when I run python3 train.py --scenario simple
File "train.py", line 12, in
import maddpg
ModuleNotFoundError: No module named 'maddpg'

run code

I don't find code about "Inferring Policies of Ther Agents" and "Agent with Polivy Ensembles" mentioned in the paper 'Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments'.Did I miss something?

Import Errors

Sorry for raising the same issue again. However, I have a strange issue here.

I can import the "maddpg" alone, however, when I run the script, I get the following errors copied below. The same issue is reproduced when I try to import maddpg.common.tf_util or the others in python using the terminal.

I am running OpenAI Gym > 0.10

I am pretty sure, that my PythonPath is set correctly, to the maddpg directory that has the init file in it.

Traceback (most recent call last):
File "train.py", line 9, in
import maddpg.common.tf_util as U
ImportError: No module named common.tf_util

Traceback (most recent call last):
File "train.py", line 10, in
from maddpg.trainer.maddpg import MADDPGAgentTrainer
ImportError: No module named trainer.maddpg

How can I use DDPG to train it?

"readme.md" said we can choose DDPG to train it, but it seems not useful. So if I want to use DDPG, how can I modify the code?

SoftMultiCategoricalPd

Hi, you have commented that SoftMultiCategoricalPd: doesnt't work yet.
File: maddpg/common/distributions.py
Line: 272

Can you tell me have you fixed it or working on it?

And also can you please explain me difference between Distributions and its soft version, because in OpenAI/Baselines there is no Soft version.
And will there be any problems if I use MultiCategoricalPd for MultiDiscrete action space?

Thanks

displaying agent behaviors on the screen

Hi,I want to display agent behaviors on the screen,but when I change the default value of "--display" to be True,I get an error:

Using good policy maddpg and adv policy maddpg
Loading previous state...
Starting iterations...
agent 1 to agent 0: _ agent 2 to agent 0: _ agent 0 to agent 1: _ agent 2 to agent 1: _ agent 0 to agent 2: _ agent 1 to agent 2: _
Traceback (most recent call last):

File "", line 1, in
runfile('/home/***/Desktop/maddpg-master-new/experiments/train.py', wdir='/home/yuanweilin/Desktop/maddpg-master-new/experiments')

File "/home/***/anaconda3/envs/tensorflow/lib/python3.5/site-packages/spyder/utils/site/sitecustomize.py", line 705, in runfile
execfile(filename, namespace)

File "/home/***/anaconda3/envs/tensorflow/lib/python3.5/site-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "/home/***/Desktop/maddpg-master-new/experiments/train.py", line 198, in
train(arglist)

File "/home/***/Desktop/maddpg-master-new/experiments/train.py", line 155, in train
env.render()

File "/home/***/Desktop/multiagent-particle-envs-master-new/multiagent/environment.py", line 220, in render
self.viewers[i] = rendering.Viewer(700,700)

File "/home/***/Desktop/multiagent-particle-envs-master-new/multiagent/rendering.py", line 52, in init
self.window = pyglet.window.Window(width=width, height=height, display=display)

File "/home/***/anaconda3/envs/tensorflow/lib/python3.5/site-packages/pyglet/window/init.py", line 504, in init
screen = display.get_default_screen()

File "/home/***/anaconda3/envs/tensorflow/lib/python3.5/site-packages/pyglet/canvas/base.py", line 73, in get_default_screen
return self.get_screens()[0]

File "/home/***/anaconda3/envs/tensorflow/lib/python3.5/site-packages/pyglet/canvas/base.py", line 65, in get_screens
raise NotImplementedError('abstract')

NotImplementedError: abstract

add multiagent-particle-envs to PYTHONPATH

Hello, thanks for implement of this algorithm!
I tried to run train.py, but it complains:
Traceback (most recent call last):
File "train.py", line 9, in
import maddpg.common.tf_util as U
ImportError: No module named common.tf_util

I am suspicious that I did not correctly add multiagent-particle-envs to PYTHONPATH. I git cloned multiagent-particle-envs and maddpg into the same folder, and tried command of "export PYTHONPATH=$PYTHONPATH:/home/user/path/multiagent-particle-envs" in terminal, but it still gave me the same error.

How can i use it for "simple_world_comm" in MPE? ---- "AssertionError: nvec should be a 1d array (or list) of ints"

I am putting my focus on the implement of the "maddpg" but an error occurring.

The pycharm showed:
Traceback (most recent call last):
File "/home/zimoqingfeng/rlSource/maddpg/experiments/train.py", line 195, in
train(arglist)
File "/home/zimoqingfeng/rlSource/maddpg/experiments/train.py", line 83, in train
env = make_env(arglist.scenario, arglist, arglist.benchmark)
File "/home/zimoqingfeng/rlSource/maddpg/experiments/train.py", line 62, in make_env
env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation)
File "/home/zimoqingfeng/rlSource/multiagent-particle-envs/multiagent/environment.py", line 60, in init
act_space = spaces.MultiDiscrete([[0,act_space.n-1] for act_space in total_action_space])
File "/home/zimoqingfeng/rlSource/gym/gym/spaces/multi_discrete.py", line 10, in init
assert self.nvec.ndim == 1, 'nvec should be a 1d array (or list) of ints'
AssertionError: nvec should be a 1d array (or list) of ints

For your information, if you need.
def parse_args():
parser = argparse.ArgumentParser("Reinforcement Learning experiments for multiagent environments")
# Environment
parser.add_argument("--scenario", type=str, default="simple_world_comm", help="name of the scenario script")
parser.add_argument("--max-episode-len", type=int, default=25, help="maximum episode length")
parser.add_argument("--num-episodes", type=int, default=60000, help="number of episodes")
parser.add_argument("--num-adversaries", type=int, default=0, help="number of adversaries")
parser.add_argument("--good-policy", type=str, default="maddpg", help="policy for good agents")
parser.add_argument("--adv-policy", type=str, default="maddpg", help="policy of adversaries")
# Core training parameters
parser.add_argument("--lr", type=float, default=1e-2, help="learning rate for Adam optimizer")
parser.add_argument("--gamma", type=float, default=0.95, help="discount factor")
parser.add_argument("--batch-size", type=int, default=1024, help="number of episodes to optimize at the same time")
parser.add_argument("--num-units", type=int, default=64, help="number of units in the mlp")
# Checkpointing
parser.add_argument("--exp-name", type=str, default= "001", help="name of the experiment")
parser.add_argument("--save-dir", type=str, default="./tmp/policy_simple_world_comm/", help="directory in which training state and model should be saved")
parser.add_argument("--save-rate", type=int, default=1000, help="save model once every time this many episodes are completed")
parser.add_argument("--load-dir", type=str, default="", help="directory in which training state and model are loaded")
# Evaluation
parser.add_argument("--restore", action="store_true", default=False)
parser.add_argument("--display", action="store_true", default=False)
parser.add_argument("--benchmark", action="store_true", default=False)
parser.add_argument("--benchmark-iters", type=int, default=100000, help="number of iterations run for benchmarking")
parser.add_argument("--benchmark-dir", type=str, default="./benchmark_files/", help="directory where benchmark data is saved")
parser.add_argument("--plots-dir", type=str, default="./learning_curves/", help="directory where plot data is saved")
return parser.parse_args()

Thank you anyway, and look forward to your reply!

How to normalize the data in table of Appendix to obtain Figure 3 in paper?

Dear Author,

I am replicating your results recently, and a little confused about Figure 3 in the paper, how the data, specifically the in the table of Appendix (e.g. success rate of agents, adversaries in Table 4) can be normalized to get that 'histogram'. Now I think that if that normalized value is computed referred to the maximum value of one column? or what is exactly the score of an agent? Please give me some hints.

Thank you in advance sincerely!

image

The sample function in distribution is implementation of Gumbel-softmax, I added it to my code, now it helps to speed up stabilize the training, but my speaker still can not tell the different landmarks.

The sample function in distribution is implementation of Gumbel-softmax, I added it to my code, now it helps to speed up stabilize the training, but my speaker still can not tell the different landmarks.

How do you handle the action exploration then?

Originally posted by @djbitbyte in #9 (comment)

Calculating Success Rate for Physical Deception

Hello, I am trying to reproduce the experiments on physical deception scenario. Do you mind if I am asking the setup for calculating the success rate.
I. What is the distance threshold?
II. In which situation an agent succeed?
a. as long as it was close enough to the target at any time step
b. it was close enough to target at end of an episode
c. it should be close to target for a while, multiple time steps
d. any other situation?
III. How to calculate the rate if we have multiple good/adv agents?
a. calculate the average rate
b. pick the minimum distance to target (under setup II. ) of good/adv agents at each episode
c. any other situation?

Thanks for your time.

Running train.py doesn't seem to work

After setting up the multiagent-particle-envs and adding the path to the PYTHONPATH, and after all the pip installs, running the

python train.py --scenario simple

results in:

/home/.local/lib/python3.6/site-packages/tensorflow/python/util/tf_inspect.py:45: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() or inspect.getfullargspec()
if d.decorator_argspec is not None), _inspect.getargspec(target))
Using good policy maddpg and adv policy maddpg
Traceback (most recent call last):
File "train.py", line 195, in
train(arglist)
File "train.py", line 106, in train
obs_n = env.reset()
File "~/gym/gym/core.py", line 71, in reset
raise NotImplementedError
NotImplementedError

Any fixes?

Nontype flaw in "train.py", line 182

If one tries to set "--exp-name" by default, an error may occur in train.py(line 182) indicating that there is a TypeError. This may have resulted from the parse_args function's setting "--exp-name"'s default to None. It might be better to set it to "" as in "--load-dir"

There is no provision to run ddpg.

I typed the policy as 'lol' and it still gives the same result. The readme says that we can run the experiments for ddpg as well as maddpg which is definitely not working.

Cannot reproduce experiment results

Is this code vastly different from the code used to generate results for the paper?
I cannot reproduce any of the results of the experiments simple_spread, simple_reference, simple_tag even after running for over 2 million iterations. The policy doesn't even look like its getting better. Any tips on this? Has somebody else got it to work?

Further, I don't see the ensemble policy part or the estimating other agents policies part in the (Section 4,2 and 4.3 in the paper) code. Am I missing something?

When I run train.py,it shows "TypeError: Can't convert 'NoneType' object to str implicitly".

I have configured MPE and MADDPG, but some errors occurred while I was running train.py.(python train.py --scenario simple)
I can see the results of the program.

steps: 1374975, episodes: 55000, mean episode reward: -6.977302725370943, time: 14.844
steps: 1399975, episodes: 56000, mean episode reward: -6.743868053210359, time: 14.563
steps: 1424975, episodes: 57000, mean episode reward: -6.622564807563306, time: 14.43
steps: 1449975, episodes: 58000, mean episode reward: -6.21897592491655, time: 14.479
steps: 1474975, episodes: 59000, mean episode reward: -6.874195291324205, time: 14.642
steps: 1499975, episodes: 60000, mean episode reward: -6.769229165363719, time: 14.58
Traceback (most recent call last):
File "train.py", line 195, in
train(arglist)
File "train.py", line 184, in train
rew_file_name = arglist.plots_dir + arglist.exp_name + '_rewards.pkl'
TypeError: Can't convert 'NoneType' object to str implicitly

I don't know where the problem is.
Look forward to your reply! Thanks!

TypeError: set_color() got multiple values for argument 'alpha' in Simple-Crypto

Running the train.py using the simple crypto scenario gives me this error, I have tried printing out the value of the alpha but all I got was only one value so how to solve this ?

`Traceback (most recent call last):
File "train.py", line 193, in
train(arglist)
File "train.py", line 153, in train
env.render()
File "/Users/Maro31/bachelor/multiagent-particle-envs/multiagent/environment.py", line 235, in render
geom.set_color(*entity.color, alpha=0.5)
TypeError: set_color() got multiple values for argument 'alpha'```

reward is too large

When I ran python train.py --scenario simple, the reward function was too large, but I didn't change the code.

How to turn continuous action into discrete action

In the source code, I saw some code about discrete actions. I changed act_space=MultiDiscrete() to spaces.Discrete(), but the output actions were not expected discrete numbers, but decimals, which made me unable to perform my discrete actions

Spark

Has anyone adapted the code to run on Spark? Would be very grateful if it could be shared, or at least some pointers given. Or maybe it runs as-is?

TypeError: must be str, not NoneType . run train.py

i run train.py,the following error:
Traceback (most recent call last):
File "/home/shy/桌面/maddpg-master/experiments/train.py", line 195, in
train(arglist)
File "/home/shy/桌面/maddpg-master/experiments/train.py", line 184, in train
rew_file_name = arglist.plots_dir + arglist.exp_name
TypeError: must be str, not NoneType

Error in scenario simple_reference with gym.spaces.MultiDiscrete

In gym versions later than 0.9.5, the MultiDiscrete is modified, which is n longer compatible with the simple_reference scenario.

Got Error:
assert self.nvec.ndim == 1, 'nvec should be a 1d array (or list) of ints'
AssertionError: nvec should be a 1d array (or list) of ints

where self.ncev = [[0,9],[0,4]]

Two problem about update function

I have two problem about update function after read your code. And could anyone explanation it for me? I am very appreciated.
Firstly, I can't understand what role does variable "num_sample" play when train q network?

# train q network
num_sample = 1
target_q = 0.0
for i in range(num_sample):
      target_act_next_n = [agents[i].p_debug['target_act'](obs_next_n[i]) for i in range(self.n)]
      target_q_next = self.q_debug['target_q_values'](*(obs_next_n + target_act_next_n))
      target_q += rew + self.args.gamma * (1.0 - done) * target_q_next
target_q /= num_sample

Secondly, why the loss of p should be loss = pg_loss + p_reg * 1e-3, and what role does p_reg play in the loss.

when I run train.py,it shows "module 'tensorflow' has no attribute 'float32'"

when I run bin/interactive.py ,it shows:
WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
Traceback (most recent call last):
File "interactive.py", line 23, in
env.render()
File "/home/***/anaconda3/lib/python3.6/site-packages/gym/core.py", line 111, in render
raise NotImplementedError
NotImplementedError

and another problem is that ,when I run train.py,it shows:
runfile('/home//Desktop/maddpg-master/experiments/train.py', wdir='/home//Desktop/maddpg-master/experiments')
Traceback (most recent call last):
File "", line 1, in
runfile('/home//Desktop/maddpg-master/experiments/train.py', wdir='/home//Desktop/maddpg-master/experiments')
File "/home//anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 710, in runfile
execfile(filename, namespace)
File "/home/
/anaconda3/lib/python3.6/site-packages/spyder/utils/site/sitecustomize.py", line 101, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/home//Desktop/maddpg-master/experiments/train.py", line 9, in
import maddpg.common.tf_util as U
File "/home/
/Desktop/maddpg-master/maddpg/common/tf_util.py", line 71, in
class BatchInput(PlacholderTfInput):
File "/home/***/Desktop/maddpg-master/maddpg/common/tf_util.py", line 72, in BatchInput
def init(self, shape, dtype=tf.float32, name=None):
AttributeError: module 'tensorflow' has no attribute 'float32'

why dose it show that "module 'tensorflow' has no attribute 'float32'"?

How maddpg update actor?

Excuse me, I have a question with a detail in maddpg.py: In function update(), when training p network, why using observation and action from replay buffer, instead of using observation and corresponding action through actor, i.e., obs_n and act(*obs_n)?

ImportError: cannot import name 'prng' when run train.py

When I run train.py I got an import error:

python train.py --scenario simple
Traceback (most recent call last):
File "train.py", line 8, in
from maddpg.trainer.maddpg import MADDPGAgentTrainer
File "e:\dpl\ma\maddpg-master\maddpg-master\maddpg\trainer\maddpg.py", line 6,
in
from maddpg.common.distributions import make_pdtype
File "e:\dpl\ma\maddpg-master\maddpg-master\maddpg\common\distributions.py", l
ine 5, in
from multiagent.multi_discrete import MultiDiscrete
File "e:\dpl\ma\multiagent-particle-envs-master\multiagent-particle-envs-maste
r\multiagent\multi_discrete.py", line 7, in
from gym.spaces import prng
ImportError: cannot import name 'prng'

How to address it? Thanks.

How or why the gaussian distribution contributes to the training?

It's interesting that the code decomposes the output of actor network as the mean and the standard deviation, and then constructs a new action with a gaussian distribution. In past, there is always a extra noisy factor which will decrease gradually to control the adding noise.
I wonder if you can explain how or why this will work :)

The result is not that ideal like the paper showed

I just run maddpg in simple_speaker_listener several times,but none of them get the -20 avg-reward like the paper proposed. Are there anything i should modify to get a better or more stable result?

It seems that the training is decentralized?

I have looked through the train.py and found that your guys provide each agent an trainer:

def get_trainers(env, num_adversaries, obs_shape_n, arglist):

    trainers = []

    model = mlp_model

    trainer = MADDPGAgentTrainer

    for i in range(num_adversaries):

        trainers.append(trainer(

            "agent_%d" % i, model, obs_shape_n, env.action_space, i, arglist,

            local_q_func=(arglist.adv_policy=='ddpg')))

    for i in range(num_adversaries, env.n):

        trainers.append(trainer(

            "agent_%d" % i, model, obs_shape_n, env.action_space, i, arglist,

            local_q_func=(arglist.good_policy=='ddpg')))

    return trainers

It seems that maybe I have wrong understanding. But decentralize training in my understanding mean using an identical model to learn the q function, so therefore I can't understanding why assign each agent a trainer(include the model in the trainer), since you get so many model to train rather than only one.

And I found that even you use Reuse=True in the setting of tf.variable_scope, but each model of the trainer of a agent has name like "agent_0/fully_connected/weights". That means all the weights and bias of the model are not exactly the same. Namely, agent_0 has it's own model, agent_1 has it's own model, ...

So how could you say your training of this multi-agents system is centralized?

Look forward to your reply! Thanks!

Question regarding the replay buffers and the Critic networks. (duplicates in the state)

Hello everybody!

As far as I can see from the code, each agent maintains its own replay buffer.

In the training step, when sampling the minibatch, the observations of all agents are collected and concatenated.

for i in range(self.n):
obs, act, rew, obs_next, done = agents[i].replay_buffer.sample_index(index)
obs_n.append(obs)
obs_next_n.append(obs_next)
act_n.append(act)

As far as I can see, this would lead to duplicates in the state input to the agent's critic function. If there are components of the environment-state which are part of every agent's observation, these components would be contained the critic's input multiple times.

Is this true, or do I miss anything?

Does this (artificial) state expansion have any adverse effects on the critic, or can we safely assume, that the critic will learn quite fast, that the input values at some input nodes are always identical and hence can be treated commonly?

Are there any memory issues due to the multiple storage of the state components in each of the agents' replay buffer? (Probably, memory is not an issue with RL guys, but I have a background in embedded systems)

I would be very grateful for some more insight on this topic.

Regards,
Felix

Q divergence

Hello! I am working to implement MADDPG in pytorch based on the details of this implementation in tensorflow. I have followed the implementation to a tee, but I when I remove regularization on the policy logits, my Q values diverge. When I remove the same regularization term in your implementation, this does not occur. Did you experience this divergence issue? Was a matter of tuning to fix or does this indicate an issue with my implementation? Thank you.

action exploration & Gumbel-Softmax

Hello, I have questions on exploration and Gumbel-Softmax.

In the pseudocode, it mentioned initialize random process N for action exploration, which is same in the paper of DDPG. But I have difficulty to understand the exploration in your implementation. Is it Ornstein-Uhlenbeck process used for this algorithm, same as DDPG? Could you explain how you handled action exploration?

Another question, did you use softmax instead of Gumbel-Softmax?

I have tried to implement MADDPG on scenario of simple-speaker-listener, but not with Ornstein-Uhlenbeck process for action exploration, and only softmax for actor network. The other parts are the same as on paper, but my speaker is converged to telling same wrong target landmark, and listener is wondering around or in between the 3 landmarks. I guess the listener ignoreed speaker as described on paper.
And I've tried yours on simple-speaker-listener, it converges correctly for some trainings. Are the action exploration and activation functions the reasons for wrong convergence, do they have big impact on training process?

Thanks for your time!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.