The di-engine's discuss from opendilab

Discussion channel for how to apply self-play to custom env?

Hi all,

Nice project. We want to start using it. After reading the doc and the config dizoo/competitive_rl/entry/cpong_dqn_default_config.py for league train, there are still something not clear to us. Do you have a channel that can discuss trivial questions frequently? Like a WeChat group or slack channel?

cc: [email protected]

PettingZoo for SMAC

As a follow up to #153, you guys don't need to separately support the SMAC API; you can just use the PettingZoo API since SMAC supports it and it's fairly heavily used.

gfootball_ppo_parallel_config.py does not work

WARNING:root:If you want to use numba to speed up segment tree, please install numba first
Traceback (most recent call last):
  File "/Users/zzhaoao/Documents/RL/New/DI-engine/dizoo/gfootball/entry/parallel/gfootball_ppo_parallel_config.py", line 102, in <module>
    parallel_pipeline(config, seed=0)
  File "/Users/zzhaoao/Documents/RL/New/DI-engine/ding/entry/parallel_entry.py", line 52, in parallel_pipeline
    launch_coordinator(config.seed, config, learner_handle=learner_handle, collector_handle=collector_handle)
  File "/Users/zzhaoao/Documents/RL/New/DI-engine/ding/entry/parallel_entry.py", line 125, in launch_coordinator
    coordinator = Coordinator(config)
  File "/Users/zzhaoao/Documents/RL/New/DI-engine/ding/worker/coordinator/coordinator.py", line 61, in __init__
    self._exp_name = cfg.main.exp_name
AttributeError: 'EasyDict' object has no attribute 'exp_name'
WARNING:root:If you want to use numba to speed up segment tree, please install numba first
WARNING:root:If you want to use numba to speed up segment tree, please install numba first
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/Cellar/[email protected]/3.9.12/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/local/Cellar/[email protected]/3.9.12/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/usr/local/Cellar/[email protected]/3.9.12/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/synchronize.py", line 110, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/Cellar/[email protected]/3.9.12/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/local/Cellar/[email protected]/3.9.12/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/usr/local/Cellar/[email protected]/3.9.12/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/synchronize.py", line 110, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <function Coordinator.__del__ at 0x14d374790>
Traceback (most recent call last):
  File "/Users/zzhaoao/Documents/RL/New/DI-engine/ding/worker/coordinator/coordinator.py", line 289, in __del__
    self.close()
  File "/Users/zzhaoao/Documents/RL/New/DI-engine/ding/worker/coordinator/coordinator.py", line 268, in close
    if self._end_flag:
AttributeError: 'Coordinator' object has no attribute '_end_flag'

Comparison of training efficiency between asynchronous mode and distributed mode based on Gobigger Env

>>> print(ding.__version__, torch.__version__, sys.version, sys.platform)
v0.2.2 1.10.0+cu102 3.6.10 |Anaconda, Inc.| (default, Mar 25 2020, 23:51:54)
[GCC 7.3.0] linux

按照DI-engine文档中关于Task与Parallel的用法，将Gogbigger训练di-baseline改为并行及异步形式。
训练测试结果显示，使用Parallel比只使用Task更慢，这可能是什么原因？

    with Task(async_mode=True) as task:
        task.use_step_wrapper(StepTimer(print_per_step=1))
        task.use(evalute(random_evaluator, rule_evaluator, model, task), filter_labels=["standalone", "node.1"])
        task.use(collect(epsilon_greedy, collector, replay_buffer), filter_labels=["standalone", "node.0"])
        task.use(training(cfg, learner, replay_buffer, task, model), filter_labels=["standalone", "node.0"])
        task.run(max_step=max_iterations)

Is it a bug about transformer mask?

DI-engine/ding/torch_utils/network/transformer.py

Line 72 in 3bf01a9

score.masked_fill(~mask, value=-1e9)

I suppose it was meant to be an in place func here, masked_fill_() to be exact, otherwise maybe it wouldn't work

The default n_sample in SAC Policy

If I use n_episode for SAC policy, it raises error. 'AssertionError: n_episode/n_sample in policy cfg can't be not None at the same time'.
I find that there is a default config value 'n_sample=1' in SAC policy, thus if I define n_episode, two config keys exist at the same time.
I suggest delete the default config for n_sample.

[Error] AttributeError: 'InteractionSerialEvaluator' object has no attribute '_end_flag'

v0.3.1 1.8.1+cpu 3.8.12 (default, Oct 12 2021, 03:01:40) [MSC v.1916 64 bit (AMD64)] win32

When running the basic example: python3 -u dizoo/classic_control/cartpole/entry/cartpole_dqn_main.py
It shows the following error.

Traceback (most recent call last):
  File "C:/ProgramData/Anaconda3/envs/PYTORCH/Lib/site-packages/dizoo/classic_control/cartpole/entry/cartpole_dqn_main.py", line 91, in <module>
    main(cartpole_dqn_config)
  File "C:/ProgramData/Anaconda3/envs/PYTORCH/Lib/site-packages/dizoo/classic_control/cartpole/entry/cartpole_dqn_main.py", line 84, in main
    evaluator = InteractionSerialEvaluator(
  File "C:\ProgramData\Anaconda3\envs\PYTORCH\lib\site-packages\ding\worker\collector\interaction_serial_evaluator.py", line 56, in __init__
    self.reset(policy, env)
  File "C:\ProgramData\Anaconda3\envs\PYTORCH\lib\site-packages\ding\worker\collector\interaction_serial_evaluator.py", line 112, in reset
    self.reset_env(_env)
  File "C:\ProgramData\Anaconda3\envs\PYTORCH\lib\site-packages\ding\worker\collector\interaction_serial_evaluator.py", line 76, in reset_env
    self._env.launch()
  File "C:\ProgramData\Anaconda3\envs\PYTORCH\lib\site-packages\ding\envs\env_manager\base_env_manager.py", line 199, in launch
[2022-05-22 16:15:02] ERROR    Env 0 reset has exceeded max retries(1)                                                                             base_env_manager.py:274
    self.reset(reset_param)
  File "C:\ProgramData\Anaconda3\envs\PYTORCH\lib\site-packages\ding\envs\env_manager\base_env_manager.py", line 242, in reset
    self._reset(env_id)
  File "C:\ProgramData\Anaconda3\envs\PYTORCH\lib\site-packages\ding\envs\env_manager\base_env_manager.py", line 281, in _reset
    raise runtime_error
  File "C:\ProgramData\Anaconda3\envs\PYTORCH\lib\site-packages\ding\envs\env_manager\base_env_manager.py", line 259, in _reset
    obs = reset_fn()
  File "C:\ProgramData\Anaconda3\envs\PYTORCH\lib\site-packages\ding\envs\env_manager\base_env_manager.py", line 251, in reset_fn
    return self._envs[env_id].reset(**self._reset_param[env_id])
  File "C:\ProgramData\Anaconda3\envs\PYTORCH\lib\site-packages\ding\envs\env\ding_env_wrapper.py", line 68, in reset
    obs = self._env.reset()
  File "C:\ProgramData\Anaconda3\envs\PYTORCH\lib\site-packages\gym\wrappers\record_video.py", line 58, in reset
    self.start_video_recorder()
  File "C:\ProgramData\Anaconda3\envs\PYTORCH\lib\site-packages\gym\wrappers\record_video.py", line 75, in start_video_recorder
    self.video_recorder.capture_frame()
  File "C:\ProgramData\Anaconda3\envs\PYTORCH\lib\site-packages\gym\wrappers\monitoring\video_recorder.py", line 155, in capture_frame
    self._encode_image_frame(frame)
  File "C:\ProgramData\Anaconda3\envs\PYTORCH\lib\site-packages\gym\wrappers\monitoring\video_recorder.py", line 213, in _encode_image_frame
    self.encoder = ImageEncoder(
  File "C:\ProgramData\Anaconda3\envs\PYTORCH\lib\site-packages\gym\wrappers\monitoring\video_recorder.py", line 337, in __init__
    raise error.DependencyNotInstalled(
RuntimeError: Env 0 reset has exceeded max retries(1), and the latest exception is: DependencyNotInstalled("Found neither the ffmpeg nor avconv executables. On OS X, you can install ffmpeg via `brew install ffmpeg`. On most Ubuntu variants, `sudo apt-get install ffmpeg` should do it. On Ubuntu 14.04, however, you'll need to install avconv with `sudo apt-get install libav-tools`.")
Exception ignored in: <function InteractionSerialEvaluator.__del__ at 0x0000017CA3252280>
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\envs\PYTORCH\lib\site-packages\ding\worker\collector\interaction_serial_evaluator.py", line 138, in __del__
  File "C:\ProgramData\Anaconda3\envs\PYTORCH\lib\site-packages\ding\worker\collector\interaction_serial_evaluator.py", line 125, in close
AttributeError: 'InteractionSerialEvaluator' object has no attribute '_end_flag'

Process finished with exit code 1

Test trigger

`platform_test` for m1 (arm64)

I propose for m1 (arm64) to be added to do_platform_test
so many failed wheels with default set-ups......

Originally posted by @1733-afk in #259 (comment)

Suggestion on unittest for ding/utils/plot.py

The unit test here can just check whether the file was produced successfully, but in actual, image files frequently have issues with generation (for example, some styles, element rendering failure, etc.). As a result, such a test does not function properly:

Here's an idea:

Please test matplotlib image content using unit testing (see https://github.com/matplotlib/pytest-mpl).
Use an image similarity unit test, such as https://github.com/Apkawa/pytest-image-diff.
In addition, files will be generated under the project directory, which could have unintended consequences on the Git workspace and may be added to the repository in later git add commands. For unit tests, please use the mocked path, such as isolated_directory in hbutils.

compile_config()中cfg其实是default_cfg ，在merge和compile的时候放到了usr_cfg的位置

when I use Multi-GPUs to train my model，RuntimeError: Caught RuntimeError in replica 0 on device 0

when I use Multi-GPUs to train my model for Gobigger, the error happened: RuntimeError: Caught RuntimeError in replica 0 on device 0

The detail in opendilab/Gobigger-Explore#3

How to create customized model (pointer network)

Hi there, I am new to DI-engine.
I am trying to implement the pointer network for my own environment.
The most relevant resource I can find is the docs about the RNN here. It seems that I can treat the pointer network as a kind of RNN and wrap each decoding output as hidden_state . But the encoder (also an LSTM) output is also used in every decoding step. Can I wrap it as another hidden_state ?
I noticed from slack that a similar architecture had been implemented in DI-star.
Can you give me directions on how to make it work?
Also, I am not sure which part of the codes I should modify. It will be good if you can point me to the docs/ tutorial on customizing models.

sac_discrete用random collect policy的输出没有logit

File "/root/cityflow/my_cityflow/SAC/cityflow_sac_train.py", line 177, in serial_pipeline
random_collect(cfg.policy, policy, collector, collector_env, commander, replay_buffer)
File "/root/DI-engine/ding/entry/utils.py", line 40, in random_collect
new_data = collector.collect(n_sample=policy_cfg.random_collect_size, policy_kwargs=collect_kwargs)
File "/root/DI-engine/ding/worker/collector/sample_serial_collector.py", line 251, in collect
self._obs_pool[env_id], self._policy_output_pool[env_id], timestep
File "/root/DI-engine/ding/policy/sac.py", line 453, in _process_transition
'logit': model_output['logit'],
KeyError: 'logit'

Visualize Training Progression

RL training can be unstable and fall into local optimal solution easily. Visualization and monitoring metrics are therefore extremely important. Assume there are 3 roles in league training (MA, ME, LE). It would be better to visualize metrics for each of these roles over the training time.

PettingZoo

Hey, for all your multi-agent environments have you considered using the pettingzoo API?

https://github.com/Farama-Foundation/PettingZoo

Recommended change on sac as well as how to inherit a policy

Change how we transform a distribution. For example, https://github.com/opendilab/DI-engine/blob/main/ding/policy/sac.py#L816-L822, can be changed to

dist = TransformedDistribution(Independent(Normal(mu, sigma), 1), [TanhTransform()]) 
next_action = dist.rsample()
next_log_prob = dist.log_prob()

This is much easier and more importantly, more numerically stable.
2. I also recommend the practice in mbsac when creating variants of a policy. A lot of copies in configs and __init__learn are not necessary (for example, in SQILSACPolicy)
3. There are two sac papers sac-v1 and sac-v2 from the same search group, I think we should include links to both papers since it is sac-v2 that proposes automatic entropy adjustment.
4. Maybe we can delete value_network instead of hardcoding value_network=False, which is not commonly used and not even used in sac-v2.
5. Maybe we can create a subdirectory for sac instead of squeezing every variant of sac in a single file. Since sac is a very good commonly-used baseline. If a subdirectory is not necessary, at least we should move SACPolicy to the top instead of SACDiscretePolicy, which is not commonly used.

r2d2 atari

Hello there, I'm sort of a newbie here. I am trying to reproduce some of the atari games with R2D2, and I am unable to produce them. I've been blocked on this for quite some time and it would be a great help if anyone can help me here.
Thank you.

Is the performance of re-implemented GAIL compareble with the results in original paper?

Hi, I want to use your code as a baseline, however I am not sure if your code is good enough to match the results in original paper.
Could you please provide more information, like training figures?

The default random_collect_size is not compatible with episode collector

When I used a episode collector together with SAC policy. It raised the following exception.

Traceback (most recent call last):
  File "/home/tianhan/codes/astraea/src/train/astraea_episode_ma_sac_config.py", line 96, in <module>
    serial_pipeline([main_config, create_config], seed = 9)
  File "/home/tianhan/codes/astraea/third_party/DI-engine/ding/entry/serial_entry.py", line 91, in serial_pipeline
    new_data = collector.collect(n_sample=cfg.policy.random_collect_size, policy_kwargs=collect_kwargs)
TypeError: collect() got an unexpected keyword argument 'n_sample'

I think the reason is that SAC policy has a default random_collect_size, which is not compatible with episode collector (e.g. EpisodeSerialCollector, which requires n_episode as an argument instead of n_sample).

May be you should add readme on pypi site?

What do I find?

Look at here: https://pypi.org/project/DI-engine/

It says

The author of this package has not provided a project description

Why this is not okay?

If I open the pypi page, I will be confused that what is it? 😕

How to solve this problem

These information should be configured in setup.py. So just take a look at the implement in treevalue.

After that, the content in README will be visible on pypi site, like the treevalue. (Some links are down, I'm fixing 😸 )

League Training for SlimeVolleyball Env

Overview

Add league training pipeline in slime_volleyball environment, and make better performance than self-play results (#23)
Related Discussion: #61

TODO

league pipeline
mutate from pre-trained model (vs bot)
policy behavior analysis

Entropy scheduling

In reinforcement learning there is well known explore-exploit dilemma. In league training it's crucial that we can have a better entropy coefficient scheduling because of the following reasons:
(1) If entropy of policy drops too fast to zero, it might get stuck in local optimum and failed to explore more states.
(2) If entropy of policy drops too slow, it might fail to select the right action at pivotal moments and the training is very slow.

One solution to address above problem is to have a good scheduling. Assume there are some validation measurements that we can use like win rate, we only decrease entropy coefficient when the win rate is on plateau.

It's similar to learning rate scheduler ReduceLROnPlateau in PyTorch link

May us know if there could be some documentations about how entropy scheduling can be supported?

CPU utilization problem

  # ding version `v0.2.0`, linux platform

Issue Description

CPU utilization is not 100% and very low. (below 5% on average)

Steps to Reproduce

clone the repo and git checkout main. (currently on 0fcfdf26). Run python3 dizoo/slime_volley/entry/slime_volley_selfplay_ppo_main.py. Open htop to check CPU usage. Only one core is occupied on a multi-core machine.

What Do We Need?

During training, run command mpstat 3. The column of %idle is less than 20% (Current value is 97%)

A brave new interface

We will design a brave new interactive interface in Di-engine 1.0, including program api and cli commands, which will support most reinforcement learning scenarios, and the rest will be implemented by our elastic atomic components.

Here are some design guidelines for the new interfaces:

It should be compatible with the existing configurations, and easy to convert from the old code to the new style.
It should be semantic and easy to understand, anyone with little reinforment learning experience will benifit from the policy examples.
It should be easy to extend to multi-threaded or distributed environments.

Any suggestions are welcome, please leave your comments in this channel.

Import error for TREX

Hello, I'm a beginner in IRL and I want to reproduce the results of TREX algorithm by running "dizoo/mujoco/entry/mujoco_trex_main.py" in the repo. But there is an ImportError: cannot import name 'serial_pipeline_trex_onpolicy' from 'ding.entry'. I think it is because there is no corresponding file named "serial_entry_trex.py" which contains functions 'serial_pipeline_trex_onpolicy' and 'serial_pipeline_trex'.
Can someone help me to solve that, thank you.

when runing cartpole_ppo_rnd_main.py, some bug is coming.

hi, when runing cartpole_ppo_rnd_main.py, some bug is coming. I want to know the reason and the corresponding solution. the bug is below. looking forward your answer.

Traceback (most recent call last):
File "/home/jgp/.conda/envs/jgpenv/lib/python3.8/site-packages/dizoo/classic_control/cartpole/entry/cartpole_ppo_rnd_main.py", line 70, in
main(cartpole_ppo_rnd_config)
File "/home/jgp/.conda/envs/jgpenv/lib/python3.8/site-packages/dizoo/classic_control/cartpole/entry/cartpole_ppo_rnd_main.py", line 60, in main
reward_model.train()
File "/home/jgp/.conda/envs/jgpenv/lib/python3.8/site-packages/ding/reward_model/rnd_reward_model.py", line 97, in train
self._train()
File "/home/jgp/.conda/envs/jgpenv/lib/python3.8/site-packages/ding/reward_model/rnd_reward_model.py", line 81, in _train
if self.cfg.obs_norm:
AttributeError: 'EasyDict' object has no attribute 'obs_norm'

task.parallel_ctx always stays the same in parallel mode.

I followed the code in the section "Distributed - Async and parallel" of doc to implement parallelism, but I found that task.parallel_ctx in both processes always remained as original.
And I can't find a function to change task.parallel_ctx in the files of di-engine /ding/framework/, so I can't judge what's wrong.
I want to know how the task.parallel_ctx in one process is synchronized with the ctx of another process. Thanks!

Image and dict observation spaces support by BC and GAIL implementations

In the README, could you provide the information on the observation/action spaces that are supported by each algorithm?

Specifically, do BC and GAIL implementations support image and dict observation spaces?

League Evaluation Metric

Added this issue as suggested by @PaParaZz1.

TrueSkill is a ranking metric developed by Microsoft for game matchmaking. Unlike ELO which just measures one agent's strength, TrueSkill can measure both strength and stability. Each player starts with mu=25.000 and sigma=8.333. Former one (mu) measures strength and the latter one (sigma) measures stability. After receiving payoffs of one matching, mu and sigma will be updated accordingly from the TrueSkill API. Final agent's score can be defined as mu - 3 * sigma to take both strength and stability into consideration.

Currently this metric is missing in the league demo. It would be better to add it.

Add slimevolleygym into dizoo

slimevolleygym is a pong-like physics game env from the open source community. It follows standard OpenAI gym interface. Naive PPO self-play achieves scores of -0.371 ± 1.085 in slimeVolley-v0 env against built-in AI report.
It would be good to benchmark opendilab's league training and see if it can generate higher results.

How to get more info data in reward model?

I have marked all applicable categories:
- exception-raising bug
- RL algorithm bug
- system worker bug
- system utils bug
- code design/refactor
- [+] documentation request
- [+] new feature request
[+] I have visited the readme and doc
[+] I have searched through the issue tracker and pr tracker

[+] I have mentioned version numbers, operating system and environment, where applicable:

import ding, torch, sys
print(ding.__version__, torch.__version__, sys.version, sys.platform)

Dear all,
I just try to customize a reward model by Di-engine, however I found we only can get those data(as input of collect_data funciton):

My question is how to get more data in reward model? such as the 'info' from env return.
Looking for replay and thank you.

PPO Policy Bug in Parallel Mode

The value_norm is used in _get_train_sample in PPO Policy, which is used in the _process_timestep function in collector. However, in parallel mode, the collector doesn't have value_norm which is only initialized in _init_learn. Thus, raise the exception "AttributeError: 'PPOCommandModePolicy' object has no attribute '_value_norm".

Bugs fix and new feature request for gfootball

There are many bugs in current vesion of DI-engine(V0.3.1) gfootball environment. I have tried to fix some of those, but some problems still exists which are beyond my ability. So I guess it needs systemic maintenance and updates. As far as the codes I have tested, only files in dizoo/gfootball/envs/tests works well(after some bug fix). And the fundamental features metioned in the doc(play with built-in AI & self-play) are basicly unusable.

Besides, since gfootball is an environment with great potential both in academy and practice. I strongly recommend following features being added:

Battle between customized models
Multi-agent support（5 vs 5, 11 vs 11）
League training support
Imitation learning algorithm enrich (paticularly GAIL, MAGAIL)

Thanks. I think DI-engine is an excelent potential framwork, hope it to be better.

gtrxl

How to run gtrxl with ppo policy? can someone provide an example?

How to use ObsNormEnv？

I would like to ask about the details of the use of state standardization.

State is a one-dimensional vector composed of three eigenvectors, where the first eigenvector has a value range of about 0-40, for example: [2, 1, 8, 12, 12, 4, 1, 2]. The range of the second eigenvector is approximately 0-11, for example: [2.3, 1.4, 0.2, 0.9, 8.4, 7.1, 8.3, 9.4]. The third eigenvector is the one-hot vector. Example: [0,0,1,0,0,0,0,0].

In this case, can I use ObsNormEnv directly? I don't think so.

So I would like to ask your advice, thank you very much.

Issues w.r.t the refactored logging system

I used to use 2> error.txt to redirect only error message, which leaves other information on terminal.
But after logging system refactoriation, 2> error.txt will redirect all the outputs, including evaluator and collector outputs and so on.

Can you provide some examples of multi-agent rl ?

Although, many multi-agent rl algorithms are implemented here. But there are very few examples of application shown here.

v2版本中，可以实现不同worker上的collector将收集到的信息传送到learner所在的worker上吗

Should put "log" and "ckpt_LEARNER_DATE_TIME" into same folder?

Now the checkpoints and log data are stored in two separate folders. Should we introduce a higher level folder, name EXPERIMENT_NAME or so to store all data of single experiment in it?

By the way, the name format of the checkpoint look quite weird to me, why there are two "_" between the date? I suggest to make it as "checkpoints_MODELNAME" and this is enough! It is not reasonable to write the "created time" in the folder name, since the folder contains lots of checkpoints that are created at different time.

安装DI-engine的问题

I have visited the readme and doc
I have searched through the issue tracker and pr tracker
I have mentioned version numbers, operating system and environment, where applicable:

print(torch.__version__, sys.version, sys.platform)
1.11.0+cu113 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)] win32

我已经安装了torch1.11.0+cu113 ,但是在使用pip install DI-engine时，却在下载不带cuda的torch-1.10.0-cp38-cp38-win_amd64.whl

这和手册里面不一致：
https://di-engine-docs.readthedocs.io/zh_CN/latest/01_quickstart/installation_zh.html

在安装好 CUDA 之后，当您在安装 DI-engine 的依赖项时，会自动获取和安装带有 Nvidia CUDA 加速的 PyTorch。

安装完成后，导致如下冲突：

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.12.0+cu113 requires torch==1.11.0, but you have torch 1.10.0 which is incompatible.
torchaudio 0.11.0+cu113 requires torch==1.11.0, but you have torch 1.10.0 which is incompatible.
tianshou 0.4.9 requires gym>=0.23.1, but you have gym 0.20.0 which is incompatible.
nbconvert 6.5.0 requires jinja2>=3.0, but you have jinja2 2.11.3 which is incompatible.

卸载torch，升级gym，重新安装torchtorch1.11.0+cu113,导致如下冲突：

di-engine 0.4.0 requires gym==0.20.0, but you have gym 0.25.1 which is incompatible.
di-engine 0.4.0 requires torch<=1.10.0,>=1.1.0, but you have torch 1.11.0+cu113 which is incompatible.

di-engine对版本要求太固定了。

Can you provide a complete example of environment migration?

Can you provide a complete example of environment migration? I can't move into my environment.

ImportError: cannot import name 'SampleCollector' from 'ding.worker'

From the quick start:

from ding.config import compile_config
from ding.envs import BaseEnvManager, DingEnvWrapper
from ding.model import DQN
from ding.policy import DQNPolicy
from ding.worker import BaseLearner, SampleCollector, BaseSerialEvaluator, AdvancedReplayBuffer
from dizoo.classic_control.cartpole.config.cartpole_dqn_config import cartpole_dqn_config

# compile config
cfg = compile_config(
    cartpole_dqn_config,
    BaseEnvManager,
    DQNPolicy,
    BaseLearner,
    SampleCollector,
    BaseSerialEvaluator,
    AdvancedReplayBuffer,
    save_cfg=True
)

This raises

Traceback (most recent call last):
  File "r2d2/main.py", line 7, in <module>
    from ding.worker import BaseLearner, SampleCollector, BaseSerialEvaluator, AdvancedReplayBuffer
ImportError: cannot import name 'SampleCollector' from 'ding.worker' (/Users/ethanbrooks/Library/Caches/pypoetry/virtualenvs/r2d2-3EWJbHPG-py3.8/lib/python3.8/site-packages/ding/worker/__init__.py)

It appears that BaseSerialEvaluator is also not present in the library.

Agent Demo List

This issue is a collection of various interesting agent demonstration trained by DI-engine, it will be updated continually.

Mario 1-1

mario_trained.mp4
Mario 1-2

mario_trained_1_2.mp4
rocket landing

rocket_landing.mp4
SMAC 5m VS 6m

5m6m.mp4
SMAC MMM

mmm.mp4
SMAC MMM2

mmm2.mp4
SMAC 3s5z

3s5z.mp4
lunarlander

lunarlander.mp4
gfootball
- rule-based bot vs rule-based bot
football_rvr.mp4
- trained agent vs rule-based bot
football_avr.mp4
slime_volley
- rule-base bot vs trained agent

Multi GPU training problem

My DING's version is f1bf66. My pytorch version is 1.7.1+cu101. My system is Linux 3.7.11 (default, Jul 27 2021, 14:32:16) \n[GCC 7.5.0].
I follow the docs' guidance to enable multi-gpu training. I add a config term config.policy.learn.multi_gpu=True in demo/simple_rl/ppo_train.py. But I get the following exception:

WARNING:root:If you want to use numba to speed up segment tree, please install numba first
pygame 1.9.6
Hello from the pygame community. https://www.pygame.org/contribute.html
[ENV] Setting seed: 0
Traceback (most recent call last):
File "imgppo_train.py", line 189, in
main(main_config)
File "imgppo_train.py", line 158, in main
policy = PPOPolicy(cfg.policy, model=model)
File "/home/qhzhang/code/DI-engine/ding/policy/base_policy.py", line 81, in init
self._init_multi_gpu_setting(model)
File "/home/qhzhang/code/DI-engine/ding/policy/base_policy.py", line 101, in _init_multi_gpu_setting
broadcast(param.data, 0)
File "/home/qhzhang/anaconda3/envs/didrive/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 859, in broadcast
_check_default_pg()
File "/home/qhzhang/anaconda3/envs/didrive/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 211, in _check_default_pg
"Default process group is not initialized"
AssertionError: Default process group is not initialized

Invalid output of `ding --help`

Currently when I use ding --help, it prints out the following:

usage: ding [-h] [--cfg CFG] [--seed SEED] [--device DEVICE]

optional arguments:
  -h, --help       show this help message and exit
  --cfg CFG
  --seed SEED
  --device DEVICE

which is not the expected help information.

I searched the code and found that this behaviour is caused by This line of code, please try to fix this.

Example of MAPPO

Would it be possible to get an example of training an MAPPO in a sample MA environment?

When the onpolicy of PPO processes the continuous action space, an error occurs.

The error message is as follows。

Traceback (most recent call last):
File "/root/cityflow/my_cityflow/PPO_Continuous/cityflow_ppo_continuous_train.py", line 201, in
serial_pipeline_onpolicy([main_config, create_config], seed=0)
File "/root/cityflow/my_cityflow/PPO_Continuous/cityflow_ppo_continuous_train.py", line 193, in serial_pipeline_onpolicy
learner.train(new_data, collector.envstep)
File "/root/DI-engine/ding/worker/learner/base_learner.py", line 166, in wrapper
ret = fn(*args, **kwargs)
File "/root/DI-engine/ding/worker/learner/base_learner.py", line 203, in train
log_vars = self._policy.forward(data)
File "/root/DI-engine/ding/policy/ppo.py", line 214, in _forward_learn
ppo_loss, ppo_info = ppo_error_continuous(ppo_batch, self._clip_ratio)
File "/root/DI-engine/ding/rl_utils/ppo.py", line 181, in ppo_error_continuous
dist_new = Independent(Normal(mu_sigma_new['mu'], mu_sigma_new['sigma']), 1)
File "/opt/conda/lib/python3.6/site-packages/torch/distributions/normal.py", line 50, in init
super(Normal, self).init(batch_shape, validate_args=validate_args)
File "/opt/conda/lib/python3.6/site-packages/torch/distributions/distribution.py", line 56, in init
f"Expected parameter {param} "
ValueError: Expected parameter loc (Tensor of shape (64, 1)) of distribution Normal(loc: torch.Size([64, 1]), scale: torch.Size([64, 1])) to satisfy the constraint Real(), but found invalid values:
tensor([[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan],
[nan]], grad_fn=)

The parameter configuration is as follows。

policy=dict(
    cuda=False,
    action_space='continuous',
    recompute_adv=True,
    model=dict(
        obs_shape=20,
        action_shape=1,
        action_space='continuous',
        share_encoder = True,
        encoder_hidden_size_list=[256,64],
        actor_head_hidden_size = 64,
        actor_head_layer_num = 1,
        critic_head_hidden_size = 64,
        critic_head_layer_num  = 1, 
        activation  = nn.ReLU(),
        norm_type  = None,
        sigma_type  = 'conditioned',
        fixed_sigma_value  = 0.3,
        bound_type  = 'tanh',
    ),
    learn=dict(
        multi_gpu=False,
        epoch_per_collect=5, 
        batch_size=64,
        learning_rate=3e-4,   
        value_weight=0.5,     
        entropy_weight=0.01,  
        clip_ratio=0.2,
        adv_norm=True,
        value_norm=True,
        ignore_done=False,
        grad_clip_type='clip_norm',
        grad_clip_value=0.5,
    ),
    collect=dict(
        n_sample=int(640),
        unroll_len=1,
        discount_factor=0.99,
        gae_lambda=0.95,
    ),

Initialization bug in RegressionHead

I have run the ddpg and td3 algorithms which use RegressionHead and check their initialized weights. However, head.main.1 seemed to haven't initialized properly.

It's only for head.main.1, head.main.0 is initialized properly.