alex-petrenko / sample-factory Goto Github PK

View Code? Open in Web Editor NEW

784.0 784.0 108.0 99.35 MB

High throughput synchronous and asynchronous reinforcement learning

Home Page: https://samplefactory.dev

License: MIT License

Python 97.99% Shell 0.20% Makefile 0.17% Jupyter Notebook 0.80% CMake 0.05% C++ 0.79%

reinforcement-learning

sample-factory's Introduction

sample-factory's People

Contributors

Stargazers

Watchers

Forkers

tushartk erikwijmans killsking peterzs jingweiz alessandrozavoli synctext stjordanis leonletto wadrhaw xcidar zhehui-huang xrosliang shadowkun inspectordidi gitter-badger neevparikh sumeetbatra cock-puncher signalprime chutaklee magicly tuskaw lanpartis dre2004 boyuanlong kornbergfresnel gautams3 rl-code-lib zivzone xmgfx eles13 kinman0224 zhenyuanlin gebob19 mengf1 aod321 nusmadrl cjmdd jiaxinchen666 mrzhuzhe aliciafmachado erenulu2020 ugadiarov-la-phystech-edu wmfrank andrewzhang505 iglu-contest kpbmarques usamar240 zoyav zyh1994 czthehusky alaatekleh carolinewang01 simoninithomas qazi0 giangbang slienteagle-wyb henrys-lab mikahil vballoli frankie4fingers marktrovinger visuallization hegde95 briananguloyauri artemzholus perseus101 yang-zj1026 frederikschubert arthurallshire zuoningz newablesys paleziart jiukaishi alex404 wangxuzhaojoewong ard-skelling dviraharon aeoniv titardrew qgallouedec footoredo superboysb bartekcupial mattstammers syst3m1can0maly gaosz0755 blze-apps klyuchnikova-ana dr-smgad daveey astonisinghsr im-ant wangyq199 mindw0rld ivan-267 carlosgual marceloneppel liushuai26

sample-factory's Issues

[question] Callback and Debugging?

What is the best way to add a callback routine?
. To collect all worker env custom metrics every N seconds and record to tensorboard
. Evaluate custom metrics for early stop (and end training of course)

How to debug (using vscode)? Just set '--train_in_background_thread' to False?

Learner is likely a bottleneck in your experiment (50 times)

Hello, I've been running sample-factory with much larger encoders than the defaults, and this warning keeps popping up:

Learner 0 accumulated too much experience, stop experience collection! Learner is likely a bottleneck in your experiment (50 times)

I'm wondering if there's anything you'd recommend to address this? Performance still seems okay (FPS ~25,000), although I noticed this tends to happen when my GPUs memory gets maxed out, and I'm wondering if there's e.g. some arguments I could set to improve performance.

Thanks!

[sf2][Tests] Confirm Tuple action distibution is still working on Doom Battle environment

To test changes made in #147

[question] gpu benefit?

In another RL training with a gpu I noticed that the gpu utilization is no more than ~10% (both pytorch and TF). I assume it’s because python is busy with the env and agent logic. Planning to buy a laptop but not sure if it’s worth paying more for discrete Nvidia gpu if utilization is that low.

My question is: does sample-factory really use the gpu and what is a typical utilization %?

Thanks.

Adding an aysnc Rainbow-style DQN to the algorithms default

Hi all,

I was wondering if you guys would be interested in adding an async Rainbow DQN that fits into this framework. I would like to contribute towards such a thing and ideally have it be there for others to use later on.

quad-swam-rl

Execuse me. I have two issues about quad-swarm:

Although I have trained the model by run the commands, running the test command " python -m swarm_rl.enjoy --algo=APPO --env=quadrotor_multi --replay_buffer_sample_prob=0 --continuous_actions_sample=False --quads_use_numba=False --train_dir=PATH_TO_PROJECT/swarm_rl/train_dir --experiments_root=EXPERIMENT_ROOT --experiment=EXPERIMENT_NAME" will make an error

Error:
File "/home/chengjiaming/anaconda3/envs/swarm-rl/lib/python3.8/site-packages/sample_factory/algorithms/utils/arguments.py", line 123, in load_from_checkpoint
raise Exception(f'Could not load saved parameters for experiment {cfg.experiment}')
Exception: Could not load saved parameters for experiment EXPERIMENT_NAME

There was no response when I ran the command “ ./run_tests.sh”：

ERROR: test_quad_env (swarm_rl.env_wrappers.tests.test_quads.TestQuads)

Traceback (most recent call last):
File "/home/a409/users/cjm/quad-swarm-rl/swarm_rl/env_wrappers/tests/test_quads.py", line 41, in test_quad_env
self.assertIsNotNone(create_env(env_name, cfg=cfg))
File "/home/a409_home/anaconda3/envs/swarm-rl/lib/python3.8/site-packages/sample_factory/envs/create_env.py", line 22, in create_env
env = env_registry_entry.make_env_func(full_env_name, cfg=cfg, env_config=env_config)
File "/home/a409/users/cjm/quad-swarm-rl/swarm_rl/env_wrappers/quad_utils.py", line 129, in make_quadrotor_env
return make_quadrotor_env_single(cfg, **kwargs)
File "/home/a409/users/cjm/quad-swarm-rl/swarm_rl/env_wrappers/quad_utils.py", line 37, in make_quadrotor_env_single
env = QuadrotorEnv(
File "/home/a409/users/cjm/quad-swarm-rl/gym_art/quadrotor_single/quadrotor.py", line 828, in init
self.spec = gym_reg.EnvSpec(id='Quadrotor-v0', max_episode_steps=self.ep_len)
TypeError: init() got an unexpected keyword argument 'id'

Ran 10 tests in 40.950s

FAILED (errors=3)
Status: 1
0 means success (no errors), non-zero status indicates failed tests

[question] Feature importance?

Hi,
Let's say we have a bunch of features (> 100) and the model is trained and working (but took forever to train). Is there a way to tell which features are more important than others or not used at all by the model? So that the next training is more efficient/accurate? Can something like https://captum.ai be used with sample-factory?
The only way I know is the process of elimination. But that can take an exorbitant amount of time. I understand that this is beyond the scope of the repository. Any piece of information is appreciated.

[question] OpenAI Gym Env?

Hi,
Is it possible to use with OpenAI Gym Env (non-image like Cartpole - yes I know it's very basic but I have a custom one that acts similar to Cartpole but can take forever to solve on a single machine)? If yes, can you suggest how-to?

Is it possible to use in Windows10 (with Nvidia GPU)?

Thanks!

Memory leak on the actor worker?

In my experiments with LunarLanderContinuous I spotted a constant memory increase on the actor workers. We need to find the reason for it and fix it.

Please first of all try to see if it only reproduces on LunarLander or on other envs too (e.g. doom)

[question] SF on M1 Mac?

Hello everyone,

Pytorch can run natively on an M1 CPU/GPU Mac.. So I'm thinking of moving to it. Has anyone managed to run SF on the M1? If yes, how's the GPU utilization/speed?

Condition event not triggering on macOS laptop

running the following command on macOS,
python -m sample_factory.algorithms.appo.train_appo --env mujoco_halfcheetah --algo APPO --device cpu

gets perma stuck on the command
Waiting for policy worker 0-0 to finish initialization...

Any advice on how I could get it working?

The problem seems to be with the shared buffers:
while self.shared_buffers.stop_experience_collection[self.policy_id]:

with a debugging setup like so:

# in policy_worker.py
while self.shared_buffers.stop_experience_collection[self.policy_id]:
    log.debug('[PW]: %s', self.shared_buffers.stop_experience_collection[self.policy_id])
    ....
# ... 
# in learner.py 
log.debug('[Learner]: %s', self.stop_experience_collection[self.policy_id])

You get results like

[2022-01-31 10:23:51,941][55743] [Learner]: False
[2022-01-31 10:23:51,942][55758] [PW]: True

Add evaluation worker

Ideally, we want a worker that can render an episode every x minutes and post a gif animation or video on Tensorboard.

The best solution (I think) is to repurpose the existing ActorWorker for this. We can just add a regime where it also saves the environment frames (rendering them if necessary).

Atari Hyper-parameters

Could I have the hyper-parameters used in the Atari experiments of the paper?

Thanks!

Question about CTDE

Hello, I have a question about centralized training and decentralized execution. Far as I understood it, this library uses something like Independent Asynchronous PPO and does not take into account all observations for critic evaluation.

Is this assumption correct? How difficult will it be to implement something like MAPPO (assuming current design perspective)?

I have observed, it works well with self play for doom environment, maybe because the agent cannot alter the environment that much.
I, for example, in order to encourage faster speeds, mitigate reward shaping, and introduce curriculum training, reformulate my single agent car racing game as a competitive RL, where two players compete, and first agent to come to the finish gets the reward. But agents from different teams do not see each other's observations, only their own screen(observation).

Do you think sample factory can be used in such setting?
Thanks in advance!

How to use all cores?

Can't utilize all 16 logical processors (8 cores, 2 threads) on an AMD Ryzen laptop (AMD GPU :( ). I can see that at any point 8 processors are maxed, while the other 8 vary between 0 and 50%. Tried various num_workers values between 2 and 16 as well as various policy_workers_per_policy. Any suggestions on how are welcome.

Command:
python -m sample_factory_examples.train_gym_env --algo=APPO --use_rnn=False --recurrence=1 --with_vtrace=False --batch_size=512 --hidden_size=256 --encoder_type=mlp --encoder_subtype=mlp_mujoco --reward_scale=0.1 --save_every_sec=10 --experiment_summaries_interval=20 --experiment=example_gym_cartpole-v1 --env=gym_CartPole-v1 --device=cpu --num_envs_per_worker=2 --num_workers=7 --train_in_background_thread=true --policy_workers_per_policy=4

PackedSequence should be unpacked first before any chunk manipulations

sample factory uses PackedSequence in case of RNN . When actor and critic don't share the weights, it's needed to split the head output as well as RNN states into chunks:

def _core_rnn(self, head_output, rnn_states):
        """
        This is actually pretty slow due to all these split and cat operations.
        Consider using shared weights when training RNN policies.
        """

        num_cores = len(self.cores)
        head_outputs_split = head_output.chunk(num_cores, dim=1)
        rnn_states_split = rnn_states.chunk(num_cores, dim=1)

head_output and rnn_states are PackedSequence instead of numpy/array and it causes the exception:

Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/usr/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.7/dist-packages/sample_factory/algorithms/appo/learner.py", line 1156, in _train_loop
self._process_training_data(data, timing, wait_stats)
File "/usr/local/lib/python3.7/dist-packages/sample_factory/algorithms/appo/learner.py", line 1106, in _process_training_data
train_stats = self._train(buffer, batch_size, experience_size, timing)
File "/usr/local/lib/python3.7/dist-packages/sample_factory/algorithms/appo/learner.py", line 737, in _train
core_output_seq, _ = self.actor_critic.forward_core(head_output_seq, rnn_states)
File "/usr/local/lib/python3.7/dist-packages/sample_factory/algorithms/appo/model.py", line 175, in forward_core
return self.core_func(head_output, rnn_states)
File "/usr/local/lib/python3.7/dist-packages/sample_factory/algorithms/appo/model.py", line 147, in _core_rnn
head_outputs_split = head_output.chunk(num_cores, dim=1)
AttributeError: 'PackedSequence' object has no attribute 'chunk'

[question] How to run evals during training?

I am trying to figure out how I could run quick evals during training.

I saw an old issue about creating an eval worker using the current infrastructure,
but that would most likely take me quite some time to implement.

I was wondering if there was a quick and dirty way I could do this?

For instance is there a way I could access the policy from within the environment,
and thus run evals in the reset function?

I could also run a bash loop that alternates between the training script
and the eval script, but reinitializing the training script every 1M steps,
although automated, would probably be quite a slowdown.

Thanks!

Design questions

Hi all,

As brought up in #47, I'm trying to get an async DQN with some of the Rainbow bells and whistles implemented with this framework. In familiarizing myself with the current setup, I had some questions about the intended ways to extend this.

should each algorithm define it's own rollout/actor workers, learners, and policy workers? I imagine there is duplicated code this way, with the shared_buffer / thread management stuff. What are your recommendations on that front?
do you guys have a contributing guide that I should read before starting?

Mujoco environments in SF2

Make sure we have an example folder and we can run 6 standard Mujoco environments
Try to match results from OpenRLBenchmark in terms of sample efficiency
Try to match the best wall-time results using envpool

[question] subset of CPUs

Is there a way to restrict sample-factory to only use a subset of the CPUs on the machine instead of splitting them all amongst the number of workers? Thanks!

version_diff keeps increasing

Thanks for building this awesome library. I believe I am having some trouble getting any example to work and it would be great if you had any suggestions to what I could try.

Using the example
python -m sample_factory.algorithms.appo.train_appo --env=doom_basic --algo=APPO --train_for_env_steps=3000000 --num_workers=20 --num_envs_per_worker=20 --experiment=doom_basic

The policy lag seems to keep linearly increasing which I assume is not expected? It's like the model version isn't being updated.

[2022-03-02 23:05:46,460][18482] Fps is (10 sec: 20404.3, 60 sec: 20404.3, 300 sec: 20404.3). Total num frames: 241664. Throughput: 0: 3098.6. Samples: 54300. Policy #0 lag: (min: 52.0, avg: 52.0, max: 52.0)
[2022-03-02 23:05:46,460][18482] Avg episode reward: [(0, '-1.416')]
[2022-03-02 23:05:51,461][18482] Fps is (10 sec: 19999.7, 60 sec: 19883.6, 300 sec: 19883.6). Total num frames: 335872. Throughput: 0: 4499.1. Samples: 83850. Policy #0 lag: (min: 77.0, avg: 77.0, max: 77.0)
[2022-03-02 23:05:51,461][18482] Avg episode reward: [(0, '-1.509')]
[2022-03-02 23:05:56,480][18482] Fps is (10 sec: 19622.1, 60 sec: 20013.5, 300 sec: 20013.5). Total num frames: 438272. Throughput: 0: 4965.4. Samples: 113450. Policy #0 lag: (min: 104.0, avg: 104.0, max: 104.0)
[2022-03-02 23:05:56,480][18482] Avg episode reward: [(0, '-1.825')]
[2022-03-02 23:06:01,488][18482] Fps is (10 sec: 19606.6, 60 sec: 19772.8, 300 sec: 19772.8). Total num frames: 532480. Throughput: 0: 4454.0. Samples: 128060. Policy #0 lag: (min: 104.0, avg: 104.0, max: 104.0)
[2022-03-02 23:06:01,489][18482] Avg episode reward: [(0, '-1.599')]
[2022-03-02 23:06:06,514][18482] Fps is (10 sec: 19593.5, 60 sec: 19873.5, 300 sec: 19873.5). Total num frames: 634880. Throughput: 0: 5167.0. Samples: 157920. Policy #0 lag: (min: 131.0, avg: 131.0, max: 131.0)
[2022-03-02 23:06:06,514][18482] Avg episode reward: [(0, '-1.209')]
[2022-03-02 23:06:11,515][18482] Fps is (10 sec: 19609.2, 60 sec: 19726.1, 300 sec: 19726.1). Total num frames: 729088. Throughput: 0: 5162.6. Samples: 187380. Policy #0 lag: (min: 157.0, avg: 157.0, max: 157.0)

Environment:
Running Ubuntu 20:04 in WSL 2 (maybe that's the problem).
sample-factory==1.120.0
torch==1.7.1+cu110 (I have tried 1.10 as well)

Using tensorboardX vs torch.utils.tensorboard

Is there a reason tensorboardX is used instead of the supported default https://pytorch.org/docs/stable/tensorboard.html?

quad-swarm

Execuse me. I have two issues about quad-swarm:

Although I have trained the model by run the commands, running the test command " python -m swarm_rl.enjoy --algo=APPO --env=quadrotor_multi --replay_buffer_sample_prob=0 --continuous_actions_sample=False --quads_use_numba=False --train_dir=PATH_TO_PROJECT/swarm_rl/train_dir --experiments_root=paper_quads_multi_mix_baseline_8a_attn_v116/quad_mix_baseline-8_mixed_attn_ --experiment=00_quad_mix_baseline-8_mixed_attn_q.n.e.typ_attention_see_0" will make an error

There was no response when I ran the command “ ./run_tests.sh”：

ERROR: test_quad_env (swarm_rl.env_wrappers.tests.test_quads.TestQuads)

Ran 10 tests in 40.950s

FAILED (errors=3)
Status: 1
0 means success (no errors), non-zero status indicates failed tests

Get rid of the explicit for-loop in the learner

Currently, we loop through individual timesteps during training, it is a lot faster to process the timesteps as a single sequence.

An option to use symmetric KL-divergence with a uniform prior as an exploration objective

Currently the exploration loss is based on maximizing the entropy of the probability distribution. Note that mathematically maximizing entropy of the categorical probability distribution is exactly the same as minimizing the (regular) KL-divergence between this distribution and a uniform prior.

Before the ICML publication we also experimented with a different exploration objective. The downside of using the entropy term (or regular asymmetric KL-divergence) is the fact that penalty does not increase as probabilities of some actions approach zero. I.e. numerically, there is almost no difference between an action distribution with a probability epsilon > 0 for some action and an action distribution with a probability = zero for this action. For many tasks the first (epsilon) distribution is preferrable because we keep some (albeit small) amount of exploration, while the second distribution will never explore this action ever again.

Unlike the entropy term, symmetric KL divergence between the action distribution and a uniform prior approaches infinity when entropy of the distribution approaches zero, so it can prevent the pathological situations where the agent stops exploring.

Empirically, symmetric KL-divergence yielded slightly better results on doom_battle problems.

Old implementation can be found up until commit 1ef8cb2
See learner.py line 512-514
It got removed because we wanted to simplify action distributions and implement continuous actions, and it was just too much stuff to support. I didn't realize back then that it actually contributed to performance on some tasks.

What is required of new implementation:

Categorical and Tuple (of Categorical) action distributions should implement method .symmetric_kl_with_uniform_prior(). We will only use uniform prior, so there is no need for mechanism to add custom prior that existed in the old version.
there should be a CLI option to switch between entropy loss and symmetric-KL loss
there should be no impact on learner performance. Ideally, we should maybe even get rid of the if. Instead, we can write two functions that calculate exploration loss and then save one function to a variable based on CLI parameter once (before the beginning of training).
Theoretically, KL-divergence loss can approach infinity, so we should cap it at some reasonable value to prevent numerical issues.
This loss does not make sense for continuous actions I think (what is the prior?). So there should be a reasonable fallback. I.e. an exception is thrown telling the user that chosen configuration is incorrect.
After implementing this, we should compare results on doom_battle and doom_battle2 with different values for the exploration loss coefficient. If there is a performance improvement on these tasks, we should set the new loss as default for doom envs in doom_params.py

test

I have successfully tested according to the default parameters, but the effect is very poor. It shows that the UAV is flying on the ground

[Q?] How to max fps?

Hey all,
I have a system with 13G ram, two cores and a GPU. While training I notice that top shows overall ~25% cpu utilization (~55% each, in the columns) and same for GPU. FPS is around 2700. The gym environment is similar to cartpole, args below. If I push num_envs_per_worker above 40 it logs a lot of: "Waiting for trajectory buffer". Also tried increasing batch_size=1024.

What's the best way (methodology) to max fps?

   '--num_envs_per_worker=40',
    '--policy_workers_per_policy='2',
    '--batch_size=512',
    '--env=gym_Myrl-v0',
    '--experiment=' + timestamp_dir,
    '--recurrence=1',
    '--num_batches_per_iteration=4',
    '--hidden_size=256', #default 256
    '--encoder_type=mlp',
    '--encoder_subtype=mlp_mujoco',
    '--reward_scale=0.1',
    '--save_every_sec=30',
    '--experiment_summaries_interval=3',
    '--ppo_epochs=4',
    '--max_policy_lag=140',  # orig:100000
    '--seed=5', 
    '--use_rnn=False',
    '--with_vtrace=False',
    '--algo=APPO',

On another machine with 32 cores/64GB ram 1070 TI GPU I'm seeing: Learner 0 accumulated too much experience, stop experience collection! Learner is likely a bottleneck in your experiment (50 times)

Tried to increase --learner_main_loop_num_cores anywhere from 2 to 10. Any ideas what is hapening?

Exploration loss question

I see you have two options for the exploration loss:
symmetric_kl_with_uniform_prior and entropy. And you used symmetric KL as the loss in the vizdoom config.
Is there any paper or comparison between both approaches?

[question] reproduce results?

Hi,
I find that reproducing results helps a lot while tweaking the various parameters by removing the randomness factor. Is it possible to reproduce the results?
I used the argument --seed=5 in train_gym.py. Yet it does not lock the results. I get different results on tensorboard and also different completion times. So it's harder to pin down good parameters/code. The gym I use is reproducible in another agent).

Note the completion time difference for the two runs below.

[question] Validation results not good

In gym classic-like env training, the model yields good results and completes the task. Running enjoy with unseen data yields considerably lower results.

. Added random noise to observations to prevent the model from memorizing (in training/validation).
. Applied same normalization and also no normalization to training/validation. Same results.

Any idea what it could be or what I can try? Is there something in the settings/switches worth trying? Your insight is greatly appreciated!

Get a documentation website going

We definitely need something like ReadTheDocs for Sample Factory. Things are getting too sophisticated to keep documenting everything in README.

@wmFrank I think this can be a great next project for you after tests and coverage are working!

@edbeeching do you think we should use ReadTheDocs or are there better alternatives now?

TypeError: cannot deepcopy this pattern object

hello, can you help me? The error of sample-factory happened when I run command: "python3 -m swarm_rl.train --env=quadrotor_multi --train_for_env_steps=1000000000 --algo=APPO --use_rnn=False --num_workers=36 --num_envs_per_worker=4 --learning_rate=0.0001 --ppo_clip_value=5.0 --recurrence=1 --nonlinearity=tanh --actor_critic_share_weights=False --policy_initialization=xavier_uniform --adaptive_stddev=False --with_vtrace=False --max_policy_lag=100000000 --hidden_size=256 --gae_lambda=1.00 --max_grad_norm=5.0 --exploration_loss_coeff=0.0 --rollout=128 --batch_size=1024 --quads_use_numba=True --quads_mode=mix --quads_episode_duration=15.0 --quads_formation_size=0.0 --encoder_custom=quad_multi_encoder --with_pbt=False --quads_collision_reward=5.0 --quads_neighbor_hidden_size=256 --neighbor_obs_type=pos_vel --quads_settle_reward=0.0 --quads_collision_hitbox_radius=2.0 --quads_collision_falloff_radius=4.0 --quads_local_obs=6 --quads_local_metric=dist --quads_local_coeff=1.0 --quads_num_agents=8 --quads_collision_reward=5.0 --quads_collision_smooth_max_penalty=10.0 --quads_neighbor_encoder_type=attention --replay_buffer_sample_prob=0.75 --anneal_collision_steps=300000000 --experiments_root=EXPERIMENT_ROOT --experiment=swarm_rl "

Memory leaks

Hello, I try to use SF with custom gym env with observation size 360. I used lots of env's - here is my config:

  cfg.num_workers = 16
  cfg.num_envs_per_worker = 256
  cfg.num_batches_per_iteration = 32
  cfg.traj_buffers_excess_ratio=1.
  cfg.learning_rate = 1e-4
  cfg.ppo_clip_ratio = .2
  cfg.rollout = 128
  cfg.recurrence = 16
  cfg.ppo_epochs = 1
  cfg.batch_size = cfg.num_workers * cfg.num_envs_per_worker * cfg.rollout // cfg.num_batches_per_iteration
  cfg.with_vtrace = False
  cfg.num_minibatches_to_accumulate = -1
  cfg.gamma = 0.999

With 4096 envs on Ubuntu 20.04 I see memory SF consumption - additional 1gb per 5-10 mins, so my 32gb is not enough for 1hr of training. I tried to investigate possible memory leak in SF with pympler but for now without success. So could you please check it. I also use custom encoder. My throughoutput 50k samples.

Add git commit hash to cfg.json

Also save diff to some separate file in the experiment folder.
This is great for reproducibility.

Cannot reproduce scores on dmlab-30

Hi @alex-petrenko,

I ran the codes on dmlab-30 with the exactly same arguments/configurations in README.
However, as shown in the below figure, the obtained scores (mean capped) are lower than the reference scores in the paper (Fig. 5) by about 10%.
This performance gap is still within a reasonable range due to the randomness?

In addition, when I increase the rollout/recurrence length to 100 or incorporate a previous action and reward into the current LSTM inputs like the original IMPALA, the scores are more decreased. Did you observe the similar results or is there any reason for these differences of your setting/architecture compared to the original IMPALA?

Recurrent policy details

Hi @alex-petrenko !

I've got a brief question on your recurrent policy implementation.
Are you feeding complete episode trajectories to the learner's RNN?

That's what it looks like to me, because I'm not finding any padding or a distinct sequence length to do truncated bptt.
All I found was the PackedSequence object of PyTorch.

Are pretrained weights available?

Are there any pretrained weights available? I'd love to try the code base without having to train an agent from scratch. Thanks!

How to generate fixed, deterministic episode on dmlab30?

Hi @alex-petrenko ,
How can I produce fixed, deterministic episode on dmlab30, using sample factory?
Is there any control parameters like random seed to be fixed?

RTX 30xx compatibility

Hello, I was wondering if you could update the environment.yaml file to more recent versions of pytorch and cuda (i.e. 1.9.* and 11.*) to "officially" support RTX 30xx cards.

I have an AMD 5950x and an RTX 3090, and I've been able to get sample-factory running with vizdoom by installing vizdoom with pip in a conda environment and running sample-factory straight from the repository. The performance is great (I'm getting about 100K FPS), but I guess because I'm not using your vizdoom branch I'm locked out of the multi-agent framework. I'm also hoping to rely on sample-factory for a longer-term research project, and I worry that other issues might arise if I have to use it in this "unofficial" way.

Thanks for your hard work. This is an impressive piece of software!

Training with RLlib

Hi @alex-petrenko ，I have two questions about training with RLlib:
1.do you know the DQN training hyperparameters that can works well in other scenes of vizdoom (except basic and health gethering)?
2.Can this yaml file be used in all scenarios of dmlab?
Any feedback or help/tips would be greatly appreciated!

Can we train maddpg-pytorch on sample-factory?

Hello, I'm checking for simulators to execute the multi agent algorithm (maddpg [https://github.com/shariqiqbal2810/maddpg-pytorch]). So, can anyone help me if I can execute the pytorch maddpg on sample-factory?

CI for SF2 (fix tests)

we need auto unit tests to run after we commit to the branch
most tests are broken because APIs have changed. We need to fix at least a couple of tests are running (prefer fixing the system tests that run the entire little experiment)

Advice for Sample Factory use?

Hi Alex, this is very impressive work!
My use case is for an environment where

each env requires its own process
on a 16 core machine, maximum sample throughput is reached at a small amount (~40) of environments due to poor cache usage (comes w/ the env)
but env is fast, sample throughput bottlenecked by policy inference speed, not the simulation/step itself

Some requirements for training are that

Self play which plays against various older versions of the agent 20% of the time
1 GPU is probably fine for learning

Given these characteristics, I theorized I may be better suited to roll my own setup for the sampler where rollout & policy workers are combined, so that each rollout worker has a copy of the policy. Self play details are not too difficult, i.e., periodically save a copy of the params to a local directory, and reserve a few workers to play vs these older versions.

Do you have any thoughts/advice on the right approach to this case? Thank you! (Feel free to ask for any clarifications)

Multi-GPU learner

This is a very desirable feature, especially to push the throughput of single-agent training to 200K FPS and beyond.

Plan: use NCCL and/or Torch DistributedDataParallel.
We can spawn one learner process per GPU and then split the data equally (e.g. learner #3 gets all trajectories with index % 3 == 0).
Then we average the gradients. This will also help to parallelize the batching since there will be multiple processes doing this.

An alternative is to spawn the learner process (one per policy) and then have it spawn child processes for individual GPUs. This can be easier to implement.

To take full advantage of this, we also need to support policy workers on multiple GPUs. This requires exchanging the parameter vectors between learner and policy worker through CPU memory, rather than shared GPU memory. This can be a step 1 of the implementation.

[question] MuZero anyone?

Hey all,

Wondering if you've seen MuZero, the latest from Deepmind? Is it something that can improve SF speed/results? I've seen some implementations on github but not knowledgeable enough to implant them in SF.

Below is a comparrison chart from https://arxiv.org/pdf/1911.08265.pdf

--help command line argument should output help strings for all arguments

Currently it only lists "basic" args, such as env name and algorithm name.

Ideally, for a command line --algo=APPO --env=doom_basic --help we want to output all arguments that are valid for this algorithm-env combination.

Adding custom maps to doom environment

Hello,

I'm wondering what you would think is the most "idiomatic" way to run sample-factory with the vizdoom environment family, but a custom wad and cfg file. I notice in doom_utils.py there's a hard coded DOOM_ENVS variable that specifies the pre-built environments, which seems like it would be straightforward to modify, but I'd prefer not have to modify sample-factory directly for my own project (I'd rather be able to pull it in with pip install sample-factory). On the other hand in your train_custom_env_custom_model.py file I could specify a whole new environment, but I'd rather not lose any of the vizdoom functionality.

As a side note, for some reason when I run pip install sample-factory it doesn't copy the scenario folder into my conda directory, and so I can't run the example commands from your README.md. Not sure if this is me doing something foolish.

Thanks!

Question about Multi-Agent Environments

Hi!

I've got a brief question about multi-agent environments. Does sample-factory support multi-agent environments with asymmetric agents?

Running a trained agent raises: RuntimeError: view size is not compatible with input tensor's size and stride

I managed to get this repo running on python 3.8 (Ubuntu 18.04) without the miniconda environment by simply pip installing the Python dependencies manually:

pip install torch
pip install psutil
pip install tensorboardX
pip install git+https://github.com/alex-petrenko/ViZDoom@doom_bot_project#egg=vizdoom
pip install faster-fifo
pip install opencv-python
pip install filelock
pip install threadpoolctl

Training runs as before, but running a trained agent (for example, via python -m algorithms.appo.enjoy_appo --env=doom_battle --algo=APPO --experiment=doom_battle_w20_v20) leads to the following error:

  File "/home/da/git/ai/sample-factory/algorithms/appo/model_utils.py", line 161, in forward
    x = x.view(-1, self.conv_head_out_size)
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

One can fix this by going to algorithms/appo/model_utils.py, and, inside the class ConvEncoder(EncoderBase): class, inside the def forward(self, obs_dict): method, changing

x = x.view(-1, self.conv_head_out_size)

x = x.contiguous().view(-1, self.conv_head_out_size)

environment initialization

I wonder why the environment initialization always fails when the number of agents exceeds 32. How can I solve it?
The warnings are:"for split_idx, env_runner in enumerate(self.env_runners):
TypeError: 'NoneType' object is not iterable", then show " Waiting for 1 trajectory buffers..." continually.

alex-petrenko / sample-factory Goto Github PK

sample-factory's Introduction

sample-factory's People

Contributors

Stargazers

Watchers

Forkers

sample-factory's Issues

ERROR: test_quad_env (swarm_rl.env_wrappers.tests.test_quads.TestQuads)

ERROR: test_quad_env (swarm_rl.env_wrappers.tests.test_quads.TestQuads)

Recommend Projects

Recommend Topics

Recommend Org