kchua / handful-of-trials Goto Github PK

View Code? Open in Web Editor NEW

418.0 418.0 97.0 175 KB

Experiment code for "Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models"

License: MIT License

Dockerfile 1.05% Python 97.50% Jupyter Notebook 1.46%

model-based-rl reinforcement-learning

handful-of-trials's People

Contributors

Stargazers

Watchers

Forkers

danijar shikharbahl williamd4112 willwhitney amoliu tofuwen mohammedgomaa angranli wsjeon haochihlin lagrassa fedorajzf kelvinson vin136 diyano gaimjkp pathetiue fuzihang hazekiahwon agarwl lishuailong chaochaolu jacobtyo molomono gummygamer dingchenghu tcjcxy30 wwxfromtju yixuanhuang98 vitchyr nikkilkuang azahed98 riturajkaushik ntienvu tholiao vvanirudh psyche-mia jonygao621 xinleipan fusion-ml 0xdyyb qhuang-pnl zhengyi0310 aterenin haichao-zhang younggyoseo yaojiebao archielee brunokm jswang eyetest wangcongrobot jinning-li philipp01wagner zhanggaofeng1120 josiahcoad valeman hebowei2000 ethanluoyc sharad24 saberguo hyperion-shuo lamperougeyxy nicoladainese96 mcarobene pedramrabiee qtsho idthanm hakiiiim lwang0413 linhongbin-ws songminjae harimanask jieli18 jxzhangjhu abalakrishna123 shulnak09 sj-l pecey bartvlaatum hannah-zhou theling luciferdarkstar eshemomoh hardikparwana horiguchitakahiro wangjiannan98 mcpfirefly benwyb chandanpanda xuannadi zghydxfz jasmeeetkaur dtbinh yq613

handful-of-trials's Issues

About the running time of experiment.

Thanks for great work.

I would like to ask about execution time per each environment.

For example, I am running Cartpole MBExperiment. On my computer, it seems that the average action selection time is about 3~4sec. Is it normal?

Thanks.

Equations for rewards and returns

The log file has two separate parameters called rewards and returns. What are the differences between the two?
Are there mathematical equations to define the two terms?

var_loss might be inaccurate

handful-of-trials/dmbrl/modeling/models/BNN.py

Line 441 in 77fd880

var_losses = tf.reduce_mean(tf.reduce_mean(log_var, axis=-1), axis=-1)

Am I right to interpret var_losses term as the second term in Eq. 1 (i.e.
?
If so, should the inner reduce_mean be reduce_prod to get the determinant of the covariance matrix?

Thanks! This is great work! A quick question. Does this support multiple GPUs? I don't think I see settings that allow me to specify the number of GPUs used for training. If not, is there a plan to add it? If not, where should I look at for adding the support myself?

-C

More information on reacher task

I'm trying to understand the cost function for the reacher task which uses the function here.
However, I can't find a reference for what each index of the observation is in the Reacher task in the OpenAI documentation.

Can someone help explain what this function is computing? Also does anyone have a better source of documentation on the state observations in the Reacher task?

Could you please provide the data corresponding to the figures of of experiments in the paper——PETS? Because we would like to like cite your paper as one of the state of the art work in our paper, but time is limited. Thank you!

Using with other RL env

Hi, thanks for the work so much,

I would like to ask how we can use this with another gym-like environments?

For example, I create a robot simulation env that have API like Open AI gym like step, start etc. How could I adapt that env to your implementation?

Running experiment on OpenAi Gym Mujoco Environment

Dear author,
In your in Environment, for example HalfCheetah, you access the reward function and use reward_run and reward_ctrl separately instead of using original reward come from step(action) and you also add reward_run to first dim of state.

There are any config option that allow us evaluating on the original env provided by OpenAi Gym?

thank you so much for your support!

CEM Variance Has No Damping

Is the following line written as intended?

handful-of-trials/dmbrl/misc/optimizers/cem.py

Line 92 in 0fb8905

var = self.alpha * new_var + (1 - self.alpha) * new_var

About Initial Variance of CEM

Thank you for your great codes.
I have a question for the initial variance of CEM.
According your code, you used the initial variance as follows:

self.init_var = np.tile(np.square(self.ac_ub - self.ac_lb) / 16, [self.plan_hor])

I am trying to implement CEM for MBRLHalfCheetah-v0 by using the true model like you did in your paper.
However, I got the poor results. Did you you use this parameter for all tasks ??
Actually, I got a good result for the CartPole task.

Also, How did you set the parameters of CEM to get the stable results.

"CEM": { "popsize": 500, "num_elites": 50, "max_iters": 5, "alpha": 0.1 }

Thank you for your help.

Failed to reproduce with Docker

Hi!

I ran the environment using the latest pre-built docker image. When training on RTX 3080Ti, I have the following failure:

python scripts/mbexp.py -env halfcheetah
2022-08-12 09:10:09.936602: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2022-08-12 09:10:10.312749: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-12 09:10:10.312892: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1392] Found device 0 with properties: 
name: NVIDIA GeForce RTX 3080 Ti major: 8 minor: 6 memoryClockRate(GHz): 1.71
pciBusID: 0000:07:00.0                                   
totalMemory: 11.76GiB freeMemory: 11.52GiB                                                                        
2022-08-12 09:10:10.312909: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1471] Adding visible gpu devices: 0
2022-08-12 09:13:36.133367: I tensorflow/core/common_runtime/gpu/gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-08-12 09:13:36.133405: I tensorflow/core/common_runtime/gpu/gpu_device.cc:958]      0 
2022-08-12 09:13:36.133414: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   N 
2022-08-12 09:13:36.133525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11135 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce R
TX 3080 Ti, pci bus id: 0000:07:00.0, compute capability: 8.6)                                                    
{'ctrl_cfg': {'env': <dmbrl.env.half_cheetah.HalfCheetahEnv object at 0x7f8c91ede160>,
              'opt_cfg': {'ac_cost_fn': <function HalfCheetahConfigModule.ac_cost_fn at 0x7f8c32140488>,     
                          'cfg': {'alpha': 0.1,                                                                   
                                  'max_iters': 5,                                                                 
                                  'num_elites': 50,                                                               
                                  'popsize': 500},
                          'mode': 'CEM',                                                                          
                          'obs_cost_fn': <function HalfCheetahConfigModule.obs_cost_fn at 0x7f8c32140400>,
                          'plan_hor': 30},                                                                        
              'prop_cfg': {'mode': 'TSinf',           
                           'model_init_cfg': {'model_class': <class 'dmbrl.modeling.models.BNN.BNN'>,             
                                              'model_constructor': <bound method HalfCheetahConfigModule.nn_constructor of <halfcheetah.HalfCheetahConfigModule object at 0x7f8c32138748>>,
                                              'num_nets': 5},                                                                                                                                                                        
                           'model_train_cfg': {'epochs': 5},
                           'npart': 20,                                                                           
                           'obs_postproc': <function HalfCheetahConfigModule.obs_postproc at 0x7f8c321402f0>,
                           'obs_preproc': <function HalfCheetahConfigModule.obs_preproc at 0x7f8c32140268>,
                           'targ_proc': <function HalfCheetahConfigModule.targ_proc at 0x7f8c32140378>}},
 'exp_cfg': {'exp_cfg': {'nrollouts_per_iter': 1, 'ntrain_iters': 300},
             'log_cfg': {'logdir': 'log'},                                                                                                                                                                                           
             'sim_cfg': {'env': <dmbrl.env.half_cheetah.HalfCheetahEnv object at 0x7f8c91ede160>,                                                                                                                                    
                         'task_hor': 1000}}} 
Created an ensemble of 5 neural networks with variance predictions.
Created an MPC controller, prop mode TSinf, 20 particles. 
Trajectory prediction logging is disabled.
Average action selection time:  1.0358095169067383e-05
Rollout length:  1000
Network training:   0%|                                                               | 0/5 [00:00<?, ?epoch(s)/s]2022-08-12 09:15:02.909158: E tensorflow/stream_executor/cuda/cuda_blas.cc:647] failed to run cuBLAS routine cublasSgemmBatched: CUBLAS_STATUS_EXECUTION_FAILED
2022-08-12 09:15:02.909199: E tensorflow/stream_executor/cuda/cuda_blas.cc:2505] Internal: failed BLAS call, see log for details
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1322, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: Blas xGEMMBatched launch failed : a.shape=[5,32,24], b.shape=[5,24,200], m=32, n=200, k=24, batch_size=5
         [[Node: model_1/MatMul = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model_1/truediv, model/Layer0/FC_weights/read)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "scripts/mbexp.py", line 44, in <module>
    main(args.env, "MPC", args.ctrl_arg, args.override, args.logdir)
  File "scripts/mbexp.py", line 29, in main
    exp.run_experiment()
  File "/workspace/dmbrl/misc/MBExp.py", line 96, in run_experiment
    [sample["rewards"] for sample in samples]
  File "/workspace/dmbrl/controllers/MPC.py", line 180, in train
    self.model.train(self.train_in, self.train_targs, **self.model_train_cfg)
  File "/workspace/dmbrl/modeling/models/BNN.py", line 260, in train
    feed_dict={self.sy_train_in: inputs[batch_idxs], self.sy_train_targ: targets[batch_idxs]}
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Blas xGEMMBatched launch failed : a.shape=[5,32,24], b.shape=[5,24,200], m=32, n=200, k=24, batch_size=5
         [[Node: model_1/MatMul = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model_1/truediv, model/Layer0/FC_weights/read)]]

Caused by op 'model_1/MatMul', defined at:
  File "scripts/mbexp.py", line 44, in <module>
    main(args.env, "MPC", args.ctrl_arg, args.override, args.logdir)
  File "scripts/mbexp.py", line 22, in main
    cfg.exp_cfg.exp_cfg.policy = MPC(cfg.ctrl_cfg)
  File "/workspace/dmbrl/controllers/MPC.py", line 90, in __init__
    )(params.prop_cfg.model_init_cfg)
  File "/workspace/dmbrl/config/halfcheetah.py", line 83, in nn_constructor
    model.finalize(tf.train.AdamOptimizer, {"learning_rate": 0.001})
  File "/workspace/dmbrl/modeling/models/BNN.py", line 180, in finalize
    train_loss = tf.reduce_sum(self._compile_losses(self.sy_train_in, self.sy_train_targ, inc_var_loss=True))
  File "/workspace/dmbrl/modeling/models/BNN.py", line 436, in _compile_losses
    mean, log_var = self._compile_outputs(inputs, ret_log_var=True)
  File "/workspace/dmbrl/modeling/models/BNN.py", line 408, in _compile_outputs
    cur_out = layer.compute_output_tensor(cur_out)
  File "/workspace/dmbrl/modeling/layers/FC.py", line 71, in compute_output_tensor
    raw_output = tf.matmul(input_tensor, self.weights) + self.biases
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/math_ops.py", line 1976, in matmul
    a, b, adj_x=adjoint_a, adj_y=adjoint_b, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 1236, in batch_mat_mul
    "BatchMatMul", x=x, y=y, adj_x=adj_x, adj_y=adj_y, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3414, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1740, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InternalError (see above for traceback): Blas xGEMMBatched launch failed : a.shape=[5,32,24], b.shape=[5,24,200], m=32, n=200, k=24, batch_size=5
         [[Node: model_1/MatMul = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model_1/truediv, model/Layer0/FC_weights/read)]]

minor: requirements.txt is missing pytest

Could not find a version that satisfies the requirement gpflow==1.1.0

cloning and pip install -r requirements.txt from scratch resulting in this error

Downloading/unpacking dotmap==1.2.20 (from -r requirements.txt (line 1))
  Downloading dotmap-1.2.20.tar.gz
  Running setup.py (path:/home/jhoang/dev/handful-of-trials/env/build/dotmap/setup.py) egg_info for package dotmap

Downloading/unpacking future==0.16.0 (from -r requirements.txt (line 2))
  Downloading future-0.16.0.tar.gz (824kB): 824kB downloaded
  Running setup.py (path:/home/jhoang/dev/handful-of-trials/env/build/future/setup.py) egg_info for package future

    warning: no files found matching '*.au' under directory 'tests'
    warning: no files found matching '*.gif' under directory 'tests'
    warning: no files found matching '*.txt' under directory 'tests'
Downloading/unpacking gpflow==1.1.0 (from -r requirements.txt (line 3))
  Could not find a version that satisfies the requirement gpflow==1.1.0 (from -r requirements.txt (line 3)) (from versions: 1.0.0, 1.0.0, 1.1.1, 1.2.0)
Cleaning up...
No distributions matching the version for gpflow==1.1.0 (from -r requirements.txt (line 3))

I'm assuming 1.1.1 should also work, just wanted to throw this out there

Could you explain the implementation of modeling variance?

https://github.com/kchua/handful-of-trials/blob/master/dmbrl/modeling/models/BNN.py#L414-L415
Intuitively we can directly model the Gaussian variance using the output of last layer, but you created max_logvar&min_logvar. Could you explain this kind of implementation for variance?

Possible issue in normalization

handful-of-trials/dmbrl/modeling/utils/TensorStandardScaler.py

Line 45 in e1a62f2

sigma[sigma < 1e-12] = 1.0

If the sigma is < 1e-12, you are setting sigma to 1 which does not do any scaling. Shouldn't it be

sigma[sigma < 1e-12] = 1e-12 ?

Don't overwrite logs

In MBExp.py, is there a reason you're saving logs.mat to the base log directory, and not the iter_dir? It seems like this overrides logs.mat at each iteration.

            savemat(
                os.path.join(self.logdir, "logs.mat"),  # <-- This line
                {
                    "observations": traj_obs,
                    "actions": traj_acs,
                    "returns": traj_rets,
                    "rewards": traj_rews
                }
            )

I would think you'd instead want

savemat(
    os.path.join(iter_dir, "logs.mat"),
    ...

but maybe I'm missing something.

Dockerfile does not support mujoco rendering

The docker image that is provided (kchua/handful-of-trials) can run the mbexp.py script, but cannot run the render.py script. The error is as follows:

-> python scripts/render.py -env cartpole -logdir log/ -model-dir log/2021-11-06--06\:49\:30/
2022-04-21 09:37:34.268473: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2022-04-21 09:37:34.346063: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-21 09:37:34.346510: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1392] Found device 0 with properties: 
name: NVIDIA GeForce GTX 1050 major: 6 minor: 1 memoryClockRate(GHz): 1.493
pciBusID: 0000:01:00.0
totalMemory: 3.95GiB freeMemory: 3.63GiB
2022-04-21 09:37:34.346552: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1471] Adding visible gpu devices: 0
2022-04-21 09:37:34.538159: I tensorflow/core/common_runtime/gpu/gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-04-21 09:37:34.538212: I tensorflow/core/common_runtime/gpu/gpu_device.cc:958]      0 
2022-04-21 09:37:34.538222: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   N 
2022-04-21 09:37:34.538359: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3359 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1)
{'ctrl_cfg': {'env': <dmbrl.env.cartpole.CartpoleEnv object at 0x7fb8eb83fd68>,
              'opt_cfg': {'ac_cost_fn': <function CartpoleConfigModule.ac_cost_fn at 0x7fb8eb8438c8>,
                          'cfg': {'alpha': 0.1,
                                  'max_iters': 5,
                                  'num_elites': 40,
                                  'popsize': 400},
                          'mode': 'CEM',
                          'obs_cost_fn': <function CartpoleConfigModule.obs_cost_fn at 0x7fb8eb843840>,
                          'plan_hor': 25},
              'prop_cfg': {'mode': 'TSinf',
                           'model_init_cfg': {'load_model': True,
                                              'model_class': <class 'dmbrl.modeling.models.BNN.BNN'>,
                                              'model_constructor': <bound method CartpoleConfigModule.nn_constructor of <cartpole.CartpoleConfigModule object at 0x7fb8eb83f9e8>>,
                                              'model_dir': 'log/2021-11-06--06:49:30/',
                                              'num_nets': 5},
                           'model_pretrained': True,
                           'model_train_cfg': {'epochs': 5},
                           'npart': 20,
                           'obs_postproc': <function CartpoleConfigModule.obs_postproc at 0x7fb8eb843730>,
                           'obs_preproc': <function CartpoleConfigModule.obs_preproc at 0x7fb8eb8436a8>,
                           'targ_proc': <function CartpoleConfigModule.targ_proc at 0x7fb8eb8437b8>}},
 'exp_cfg': {'exp_cfg': {'ninit_rollouts': 0,
                         'nrollouts_per_iter': 1,
                         'ntrain_iters': 1},
             'log_cfg': {'logdir': 'log/', 'nrecord': 1},
             'sim_cfg': {'env': <dmbrl.env.cartpole.CartpoleEnv object at 0x7fb8eb83fd68>,
                         'task_hor': 200}}}
Model loaded from log/2021-11-06--06:49:30/.
Created an ensemble of 5 neural networks with variance predictions.
Created an MPC controller, prop mode TSinf, 20 particles. 
Trajectory prediction logging is disabled.
####################################################################
Starting training iteration 1.
ERROR:mujoco_py.mjviewer:GLFW error: 65543, desc: b'GLX: Failed to create context: BadValue (integer parameter out of range for operation)'
ERROR:mujoco_py.mjviewer:GLFW error: 65537, desc: b'The GLFW library is not initialized'
ERROR:mujoco_py.mjviewer:GLFW error: 65537, desc: b'The GLFW library is not initialized'
ERROR:mujoco_py.mjviewer:GLFW error: 65537, desc: b'The GLFW library is not initialized'
ERROR:mujoco_py.mjviewer:GLFW error: 65537, desc: b'The GLFW library is not initialized'
Average action selection time:  1.2227837240695953
Rollout length:  200
Rewards obtained: [179.53440146325102]
ERROR:mujoco_py.mjviewer:GLFW error: 65537, desc: b'The GLFW library is not initialized'
ERROR:mujoco_py.mjviewer:GLFW error: 65537, desc: b'The GLFW library is not initialized'
Exception ignored in: <bound method Env.__del__ of <dmbrl.env.cartpole.CartpoleEnv object at 0x7fb8eb83fd68>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/gym/core.py", line 203, in __del__
    self.close()
  File "/usr/local/lib/python3.5/dist-packages/gym/core.py", line 164, in close
    self.render(close=True)
  File "/usr/local/lib/python3.5/dist-packages/gym/core.py", line 150, in render
    return self._render(mode=mode, close=close)
  File "/usr/local/lib/python3.5/dist-packages/gym/envs/mujoco/mujoco_env.py", line 105, in _render
    self._get_viewer().finish()
  File "/usr/local/lib/python3.5/dist-packages/mujoco_py/mjviewer.py", line 325, in finish
    glfw.destroy_window(self.window)
  File "/usr/local/lib/python3.5/dist-packages/mujoco_py/glfw.py", line 809, in destroy_window
    window_addr = ctypes.cast(ctypes.pointer(window),
TypeError: _type_ must have storage info

No log has been saved

I ran this programme on our university's GPU server, which has Ubuntu on it. In the log folder there is only a folder, whose name is in the form of "2GU49Z~4", and it is zero KB. I tried add -o ctrl_cfg.log_cfg.save_all_models True -o ctrl_cfg.log_cfg.log_traj_preds True -o ctrl_cfg.log_cfg.log_particles True at the end, but there is still no difference.