danijar / director Goto Github PK
View Code? Open in Web Editor NEWDeep Hierarchical Planning from Pixels
Home Page: https://danijar.com/director/
Deep Hierarchical Planning from Pixels
Home Page: https://danijar.com/director/
@danijar
Would you please tell me how to do inference with model trained by "train_with_viz"??
I would be appreciate it if you could answer my questions.
Hi!
I've been trying to directly visualize the goal the manager generates, but I can't figure out how to take the one-hot skill grid and turn it into an image of the scene. I can visualize the latent properly using the WorldModel's decode head, but is the 1024 entry vector produced by Hierarchy.py::dec(skill) viewable the same way the decoded latent vector is? I'm using pong so I can train/run locally.
Thanks for publishing this repo, it's great work!
Hi,
The goal autoencoder's recreation loss in your code is the negative log probability of the world model's representation under the goal decoder's distribution:
rec = -dec.log_prob(tf.stop_gradient(goal))
But the paper lists it as the MSE of the decoded state and the original state:
np.sqrt(dec(feat.detach()) - feat)
Is the former a better measure of recreation loss than the latter?
Also, you only use the deterministic part of the RSSM as the representation, for training the goal autoencoder (in hierarchy::train_vae_replay). Why only use the deterministic part and not include the stochastic part?
Apologies if you've written about this somewhere, and thank you for making your extremely interesting work public.
Hi Danijar,
Reading the appendix of Director I couldn't understand what you mean by providing the reward to the worker. Is there a config I can use to do that? In the description of the figure you write: " When additionally providing task reward to the worker", does it mean that you change the context variable defined in hierarchy.py and include the reward as well? Also, if it works so well, why don't you do that by default? Have you tried to do the same for other tasks as well (i.e. Ant Mazes)?
Thank you so much!
Bests,
Cristian
Hi, first of all, thank you so much for sharing such amazing work & code.
I really loved the idea and the results of this paper, and am trying to apply some ideas on top of this.
However, I have faced some problems. I trained the model for dmc_vision dmc_walker_walk task using GPU with 16GB and 24GB VRAM, but received an out-of-memory error. I changed the batch size to 1, but it did not help fixing the problem.
Also, when I ran this on GPU with smaller VRAM (like 8GB or 12GB), I noticed that training process gets stuck after 8008 steps (about 3-5 minutes after training starts). In the paper, it says the training can be done in one day using V100 GPU which has 32GB VRAM. I was wondering if I need a GPU with larger VRAM to train this model. I could infer that this is the case because running dmc_proprio did not have any problem. I think using a model with CNN causes this problem. I was wondering if there is a way to run training on a GPU with smaller VRAM.
Assuming that lack of VRAM is the problem, I also tried to use multi-gpus, and tried "multi_gpu" and "multi_worker" configurations in tfagent.py, but now I am getting a new error as follows:
metrics.update(self.model_opt(model_tape, model_loss, modules))
File
"/vol/bitbucket/jk3417/explainable-mbhrl/embodied/agents/director/tfutils.py",
ine 246, in __call__ *
self._opt.apply_gradients(
File
"/vol/bitbucket/xmbhrl/lib/python3.10/site-packages/keras/optimizer_v2/op
timizer_v2.py", line 671, in apply_gradients
return tf.__internal__.distribute.interim.maybe_merge_call(
RuntimeError: `merge_call` called while defining a new graph or a
tf.function. This can often happen if the function `fn` passed to
`strategy.run()` contains a nested `@tf.function`, and the nested `@tf.function`
contains a synchronization point, such as aggregating gradients (e.g,
optimizer.apply_gradients), or if the function `fn` uses a control flow
statement which contains a synchronization point in the body. Such behaviors are
not yet supported. Instead, please avoid nested `tf.function`s or control flow
statements that may potentially cross a synchronization boundary, for example,
wrap the `fn` passed to `strategy.run` or the entire `strategy.run` inside a
`tf.function` or move the control flow out of `fn`. If you are subclassing a
`tf.keras.Model`, please avoid decorating overridden methods `test_step` and
`train_step` in `tf.function`.
There's a high chance that I am using a wrong tensorflow version, so please do understand if I am using wrong dependencies.
I checked out the dockerfile and saw that it is using tensorflow 2.8 or 2.9, but when using 2.9, JIT compilation failed.
Would be amazing if someone can share if they're also facing similar issues or know the solution to this problem. Thank you so much.
I am using
I see that the code include the part of minecraft, can I train on this environment? Thanks a lot!
Hi Danijar,
I know this might sound like a dumb question, but after I trained Director on a few tasks, I'd like to see how it performs by either rendering an environment while running the agent or simply running headless and looking at some plots/board on several performance metrics.
Just FYI I used the docker container to train. (which BTW I had to update it for me, as it actually requires Tensorflow 2.11.0rc1-gpu, libgles2-mesa-dev, upgrade PyOpenGL, matplotlib, and a few more changes to run smoothly)
Thanks for your help.
Hi Danijar, thank you so much for helping me running the code.
It took some time to run different tasks, in order to provide more information.
So, I think there are two problems in the code at the moment.
Here, I will attach the outputs of each task. (Each of them has a link to gist)
dmc_vision / dmc_walker_walk: RESOURCE_EXHAUSTED Error after collecting pre-train samples
dmc_proprio / dmc_walker_walk: In line 821, you can see that fps is 3.1. It took about 15 hours to collect 200k steps. (+ In line 809, train/duration is 3220.91, which means each train_step is taking approx. 50 minutes?)
loconav / loconav_ant_maze_m: RESOURCE_EXHAUSTED Error after collecting pre-train samples
I also tried changing number of envs to 1 or changing the batch size to 1, but it did not make a difference. It would be amazing if you could help me figure out what causes this problem. Thank you so much.
Below is the list of python packages I installed on my virtual env (python 3.10).
Package Version
---------------------------------- ---------
absl-py 1.4.0
astunparse 1.6.3
atari-py 0.2.9
backports.shutil-get-terminal-size 1.0.0
bcrypt 4.0.1
cachetools 5.3.0
certifi 2022.12.7
cffi 1.15.1
charset-normalizer 3.1.0
cloudpickle 1.6.0
colorama 0.4.6
contourpy 1.0.7
crafter 1.8.0
cryptography 40.0.1
cycler 0.11.0
decorator 5.1.1
dm-control 1.0.11
dm-env 1.6
dm-sonnet 2.0.1
dm-tree 0.1.8
flatbuffers 23.3.3
fonttools 4.39.3
gast 0.4.0
glfw 2.5.9
google-auth 2.17.1
google-auth-oauthlib 0.4.6
google-pasta 0.2.0
grpcio 1.53.0
gym 0.19.0
gym-minigrid 1.0.3
h5py 3.8.0
idna 3.4
imageio 2.27.0
keras 2.8.0
Keras-Preprocessing 1.1.2
kiwisolver 1.4.4
labmaze 1.0.6
libclang 16.0.0
llvmlite 0.39.1
lxml 4.9.2
Markdown 3.4.3
markdown-it-py 2.2.0
MarkupSafe 2.1.2
matplotlib 3.7.1
mdurl 0.1.2
mujoco 2.3.3
numba 0.56.4
numpy 1.23.5
nvidia-cublas-cu12 12.1.0.26
nvidia-cuda-runtime-cu12 12.1.55
nvidia-cudnn-cu12 8.9.0.131
oauthlib 3.2.2
opencv-python 4.7.0.72
opensimplex 0.4.4
opt-einsum 3.3.0
packaging 23.0
paramiko 3.1.0
Pillow 9.5.0
pip 22.0.2
protobuf 3.19.6
pyasn1 0.4.8
pyasn1-modules 0.2.8
pycparser 2.21
Pygments 2.14.0
PyNaCl 1.5.0
PyOpenGL 3.1.6
pyparsing 3.0.9
python-dateutil 2.8.2
reprint 0.6.0
requests 2.28.2
requests-oauthlib 1.3.1
rich 13.3.3
rsa 4.9
ruamel.yaml 0.17.21
ruamel.yaml.clib 0.2.7
scipy 1.10.1
setuptools 59.6.0
six 1.16.0
tabulate 0.9.0
tensorboard 2.8.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tensorflow 2.8.3
tensorflow-estimator 2.8.0
tensorflow-io-gcs-filesystem 0.32.0
tensorflow-probability 0.16.0
tensorrt 8.6.0
termcolor 2.2.0
tqdm 4.65.0
typing_extensions 4.5.0
urllib3 1.26.15
Werkzeug 2.2.3
wheel 0.37.1
wrapt 1.15.0
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.