Coder Social home page Coder Social logo

Comments (9)

guicho271828 avatar guicho271828 commented on May 18, 2024

I don't think regenerating the dataset is necessary.
Regarding setup.py, git diff HEAD..refs/tags/4.1.3 then if there are no diff it should be good to go

from latplan.

wn4github avatar wn4github commented on May 18, 2024

Thank you for the suggestion. But in the end, I couldn't get the code to run due to the incompatibility issues between nvidia-tensorflow 1.15 and keras. I have to use the nvidia-tensorflow because RTX 30 series do not support CUDA 10 while prebuilt tensorflow 1.x does not support CUDA 11.

I'm wondering if you are currently porting the code to PyTorch?

from latplan.

guicho271828 avatar guicho271828 commented on May 18, 2024

yes, that aspect is also something I am struggling with. My lab cluster is also transitioning to CUDA11, so I am attempting to rewrite part of the code, but the development is slow.

from latplan.

guicho271828 avatar guicho271828 commented on May 18, 2024

you could try to build tf 1.15 for cuda 11.

from latplan.

guicho271828 avatar guicho271828 commented on May 18, 2024

I just learned that NVIDIA (not Google) provides a backward-compatible version of tensorflow 1.15 that works on cuda 11.
tensorflow/tensorflow#43629
https://developer.nvidia.com/blog/accelerating-tensorflow-on-a100-gpus/
It seems its package name is nvidia-tensorflow.

from latplan.

wn4github avatar wn4github commented on May 18, 2024

Thank you so much for you help, Asaiさん.

I too discovered the nvidia-tensorflow could work with CUDA 11 and I managed to get a working tf 1.15 with

  • nvidia-tensorflow 1.15.4+nv20.12
  • Keras 2.2.5
  • keras-adabound 0.6.0
  • Keras-Applications 1.0.8

Now I am using the 4.1.3 version from the release source code, and running ./setup-dataset.sh is successful. However, I notice the difference of setup-dataset.sh between 4.1.3 and the latest commit: In 4.1.3, no .npz files are downloaded.

Anyway the problem I run into now is the error when executing ./train_all.sh. I only uncomment the first line for training: task-planning learn_plot_dump_summary. The trace is as follows

Fancy Traceback (most recent call last):
  File ./strips.py line 294 function <module> : main()
                    mode = 'learn_plot_dump_summary'
                sae_path = 'puzzle_mnist_3_3_5000_None_None_None_False_ConcreteDetNormalizedLogitAddEffectTransitionAE_planning'
      default_parameters = {'epoch': 200, 'batch_size': 500, 'optimizer': 'radam', 'max_temperature': 5.0, 'min_temperature': 0.7, 'M': 2, 'train_gumbel': True, 'train_softmax': True, 'test_gumbel': False, 'test_softmax': False, 'locality': 0.0, 'locality_delay': 0.0, 'aeclass': 'ConcreteDetNormalizedLogitAddEffectTransitionAE'}
              parameters = {'beta': [-0.3, -0.1, 0.0, 0.1, 0.3], 'lr': [0.1, 0.01, 0.001], 'N': [100, 200, 500, 1000], 'M': [2], 'layer': [1000], 'clayer': [16], 'dropout': [0.4], 'noise': [0.4], 'dropout_z': [False], 'activation': ['relu'], 'num_actions': [100, 200, 400, 800, 1600], 'aae_width': [100, 300, 600], 'aae_depth': [0, 1, 2], 'aae_activation': ['relu', 'tanh'], 'aae_delay': [0], 'direct': [0.1, 1.0, 10.0], 'direct_delay': [0.05, 0.1, 0.2, 0.3, 0.5], 'zerosuppress': [0.1, 0.2, 0.5], 'zerosuppress_delay': [0.05, 0.1, 0.2, 0.3, 0.5], 'loss': ['BCE'], 'type': ['mnist'], 'width': [3], 'height': [3], 'num_examples': [5000], 'stop_gradient': [False], 'aeclass': ['ConcreteDetNormalizedLogitAddEffectTransitionAE'], 'comment': ['planning']}

  File ./strips.py line 290 function main : globals()[task](*map(myeval,sys.argv))
                    task = 'puzzle'

  File ./strips.py line 208 function puzzle : show_summary(ae, train, test)
                    type = 'mnist'
                   width = 3
                  height = 3
            num_examples = 5000
                       N = None
             num_actions = None
                  direct = None
           stop_gradient = False
                 aeclass = 'ConcreteDetNormalizedLogitAddEffectTransitionAE'
                 comment = 'planning'
                    name = 'comment'
                   value = 'planning'
                    path = '/home/wn/workspace/latplan/latplan-4.1.3_original/latplan/puzzles/puzzle-mnist-3-3.npz'
                    data = <numpy.ndarray float32  (5000, 2, 42, 42)>
             pre_configs = <numpy.ndarray float64  (5000, 9)>
             suc_configs = <numpy.ndarray float64  (5000, 9)>
                    pres = <numpy.ndarray float32  (5000, 42, 42)>
                    sucs = <numpy.ndarray float32  (5000, 42, 42)>
             transitions = <numpy.ndarray float32  (2, 5000, 42, 42)>
                  states = <numpy.ndarray float32  (10000, 42, 42)>
                   train = <numpy.ndarray float32  (4500, 2, 42, 42)>
                     val = <numpy.ndarray float32  (250, 2, 42, 42)>
                    test = <numpy.ndarray float32  (250, 2, 42, 42)>
                      ae = None

  File ./strips.py line 180 function show_summary : ae.summary()
                      ae = None
                   train = <numpy.ndarray float32  (4500, 2, 42, 42)>
                    test = <numpy.ndarray float32  (250, 2, 42, 42)>

AttributeError: 'NoneType' object has no attribute 'summary'

I am also considering to use the trained weights directly if training cannot be done, but I need some guidance.

from latplan.

guicho271828 avatar guicho271828 commented on May 18, 2024

Now I am using the 4.1.3 version from the release source code, and running ./setup-dataset.sh is successful. However, I notice the difference of setup-dataset.sh between 4.1.3 and the latest commit: In 4.1.3, no .npz files are downloaded.

setup-dataset also downloads unrelated npz files that are not used in ijcai paper (but are used on other papers). Sorry for this confusion, this is because this entire repository is a kind of my "lab environment" which sets up everything I use for all of my papers. The failed ones for photorealistic-blocksworld are not used, so no worries. Instead, all datasets needed for reproducing the ijcai paper are rendered locally using a script included in this repo.

Anyway the problem I run into now is the error when executing ./train_all.sh.

Since you already have the trained weights, running this script is not necessary. All results including the csv dump and the PDDL domain file is included in the archive.

AttributeError: 'NoneType' object has no attribute 'summary'

Here is what is happening: task-planning learn_plot_dump_summary tries to run the training. However, since samples/*/grid_search.log already has more entires than the specified limit of hyperparameter configurations (300), it did not run the training. Thus the model instance (ae) is None.

If you want to regenerate the reconstructions etc., then task-planning plot_dump_summary would load the stored weights and produce a reconstruction plot and dump several files necessary for generating pddl files. Be sure that the archive is decompressed in a correct directory. It should be made so that samples/ directory is in the root of the repository.

The hyperparameter search is completely parallelized in the process level. So, if you have an 8-core 8-gpu machine, just run 8 processes in parallel.

from latplan.

guicho271828 avatar guicho271828 commented on May 18, 2024

If you do want to train the model, you may also want to prune some hyperparameters by looking at samples/*/grid_search.log. For example, this one is the best hyperparameter for mandrill 15-puzzle. Then you can edit strips.py and edit the dictionary.

from latplan.

wn4github avatar wn4github commented on May 18, 2024

My immediate goal is to use Cube-space AE to encode some MNIST 8-puzzle images. Then to better understand Latplan, I am planning to train the network, get my hands dirty with the implementation. But this is out of the topic of the issue, maybe I should open a new one. Thank you for all your help.

from latplan.

Related Issues (16)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.