Comments (13)
Hey Yangshell, there's no official description to run the code, yet. We're still experimenting with the architecture to find a good training setup to replicate the paper's results on the rooms_ring_camera dataset. Once we have a stable version and working model snapshots, we will merge the dev branch into master and write a detailed Readme.
In the meanwhile, you can check https://github.com/ogroth/tf-gqn/blob/rooms_ring_camera_training/train_gqn_draw.py
This is the run script training the GQN (provided you have downloaded the training data). However, we haven't managed to produce great visual results, yet.
from tf-gqn.
We have released a stable version of GQN which trains on the rooms_ring_camera dataset with the default parameters we provide. The training script is: https://github.com/ogroth/tf-gqn/blob/master/train_gqn_draw.py
Please see the Readme for detailed instructions on how to set up and run the code.
from tf-gqn.
i download the rooms_ring_camera dataset. but after i run the the script. the system will kill it. my computer has one 1080ti GPU. it not enough, needs better one ?
from tf-gqn.
i download the rooms_ring_camera dataset. but after i run the the script. the system will kill it. my computer has one 1080ti GPU. it not enough, needs better one ?
Hi wlred, could you please give a more detailed version of the error you get? Which script have you run (with which parameters) and what happened after it had been launched? Has it run into an out-of-memory error? A GTX 1080Ti is definitely sufficient to train the model.
from tf-gqn.
Hi ogroth, my steps:
(1)download the rooms_ring_camera dataset
(2)run the command:
1>source venv/bin/activate
2>python3 train_gqn_draw.py --data_dir /tmp/data/gqn-dataset --dataset rooms_ring_camera --model_dir /tmp/models/gqn
the output log is these:
Training a GQN.
FLAGS: Namespace(batch_size=36, chkpt_steps=10000, data_dir='/tmp/data/gqn-dataset', dataset='rooms_ring_camera', debug=False, initial_eval=False, log_steps=100, memcap=1.0, model_dir='/tmp/models/gqn', queue_buffer=64, queue_threads=4, train_epochs=40)
UNPARSED_ARGV: ['--mode_dir', '/tmp/models/gqn']
INFO:tensorflow:Using config: {'_save_checkpoints_secs': None, '_log_step_count_steps': 100, '_global_id_in_cluster': 0, '_task_id': 0, '_service': None, '_session_config': gpu_options {
per_process_gpu_memory_fraction: 1.0
allow_growth: true
}
, '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_master': '', '_task_type': 'worker', '_tf_random_seed': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fec2a484f28>, '_model_dir': '/tmp/models/gqn', '_save_checkpoints_steps': 10000, '_keep_checkpoint_every_n_hours': 10000, '_num_worker_replicas': 1, '_save_summary_steps': 100, '_keep_checkpoint_max': 5, '_train_distribute': None}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
kill
(3) after run the train_gqn_draw.py script. maybe 20 second. the system kill the process. and my computer is slow
from tf-gqn.
You seem to have a typo in your CLI parameters when calling the script:
UNPARSED_ARGV: ['--mode_dir', '/tmp/models/gqn']
That should read: --model_dir /tmp/models/gqn
from tf-gqn.
run the command: python3 train_gqn_draw.py --data_dir /tmp/data/gqn-dataset --dataset rooms_ring_camera
still killed
from tf-gqn.
the output log is these:
Training a GQN.
FLAGS: Namespace(batch_size=36, chkpt_steps=10000, data_dir='/tmp/data/gqn-dataset', dataset='rooms_ring_camera', debug=False, initial_eval=False, log_steps=100, memcap=1.0, model_dir='/tmp/models/gqn', queue_buffer=64, queue_threads=4, train_epochs=40)
UNPARSED_ARGV: []
INFO:tensorflow:Using config: {'_save_checkpoints_secs': None, '_log_step_count_steps': 100, '_global_id_in_cluster': 0, '_task_id': 0, '_service': None, '_session_config': gpu_options {
per_process_gpu_memory_fraction: 1.0
allow_growth: true
}
, '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_master': '', '_task_type': 'worker', '_tf_random_seed': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fec2a484f28>, '_model_dir': '/tmp/models/gqn', '_save_checkpoints_steps': 10000, '_keep_checkpoint_every_n_hours': 10000, '_num_worker_replicas': 1, '_save_summary_steps': 100, '_keep_checkpoint_max': 5, '_train_distribute': None}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
kill
from tf-gqn.
Have you tried to monitor your system with htop
and nvidia-smi
to check whether there is any unusual behaviour in terms of CPU / GPU usage or memory allocation? That's the only thing I can think off the top of my head which could cause the OS to kill the process. Which OS are you using?
from tf-gqn.
ubuntu 16.04
from tf-gqn.
hi ogroth, How big is the memory of your computer?
from tf-gqn.
We've trained on machines with 32GB of RAM, but training never occupied more than 8GB at any time.
from tf-gqn.
interesting, i had run your code on 3 computers. all can not run the code. all be killed. maybe
a lot of people have the same problem
from tf-gqn.
Related Issues (20)
- Cannot reproduce visualization from snapshots HOT 28
- TypeError: Can't instantiate abstract class GQNTFRecordDataset with abstract methods _inputs HOT 3
- Question about pack_context function HOT 2
- data format
- AttributeError: 'GQNTFRecordDataset' object has no attribute '_graph_attr' HOT 2
- Training GQN with different dataset HOT 1
- GQN trained on CLEVR dataset HOT 10
- Self-Attention and other extensions HOT 2
- eval_summary_hook error HOT 2
- Dataset
- how many step we need
- how many train step we need HOT 2
- It stays in local minimum and never goes down HOT 1
- Here is why the loss stays in local minimum ... HOT 3
- How to train using multi-GPU HOT 1
- Gradient flowing between inference_cell and generator_cell
- Does it run on Windows? HOT 1
- KernelRestarter: restarting kernel (1/5), keep random ports HOT 2
- How can I get the right result?
- ERROR:tensorflow:Model diverged with loss = NaN
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tf-gqn.