I had hoped I could solve this for myself, but I regrettably couldn't, so I'm hoping someone here knows how to fix this:
I get a NaN error by tensorflow. This is a completely unpredictable error and it happens in different epochs everytime i try to train again.
Epoch 1/200
64/10000 [..............................] - ETA: 1903s - g_loss: 4.4450 - d_loss: 5.4199 - d_loss_fake: 4.4598 - d_loss_legit: 0.9601 - time: 10.4118I tensorflow/core/common_runtime/gpu/pool_$
llocator.cc:244] PoolAllocator: After 2061 get requests, put_count=2041 evicted_count=1000 eviction_rate=0.489956 and unsatisfied allocation rate=0.543426
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 100 to 110
1280/10000 [==>...........................] - ETA: 343s - g_loss: 5.1531 - d_loss: 2.9220 - d_loss_fake: 1.1781 - d_loss_legit: 1.7439 - time: 2.3984I tensorflow/core/common_runtime/gpu/pool_al$
ocator.cc:244] PoolAllocator: After 5407 get requests, put_count=5279 evicted_count=1000 eviction_rate=0.18943 and unsatisfied allocation rate=0.212872
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 256 to 281
3584/10000 [=========>....................] - ETA: 221s - g_loss: 6.1948 - d_loss: 2.4484 - d_loss_fake: 1.0048 - d_loss_legit: 1.4435 - time: 2.1499W tensorflow/core/framework/op_kernel.cc:936]
Invalid argument: Nan in summary histogram for: HistogramSummary
[[Node: HistogramSummary = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](HistogramSummary/tag, autoencoder/add_28/_221)]]
W tensorflow/core/framework/op_kernel.cc:936] Invalid argument: Nan in summary histogram for: HistogramSummary
[[Node: HistogramSummary = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](HistogramSummary/tag, autoencoder/add_28/_221)]]
W tensorflow/core/framework/op_kernel.cc:936] Invalid argument: Nan in summary histogram for: HistogramSummary
[[Node: HistogramSummary = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](HistogramSummary/tag, autoencoder/add_28/_221)]]
W tensorflow/core/framework/op_kernel.cc:936] Invalid argument: Nan in summary histogram for: HistogramSummary
[[Node: HistogramSummary = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](HistogramSummary/tag, autoencoder/add_28/_221)]]
W tensorflow/core/framework/op_kernel.cc:936] Invalid argument: Nan in summary histogram for: HistogramSummary
[[Node: HistogramSummary = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](HistogramSummary/tag, autoencoder/add_28/_221)]]
W tensorflow/core/framework/op_kernel.cc:936] Invalid argument: Nan in summary histogram for: HistogramSummary
[[Node: HistogramSummary = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](HistogramSummary/tag, autoencoder/add_28/_221)]]
E tensorflow/core/client/tensor_c_api.cc:485] Nan in summary histogram for: HistogramSummary
[[Node: HistogramSummary = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](HistogramSummary/tag, autoencoder/add_28/_221)]]
Traceback (most recent call last):
File "./train_generative_model.py", line 168, in <module>
nb_epoch=args.epoch, verbose=1, saver=saver
File "./train_generative_model.py", line 85, in train_model
g_loss, samples, xs = g_train(x, z, counter)
File "/home/kamal/Desktop/research/models/autoencoder.py", line 241, in train_g
outs = sess.run(outputs + updates, feed_dict={Img: images, Z: z, Z2: z2, K.learning_phase(): 1})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 382, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 655, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 723, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 743, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InvalidArgumentError: Nan in summary histogram for: HistogramSummary
[[Node: HistogramSummary = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](HistogramSummary/tag, autoencoder/add_28/_221)]]
Caused by op u'HistogramSummary', defined at:
File "./train_generative_model.py", line 159, in <module>
g_train, d_train, sampler, saver, loader, extras = get_model(sess=sess, name=args.name, batch_size=args.batch, gpu=args.gpu)
File "/home/kamal/Desktop/research/models/autoencoder.py", line 204, in get_model
sum_e_mean = tf.histogram_summary("e_mean", E_mean)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/logging_ops.py", line 125, in histogram_summary
tag=tag, values=values, name=scope)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 100, in _histogram_summary
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 703, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2310, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1232, in __init__
self._traceback = _extract_stack()
Again, this happens randomly in different epochs (1,3, 18, or 23). I can only get so far in the training before I get this error. Any ideas? I tried setting the learning rate to 0.0001 but this error persisted.