Coder Social home page Coder Social logo

yellowfin's People

Contributors

jiangoforit avatar jmhessel avatar mfernezir avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

yellowfin's Issues

bug? lr command line argument is ignored for YF and instead 1.0 is used

In the line https://github.com/JianGoForIt/YellowFin/blob/master/char-rnn-tensorflow/model.py#L92
the lr is set to 1 and not to the command line argument value.
Later in https://github.com/JianGoForIt/YellowFin/blob/master/char-rnn-tensorflow/train_YF.py#L138
the learning is set to the command line argument value but for YF this has no effect because the connection between the variable model.lr and YF was never made (for Adam and SGD this will work because model.lr is passed as the learning rate)

`lr` vs `learning_rate`

Just gave yellowfin a try yesterday and it works nicely! Just a minor comment/suggestion:

what do you think about renaming lr to learning_rate for consistency with other tensorflow optimizers. Can open a small PR

Issue comparing to default optimizer setting in cifar10 in tensorflow tutorials

I have tried to replace the optimizer with YellowFin in cifar10 in tensorflow tutorials, but it did not perform well, much worse than the original decay sgd.

The origin code is :

  with tf.control_dependencies([loss_averages_op]):
    opt = tf.train.GradientDescentOptimizer(lr)
    grads = opt.compute_gradients(total_loss)

  # Apply gradients.
  apply_gradient_op = opt.apply_gradients(grads, global_step=global_step)

My code is:

 with tf.control_dependencies([loss_averages_op]):
        opt = YFOptimizer(lr=1.0, mu=0.0)
        # opt = tf.train.GradientDescentOptimizer(learning_rate=0.01)
        grads = opt.compute_gradients(total_loss)
    apply_gradient_op = opt.apply_gradients(grads, global_step=global_step)

I simply copied the yellowfin.py from Zehaos's yellowfin.py, which added compute_gradients function.

Did I miss something?

Cannot use optimizer on GPU device.

Running on GPU device I get the following error:

Cannot assign a device for operation 'apply_updates/exDeepFm/embedding/embedding_layer/YellowFin': Cou
ld not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:                                                                                                                
Colocation group had the following types and devices:                                                                                 
SparseApplyMomentum: CPU                                                                                                   
Shape: GPU CPU                                                                                                                        
Square: GPU CPU                                                                                                                       
Unique: GPU CPU                                                                                                          
Cast: GPU CPU                                                                                                              
UnsortedSegmentSum: GPU CPU                                                                                                           
Identity: GPU CPU                                                                                                                     
Assign: GPU CPU                                                                                                                       
StridedSlice: GPU CPU                                                                                                                 
Const: GPU CPU                                                                                                                        
VariableV2: GPU CPU
TruncatedNormal: GPU CPU
Gather: GPU CPU
Fill: GPU CPU
Mul: GPU CPU
Add: GPU CPU

LinAlgError (Array must not contain infs or NaNs) thrown in get_mu_tensor

Below is a simple piece of code to try YellowFin on my dataset.

x = tf.placeholder( tf.float32, [ None, train_x.shape[ 1 ] ] )
y = tf.placeholder( tf.float32, [ None, train_y.shape[ 1 ] ] )
m = tf.layers.dense( x, hidden_dim )
m = tf.layers.batch_normalization( m )
m = tf.nn.elu( m )
m = tf.layers.dense( m, hidden_dim )
m = tf.layers.batch_normalization( m )
m = tf.nn.elu( m )
m = tf.layers.dense( m, hidden_dim )
m = tf.layers.batch_normalization( m )
m = tf.nn.elu( m )
m = tf.layers.dense( m, train_y.shape[ 1 ] )
prediction = tf.nn.softmax( m )
loss = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits( labels=y, logits=m ) )
optimizer = yellowfin.YFOptimizer().minimize( loss )

s = tf.Session()
s.run( tf.global_variables_initializer() )
for epoch in range( epochs ):
    _, h = s.run( [ optimizer, loss ], feed_dict={ x: train_x, y: train_y } )

Usually, it crashes and throws the following exception.

Caused by op 'update_hyper/cond/PyFuncStateless', defined at:
  File "test2.py", line 47, in <module>
    optimizer = yf.YFOptimizer( learning_rate=1., momentum=0. ).minimize( loss )
  File "/data/python-mp-test/libs/yellowfin.py", line 268, in minimize
    return self.apply_gradients(grads_and_vars)
  File "/data/python-mp-test/libs/yellowfin.py", line 223, in apply_gradients
    update_hyper_op = self.update_hyper_param()
  File "/data/python-mp-test/libs/yellowfin.py", line 191, in update_hyper_param
    lambda: self._mu_var) )
  File "/usr/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
    return func(*args, **kwargs)
  File "/usr/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 1814, in cond
    orig_res_t, res_t = context_t.BuildCondBranch(true_fn)
  File "/usr/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 1689, in BuildCondBranch
    original_result = fn()
  File "/data/python-mp-test/libs/yellowfin.py", line 190, in <lambda>
    self._mu = tf.identity(tf.cond(self._do_tune, lambda: self.get_mu_tensor(),
  File "/data/python-mp-test/libs/yellowfin.py", line 173, in get_mu_tensor
    roots = tf.py_func(np.roots, [coef], Tout=tf.complex64, stateful=False)
  File "/usr/lib/python3.5/site-packages/tensorflow/python/ops/script_ops.py", line 201, in py_func
    input=inp, token=token, Tout=Tout, name=name)
  File "/usr/lib/python3.5/site-packages/tensorflow/python/ops/gen_script_ops.py", line 56, in _py_func_stateless
    Tout=Tout, name=name)
  File "/usr/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

UnknownError (see above for traceback): LinAlgError: Array must not contain infs or NaNs
	 [[Node: update_hyper/cond/PyFuncStateless = PyFuncStateless[Tin=[DT_FLOAT], Tout=[DT_COMPLEX64], token="pyfunc_0", _device="/job:localhost/replica:0/task:0/cpu:0"](update_hyper/cond/ScatterUpdate)]]

Global Step is not updating?

As seen in Zehaos/MobileNet#27 -- the global step does not update after each training step has been taken. Is there a fix to this coming up soon? I have tried both the older version of yellowfin.py in that issue and also the latest one available. In both instances, the global step doesn't update.

I believe the issue comes from the global variable existing only within the optimizer but not globally. As a quick fix, I moved the definition of the global step (at https://github.com/JianGoForIt/YellowFin/blob/master/tuner_utils/yellowfin.py#L60) out of the optimizer and directly in the graph, before feeding in this variable back to the optimizer.

Is there a cleaner solution to this?

Change license approval to integrate YF in T2T

@JianGoForIt as i said in different issues i was trying to adapt YF to be usable in tensor2tensor and after my PR to definitively integrate YF in T2T, it raised a license problem. Once the PR is accepted it will override your MIT License, so the T2T authors need your OK(approval) to keep the PR, otherwise we cannot use your code. This is the PR.

no such file or directory: '/tmp/pip-build-jykvuD/YellowFin/README.md

tensorflow) ➜  models git:(master) ✗ pip install  YellowFin
Collecting YellowFin
  Using cached Yellowfin-1.0.2.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-jykvuD/YellowFin/setup.py", line 7, in <module>
        with open(path.join(here, 'README.md'), encoding='utf-8') as f:
      File "/home/canoe/Project/tensorflow/lib/python2.7/codecs.py", line 896, in open
        file = __builtin__.open(filename, mode, buffering)
    IOError: [Errno 2] No such file or directory: '/tmp/pip-build-jykvuD/YellowFin/README.md'
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-jykvuD/YellowFin/

Add YellowFin to tensor2tensor

I am trying to adapt YellowFin to be usable as optimizer in tensor2tensor(it's use tensorflow>=1.2.0rc1) but unfortunately i cannot debug this error:

Step to reproduce

  1. Clone this repo.
  2. Launch the starter.sh script (inside a Docker container is better).
  3. (Optional Docker container command) nvidia-docker run -it -v $(pwd):/t2t -p 6006:6006 -w /t2t tensorflow/tensorflow:latest-devel-gpu.

Error

Using YellowFin
INFO:tensorflow:Computing gradients for global model_fn.
ERROR:tensorflow:==================================
Object was never used (type <class 'tensorflow.python.framework.ops.Operation'>):
<tf.Operation 'training/update_hyper/cond/assert_equal/Assert/Assert' type=Assert>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
['File "/usr/local/bin/t2t-trainer", line 6, in <module>\n    exec(compile(open(__file__).read(), __file__, \'exec\'))', 'File "/t2t/tensor2tensor/bin/t2t-trainer", line 83, in <module>\n    tf.app.run()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run\n    _sys.exit(main(_sys.argv[:1] + flags_passthrough))', 'File "/t2t/tensor2tensor/bin/t2t-trainer", line 79, in main\n    schedule=FLAGS.schedule)', 'File "/t2t/tensor2tensor/utils/trainer_utils.py", line 247, in run\n    run_locally(exp_fn(output_dir))', 'File "/t2t/tensor2tensor/utils/trainer_utils.py", line 537, in run_locally\n    exp.train_and_evaluate()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 495, in train_and_evaluate\n    self.train(delay_secs=0)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train\n    hooks=self._train_monitors + extra_hooks)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train\n    monitors=hooks)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func\n    return func(*args, **kwargs)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit\n    loss = self._train_model(input_fn=input_fn, hooks=hooks)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 955, in _train_model\n    model_fn_ops = self._get_train_ops(features, labels)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1162, in _get_train_ops\n    return self._call_model_fn(features, labels, model_fn_lib.ModeKeys.TRAIN)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1133, in _call_model_fn\n    model_fn_results = self._model_fn(features, labels, **kwargs)', 'File "/t2t/tensor2tensor/utils/trainer_utils.py", line 520, in model_fn\n    colocate_gradients_with_ops=True)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/layers/python/layers/optimizers.py", line 293, in optimize_loss\n    name="train")', 'File "/t2t/tensor2tensor/utils/trainer_utils.py", line 1154, in apply_gradients\n    gradients, global_step=global_step, name=name)', 'File "/t2t/tensor2tensor/utils/yellowfin.py", line 222, in apply_gradients\n    update_hyper_op = self.update_hyper_param()', 'File "/t2t/tensor2tensor/utils/yellowfin.py", line 190, in update_hyper_param\n    lambda: self._mu_var) )', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func\n    return func(*args, **kwargs)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1814, in cond\n    orig_res_t, res_t = context_t.BuildCondBranch(true_fn)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1689, in BuildCondBranch\n    original_result = fn()', 'File "/t2t/tensor2tensor/utils/yellowfin.py", line 189, in <lambda>\n    self._mu = tf.identity(tf.cond(self._do_tune, lambda: self.get_mu_tensor(),', 'File "/t2t/tensor2tensor/utils/yellowfin.py", line 180, in get_mu_tensor\n    tf.assert_equal(tf.size(root), tf.constant(1) )', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/check_ops.py", line 318, in assert_equal\n    return control_flow_ops.Assert(condition, data, summarize=summarize)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 170, in wrapped\n    return _add_should_use_warning(fn(*args, **kwargs))', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 139, in _add_should_use_warning\n    wrapped = TFShouldUseWarningWrapper(x)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 96, in __init__\n    stack = [s.strip() for s in traceback.format_stack()]']
==================================
INFO:tensorflow:Global model_fn finished.
INFO:tensorflow:Create CheckpointSaverHook.
2017-07-06 14:31:31.807218: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-06 14:31:31.807260: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-06 14:31:31.807285: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-07-06 14:31:31.855132: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-07-06 14:31:31.855471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties: 
name: GeForce GTX 670MX
major: 3 minor: 0 memoryClockRate (GHz) 0.601
pciBusID 0000:01:00.0
Total memory: 2.94GiB
Free memory: 2.60GiB
2017-07-06 14:31:31.855541: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 
2017-07-06 14:31:31.855567: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   Y 
2017-07-06 14:31:31.855606: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 670MX, pci bus id: 0000:01:00.0)
2017-07-06 14:31:32.895272: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895276: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895446: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895327: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895466: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895573: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895625: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895675: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895693: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895545: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.897115: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.901863: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.902270: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.902804: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.903010: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.903597: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.904450: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.904735: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.907982: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:33.041912: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
Traceback (most recent call last):
  File "/usr/local/bin/t2t-trainer", line 6, in <module>
    exec(compile(open(__file__).read(), __file__, 'exec'))
  File "/t2t/tensor2tensor/bin/t2t-trainer", line 83, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/t2t/tensor2tensor/bin/t2t-trainer", line 79, in main
    schedule=FLAGS.schedule)
  File "/t2t/tensor2tensor/utils/trainer_utils.py", line 247, in run
    run_locally(exp_fn(output_dir))
  File "/t2t/tensor2tensor/utils/trainer_utils.py", line 537, in run_locally
    exp.train_and_evaluate()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 495, in train_and_evaluate
    self.train(delay_secs=0)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train
    hooks=self._train_monitors + extra_hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train
    monitors=hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit
    loss = self._train_model(input_fn=input_fn, hooks=hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1003, in _train_model
    config=self._session_config
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 352, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 648, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 477, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 822, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 827, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 538, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 412, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 279, in prepare_session
    sess.run(init_op, feed_dict=init_feed_dict)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 789, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 997, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1132, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.FailedPreconditionError: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]

Caused by op u'global_step/read', defined at:
  File "/usr/local/bin/t2t-trainer", line 6, in <module>
    exec(compile(open(__file__).read(), __file__, 'exec'))
  File "/t2t/tensor2tensor/bin/t2t-trainer", line 83, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/t2t/tensor2tensor/bin/t2t-trainer", line 79, in main
    schedule=FLAGS.schedule)
  File "/t2t/tensor2tensor/utils/trainer_utils.py", line 247, in run
    run_locally(exp_fn(output_dir))
  File "/t2t/tensor2tensor/utils/trainer_utils.py", line 537, in run_locally
    exp.train_and_evaluate()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 495, in train_and_evaluate
    self.train(delay_secs=0)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train
    hooks=self._train_monitors + extra_hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train
    monitors=hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit
    loss = self._train_model(input_fn=input_fn, hooks=hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 952, in _train_model
    global_step = contrib_framework.create_global_step(g)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/framework/python/ops/variables.py", line 133, in create_global_step
    return training_util.create_global_step(graph)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/training_util.py", line 119, in create_global_step
    collections=[ops.GraphKeys.GLOBAL_VARIABLES, ops.GraphKeys.GLOBAL_STEP])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 1065, in get_variable
    use_resource=use_resource, custom_getter=custom_getter)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 962, in get_variable
    use_resource=use_resource, custom_getter=custom_getter)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 367, in get_variable
    validate_shape=validate_shape, use_resource=use_resource)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 352, in _true_getter
    use_resource=use_resource)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 725, in _get_single_variable
    validate_shape=validate_shape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 200, in __init__
    expected_shape=expected_shape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 319, in _init_from_args
    self._snapshot = array_ops.identity(self._variable, name="read")
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 1303, in identity
    result = _op_def_lib.apply_op("Identity", input=input, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

FailedPreconditionError (see above for traceback): Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]

ERROR:tensorflow:==================================
Object was never used (type <class 'tensorflow.python.framework.ops.Tensor'>):
<tf.Tensor 'report_uninitialized_variables_1/boolean_mask/Gather:0' shape=(?,) dtype=string>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
['File "/usr/local/bin/t2t-trainer", line 6, in <module>\n    exec(compile(open(__file__).read(), __file__, \'exec\'))', 'File "/t2t/tensor2tensor/bin/t2t-trainer", line 83, in <module>\n    tf.app.run()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run\n    _sys.exit(main(_sys.argv[:1] + flags_passthrough))', 'File "/t2t/tensor2tensor/bin/t2t-trainer", line 79, in main\n    schedule=FLAGS.schedule)', 'File "/t2t/tensor2tensor/utils/trainer_utils.py", line 247, in run\n    run_locally(exp_fn(output_dir))', 'File "/t2t/tensor2tensor/utils/trainer_utils.py", line 537, in run_locally\n    exp.train_and_evaluate()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 495, in train_and_evaluate\n    self.train(delay_secs=0)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train\n    hooks=self._train_monitors + extra_hooks)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train\n    monitors=hooks)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func\n    return func(*args, **kwargs)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit\n    loss = self._train_model(input_fn=input_fn, hooks=hooks)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1003, in _train_model\n    config=self._session_config', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 352, in MonitoredTrainingSession\n    stop_grace_period_secs=stop_grace_period_secs)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 648, in __init__\n    stop_grace_period_secs=stop_grace_period_secs)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 477, in __init__\n    self._sess = _RecoverableSession(self._coordinated_creator)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 822, in __init__\n    _WrappedSession.__init__(self, self._create_session())', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 827, in _create_session\n    return self._sess_creator.create_session()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 538, in create_session\n    self.tf_sess = self._session_creator.create_session()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 403, in create_session\n    self._scaffold.finalize()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 192, in finalize\n    default_ready_for_local_init_op)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 254, in get_or_default\n    op = default_constructor()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 189, in default_ready_for_local_init_op\n    variables.global_variables())', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 170, in wrapped\n    return _add_should_use_warning(fn(*args, **kwargs))', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 139, in _add_should_use_warning\n    wrapped = TFShouldUseWarningWrapper(x)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 96, in __init__\n    stack = [s.strip() for s in traceback.format_stack()]']
==================================

If you do not want to help or contribute, please close the issue and forgive me.
Otherwise, i will appreciate any help :)

I've also tried to write YellowFin as an tf.train.Optimizer, but going at C++ level seems to be out of my skills at the moment...

Keras compatability -- easy addition?

Hi! Thanks for posting this code. I thought that I would give YF a try as a drop-in optimizer. Currently, I am using Keras, and I was able to modify your code to run on Keras models by doing the following:

  • Adding a compute_gradients standalone method
  • Adding a few checks for gradients being None in apply_gradients and after_apply
  • Wrapping the YFOptimizer object in a Keras TFOptimizer

However, while it runs and my loss goes down -- I am not 100% I did everything properly. Do you think you might consider adding this support?

error in running yellowfin_test

Hello,
I want to use YF in my own code. So, first I am trying to run yellowfin_test.py but it gave me back AssertionError in line 88 of the code. Any help is appreciated!

Open Source License

Thanks for sharing the yellowfin code on github, I just tried it out in one of my projects and got good results. I am just wondering if you are planning to add a open source license in future so that people can use it (of course with proper acknowledgement) in their projects and don't have to remove the yellowfin code sections when sharing their projects e.g., on GitHub.

Swap in replacement of AdamOptimizer causes crash

        self.opt_q =  YFOptimizer().minimize(self.vae_discriminator_loss, var_list=q_vars)
  File "xxx\src\yellowfin.py", line 215, in apply_gradients
    after_apply_op = self.after_apply()
  File "xxx\src\yellowfin.py", line 139, in after_apply
    self._grad_squared.append(tf.square(g) )
  File "C:\Miniconda3\lib\site-packages\tensorflow\python\ops\math_ops.py", line 412, in square
    return gen_math_ops.square(x, name=name)
  File "C:\Miniconda3\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 2585, in square
    result = _op_def_lib.apply_op("Square", x=x, name=name)
  File "C:\Miniconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 509, in apply_op
    (input_name, err))
ValueError: Tried to convert 'x' to a tensor and failed. Error: None values not supported.
PS xxx>

If I switch back to Adam, it works fine. Not sure what is up.

Bad performance in multiple GPUs

I used Yellowfin to train Resnet50 on ImageNet using 4 k80 GPUs and got bad performance. After 50k steps, the training loss was about 6, while the SGD without momentum and learning rate decay got only about 4.7. Any idea with this phenomenon?

PyPI release?

It would be interesting to try out YFOptimizer but it's too tricky to install the package at the moment. Is a PyPI release in the works so we can do pip install yellowfin?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.