jiangoforit / yellowfin Goto Github PK

View Code? Open in Web Editor NEW

422.0 422.0 93.0 30.53 MB

auto-tuning momentum SGD optimizer

License: Apache License 2.0

Python 94.62% Shell 5.38%

yellowfin's People

Contributors

Stargazers

Watchers

Forkers

tdeboissiere yuanzhike statml xylary fence zgsxwsdxg ngchc jmhessel liusiye codeaudit vnvdvc redeipirati ml-lab yingchao-mai ethereon ssghost luffyhwl idanazuri tandychao mfernezir elkbrs asaadaldien jdvorak hulalazz heyuanhao achaiah vbillys shubhampachori12110095 yifenzhong1920 danxiangjie anpark cstein06 yiming992 dreadlord1984 qingyuanxingsi praveenasthana123 jinweizou inoueyuichi jellying manojmaurya-space afcarl rongyousu chenxingqiang hongyunnchen zzbzzb1413 yu2008fu kevintrannz eycab mgqixu smartwell maplewzx zw9977129 wuxiaolianggit knowledgehacker moelty adam0730 lzwscu greenfigo2015 wyc723 shunyuanxue david082 cjmcgraw nikita-ting wangdong1992 spideryouku caicui heyidoo jarvisustc caicaijason yinruyi squirren frankfqchen githubhoushilong linohan xiaochen-ren bdutta19 global19-atlassian-net kirstenlin xuaikun rxt2012kc ludovicabrusaferri shrinkle leostephen ziyangye-sys forky-mcforkface toby-gao sv790594 sriathv1005 lusen-yuan

yellowfin's Issues

bug? lr command line argument is ignored for YF and instead 1.0 is used

In the line https://github.com/JianGoForIt/YellowFin/blob/master/char-rnn-tensorflow/model.py#L92
the lr is set to 1 and not to the command line argument value.
Later in https://github.com/JianGoForIt/YellowFin/blob/master/char-rnn-tensorflow/train_YF.py#L138
the learning is set to the command line argument value but for YF this has no effect because the connection between the variable model.lr and YF was never made (for Adam and SGD this will work because model.lr is passed as the learning rate)

`lr` vs `learning_rate`

Just gave yellowfin a try yesterday and it works nicely! Just a minor comment/suggestion:

what do you think about renaming lr to learning_rate for consistency with other tensorflow optimizers. Can open a small PR

can you update it for new version of TF

Issue comparing to default optimizer setting in cifar10 in tensorflow tutorials

I have tried to replace the optimizer with YellowFin in cifar10 in tensorflow tutorials, but it did not perform well, much worse than the original decay sgd.

The origin code is :

  with tf.control_dependencies([loss_averages_op]):
    opt = tf.train.GradientDescentOptimizer(lr)
    grads = opt.compute_gradients(total_loss)

  # Apply gradients.
  apply_gradient_op = opt.apply_gradients(grads, global_step=global_step)

My code is:

 with tf.control_dependencies([loss_averages_op]):
        opt = YFOptimizer(lr=1.0, mu=0.0)
        # opt = tf.train.GradientDescentOptimizer(learning_rate=0.01)
        grads = opt.compute_gradients(total_loss)
    apply_gradient_op = opt.apply_gradients(grads, global_step=global_step)

I simply copied the yellowfin.py from Zehaos's yellowfin.py, which added compute_gradients function.

Did I miss something?

Cannot use optimizer on GPU device.

Running on GPU device I get the following error:

Cannot assign a device for operation 'apply_updates/exDeepFm/embedding/embedding_layer/YellowFin': Cou
ld not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:                                                                                                                
Colocation group had the following types and devices:                                                                                 
SparseApplyMomentum: CPU                                                                                                   
Shape: GPU CPU                                                                                                                        
Square: GPU CPU                                                                                                                       
Unique: GPU CPU                                                                                                          
Cast: GPU CPU                                                                                                              
UnsortedSegmentSum: GPU CPU                                                                                                           
Identity: GPU CPU                                                                                                                     
Assign: GPU CPU                                                                                                                       
StridedSlice: GPU CPU                                                                                                                 
Const: GPU CPU                                                                                                                        
VariableV2: GPU CPU
TruncatedNormal: GPU CPU
Gather: GPU CPU
Fill: GPU CPU
Mul: GPU CPU
Add: GPU CPU

LinAlgError (Array must not contain infs or NaNs) thrown in get_mu_tensor

Below is a simple piece of code to try YellowFin on my dataset.

x = tf.placeholder( tf.float32, [ None, train_x.shape[ 1 ] ] )
y = tf.placeholder( tf.float32, [ None, train_y.shape[ 1 ] ] )
m = tf.layers.dense( x, hidden_dim )
m = tf.layers.batch_normalization( m )
m = tf.nn.elu( m )
m = tf.layers.dense( m, hidden_dim )
m = tf.layers.batch_normalization( m )
m = tf.nn.elu( m )
m = tf.layers.dense( m, hidden_dim )
m = tf.layers.batch_normalization( m )
m = tf.nn.elu( m )
m = tf.layers.dense( m, train_y.shape[ 1 ] )
prediction = tf.nn.softmax( m )
loss = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits( labels=y, logits=m ) )
optimizer = yellowfin.YFOptimizer().minimize( loss )

s = tf.Session()
s.run( tf.global_variables_initializer() )
for epoch in range( epochs ):
    _, h = s.run( [ optimizer, loss ], feed_dict={ x: train_x, y: train_y } )

Usually, it crashes and throws the following exception.

Caused by op 'update_hyper/cond/PyFuncStateless', defined at:
  File "test2.py", line 47, in <module>
    optimizer = yf.YFOptimizer( learning_rate=1., momentum=0. ).minimize( loss )
  File "/data/python-mp-test/libs/yellowfin.py", line 268, in minimize
    return self.apply_gradients(grads_and_vars)
  File "/data/python-mp-test/libs/yellowfin.py", line 223, in apply_gradients
    update_hyper_op = self.update_hyper_param()
  File "/data/python-mp-test/libs/yellowfin.py", line 191, in update_hyper_param
    lambda: self._mu_var) )
  File "/usr/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
    return func(*args, **kwargs)
  File "/usr/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 1814, in cond
    orig_res_t, res_t = context_t.BuildCondBranch(true_fn)
  File "/usr/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 1689, in BuildCondBranch
    original_result = fn()
  File "/data/python-mp-test/libs/yellowfin.py", line 190, in <lambda>
    self._mu = tf.identity(tf.cond(self._do_tune, lambda: self.get_mu_tensor(),
  File "/data/python-mp-test/libs/yellowfin.py", line 173, in get_mu_tensor
    roots = tf.py_func(np.roots, [coef], Tout=tf.complex64, stateful=False)
  File "/usr/lib/python3.5/site-packages/tensorflow/python/ops/script_ops.py", line 201, in py_func
    input=inp, token=token, Tout=Tout, name=name)
  File "/usr/lib/python3.5/site-packages/tensorflow/python/ops/gen_script_ops.py", line 56, in _py_func_stateless
    Tout=Tout, name=name)
  File "/usr/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

UnknownError (see above for traceback): LinAlgError: Array must not contain infs or NaNs
	 [[Node: update_hyper/cond/PyFuncStateless = PyFuncStateless[Tin=[DT_FLOAT], Tout=[DT_COMPLEX64], token="pyfunc_0", _device="/job:localhost/replica:0/task:0/cpu:0"](update_hyper/cond/ScatterUpdate)]]

Global Step is not updating?

As seen in Zehaos/MobileNet#27 -- the global step does not update after each training step has been taken. Is there a fix to this coming up soon? I have tried both the older version of yellowfin.py in that issue and also the latest one available. In both instances, the global step doesn't update.

I believe the issue comes from the global variable existing only within the optimizer but not globally. As a quick fix, I moved the definition of the global step (at https://github.com/JianGoForIt/YellowFin/blob/master/tuner_utils/yellowfin.py#L60) out of the optimizer and directly in the graph, before feeding in this variable back to the optimizer.

Is there a cleaner solution to this?

Potentially dead tf.assert Op

It seems that there is an assert operation which might never be evaluated:
tf.assert_equal(tf.size(root), tf.constant(1) )
https://github.com/JianGoForIt/YellowFin/blob/master/tuner_utils/yellowfin.py#L180

Newer versions of TF produce an error with that, and it was probably not the intended behaviour.

Related issue: tensorflow/tensorflow#11315

Change license approval to integrate YF in T2T

@JianGoForIt as i said in different issues i was trying to adapt YF to be usable in tensor2tensor and after my PR to definitively integrate YF in T2T, it raised a license problem. Once the PR is accepted it will override your MIT License, so the T2T authors need your OK(approval) to keep the PR, otherwise we cannot use your code. This is the PR.

no such file or directory: '/tmp/pip-build-jykvuD/YellowFin/README.md

tensorflow) ➜  models git:(master) ✗ pip install  YellowFin
Collecting YellowFin
  Using cached Yellowfin-1.0.2.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-jykvuD/YellowFin/setup.py", line 7, in <module>
        with open(path.join(here, 'README.md'), encoding='utf-8') as f:
      File "/home/canoe/Project/tensorflow/lib/python2.7/codecs.py", line 896, in open
        file = __builtin__.open(filename, mode, buffering)
    IOError: [Errno 2] No such file or directory: '/tmp/pip-build-jykvuD/YellowFin/README.md'
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-jykvuD/YellowFin/

prove of performance on simple logistic regression task

how to use it for usual logistic regression?
did you tested on simple logistic regression task to prove that your code is better ?

Add YellowFin to tensor2tensor

I am trying to adapt YellowFin to be usable as optimizer in tensor2tensor(it's use tensorflow>=1.2.0rc1) but unfortunately i cannot debug this error:

Step to reproduce

Clone this repo.
Launch the starter.sh script (inside a Docker container is better).
(Optional Docker container command) nvidia-docker run -it -v $(pwd):/t2t -p 6006:6006 -w /t2t tensorflow/tensorflow:latest-devel-gpu.

Error

Using YellowFin
INFO:tensorflow:Computing gradients for global model_fn.
ERROR:tensorflow:==================================
Object was never used (type <class 'tensorflow.python.framework.ops.Operation'>):
<tf.Operation 'training/update_hyper/cond/assert_equal/Assert/Assert' type=Assert>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
['File "/usr/local/bin/t2t-trainer", line 6, in <module>\n    exec(compile(open(__file__).read(), __file__, \'exec\'))', 'File "/t2t/tensor2tensor/bin/t2t-trainer", line 83, in <module>\n    tf.app.run()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run\n    _sys.exit(main(_sys.argv[:1] + flags_passthrough))', 'File "/t2t/tensor2tensor/bin/t2t-trainer", line 79, in main\n    schedule=FLAGS.schedule)', 'File "/t2t/tensor2tensor/utils/trainer_utils.py", line 247, in run\n    run_locally(exp_fn(output_dir))', 'File "/t2t/tensor2tensor/utils/trainer_utils.py", line 537, in run_locally\n    exp.train_and_evaluate()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 495, in train_and_evaluate\n    self.train(delay_secs=0)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train\n    hooks=self._train_monitors + extra_hooks)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train\n    monitors=hooks)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func\n    return func(*args, **kwargs)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit\n    loss = self._train_model(input_fn=input_fn, hooks=hooks)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 955, in _train_model\n    model_fn_ops = self._get_train_ops(features, labels)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1162, in _get_train_ops\n    return self._call_model_fn(features, labels, model_fn_lib.ModeKeys.TRAIN)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1133, in _call_model_fn\n    model_fn_results = self._model_fn(features, labels, **kwargs)', 'File "/t2t/tensor2tensor/utils/trainer_utils.py", line 520, in model_fn\n    colocate_gradients_with_ops=True)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/layers/python/layers/optimizers.py", line 293, in optimize_loss\n    name="train")', 'File "/t2t/tensor2tensor/utils/trainer_utils.py", line 1154, in apply_gradients\n    gradients, global_step=global_step, name=name)', 'File "/t2t/tensor2tensor/utils/yellowfin.py", line 222, in apply_gradients\n    update_hyper_op = self.update_hyper_param()', 'File "/t2t/tensor2tensor/utils/yellowfin.py", line 190, in update_hyper_param\n    lambda: self._mu_var) )', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func\n    return func(*args, **kwargs)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1814, in cond\n    orig_res_t, res_t = context_t.BuildCondBranch(true_fn)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1689, in BuildCondBranch\n    original_result = fn()', 'File "/t2t/tensor2tensor/utils/yellowfin.py", line 189, in <lambda>\n    self._mu = tf.identity(tf.cond(self._do_tune, lambda: self.get_mu_tensor(),', 'File "/t2t/tensor2tensor/utils/yellowfin.py", line 180, in get_mu_tensor\n    tf.assert_equal(tf.size(root), tf.constant(1) )', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/check_ops.py", line 318, in assert_equal\n    return control_flow_ops.Assert(condition, data, summarize=summarize)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 170, in wrapped\n    return _add_should_use_warning(fn(*args, **kwargs))', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 139, in _add_should_use_warning\n    wrapped = TFShouldUseWarningWrapper(x)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 96, in __init__\n    stack = [s.strip() for s in traceback.format_stack()]']
==================================
INFO:tensorflow:Global model_fn finished.
INFO:tensorflow:Create CheckpointSaverHook.
2017-07-06 14:31:31.807218: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-06 14:31:31.807260: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-06 14:31:31.807285: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-07-06 14:31:31.855132: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-07-06 14:31:31.855471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties: 
name: GeForce GTX 670MX
major: 3 minor: 0 memoryClockRate (GHz) 0.601
pciBusID 0000:01:00.0
Total memory: 2.94GiB
Free memory: 2.60GiB
2017-07-06 14:31:31.855541: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 
2017-07-06 14:31:31.855567: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   Y 
2017-07-06 14:31:31.855606: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 670MX, pci bus id: 0000:01:00.0)
2017-07-06 14:31:32.895272: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895276: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895446: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895327: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895466: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895573: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895625: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895675: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895693: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895545: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.897115: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.901863: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.902270: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.902804: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.903010: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.903597: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.904450: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.904735: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.907982: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:33.041912: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
Traceback (most recent call last):
  File "/usr/local/bin/t2t-trainer", line 6, in <module>
    exec(compile(open(__file__).read(), __file__, 'exec'))
  File "/t2t/tensor2tensor/bin/t2t-trainer", line 83, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/t2t/tensor2tensor/bin/t2t-trainer", line 79, in main
    schedule=FLAGS.schedule)
  File "/t2t/tensor2tensor/utils/trainer_utils.py", line 247, in run
    run_locally(exp_fn(output_dir))
  File "/t2t/tensor2tensor/utils/trainer_utils.py", line 537, in run_locally
    exp.train_and_evaluate()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 495, in train_and_evaluate
    self.train(delay_secs=0)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train
    hooks=self._train_monitors + extra_hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train
    monitors=hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit
    loss = self._train_model(input_fn=input_fn, hooks=hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1003, in _train_model
    config=self._session_config
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 352, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 648, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 477, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 822, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 827, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 538, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 412, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 279, in prepare_session
    sess.run(init_op, feed_dict=init_feed_dict)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 789, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 997, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1132, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.FailedPreconditionError: Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]

Caused by op u'global_step/read', defined at:
  File "/usr/local/bin/t2t-trainer", line 6, in <module>
    exec(compile(open(__file__).read(), __file__, 'exec'))
  File "/t2t/tensor2tensor/bin/t2t-trainer", line 83, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/t2t/tensor2tensor/bin/t2t-trainer", line 79, in main
    schedule=FLAGS.schedule)
  File "/t2t/tensor2tensor/utils/trainer_utils.py", line 247, in run
    run_locally(exp_fn(output_dir))
  File "/t2t/tensor2tensor/utils/trainer_utils.py", line 537, in run_locally
    exp.train_and_evaluate()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 495, in train_and_evaluate
    self.train(delay_secs=0)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train
    hooks=self._train_monitors + extra_hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train
    monitors=hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit
    loss = self._train_model(input_fn=input_fn, hooks=hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 952, in _train_model
    global_step = contrib_framework.create_global_step(g)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/framework/python/ops/variables.py", line 133, in create_global_step
    return training_util.create_global_step(graph)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/training_util.py", line 119, in create_global_step
    collections=[ops.GraphKeys.GLOBAL_VARIABLES, ops.GraphKeys.GLOBAL_STEP])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 1065, in get_variable
    use_resource=use_resource, custom_getter=custom_getter)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 962, in get_variable
    use_resource=use_resource, custom_getter=custom_getter)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 367, in get_variable
    validate_shape=validate_shape, use_resource=use_resource)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 352, in _true_getter
    use_resource=use_resource)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 725, in _get_single_variable
    validate_shape=validate_shape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 200, in __init__
    expected_shape=expected_shape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 319, in _init_from_args
    self._snapshot = array_ops.identity(self._variable, name="read")
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 1303, in identity
    result = _op_def_lib.apply_op("Identity", input=input, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

FailedPreconditionError (see above for traceback): Attempting to use uninitialized value global_step
	 [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]

ERROR:tensorflow:==================================
Object was never used (type <class 'tensorflow.python.framework.ops.Tensor'>):
<tf.Tensor 'report_uninitialized_variables_1/boolean_mask/Gather:0' shape=(?,) dtype=string>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
['File "/usr/local/bin/t2t-trainer", line 6, in <module>\n    exec(compile(open(__file__).read(), __file__, \'exec\'))', 'File "/t2t/tensor2tensor/bin/t2t-trainer", line 83, in <module>\n    tf.app.run()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run\n    _sys.exit(main(_sys.argv[:1] + flags_passthrough))', 'File "/t2t/tensor2tensor/bin/t2t-trainer", line 79, in main\n    schedule=FLAGS.schedule)', 'File "/t2t/tensor2tensor/utils/trainer_utils.py", line 247, in run\n    run_locally(exp_fn(output_dir))', 'File "/t2t/tensor2tensor/utils/trainer_utils.py", line 537, in run_locally\n    exp.train_and_evaluate()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 495, in train_and_evaluate\n    self.train(delay_secs=0)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train\n    hooks=self._train_monitors + extra_hooks)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train\n    monitors=hooks)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func\n    return func(*args, **kwargs)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit\n    loss = self._train_model(input_fn=input_fn, hooks=hooks)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1003, in _train_model\n    config=self._session_config', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 352, in MonitoredTrainingSession\n    stop_grace_period_secs=stop_grace_period_secs)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 648, in __init__\n    stop_grace_period_secs=stop_grace_period_secs)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 477, in __init__\n    self._sess = _RecoverableSession(self._coordinated_creator)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 822, in __init__\n    _WrappedSession.__init__(self, self._create_session())', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 827, in _create_session\n    return self._sess_creator.create_session()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 538, in create_session\n    self.tf_sess = self._session_creator.create_session()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 403, in create_session\n    self._scaffold.finalize()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 192, in finalize\n    default_ready_for_local_init_op)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 254, in get_or_default\n    op = default_constructor()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 189, in default_ready_for_local_init_op\n    variables.global_variables())', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 170, in wrapped\n    return _add_should_use_warning(fn(*args, **kwargs))', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 139, in _add_should_use_warning\n    wrapped = TFShouldUseWarningWrapper(x)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 96, in __init__\n    stack = [s.strip() for s in traceback.format_stack()]']
==================================

If you do not want to help or contribute, please close the issue and forgive me.
Otherwise, i will appreciate any help :)

I've also tried to write YellowFin as an tf.train.Optimizer, but going at C++ level seems to be out of my skills at the moment...

Keras compatability -- easy addition?

Hi! Thanks for posting this code. I thought that I would give YF a try as a drop-in optimizer. Currently, I am using Keras, and I was able to modify your code to run on Keras models by doing the following:

Adding a compute_gradients standalone method
Adding a few checks for gradients being None in apply_gradients and after_apply
Wrapping the YFOptimizer object in a Keras TFOptimizer

However, while it runs and my loss goes down -- I am not 100% I did everything properly. Do you think you might consider adding this support?

Cannot operate on graphs with gradients that are None.

Error is thrown during the after_apply operation in the yellowfin class. My suggestion is to screen for Nones during the apply_gradients operation. This is similar to the apply_gradients operation in the official repo:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/optimizer.py#L426

AttributeError: 'YFOptimizer' object has no attribute 'compute_gradients'

As title.

variable mu_update_interval is not used

The mu_update_interval parameters is not used by the optimizer, is it declared for future works? Maybe to schedule the mu during training?

error in running yellowfin_test

Hello,
I want to use YF in my own code. So, first I am trying to run yellowfin_test.py but it gave me back AssertionError in line 88 of the code. Any help is appreciated!

Open Source License

Thanks for sharing the yellowfin code on github, I just tried it out in one of my projects and got good results. I am just wondering if you are planning to add a open source license in future so that people can use it (of course with proper acknowledgement) in their projects and don't have to remove the yellowfin code sections when sharing their projects e.g., on GitHub.

Swap in replacement of AdamOptimizer causes crash

        self.opt_q =  YFOptimizer().minimize(self.vae_discriminator_loss, var_list=q_vars)

  File "xxx\src\yellowfin.py", line 215, in apply_gradients
    after_apply_op = self.after_apply()
  File "xxx\src\yellowfin.py", line 139, in after_apply
    self._grad_squared.append(tf.square(g) )
  File "C:\Miniconda3\lib\site-packages\tensorflow\python\ops\math_ops.py", line 412, in square
    return gen_math_ops.square(x, name=name)
  File "C:\Miniconda3\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 2585, in square
    result = _op_def_lib.apply_op("Square", x=x, name=name)
  File "C:\Miniconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 509, in apply_op
    (input_name, err))
ValueError: Tried to convert 'x' to a tensor and failed. Error: None values not supported.
PS xxx>

If I switch back to Adam, it works fine. Not sure what is up.

Bad performance in multiple GPUs

I used Yellowfin to train Resnet50 on ImageNet using 4 k80 GPUs and got bad performance. After 50k steps, the training loss was about 6, while the SGD without momentum and learning rate decay got only about 4.7. Any idea with this phenomenon?

PyPI release?

It would be interesting to try out YFOptimizer but it's too tricky to install the package at the moment. Is a PyPI release in the works so we can do pip install yellowfin?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.