Coder Social home page Coder Social logo

deeprec-ai / deeprec Goto Github PK

View Code? Open in Web Editor NEW
970.0 970.0 338.0 781.9 MB

DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.

License: Apache License 2.0

Starlark 2.43% Shell 0.49% Batchfile 0.02% Python 33.01% Dockerfile 0.05% CMake 0.14% Makefile 0.07% HTML 3.04% C++ 55.93% Cuda 0.13% Jupyter Notebook 1.89% C 0.58% MLIR 1.32% SWIG 0.11% Cython 0.01% LLVM 0.01% Java 0.57% Objective-C 0.06% Objective-C++ 0.14% Ruby 0.01%
advertising deep-learning distributed-training machine-learning python recommendation-engine scalability search-engine

deeprec's People

Contributors

aaroey avatar alextp avatar allenlavoie avatar andrewharp avatar annarev avatar asimshankar avatar benoitsteiner avatar caisq avatar ebrevdo avatar ezhulenev avatar facaiy avatar feihugis avatar gunan avatar hawkinsp avatar ilblackdragon avatar jdduke avatar jsimsa avatar liutongxuan avatar markdaoust avatar martinwicke avatar mihaimaruseac avatar mrry avatar nouiz avatar petewarden avatar rohan100jain avatar skye avatar tensorflower-gardener avatar terrytangyuan avatar yifeif avatar yongtang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deeprec's Issues

[Smartstaged] After enabling smartstaged feature in distributed training with modelzoo code, an error occurs.

After enabling smartstaged feature in distributed training with modelzoo code, an error occurs.

Other info / logs

  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: From /job:ps/replica:0/task:0:
Output 30 of type float does not match declared output type int64 for node {{node prefetch_2/DataBufferTake}}
### 
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 840, in <module>
    main(tf_config, server)
  File "train.py", line 610, in main
    checkpoint_dir, tf_config, server)
  File "train.py", line 480, in train
    sess.run([model.loss, model.train_op])
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 804, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1309, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1410, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 719, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1395, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1468, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1226, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: From /job:ps/replica:0/task:0:
Output 30 of type float does not match declared output type int64 for node node prefetch_2/DataBufferTake (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748)

DeepRec supports multiple evaluator

Background

At present, DeepRec cannot support the evaluation of very large models on a single node. Multiple ps are required to load large models, and multiple workers are used for distributed evaluation.This can improve DeepRec's support for more scenarios

Realize ideas

Unlike training models, evaluating models does not require modifying the network structure to improve model accuracy, but instead requires consideration of how to improve the throughput of model evaluation and reduce evaluation latency. DeepRec already supports distributed training, and the evaluation is actually simpler compared to the training process because no updates to ps are involved. In the code, DeepRec first decides whether to initialize the cluster and how to initialize it according to the parameters.

There are two modes of distributed multi-evaluator evaluation of the system that need to be implemented.
1.Mode 1 contains ps, worker and evaluator nodes.DeepRec has implemented the case of a single evaluator in this mode,we need to implement multiple evaluators.One of the ideas is to directly add multiple evaluators to the initialization list of distributed clusters in DeepRec, or use the tf.distribute.Strategy interface
2.Mode 2 only has ps and evaluator nodes.The difference between this mode and mode 1 is that there is no need to train, just load the offline model that has been trained into ps and directly evaluate its performance.

at PMEM memkind environment execute the launch script ,I got error log

When I use the latest commit to build a PMEM memkind environment and execute the launch script, the following error will appear.

  1. The commit code version I used
    image

2.The build option I used

bazel build --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --host_cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" -c opt --copt="-L/usr/local/lib" --copt="-lpmem" --copt="-lmemkind" --config=opt //tensorflow/tools/pip_package:build_pip_package

  1. The scprit I used
    numactl -N 1 ./launch.sh --batch_size=1280 --dim_size=512 --max_mock_id_amplify=1800 --num_steps=2000 --ev_storage=pmem_memkind

  2. error logs

INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Traceback (most recent call last):
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: From /job:ps/replica:0/task:0:
MultiLevel EV's Cache size -1 should large than IDs in batch 1280
[[{{node fm/embedding_lookup_36}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "./benchmark.py", line 228, in
tf.app.run()
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/pai/lib/python3.6/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/home/pai/lib/python3.6/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "./benchmark.py", line 203, in main
sess.run(train_op)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 804, in run
run_metadata=run_metadata)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1309, in run
run_metadata=run_metadata)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1410, in run
raise six.reraise(*original_exc_info)
File "/home/pai/lib/python3.6/site-packages/six.py", line 719, in reraise
raise value
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1395, in run
return self._sess.run(*args, **kwargs)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1468, in run
run_metadata=run_metadata)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1226, in run
return self._sess.run(*args, **kwargs)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: From /job:ps/replica:0/task:0:
MultiLevel EV's Cache size -1 should large than IDs in batch 1280
[[node fm/embedding_lookup_36 (defined at /home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for 'fm/embedding_lookup_36':
File "./benchmark.py", line 228, in
tf.app.run()
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/pai/lib/python3.6/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/home/pai/lib/python3.6/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "./benchmark.py", line 121, in main
tf.nn.embedding_lookup(fm_w, batch['col{}'.format(sidx)]))
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/ops/embedding_ops.py", line 418, in embedding_lookup
counts=counts)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/ops/embedding_ops.py", line 184, in _embedding_lookup_and_transform
counts=counts),
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/util/dispatch.py", line 180, in wrapper
return target(*args, **kwargs)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/ops/array_ops.py", line 3958, in gather
counts=counts)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/ops/kv_variable_ops.py", line 749, in sparse_read
name=name)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_kv_variable_ops.py", line 647, in kv_resource_gather
validate_indices=validate_indices, name=name)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in init
self._traceback = tf_stack.extract_stack()

[Modelzoo] DIN and DIEN perf drop based on r1.15.5-deeprec2201 tag.

Modelzoo perf Test based on [Release] Update DeepRec release version to 1.15.5+deeprec2201. (#43).
Test machines: Alibaba Cloud ECS general purpose instance family with high clock speeds - ecs.hfg7.2xlarge.

Test perf result:

Gstep WDL WDL DLRM DLRM DeepFM DeepFM DSSM DSSM DIEN DIEN DIN DIN
/ value percent value percent value percent value percent value percent value percent
Commuty TF 31.92626 baseline 82.09168 baseline 37.20978 baseline 18.54726 baseline 14.62987 baseline 18.57746 baseline
DeepRec FP32 34.69318 108.67% 105.4547 128.46% 43.31713 116.41% 21.64175 116.68% 13.27125 90.71% 17.6932 95.24%
DeepRec BF16 49.38222 154.68% 114.2221 139.14% 47.34401 127.24% 23.13698 124.75% 13.0392 89.13% 17.20525 92.61%

Test AUC result:

AUC WDL WDL DLRM DLRM DeepFM DeepFM DSSM DSSM DIEN DIEN DIN DIN
/ value percent value percent value percent value percent value percent value percent
Commuty TF 0.775168 baseline 0.768852 baseline 0.744794 baseline 0.504404 baseline 0.8443 baseline 0.7887 baseline
DeepRec FP32 0.775515 100.04% 0.771128 100.30% 0.746055 100.17% 0.503653 99.85% 0.8472 100.34% 0.7913 100.33%
DeepRec BF16 0.77604 100.11% 0.772185 100.43% 0.741192 99.52% 0.492327 97.61% 0.8358 98.99% 0.7883 99.95%

PS: DSSM dataset is small, so its ACC and AUC is limited.

[Auto Graph Fusion] An error occurred when Auto Graph Fusion enabled in modelzoo's DIEN.

An error occurred when Auto Graph Fusion enabled in modelzoo's DIEN.

Reproduce the issue
The code and dataset is provide in docker image, docker pull cesg-prc-registry.cn-beijing.cr.aliyuncs.com/cesg-ali/deeprec-modelzoo:220401-dien-issue
The DeepRec installed in the image is built on f4368d6
And run following code to reproduce the issue.

/root/modelzoo/DIEN
python train.py --steps 100 --no_eval --op_fusion True

Other info / logs

2022-04-01 02:58:35.554337: I ./tensorflow/core/graph/template_select_pruning_base.h:70] Found match op by select_pruning_else_const head/gradients/head/loss/xentropy/Select_grad/zeros_like
2022-04-01 02:58:35.554414: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/head/loss/xentropy/Select_grad/Select_1
2022-04-01 02:58:35.554462: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/head/loss/xentropy/Select_grad/tuple/control_dependency_1
2022-04-01 02:58:35.554552: I ./tensorflow/core/graph/template_select_pruning_base.h:70] Found match op by select_pruning_else_const head/gradients/attention_layer/Select_grad/zeros_like
2022-04-01 02:58:35.554612: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/attention_layer/Select_grad/Select_1
2022-04-01 02:58:35.554668: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/attention_layer/Select_grad/tuple/control_dependency_1
2022-04-01 02:58:35.554933: I ./tensorflow/core/graph/template_select_pruning_base.h:70] Found match op by select_pruning_then_const head/gradients/input_layer/input_layer/UID_embedding/UID_embedding_weights_grad/zeros_like
2022-04-01 02:58:35.554993: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/input_layer/UID_embedding/UID_embedding_weights_grad/Select
2022-04-01 02:58:35.555030: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/input_layer/UID_embedding/UID_embedding_weights_grad/tuple/control_dependency
2022-04-01 02:58:35.555062: I ./tensorflow/core/graph/template_select_pruning_base.h:70] Found match op by select_pruning_then_const head/gradients/input_layer/embedding_lookup_4_grad/zeros_like
2022-04-01 02:58:35.555117: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/embedding_lookup_4_grad/Select
2022-04-01 02:58:35.555166: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/embedding_lookup_4_grad/tuple/control_dependency
2022-04-01 02:58:35.555187: I ./tensorflow/core/graph/template_select_pruning_base.h:70] Found match op by select_pruning_then_const head/gradients/input_layer/embedding_lookup_5_grad/zeros_like
2022-04-01 02:58:35.555234: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/embedding_lookup_5_grad/Select
2022-04-01 02:58:35.555279: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/embedding_lookup_5_grad/tuple/control_dependency
2022-04-01 02:58:35.555318: I ./tensorflow/core/graph/template_select_pruning_base.h:70] Found match op by select_pruning_then_const head/gradients/rnn_1/gru1/while/Select_grad/zeros_like
2022-04-01 02:58:35.555383: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/rnn_1/gru1/while/Select_grad/Select
2022-04-01 02:58:35.555449: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/rnn_1/gru1/while/Select_grad/tuple/control_dependency
2022-04-01 02:58:35.555466: I ./tensorflow/core/graph/template_select_pruning_base.h:70] Found match op by select_pruning_then_const head/gradients/input_layer/embedding_lookup_grad/zeros_like
2022-04-01 02:58:35.555530: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/embedding_lookup_grad/Select
2022-04-01 02:58:35.555594: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/embedding_lookup_grad/tuple/control_dependency
2022-04-01 02:58:35.555610: I ./tensorflow/core/graph/template_select_pruning_base.h:70] Found match op by select_pruning_then_const head/gradients/input_layer/embedding_lookup_1_grad/zeros_like
2022-04-01 02:58:35.555673: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/embedding_lookup_1_grad/Select
2022-04-01 02:58:35.555737: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/embedding_lookup_1_grad/tuple/control_dependency
2022-04-01 02:58:35.555764: I ./tensorflow/core/graph/template_select_pruning_base.h:70] Found match op by select_pruning_then_const head/gradients/input_layer/embedding_lookup_2_grad/zeros_like
2022-04-01 02:58:35.555842: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/embedding_lookup_2_grad/Select
2022-04-01 02:58:35.555920: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/embedding_lookup_2_grad/tuple/control_dependency
2022-04-01 02:58:35.555937: I ./tensorflow/core/graph/template_select_pruning_base.h:70] Found match op by select_pruning_then_const head/gradients/input_layer/embedding_lookup_3_grad/zeros_like
2022-04-01 02:58:35.556015: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/embedding_lookup_3_grad/Select
2022-04-01 02:58:35.556092: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/embedding_lookup_3_grad/tuple/control_dependency
2022-04-01 02:58:35.556208: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[input_layer/input_layer/UID_embedding/UID_embedding_weights]
2022-04-01 02:58:35.556260: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[input_layer/embedding_lookup]
2022-04-01 02:58:35.556278: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[input_layer/embedding_lookup_1]
2022-04-01 02:58:35.556294: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[input_layer/embedding_lookup_2]
2022-04-01 02:58:35.556312: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[input_layer/embedding_lookup_3]
2022-04-01 02:58:35.556330: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[input_layer/embedding_lookup_4]
2022-04-01 02:58:35.556346: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[input_layer/embedding_lookup_5]
2022-04-01 02:58:35.556676: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_else_scalar] match op[head/loss/xentropy/Select]
2022-04-01 02:58:35.556988: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_else_scalar_in_grad] match op[head/gradients/head/loss/xentropy/Select_grad/Select]
2022-04-01 02:58:35.557014: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_else_scalar_in_grad] match op[head/gradients/head/loss/xentropy/Select_1_grad/Select]
2022-04-01 02:58:35.557041: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_else_scalar_in_grad] match op[head/gradients/rnn_2/gru2/while/Select_1_grad/Select]
2022-04-01 02:58:35.557072: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_else_scalar_in_grad] match op[head/gradients/attention_layer/Select_grad/Select]
2022-04-01 02:58:35.557095: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_else_scalar_in_grad] match op[head/gradients/rnn_1/gru1/while/Select_1_grad/Select]
2022-04-01 02:58:35.557373: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/head/loss/xentropy/Select_1_grad/Select_1]
2022-04-01 02:58:35.557415: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/input_layer/UID_embedding/UID_embedding_weights_grad/Select_1]
2022-04-01 02:58:35.557431: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/rnn_2/gru2/while/Select_1_grad/Select_1]
2022-04-01 02:58:35.557453: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/embedding_lookup_4_grad/Select_1]
2022-04-01 02:58:35.557466: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/embedding_lookup_5_grad/Select_1]
2022-04-01 02:58:35.557495: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/rnn_1/gru1/while/Select_1_grad/Select_1]
2022-04-01 02:58:35.557509: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/rnn_1/gru1/while/Select_grad/Select_1]
2022-04-01 02:58:35.557523: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/embedding_lookup_grad/Select_1]
2022-04-01 02:58:35.557536: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/embedding_lookup_1_grad/Select_1]
2022-04-01 02:58:35.557560: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/embedding_lookup_2_grad/Select_1]
2022-04-01 02:58:35.557572: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/embedding_lookup_3_grad/Select_1]
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2022-04-01 02:58:37.395205: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] function_optimizer failed: Invalid argument: {{node head/gradients/rnn_2/gru2/while/add_1_grad/Reshape}} has inputs from different frames. The input {{node head/gradients/rnn_2/gru2/while/add_1_grad/BroadcastGradientArgs/StackPopV2}} is in frame 'head/gradients/rnn_2/gru2/while/while_context'. The input {{node head/gradients/rnn_2/gru2/while/add_1_grad/Sum}} is in frame ''.
2022-04-01 02:58:37.878245: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] function_optimizer failed: Invalid argument: {{node head/gradients/rnn_2/gru2/while/Switch_3_grad/b_switch}} has inputs from different frames. The input {{node head/gradients/rnn_2/gru2/while/Switch_3_grad_1/NextIteration}} is in frame ''. The input {{node head/gradients/rnn_2/gru2/while/Exit_3_grad/b_exit}} is in frame 'head/gradients/rnn_2/gru2/while/while_context'.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: {{node fused_op_3_select_else_scalar_in_grad}} has inputs from different frames. The input {{node head/gradients/rnn_2/gru2/while/Select_1_grad/Select/StackPopV2}} is in frame 'head/gradients/rnn_2/gru2/while/while_context'. The input {{node head/clip_by_norm_25/Greater/y}} is in frame ''.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 1147, in <module>
    main()
  File "train.py", line 927, in main
    checkpoint_dir, tf_config, server)
  File "train.py", line 786, in train
    sess.run([model.loss, model.train_op])
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 804, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1309, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1410, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 719, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1395, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1468, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1226, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: {{node fused_op_3_select_else_scalar_in_grad}} has inputs from different frames. The input node head/gradients/rnn_2/gru2/while/Select_1_grad/Select/StackPopV2 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748)  is in frame 'head/gradients/rnn_2/gru2/while/while_context'. The input node head/clip_by_norm_25/Greater/y (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748)  is in frame ''.

[Auto Micro Batch] After enable Auto Micro Batch feature in WDL of modelzoo, but get an error.

I want to enable Auto Micro Batch feature in WDL and follow the steps in DeepRec Docs, but I get an error.

Code to reproduce the issue
I use following codes to enable Auto Graph Fusion. The full code please see Full code

        if args.op_fusion and not args.tf:
            '''Auto Graph Fusion'''
            sess_config.graph_options.optimizer_options.do_op_fusion = True

Run python train.py --steps 1000 --no_eval --micro_batch 2 can reproduce error. Use WDL dataset.
When set --micro_batch(micro_batch_num) to 1, it's OK.
"AutoMicroBatch功能依赖于用户开启图优化的选项" means Auto Graph Fusion? It can be enabled by --op_fusion True, but get the same error. And I also get terrible in enabling Auto Graph Fusion, see issue #126

This seems to be because of the initialization of dataset in MonitorTrainingSession. So this issue is different from #86 which use tf.Session().

logs

INFO:tensorflow:Parsing ./data/train.csv
INFO:tensorflow:Parsing ./data/eval.csv
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Init incremental saver , incremental_save:False, incremental_path:./result/model_WIDE_AND_DEEP_1648002155/.incremental_checkpoint/incremental_model.ckpt
INFO:tensorflow:Graph was finalized.
2022-03-23 10:22:39.913346: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3000000000 Hz
2022-03-23 10:22:39.932151: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x556fea568950 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-03-23 10:22:39.932183: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
INFO:tensorflow:run without loading checkpoint
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into ./result/model_WIDE_AND_DEEP_1648002155/model.ckpt.
INFO:tensorflow:Create incremental timer, incremental_save:False, incremental_save_secs:None
Using TensorFlow version 1.15.5
Checking dataset...
Numbers of training dataset is 8000000
Numbers of test dataset is 2000000
The training steps is 100
The testing steps is 7813
Saving model checkpoints to ./result/model_WIDE_AND_DEEP_1648002155
Traceback (most recent call last):
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.FailedPreconditionError: GetNext() failed because the iterator has not been initialized. Ensure that you have run the initializer operation for this iterator before getting the next element.
	 [[{{node IteratorGetNext_1/dup0}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train_rebuild.py", line 746, in <module>
    main()
  File "train_rebuild.py", line 542, in main
    checkpoint_dir, tf_config, server)
  File "train_rebuild.py", line 414, in train
    sess.run([model.loss, model.train_op])
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 804, in run
    run_metadata=run_metadata)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1309, in run
    run_metadata=run_metadata)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1410, in run
    raise six.reraise(*original_exc_info)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/six.py", line 719, in reraise
    raise value
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1395, in run
    return self._sess.run(*args, **kwargs)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1468, in run
    run_metadata=run_metadata)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1226, in run
    return self._sess.run(*args, **kwargs)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.FailedPreconditionError: GetNext() failed because the iterator has not been initialized. Ensure that you have run the initializer operation for this iterator before getting the next element.
	 [[{{node IteratorGetNext_1/dup0}}]]

ERROR: bias_op_gpu.cu.pic.d (No such file or directory) when building from source

Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Centos7
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): source
  • TensorFlow version: build from main branch
  • Python version: 3.6
  • Installed using virtualenv? pip? conda?: no
  • Bazel version (if compiling from source): 0.24.1
  • GCC/Compiler version (if compiling from source): 7.3.1
  • CUDA/cuDNN version: 11.6/8
  • GPU model and memory: V100 16G

Describe the problem

ERROR: /DeepRec/tensorflow/core/kernels/BUILD:4695:1: error while parsing .d file: /root/.cache/bazel/_bazel_root/de860b3f457ade81f033a15040b8fdd2/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/core/kernels/_objs/bias_op_gpu/bias_op_gpu.cu.pic.d (No such file or directory)
In file included from /usr/local/cuda/bin/../targets/x86_64-linux/include/thrust/system/cuda/config.h:33:0,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/thrust/system/cuda/detail/execution_policy.h:35,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/thrust/iterator/detail/device_system_tag.h:23,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/thrust/iterator/detail/iterator_facade_category.h:22,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/thrust/iterator/iterator_facade.h:37,
                 from bazel-out/host/bin/external/cub_archive/_virtual_includes/cub/third_party/cub/device/../iterator/arg_index_input_iterator.cuh:48,
                 from bazel-out/host/bin/external/cub_archive/_virtual_includes/cub/third_party/cub/device/device_reduce.cuh:41,
                 from ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:27,
                 from tensorflow/core/kernels/bias_op_gpu.cu.cc:28:
/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_namespace.cuh:46:2: error: #error CUB requires a definition of CUB_NS_QUALIFIER when CUB_NS_PREFIX/POSTFIX are defined.
 #error CUB requires a definition of CUB_NS_QUALIFIER when CUB_NS_PREFIX/POSTFIX are defined.
  ^~~~~
Target //tensorflow/tools/pip_package:build_pip_package failed to build
INFO: Elapsed time: 41.305s, Critical Path: 36.17s
INFO: 928 processes: 928 local.
FAILED: Build did NOT complete successfully

Provide the exact sequence of commands / steps that you executed before running into the problem
step1
./configure
here is the content of .tf_configure.bazelrc

build --action_env PYTHON_BIN_PATH="/usr/bin/python3"
build --action_env PYTHON_LIB_PATH="/usr/lib64/python3.6/site-packages"
build --python_path="/usr/bin/python3"
build:xla --define with_xla_support=true
build --config=xla
build:star --define with_star_support=true
build:pmem --define with_pmem_support=true
build --action_env TF_USE_CCACHE="0"
build --action_env CUDA_TOOLKIT_PATH="/usr/local/cuda"
build --action_env TF_CUDA_COMPUTE_CAPABILITIES="7.0,8.0,8.6,6.1"
build --action_env LD_LIBRARY_PATH="/usr/lib64:/usr/local/lib64:/usr/local/lib64:/usr/local/cuda/lib64:/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib:/opt/rh/devtoolset-7/root/usr/lib64/dyninst:/opt/rh/devtoolset-7/root/usr/lib/dyninst:/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib:/usr/lib64/:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64/:/usr/lib64/:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64"
build --action_env GCC_HOST_COMPILER_PATH="/opt/rh/devtoolset-7/root/usr/bin/gcc"
build --config=cuda
build:opt --copt=-march=native
build:opt --copt=-Wno-sign-compare
build:opt --host_copt=-march=native
build:opt --define with_default_optimizations=true
build:v2 --define=tf_api_version=2
test --flaky_test_attempts=3
test --test_size_filters=small,medium
test --test_tag_filters=-benchmark-test,-no_oss,-oss_serial
test --build_tag_filters=-benchmark-test,-no_oss
test --test_tag_filters=-gpu
test --build_tag_filters=-gpu
build --action_env TF_CONFIGURE_IOS="0"

step2:
bazel build --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --host_cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" -c opt --config=opt --copt=-march=native //tensorflow/tools/pip_package:build_pip_package --verbose_failures

Any other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

Build from source and import error "cannot import name saver"

Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary):
  • TensorFlow version:1.15
  • Python version:2.7
  • Installed using virtualenv? pip? conda?:
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source): g++ 7.5
  • CUDA/cuDNN version:
  • GPU model and memory:

Describe the problem

ERROR: /DeepRec/tensorflow/BUILD:893:1: Executing genrule //tensorflow:tf_python_api_gen_v1 failed (Exit 1)
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/tools/api/generator/create_python_api.py", line 27, in <module>
    from tensorflow.python.tools.api.generator import doc_srcs
  File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/__init__.py", line 73, in <module>
    from tensorflow.python.ops.standard_ops import *
  File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/ops/standard_ops.py", line 25, in <module>
    from tensorflow.python import autograph
  File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/autograph/__init__.py", line 35, in <module>
    from tensorflow.python.autograph import operators
  File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/autograph/operators/__init__.py", line 40, in <module>
    from tensorflow.python.autograph.operators.control_flow import for_stmt
  File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/autograph/operators/control_flow.py", line 65, in <module>
    from tensorflow.python.autograph.operators import py_builtins
  File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/autograph/operators/py_builtins.py", line 30, in <module>
    from tensorflow.python.data.ops import dataset_ops
  File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/data/__init__.py", line 25, in <module>
    from tensorflow.python.data import experimental
  File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/data/experimental/__init__.py", line 89, in <module>
    from tensorflow.python.data.experimental.ops.batching import dense_to_sparse_batch
  File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/data/experimental/ops/batching.py", line 20, in <module>
    from tensorflow.python.data.ops import dataset_ops
  File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/data/ops/dataset_ops.py", line 40, in <module>
    from tensorflow.python.data.ops import iterator_ops
  File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/data/ops/iterator_ops.py", line 35, in <module>
    from tensorflow.python.training.saver import BaseSaverBuilder
  File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/training/saver.py", line 57, in <module>
    from tensorflow.python.training.saving import saveable_object_util
  File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/training/saving/saveable_object_util.py", line 33, in <module>
    from tensorflow.python.training import saver
ImportError: cannot import name saver
Target //tensorflow/tools/pip_package:build_pip_package failed to build

Provide the exact sequence of commands / steps that you executed before running into the problem

bazel build --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=1" --host_cxxopt="-D_GLIBCXX_USE_CXX11_ABI=1" -c opt --config=v1 --config=opt --config=mkl_threadpool --define build_with_mkl_dnn_v1_only=true //tensorflow/tools/pip_package:build_pip_package

Any other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

[Optimizer] get_embedding_variable_internal keyword argument error when use custom optimizer

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): CentOS7
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: No
  • TensorFlow installed from (source or binary): source
  • TensorFlow version (use command below): 1.15.5
  • Python version: 3.7.4
  • Bazel version (if compiling from source): 0.24.1
  • GCC/Compiler version (if compiling from source): 7.3.1
  • CUDA/cuDNN version: None
  • GPU model and memory: None

Describe the current behavior
Using get_embedding_variable to create an EmbeddingVariable for embedding lookup, but encounters unexpected keyword argument issue while creating slot var, the detailed error stack is:

  File "/usr/local/python3.7/lib/python3.7/site-packages/tensorflow_core/python/training/optimizer.py", line 1302, in _zeros_slot
    new_slot_variable = slot_creator.create_zeros_slot(var, op_name, slot_config=slot_config)
  File "/usr/local/python3.7/lib/python3.7/site-packages/tensorflow_core/python/training/slot_creator.py", line 266, in create_zeros_slot
    slot_config=slot_config)
  File "/usr/local/python3.7/lib/python3.7/site-packages/tensorflow_core/python/training/slot_creator.py", line 239, in create_slot_with_initializer
    dtype, slot_config)
  File "/usr/local/python3.7/lib/python3.7/site-packages/tensorflow_core/python/training/slot_creator.py", line 92, in _create_slot_var
    ht_partition_num=primary._ht_partition_num)
TypeError: get_embedding_variable_internal() got an unexpected keyword argument 'ht_partition_num'

Describe the expected behavior
Training without error.
Maybe _create_slot_var should use get_embedding_variable_v2_internal for all cases.

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

How to use Dynamic-dimension Embedding Variable?

Refer to the document of Dynamic-dim Embedding Variable, I got that 当使用dynamic dimension embedding variable的时候,在embedding_lookup的时候需要传入blocknum上参数,用来指示每一个特征对应的blocknum.

But I am confused about how to assign blocknum for every feature when embedding lookup. Could you please provide a minimal example to show how to initialize and look_up Dynamic-dimension Embedding Variable?

How to adjust learning rate when the number of workers incresing?

For example, I set original learning rate 0.001 when standalone mode, and 0.001 / sqrt(10) performs well when 10 workers running. But when 20 workers running, 0.001 / sqrt(20) performs very bad. So, is there any suggestion to adjust when the number of workers incresing?

[Adaptive Embedding] After enable Adaptive embedding, it fails to evaluate model with modelzoo.

After enable Adaptive embedding, it fails to evaluate model with modelzoo after completing training.

Code to reproduce the issue
With WDL in modelzoo, run python train.py --steps 100 --adaptive_emb true

Other info / logs

Training completed.                                                                                                                                                                                                                                                     
INFO:tensorflow:Graph was finalized.                                                                                                                                                                                                                                    
INFO:tensorflow:run with loading checkpoint                                                                                                                                                                                                                             
INFO:tensorflow:Restoring parameters from ./result/model_BST_1653893703/model.ckpt-100                                                                                                                                                                                  
2022-05-30 14:56:04.871953: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/unseq_input_layer/input_layer/adgroup_id_embedding/adgroup_id_embedding_weights][new_name:fused_op_1_select_then_scalar]      
2022-05-30 14:56:04.872043: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/unseq_input_layer/input_layer/age_level_embedding/age_level_embedding_weights][new_name:fused_op_2_select_then_scalar]        
2022-05-30 14:56:04.872552: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/unseq_input_layer/input_layer/brand_embedding/brand_embedding_weights][new_name:fused_op_3_select_then_scalar]                
2022-05-30 14:56:04.872924: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/unseq_input_layer/input_layer/campaign_id_embedding/campaign_id_embedding_weights][new_name:fused_op_4_select_then_scalar]    
2022-05-30 14:56:04.873322: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/unseq_input_layer/input_layer/cate_id_embedding/cate_id_embedding_weights][new_name:fused_op_5_select_then_scalar]            
2022-05-30 14:56:04.873678: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/unseq_input_layer/input_layer/cms_group_id_embedding/cms_group_id_embedding_weights][new_name:fused_op_6_select_then_scalar]  
2022-05-30 14:56:04.874156: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/unseq_input_layer/input_layer/cms_segid_embedding/cms_segid_embedding_weights][new_name:fused_op_7_select_then_scalar]        
2022-05-30 14:56:04.874631: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/unseq_input_layer/input_layer/customer_embedding/customer_embedding_weights][new_name:fused_op_8_select_then_scalar]          
2022-05-30 14:56:04.875088: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/unseq_input_layer/input_layer/new_user_class_level_embedding/new_user_class_level_embedding_weights][new_name:fused_op_9_sele$
t_then_scalar]                                                                                                                                                                                                                                                          
2022-05-30 14:56:04.875571: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/unseq_input_layer/input_layer/occupation_embedding/occupation_embedding_weights][new_name:fused_op_10_select_then_scalar]     
2022-05-30 14:56:04.875981: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/unseq_input_layer/input_layer/pid_embedding/pid_embedding_weights][new_name:fused_op_11_select_then_scalar]                   
2022-05-30 14:56:04.876455: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/unseq_input_layer/input_layer/price_embedding/price_embedding_weights][new_name:fused_op_12_select_then_scalar]               
2022-05-30 14:56:04.876896: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/unseq_input_layer/input_layer/pvalue_level_embedding/pvalue_level_embedding_weights][new_name:fused_op_13_select_then_scalar] 
2022-05-30 14:56:04.877411: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/unseq_input_layer/input_layer/shopping_level_embedding/shopping_level_embedding_weights][new_name:fused_op_14_select_then_scal
ar]
2022-05-30 14:56:04.877944: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/unseq_input_layer/input_layer/user_id_embedding/user_id_embedding_weights][new_name:fused_op_15_select_then_scalar]
2022-05-30 14:56:04.880237: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/adgroup_id_embedding/adgroup_id_embedding_weights_grad/Select][new_name:f
used_op_1_select_else_scalar_in_grad]
2022-05-30 14:56:04.880278: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/age_level_embedding/age_level_embedding_weights_grad/Select][new_name:fus
ed_op_2_select_else_scalar_in_grad]
2022-05-30 14:56:04.880297: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/campaign_id_embedding/campaign_id_embedding_weights_grad/Select[101/4331$
:fused_op_3_select_else_scalar_in_grad]                                                                                                                                                                                                                                 
2022-05-30 14:56:04.880316: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/cms_group_id_embedding/cms_group_id_embedding_weights_grad/Select][new_na
me:fused_op_4_select_else_scalar_in_grad]                                                                                                                                                                                                                               
2022-05-30 14:56:04.880335: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/cms_segid_embedding/cms_segid_embedding_weights_grad/Select][new_name:fus
ed_op_5_select_else_scalar_in_grad]                                                                                                                                                                                                                                     
2022-05-30 14:56:04.880351: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/customer_embedding/customer_embedding_weights_grad/Select][new_name:fused
_op_6_select_else_scalar_in_grad]                                                                                                                                                                                                                                       
2022-05-30 14:56:04.880367: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/new_user_class_level_embedding/new_user_class_level_embedding_weights_gra
d/Select][new_name:fused_op_7_select_else_scalar_in_grad]                                                                                                                                                                                                               
2022-05-30 14:56:04.880383: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/occupation_embedding/occupation_embedding_weights_grad/Select][new_name:f
used_op_8_select_else_scalar_in_grad]                                                                                                                                                                                                                                   
2022-05-30 14:56:04.880398: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/pid_embedding/pid_embedding_weights_grad/Select][new_name:fused_op_9_sele
ct_else_scalar_in_grad]                                                                                                                                                                                                                                                 
2022-05-30 14:56:04.880413: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/price_embedding/price_embedding_weights_grad/Select][new_name:fused_op_10
_select_else_scalar_in_grad]                                                                                                                                                                                                                                            
2022-05-30 14:56:04.880428: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/pvalue_level_embedding/pvalue_level_embedding_weights_grad/Select][new_na
me:fused_op_11_select_else_scalar_in_grad]                                                                                                                                                                                                                              
2022-05-30 14:56:04.880443: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/shopping_level_embedding/shopping_level_embedding_weights_grad/Select][ne
w_name:fused_op_12_select_else_scalar_in_grad]                                                                                                                                                                                                                          
2022-05-30 14:56:04.880458: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/user_id_embedding/user_id_embedding_weights_grad/Select][new_name:fused_o
p_13_select_else_scalar_in_grad]                                                                                                                                                                                                                                        
2022-05-30 14:56:04.880521: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/cate_id_embedding/cate_id_embedding_weights_grad/Select][new_name:fused_o
p_14_select_else_scalar_in_grad]                                                                                                                                                                                                                                        
2022-05-30 14:56:04.880537: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/brand_embedding/brand_embedding_weights_grad/Select][new_name:fused_op_15
_select_else_scalar_in_grad]                                                                                                                                                                                                                                            
2022-05-30 14:56:04.881665: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/adgroup_id_embedding/adgroup_id_embedding_weights_grad/Select_1][new_nam$
:fused_op_1_select_then_scalar_in_grad]                                                                                                                                                                                                                                
2022-05-30 14:56:04.881789: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/age_level_embedding/age_level_embedding_weights_grad/Select_1][new_name:$
used_op_2_select_then_scalar_in_grad]                                                                                                                                                                                                                                  
2022-05-30 14:56:04.882280: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/campaign_id_embedding/campaign_id_embedding_weights_grad/Select_1][new_n$
me:fused_op_3_select_then_scalar_in_grad]                                                                                                                                                                                                                              
2022-05-30 14:56:04.882307: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/cms_group_id_embedding/cms_group_id_embedding_weights_grad/Select_1][new$
name:fused_op_4_select_then_scalar_in_grad]                                                                                                  
2022-05-30 14:56:04.882325: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/cms_segid_embedding/cms_segid_embedding_weights_grad/Select_1][new_name:$
used_op_5_select_then_scalar_in_grad]                                                                                                                                                                                                                                  
2022-05-30 14:56:04.882343: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/customer_embedding/customer_embedding_weights_grad/Select_1][new_name:fu$
ed_op_6_select_then_scalar_in_grad]                                                                                                                                                                                                                                    
2022-05-30 14:56:04.882359: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/new_user_class_level_embedding/new_user_class_level_embedding_weights_gr$
d/Select_1][new_name:fused_op_7_select_then_scalar_in_grad]                                                                                                                                                                                                            
2022-05-30 14:56:04.882375: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/occupation_embedding/occupation_embedding_weights_grad/Select_1][new_nam$
:fused_op_8_select_then_scalar_in_grad]                                                                                                      
2022-05-30 14:56:04.882391: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/pid_embedding/pid_embedding_weights_grad/Select_1][new_name:fused_op_9_s$
lect_then_scalar_in_grad]                                                                
2022-05-30 14:56:04.882408: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/price_embedding/price_embedding_weights_grad/Select_1][new_name:fused_op$
10_select_then_scalar_in_grad]                                                                                                                                                                                                                                         
2022-05-30 14:56:04.882423: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/pvalue_level_embedding/pvalue_level_embedding_weights_grad/Select_1][new$
name:fused_op_11_select_then_scalar_in_grad]                                                                                                                                                                                                                           
2022-05-30 14:56:04.882440: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/shopping_level_embedding/shopping_level_embedding_weights_grad/Select_1]$
new_name:fused_op_12_select_then_scalar_in_grad]
2022-05-30 14:56:04.882456: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/user_id_embedding/user_id_embedding_weights_grad/Select_1][new_name:fuse$
_op_13_select_then_scalar_in_grad]        
2022-05-30 14:56:04.882515: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/cate_id_embedding/cate_id_embedding_weights_grad/Select_1][new_name:fuse$
_op_14_select_then_scalar_in_grad]
2022-05-30 14:56:04.882533: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/brand_embedding/brand_embedding_weights_grad/Select_1][new_name:fused_op$
15_select_then_scalar_in_grad]             
2022-05-30 14:56:05.593580: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200                                                                                                                          
2022-05-30 14:56:05.598049: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200                                                                                                                          
2022-05-30 14:56:06.180909: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200                                                                                                                          
INFO:tensorflow:Running local_init_op.                                                                                                                                                                                                                                 
2022-05-30 14:56:06.308357: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200                                                                                                                          
2022-05-30 14:56:06.309002: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200                                                                                                                          
2022-05-30 14:56:06.309340: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200                                                                                                                          
2022-05-30 14:56:06.336448: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200                                                                                                                          
INFO:tensorflow:Done running local_init_op.                                                                                                                                                                                                                   
2022-05-30 14:56:06.649402: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200                                                                                                                           
2022-05-30 14:56:06.768921: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200                                                                                                                          
2022-05-30 14:56:06.882066: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200                                                                                                                          
2022-05-30 14:56:07.804632: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200                                                                                                                          
2022-05-30 14:56:07.812211: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200                                                                                                                          
2022-05-30 14:56:08.707163: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200                                                                                                                          
2022-05-30 14:56:10.078551: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200                                                                                                                           
2022-05-30 14:56:10.496578: I tensorflow/core/common_runtime/tensorpool_allocator.cc:146] TensorPoolAllocator enabled
INFO:tensorflow:Prefetching was closed.                                                                                                                                                                                                                      
INFO:tensorflow:Prefetching was closed.                                                                                                                                                                                                                                 
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.                                                                                                                                                                                                                                 
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, data.shape must start with partitions.shape, got data.shape = [272], partitions.shape = [512]
         [[{{node input_layer/unseq_input_layer/input_layer/price_embedding/DynamicPartition_1}}]]
ERROR:tensorflow:Prefetching was cancelled unexpectedly:  
                                                                                                                                                   
data.shape must start with partitions.shape, got data.shape = [272], partitions.shape = [512]
         [[{{node input_layer/unseq_input_layer/input_layer/price_embedding/DynamicPartition_1}}]]                                                        
Exception in thread PrefetchThread-PrefetchRunner-4:
Traceback (most recent call last):                                                                                                               
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()                                                                                                                                   
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)                                                                                            
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/prefetch_runner.py", line 236, in run
    run_fetch(*feed)                                                                                
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287, in _single_operation_run
    self._call_tf_sessionrun(None, {}, [], target_list, None)                                                                          
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)                                                                                                                                    
tensorflow.python.framework.errors_impl.InvalidArgumentError: data.shape must start with partitions.shape, got data.shape = [272], partitions.shape = [512]
         [[{{node input_layer/unseq_input_layer/input_layer/price_embedding/DynamicPartition_1}}]]                                                 
                 
2022-05-30 14:56:10.811248: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200
2022-05-30 14:56:14.841705: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200                                                                                                                   [0/4331]
Traceback (most recent call last):
  File "train.py", line 573, in eval
    [model.acc_op, model.auc_op, merged])
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 804, in run                                                                                                                         
    run_metadata=run_metadata)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1309, in run                                                                                                                        
    run_metadata=run_metadata)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1408, in run                                                                                                                        
    raise six.reraise(*original_exc_info)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/six.py", line 719, in reraise
    raise value
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1395, in run                                                                                                                        
    return self._sess.run(*args, **kwargs)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1468, in run                                                                                                                        
    run_metadata=run_metadata)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1226, in run                                                                                                                        
    return self._sess.run(*args, **kwargs)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run                                                                                                                                
    run_metadata)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call                                                                                                                               
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.CancelledError: Session was closed.
         [[node prefetch_2/TensorBufferTake (defined at /home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]                                                                                                

Original stack trace for 'prefetch_2/TensorBufferTake':
  File "train.py", line 907, in <module>
    main()
  File "train.py", line 653, in main
    next_element = tf.staged(next_element, num_threads=8, capacity=40)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/prefetch.py", line 140, in staged
    shared_threads=num_clients)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_tensor_buffer_ops.py", line 535, in tensor_buffer_take                                                                                                           
    shared_threads=shared_threads, name=name)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper                                                                                                              
    op_def=op_def)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func                                                                                                                              
    return func(*args, **kwargs)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op                                                                                                                               
    attrs, op_def, compute_device)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal                                                                                                                     
    op_def=op_def)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__                                                                                                                                
    self._traceback = tf_stack.extract_stack()


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 907, in <module>
    main()
  File "train.py", line 683, in main
    checkpoint_dir)
  File "train.py", line 576, in eval
    print("ACC = {}\nAUC = {}".format(eval_acc, eval_auc))
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 911, in __exit__                                                                                                                    
    self._close_internal(exception_type)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 949, in _close_internal                                                                                                             
    self._sess.close()
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1216, in close                                                                                                                      
    self._sess.close()
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1384, in close                                                                                                                      
    ignore_live_threads=True)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/coordinator.py", line 389, in join                                                                                                                              
    six.reraise(*self._exc_info_to_raise)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/six.py", line 718, in reraise
    raise value.with_traceback(tb)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/prefetch_runner.py", line 236, in run                                                                                                                                
    run_fetch(*feed)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287, in _single_operation_run                                                                                                                  
    self._call_tf_sessionrun(None, {}, [], target_list, None)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun                                                                                                                    
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: data.shape must start with partitions.shape, got data.shape = [272], partitions.shape = [512]                                                                                                            
         [[{{node input_layer/unseq_input_layer/input_layer/price_embedding/DynamicPartition_1}}]]

[Memory Optimization][DIEN]After replacing tf.Session() with tf.MonitorTrainingSession(), it take many time to collect info.

When I replaced tf.Session() with tf.MonitorTrainingSession(), it take too much time on collect info for Memory Optimization at one steps.
After closed Memory Optimization by export ENABLE_MEMORY_OPTIMIZATION=0, it's ok.
Code
Here is the full code of rebuilded DIEN.
reproduce the issue
Here is the docker image to reproduce with DeepRec build on commit e3f51a3
docker pull cesg-prc-registry-vpc.cn-beijing.cr.aliyuncs.com/cesg-ali/deeprec-modelzoo:220328-DIEN-issue

cd /root/modelzoo/DIEN
python train.py --steps 300 --no_eval

START_STATISTIC_STEP and STOP_STATISTIC_STEP are set to 100 and 200 in the train.py.
I put the DIEN code in the main branch at /root/modelzoo/DIEN-old directory.

logs
START_STATISTIC_STEP and STOP_STATISTIC_STEP are set to 100 and 200, and step 193 takes so much time.
If use the default setting(start at 1000,stop and 1100), one step between 1080 to 1095 will take a long time.

INFO:tensorflow:loss = 0.92963386, steps = 191 (0.206 sec)
INFO:tensorflow:loss = 0.93619776, steps = 192 (0.206 sec)
INFO:tensorflow:loss = 0.96694994, steps = 193 (370.673 sec)
INFO:tensorflow:loss = 0.93243694, steps = 194 (0.208 sec)
INFO:tensorflow:loss = 0.94905794, steps = 195 (0.210 sec)
INFO:tensorflow:loss = 0.9613142, steps = 196 (0.210 sec)
INFO:tensorflow:loss = 0.96409273, steps = 197 (0.209 sec)

Alibaba Summer of Code (ASOC) 2022

Alibaba Summer of Code (ASOC) 2022

Welcome to the open source world! If you haven't planned how to spend this summer, come to the Alibaba Summer of Code and code with us! 💻

Alibaba Summer of Code is a global program focused on engaging students directly in open source software development. Under the guidance of the mentor in the Alibaba open source project, students can experience software development in the real world. Alibaba Summer of code will begin from May 30th to September 1st. Students can use the summertime to participate in the open source project and work with the core members of the project.

This is a master issue to track the progress and result of Alibaba Summer of Code 2022.

What you can get?

On this exclusive developer journey, students will have the opportunity to:

Participate in the top projects of the International Open Source Foundation;
Get a scholarship from Alibaba;
Obtain an open source contributor certificate;
Get a fast pass of Alibaba Internship
Get your code adopted and used by the open source project!

Our Mentor

@shanshanpt [email protected]
@candyzone [email protected]
@JackMoriarty [email protected]

Timeline

image

Apply Now!

Browse open idea list here:
#230 Difficulty:Advance
#232 Difficulty:Basic
#233 Difficulty:Basic
Upload your CV and project proposal via ASOC 2022 official website

Contact the Organizer

If you have any questions, visit the event website: https://opensource.alibaba.com/asoc2022

Email address: [email protected]

阿里巴巴编程之夏 (ASOC) 2022

欢迎来到开源世界! 如果你还没有计划如何度过这个夏天,那就来阿里巴巴编程之夏和我们一起编程吧! 💻

阿里巴巴编程之夏是一个全球性项目,专注于让学生直接参与开源软件开发。 在阿里巴巴开源项目导师的指导下,学生可以在现实世界中体验软件开发。

阿里巴巴代码之夏将于 5 月 30 日至 9 月 1 日开始。 学生可以利用暑期参与开源项目,与项目核心成员一起工作。

参与活动能获得什么?

在这个独家开发者之旅中,学生将有机会:

参与国际开源基金会的顶级项目;
获得阿里巴巴奖学金;
获得开源贡献者证书;
获得阿里巴巴实习快速通行证
让你的代码被开源项目采纳和使用!

活动导师

@shanshanpt [email protected]
@candyzone [email protected]
@JackMoriarty [email protected]

活动里程碑

image

立刻申请!

浏览如下课题列表:
#230 难度:进阶
#232 难度:基础
#233 难度:基础
通过ASOC 2022 官网上传您的简历和项目提案

联系主办方

如有任何问题,请访问活动网站:https://opensource.alibaba.com/asoc2022

邮箱:[email protected]

Multiple GPU-Worker Protocol Issue

We first test the star_server protocol on the CPU machine, and the training task runs normally. Now, we want to switch to the GPU machine. The cluster info is 2 PS node and 2 GPU-Worker node.
When in star_server protocol, the training task is failed with the ERROR /job:worker/replica:0/task:0/device:GPU:0 unknown device . But when in grpc++ and grpc, the training task runs normally.

[Auto Micro Batch] auto micro batch run error

git commit-id, 821d157, branch, master

'''
from future import absolute_import
from future import division
from future import print_function

import os
import numpy as np
import tensorflow as tf

num_x = np.random.randint(0, 10, (500, 10)).astype(dtype=np.float32)
num_y = np.random.randint(0, 10, 500).astype(dtype=np.int64)
dataset = tf.data.Dataset.from_tensor_slices((num_x, num_y))
.batch(10)
iterator = dataset.make_initializable_iterator()

x, labels = iterator.get_next()
outputs = tf.layers.dense(x, 10)

logits = tf.layers.dense(outputs, 10)
loss = tf.losses.sparse_softmax_cross_entropy(labels=labels,
logits=logits)

optimizer = tf.train.AdamOptimizer(learning_rate=0.001)
train_op = optimizer.minimize(loss)

init = tf.global_variables_initializer()

config = tf.ConfigProto()
config.graph_options.optimizer_options.micro_batch_num = 2

with tf.Session(config=config) as sess:
sess.run(iterator.initializer)
sess.run(init)
print("================================")
train_loss, _ = sess.run([loss, train_op])
print(' Loss: %s .' % ( train_loss))

'''

error msg

================================
Traceback (most recent call last):
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.FailedPreconditionError: GetNext() failed because the iterator has not been initialized. Ensure that you have run the initializer operation for this iterator before getting the next element.
[[{{node IteratorGetNext/dup0}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "demo1.py", line 37, in
train_loss, _ = sess.run([loss, train_op])
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.FailedPreconditionError: GetNext() failed because the iterator has not been initialized. Ensure that you have run the initializer operation for this iterator before getting the next element.
[[{{node IteratorGetNext/dup0}}]]

DeepRec utilize GPU with really low utilization on the special kind of CPU

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): YES
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04 in Docker
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): source
  • TensorFlow version (use command below): r1.15.5-deeprec2204-39-g0527d0b2ad8 1.15.5
  • Python version: Python 3.6.9
  • Bazel version (if compiling from source): Bazelisk version: v1.11.0
    Build label: 0.24.1
  • GCC/Compiler version (if compiling from source): gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
  • CUDA/cuDNN version: CUDA=11.4, V11.4.152, cuDNN 8
  • GPU model and memory: NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4, Tesla P100 * 4, 16280MiB

You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior
In some kind of GPU instance in aliyun, I build DeepRec from source following this docs: https://github.com/alibaba/DeepRec#how-to-build, I confirm I enabled GPU, but in this machine, I notice my code only run on CPU, and GPU-Util is always zero and with low GPU Memory-Usage, here is a runtime capture
image

But on other machines, the same building and execute behavior works normally.

Here is the CPU info which works fine:

# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
stepping        : 4
microcode       : 0x1
cpu MHz         : 2499.998
cache size      : 33792 KB
physical id     : 0
siblings        : 16
core id         : 0
cpu cores       : 8
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq monitor ssse3 fma cx16 pcid sse
4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsav
eopt xsavec xgetbv1 arat
bogomips        : 4999.99
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

Here is the CPU info which works with low GPU util:

$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 79
model name      : Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz
stepping        : 1
microcode       : 0x1
cpu MHz         : 2499.996
cache size      : 40960 KB
physical id     : 0
siblings        : 32
core id         : 0
cpu cores       : 16
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 20
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic
movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat spec_ctrl intel_stibp
bogomips        : 4999.99
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

Describe the expected behavior

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

[Multi-Hash Variable] An error occurred when Multi-Hash Variable enabled in modelzoo's DIEN.

An error occurred when Multi-Hash Variable enabled in modelzoo's DIEN.
And the doc of Multi-Hash Variable should be updated. https://deeprec.readthedocs.io/zh/latest/Multi-Hash-Variable.html
num_of_partitions param of get_multihash_variable is removed in the code which is still in the doc.

It seems that Multi-Hash Variable has something wrong with variable partitioner. error is type object 'float' has no attribute 'base_dtype', but object 'float' is the parameter passed down by default.
Without using variable partitioner, another error occurred. 'MultiHashVariable' object has no attribute '_dtype'

Reproduce the issue
The code and dataset is provide in docker image, docker pull cesg-prc-registry.cn-beijing.cr.aliyuncs.com/cesg-ali/deeprec-modelzoo:220401-dien-issue
The DeepRec installed in the image is built on f4368d6
And run following code to reproduce the issue.

/root/modelzoo/DIEN
python train.py --steps 100 --no_eval --multihash True
# Disable variable partitioner 
python train.py --steps 100 --no_eval --multihash True --input_layer_partitioner 0 --dense_layer_partitioner 0

Other info / logs

Traceback (most recent call last):
  File "train.py", line 1147, in <module>
    main()
  File "train.py", line 903, in main
    dense_layer_partitioner=dense_layer_partitioner)
  File "train.py", line 157, in __init__
    self._create_model()
  File "train.py", line 464, in _create_model
    uid_emb, item_emb, his_item_emb, noclk_his_item_emb, sequence_length = self._embedding_input_layer(
  File "train.py", line 398, in _embedding_input_layer
    self._embedding_dim
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variable_scope.py", line 2344, in get_multihash_variable
    aggregation=aggregation)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variable_scope.py", line 1525, in get_variable
    aggregation=aggregation)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variable_scope.py", line 805, in get_variable
    ht_partition_num=ht_partition_num)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variable_scope.py", line 697, in _true_getter
    ht_partition_num=ht_partition_num)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variable_scope.py", line 930, in _get_partitioned_variable
    partitions = _call_partitioner(partitioner, shape, dtype)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variable_scope.py", line 3237, in _call_partitioner
    slicing = partitioner(shape=shape, dtype=dtype)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/partitioned_variables.py", line 205, in _partitioner
    if dtype.base_dtype == dtypes.string:
AttributeError: type object 'float' has no attribute 'base_dtype'
Traceback (most recent call last):
  File "train.py", line 1147, in <module>
    main()
  File "train.py", line 903, in main
    dense_layer_partitioner=dense_layer_partitioner)
  File "train.py", line 157, in __init__
    self._create_model()
  File "train.py", line 464, in _create_model
    uid_emb, item_emb, his_item_emb, noclk_his_item_emb, sequence_length = self._embedding_input_layer(
  File "train.py", line 423, in _embedding_input_layer
    item_embedding_var)
  File "train.py", line 344, in _get_embedding_input
    sparse_weights=sparse_tensors_weights)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/embedding_ops.py", line 1275, in safe_embedding_lookup_sparse
    if not (isinstance(w, resource_variable_ops.ResourceVariable) and dtype in (None, w.dtype)):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 473, in dtype
    return self._dtype
AttributeError: 'MultiHashVariable' object has no attribute '_dtype'

How to use multi level embedding?

I read https://mp.weixin.qq.com/s/aEi6ooG9wDL-GXVWcGWRCw and found DeepRec supports multi level embedding, which can put feature in HBM or DRAM by its hotness. It sounds a very good feature.

Then I read https://deeprec.readthedocs.io/zh/latest/Embedding-Variable.html but I can not found how to use multi level embedding.

My question is:

  1. How to use multi level embedding?

  2. If document is missing recently, could you give some config name or variable name as a clue? Then I can find related source code by myself.

Thanks.

DeepRec supports multiple evaluator

Background

At present, DeepRec cannot support the evaluation of very large models on a single node. Multiple ps are required to load large models, and multiple workers are used for distributed evaluation.This can improve DeepRec's support for more scenarios

Realize ideas

Unlike training models, evaluating models does not require modifying the network structure to improve model accuracy, but instead requires consideration of how to improve the throughput of model evaluation and reduce evaluation latency. DeepRec already supports distributed training, and the evaluation is actually simpler compared to the training process because no updates to ps are involved. In the code, DeepRec first decides whether to initialize the cluster and how to initialize it according to the parameters.

There are two modes of distributed multi-evaluator evaluation of the system that need to be implemented.
1.Mode 1 contains ps, worker and evaluator nodes.DeepRec has implemented the case of a single evaluator in this mode,we need to implement multiple evaluators.One of the ideas is to directly add multiple evaluators to the initialization list of distributed clusters in DeepRec, or use the tf.distribute.Strategy interface
2.Mode 2 only has ps and evaluator nodes.The difference between this mode and mode 1 is that there is no need to train, just load the offline model that has been trained into ps and directly evaluate its performance.

Issue of Saving Checkpoint

First, we will train a baseline model, then we will restore the parameters of the baseline model, continue to train. When we restore parameters, our code is as follows.

    vars_to_warm_start = ['^((?!Adam)(?!pos_dense).)*$']
    variables = self.restore_variables()
    restorer = tf.compat.v1.train.Saver(var_list=variables, max_to_keep=1)
    restorer.restore(session, base_checkpoint_path)
    saver= tf.compat.v1.train.Saver(max_to_keep=1)

    def restore_variables(self):
        list_of_vars = None
        if 'vars_to_warm_start' in _Hyperparams:
            vars_to_warm_start = _Hyperparams['vars_to_warm_start']
            if isinstance(vars_to_warm_start, str) or vars_to_warm_start is None:
                # Both vars_to_warm_start = '.*' and vars_to_warm_start = None will match
                # everything (in TRAINABLE_VARIABLES) here.
                self.logger.info("Warm-starting variables only in GLOBAL_VARIABLES.")
                list_of_vars = ops.get_collection(
                    ops.GraphKeys.GLOBAL_VARIABLES, scope=vars_to_warm_start)
                self.logger.info('Loading base model variables: {}'.format(list_of_vars))
                saveable_objects = tf.get_collection(tf.GraphKeys.SAVEABLE_OBJECTS,
                                                               scope=vars_to_warm_start)
                self.logger.info('Loading saveable variables: {}'.format(saveable_objects))
                list_of_vars += saveable_objects
            elif isinstance(vars_to_warm_start, list):
                if all(isinstance(v, str) for v in vars_to_warm_start):
                    self.logger.info("Warm-starting partial variables in GLOBAL_VARIABLES.")
                    list_of_vars = []
                    saveable_objects = []
                    for v in vars_to_warm_start:
                        list_of_vars += ops.get_collection(
                            ops.GraphKeys.GLOBAL_VARIABLES, scope=v)
                        saveable_objects += tf.get_collection(tf.GraphKeys.SAVEABLE_OBJECTS,
                                                                        scope=v)
                    self.logger.info('Loading base model variables: {}'.format(list_of_vars))
                    self.logger.info('Loading saveable variables: {}'.format(saveable_objects))
                    list_of_vars += saveable_objects
        return list_of_vars

We enable GlobalStepEvict for imei feature at two stage.
If we enable GlobalStepEvict when restoring the baseline model, it will failed when saving checkpoint via saver. The core dump info is:

tensorflow::SaveV2::Compute (this=0x7f8fd20bdec0, context=<optimized out>) at 
tensorflow/core/kernels/save_restore_v2_ops.cc:177
tensor_name = "feature_processing/imei_embedding/embedding_weights/Adam"

It seems that there exists a problem when saving the Adam parameters.
If we only resotre tf.trainable_variables(), it saved checkpoint successfully. It failed when restore tf.global_variables() where including Adam parameters.

If we disable GlobalStepEvict when restoring the baseline model, it will run normally, but loss, AUC will be poor.

【grpc++】env_->rendezvous_mgr->RecvLocalAsync failed, error msg is: [_Derived_]End of sequence

System information

  • Have I written custom code :
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
  • TensorFlow installed from (source or binary): DeepRec
  • TensorFlow version : tf1.15
  • Python version: python3.6

when i used grpc++ in estimator, i got the following error,but it still training, i don't know whether it is ok

image

config = tf.estimator.RunConfig( save_checkpoints_secs=10 * 60, keep_checkpoint_max=2, protocol='grpc++' ) model = tf.estimator.Estimator( model_fn=model_fn, params=model_params, model_dir=checkpoint, config=config ) eval_spec = tf.estimator.EvalSpec(...) train_spec = tf.estimator.TrainSpec(...) tf.estimator.train_and_evaluate(model, train_spec, eval_spec)

In the DeepRec-doc, I found that it seems there some problem with ori-estimator,but I bazel failed and don't know what's Estimator check like when using grpc++,in the deeprec last version whether we need to install estimaotr specially?

Using get_dynamic_dimension_embedding_variable lets DeepRec crashed

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
    Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
    Ubuntu 20.04.3 LTS
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
    None
  • TensorFlow installed from (source or binary):
    build from source, commit = "295b752898fe3ebf23e235bf25ccf3f1621373bf"
  • TensorFlow version (use command below):
    r1.15.5-deeprec2201-31-g295b752898f 1.15.5
  • Python version:
    Python 3.6.13 :: Anaconda
  • Bazel version (if compiling from source):
    Build label: 0.24.1
  • GCC/Compiler version (if compiling from source):
    gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)
  • CUDA/cuDNN version:
    Cuda compilation tools, release 11.4, V11.4.120 / CUDNN_VERSION=v8.2.4.15
  • GPU model and memory:
    GPU Tesla T4 16G

You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior
When using get_dynamic_dimension_embedding_variable function provided by DeepRec, it crashed and raised a Segmentation fault (core dumped) problem, maybe it hits a kernel error.

Describe the expected behavior
Return the correct value.

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.

import tensorflow as tf

EMBEDDING_DIM = 10

var = tf.get_dynamic_dimension_embedding_variable("uid_embedding_var",
                                                  embedding_block_dimension=EMBEDDING_DIM / 2,
                                                  embedding_block_num=4)

ids = [21, 34, 78, 99, 56]
blocknums = [4, 1, 4, 3, 1]

emb = tf.nn.embedding_lookup(var, tf.cast(ids, tf.int64), blocknums=blocknums)

init = tf.global_variables_initializer()

sess_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)
with tf.Session(config=sess_config) as sess:
  sess.run([init])
  print(sess.run([emb]))

Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

Here is the logs:

$ python dynamic_dimension_embedding_variable_test1.py
2022-02-24 15:40:16.850638: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499995000 Hz
2022-02-24 15:40:16.851615: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x555e87ed6910 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-02-24 15:40:16.851639: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2022-02-24 15:40:16.854765: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2022-02-24 15:40:17.557625: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1084] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-24 15:40:17.558416: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x555e881cf050 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2022-02-24 15:40:17.558443: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2022-02-24 15:40:17.558719: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1084] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-24 15:40:17.559381: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1599] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:08.0
2022-02-24 15:40:17.559760: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-02-24 15:40:17.566657: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2022-02-24 15:40:17.570319: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2022-02-24 15:40:17.570728: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2022-02-24 15:40:17.571501: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11
2022-02-24 15:40:17.572976: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2022-02-24 15:40:17.573214: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2022-02-24 15:40:17.573357: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1084] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-24 15:40:17.574064: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1084] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-24 15:40:17.574719: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1727] Adding visible gpu devices: 0
2022-02-24 15:40:17.574771: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-02-24 15:40:17.575946: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1139] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-02-24 15:40:17.575961: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1145]      0
2022-02-24 15:40:17.575971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1158] 0:   N
2022-02-24 15:40:17.576139: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1084] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-24 15:40:17.576818: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1084] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-24 15:40:17.577568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1284] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13945 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:08.0, comp$
te capability: 7.5)
[array([[ 0.6678627 , -0.38343927, -0.65460324,  0.15888363,  0.710193  ,
         0.640184  ,  0.42595282,  0.7293787 ,  1.3838437 ,  0.27501038,
        -0.96244717, -0.5522712 , -0.46999097,  0.45904443, -0.35207814,
         0.39496022, -1.106673  ,  0.21438211, -1.1451356 ,  0.9796604 ],
       [-0.14699893,  0.07010368,  0.22612067, -1.9068893 , -0.44930258,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [-0.36353526,  0.36128962,  0.14200972,  0.07810795, -0.54961   ,
        -0.15781127, -0.64423895,  0.97612906, -0.96893233,  0.8196201 ,
        -0.7367647 , -0.94786507,  1.1452298 ,  1.0325592 ,  0.46815377,
        -0.4092801 , -0.5371794 , -1.2808001 , -1.057108  , -0.7823616 ],
       [-0.88329375, -1.5494045 , -0.4070856 , -1.8068027 , -0.8884988 ,
         0.3828017 , -1.0075641 , -1.4119419 , -0.16102602,  0.7351839 ,
         1.483396  ,  0.6105891 , -0.23226756,  1.6206956 ,  0.06422351,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.7161362 , -0.737407  , -0.8979032 ,  1.1798211 ,  0.37206918,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ]],
      dtype=float32)]
Segmentation fault (core dumped)

[Auto Micro Batch] auc is unstable when enable auto_micro_batch in DIN model implemented based on deepctr

auc is unstable when enable auto_micro_batch in DIN model implemented based on deepctr

Deeprec Info

Build by myself, commit id is 31f83623dde1a1d3792d7f41ba310b29e40abaa7, released by name r1.15.5-deeprec2204

Description

Everything is ok when using default deeprec environment, and auc is around to 0.716 under multiple experiments. However, when using the feature Auto Micro Batch, the auc fluctuates in this range [0.71-0.74] with slower training performance

below is the code skeleton


import tensorflow as tf
import horovod.tensorflow as hvd

class DIN:
    # implemented based on [deepctr](https://github.com/shenweichen/DeepCTR)
    pass

def prepareDataSet(data_path, batch_size):
    # parsed by tf.data.Dataset with prefetch
    pass

def create_model(data_path='.', batch_size=512, learning_rate=0.01):
    parsed_dataset = prepareDataSet(data_path, batch_size)
    iterator = parsed_dataset.make_one_shot_iterator()
    input_features, label = iterator.get_next()
    label = tf.reshape(label, [-1, 1])

    output = DIN(input_features)

    optimizer = tf.train.AdagradOptimizer(learning_rate=learning_rate * hvd.size(), initial_accumulator_value=1e-30)
    optimizer = hvd.DistributedOptimizer(optimizer)

    loss = tf.keras.losses.BinaryCrossentropy(from_logits=False)(label, output)
    global_step = tf.train.get_or_create_global_step()
    train_op = optimizer.minimize(loss, global_step=global_step)
    _, auc = tf.metrics.auc(label, output)

    return train_op, auc

def create_sess_config(deeprec_auto_micro_batch):
    sess_config = tf.ConfigProto()
    sess_config.gpu_options.allow_growth = False
    sess_config.gpu_options.visible_device_list = str(hvd.local_rank())

    if deeprec_auto_micro_batch:
        sess_config.graph_options.optimizer_options.micro_batch_num = 2

    return sess_config

def train(deeprec_auto_micro_batch ):
    batch_size = 512 if deeprec_auto_micro_batch else 1024
    train_op, auc = create_model(batch_size=batch_size)

    sess_config = create_sess_config(deeprec_auto_micro_batch=True)
    hooks = [
        hvd.BroadcastGlobalVariablesHook(0),
    ]
    with tf.train.MonitoredTrainingSession(hooks=hooks,
                                           config=sess_config) as mon_sess:
        fetches = {
            "train_op": train_op,
            'auc': auc,
        }

        while not mon_sess.should_stop():
            results = mon_sess.run(fetches)
            print(results['auc'])


if __name__ == "__main__":
    hvd.init()

    deeprec_auto_micro_batch = True
    train(deeprec_auto_micro_batch)

[Auto Graph Fusion][Modelzoo] After enable Auto op fusion feature in WDL of modelzoo, but get an error.

I want to enable Auto Graph Fusion feature in WDL and follow the steps in DeepRec Docs, but I get an error.

Code to reproduce the issue
I use following codes to enable Auto Graph Fusion. The full code please see Full code

        if args.op_fusion and not args.tf:
            '''Auto Graph Fusion'''
            sess_config.graph_options.optimizer_options.do_op_fusion = True

Run python train.py --steps 1000 --no_eval --op_fusion True can reproduce error. Use WDL dataset.

logs

INFO:tensorflow:Parsing ./data/train.csv
INFO:tensorflow:Parsing ./data/eval.csv
INFO:tensorflow:Graph was finalized.
2022-03-22 14:10:30.688360: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3000000000 Hz
2022-03-22 14:10:30.707518: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5622edbbebf0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-03-22 14:10:30.707558: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
INFO:tensorflow:run without loading checkpoint
2022-03-22 14:10:30.786850: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_else_const head/gradients/head/loss/xentropy/Select_grad/zeros_like
2022-03-22 14:10:30.787074: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/head/loss/xentropy/Select_grad/Select_1
2022-03-22 14:10:30.787248: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_else_const head/gradients_1/head/loss/xentropy/Select_grad/zeros_like
2022-03-22 14:10:30.787437: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/head/loss/xentropy/Select_grad/Select_1
2022-03-22 14:10:30.787920: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C10_embedding/C10_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.788067: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C10_embedding/C10_embedding_weights_grad/Select
2022-03-22 14:10:30.788089: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C11_embedding/C11_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.788232: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C11_embedding/C11_embedding_weights_grad/Select
2022-03-22 14:10:30.788255: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C12_embedding/C12_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.788394: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C12_embedding/C12_embedding_weights_grad/Select
2022-03-22 14:10:30.788415: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C13_embedding/C13_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.788554: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C13_embedding/C13_embedding_weights_grad/Select
2022-03-22 14:10:30.788575: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C14_embedding/C14_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.788714: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C14_embedding/C14_embedding_weights_grad/Select
2022-03-22 14:10:30.788735: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C15_embedding/C15_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.788875: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C15_embedding/C15_embedding_weights_grad/Select
2022-03-22 14:10:30.788895: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C16_embedding/C16_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.789049: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C16_embedding/C16_embedding_weights_grad/Select
2022-03-22 14:10:30.789071: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C17_embedding/C17_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.789214: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C17_embedding/C17_embedding_weights_grad/Select
2022-03-22 14:10:30.789234: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C18_embedding/C18_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.789374: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C18_embedding/C18_embedding_weights_grad/Select
2022-03-22 14:10:30.789395: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C19_embedding/C19_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.789534: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C19_embedding/C19_embedding_weights_grad/Select
2022-03-22 14:10:30.789554: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C1_embedding/C1_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.789693: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C1_embedding/C1_embedding_weights_grad/Select
2022-03-22 14:10:30.789715: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C20_embedding/C20_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.789854: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C20_embedding/C20_embedding_weights_grad/Select
2022-03-22 14:10:30.789874: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C21_embedding/C21_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.790014: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C21_embedding/C21_embedding_weights_grad/Select
2022-03-22 14:10:30.790035: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C22_embedding/C22_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.790177: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C22_embedding/C22_embedding_weights_grad/Select
2022-03-22 14:10:30.790197: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C23_embedding/C23_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.790338: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C23_embedding/C23_embedding_weights_grad/Select
2022-03-22 14:10:30.790359: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C24_embedding/C24_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.790504: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C24_embedding/C24_embedding_weights_grad/Select
2022-03-22 14:10:30.790529: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C25_embedding/C25_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.790687: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C25_embedding/C25_embedding_weights_grad/Select
2022-03-22 14:10:30.790708: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C26_embedding/C26_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.790851: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C26_embedding/C26_embedding_weights_grad/Select
2022-03-22 14:10:30.790872: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C2_embedding/C2_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.791015: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C2_embedding/C2_embedding_weights_grad/Select
2022-03-22 14:10:30.791036: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C3_embedding/C3_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.791183: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C3_embedding/C3_embedding_weights_grad/Select
2022-03-22 14:10:30.791204: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C4_embedding/C4_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.791348: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C4_embedding/C4_embedding_weights_grad/Select
2022-03-22 14:10:30.791369: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C5_embedding/C5_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.791513: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C5_embedding/C5_embedding_weights_grad/Select
2022-03-22 14:10:30.791544: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C6_embedding/C6_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.791686: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C6_embedding/C6_embedding_weights_grad/Select
2022-03-22 14:10:30.791706: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C7_embedding/C7_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.791848: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C7_embedding/C7_embedding_weights_grad/Select
2022-03-22 14:10:30.791874: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C8_embedding/C8_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.792021: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C8_embedding/C8_embedding_weights_grad/Select
2022-03-22 14:10:30.792042: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C9_embedding/C9_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.792188: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C9_embedding/C9_embedding_weights_grad/Select
2022-03-22 14:10:30.792258: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C1/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.792436: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C1/weighted_sum_grad/Select
2022-03-22 14:10:30.792456: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C10/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.792632: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C10/weighted_sum_grad/Select
2022-03-22 14:10:30.792654: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C11/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.792827: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C11/weighted_sum_grad/Select
2022-03-22 14:10:30.792848: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C12/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.793022: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C12/weighted_sum_grad/Select
2022-03-22 14:10:30.793043: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C13/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.793221: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C13/weighted_sum_grad/Select
2022-03-22 14:10:30.793243: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C14/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.793417: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C14/weighted_sum_grad/Select
2022-03-22 14:10:30.793439: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C15/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.793612: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C15/weighted_sum_grad/Select
2022-03-22 14:10:30.793633: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C16/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.793807: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C16/weighted_sum_grad/Select
2022-03-22 14:10:30.793829: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C17/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.794006: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C17/weighted_sum_grad/Select
2022-03-22 14:10:30.794027: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C18/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.794205: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C18/weighted_sum_grad/Select
2022-03-22 14:10:30.794227: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C19/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.794401: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C19/weighted_sum_grad/Select
2022-03-22 14:10:30.794422: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C2/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.794596: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C2/weighted_sum_grad/Select
2022-03-22 14:10:30.794618: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C20/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.794793: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C20/weighted_sum_grad/Select
2022-03-22 14:10:30.794813: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C21/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.794988: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C21/weighted_sum_grad/Select
2022-03-22 14:10:30.795011: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C22/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.795189: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C22/weighted_sum_grad/Select
2022-03-22 14:10:30.795210: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C23/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.795385: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C23/weighted_sum_grad/Select
2022-03-22 14:10:30.795407: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C24/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.795582: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C24/weighted_sum_grad/Select
2022-03-22 14:10:30.795602: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C25/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.795778: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C25/weighted_sum_grad/Select
2022-03-22 14:10:30.795799: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C26/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.795974: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C26/weighted_sum_grad/Select
2022-03-22 14:10:30.795999: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C3/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.796178: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C3/weighted_sum_grad/Select
2022-03-22 14:10:30.796200: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C4/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.796375: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C4/weighted_sum_grad/Select
2022-03-22 14:10:30.796396: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C5/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.796572: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C5/weighted_sum_grad/Select
2022-03-22 14:10:30.796593: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C6/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.796768: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C6/weighted_sum_grad/Select
2022-03-22 14:10:30.796790: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C7/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.796966: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C7/weighted_sum_grad/Select
2022-03-22 14:10:30.796991: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C8/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.797170: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C8/weighted_sum_grad/Select
2022-03-22 14:10:30.797192: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C9/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.797368: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C9/weighted_sum_grad/Select
2022-03-22 14:10:30.797492: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C10_embedding/C10_embedding_weights]
2022-03-22 14:10:30.797542: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C11_embedding/C11_embedding_weights]
2022-03-22 14:10:30.797564: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C12_embedding/C12_embedding_weights]
2022-03-22 14:10:30.797583: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C13_embedding/C13_embedding_weights]
2022-03-22 14:10:30.797602: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C14_embedding/C14_embedding_weights]
2022-03-22 14:10:30.797621: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C15_embedding/C15_embedding_weights]
2022-03-22 14:10:30.797645: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C16_embedding/C16_embedding_weights]
2022-03-22 14:10:30.797664: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C17_embedding/C17_embedding_weights]
2022-03-22 14:10:30.797682: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C18_embedding/C18_embedding_weights]
2022-03-22 14:10:30.797701: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C19_embedding/C19_embedding_weights]
2022-03-22 14:10:30.797720: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C1_embedding/C1_embedding_weights]
2022-03-22 14:10:30.797739: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C20_embedding/C20_embedding_weights]
2022-03-22 14:10:30.797757: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C21_embedding/C21_embedding_weights]
2022-03-22 14:10:30.797776: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C22_embedding/C22_embedding_weights]
2022-03-22 14:10:30.797794: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C23_embedding/C23_embedding_weights]
2022-03-22 14:10:30.797813: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C24_embedding/C24_embedding_weights]
2022-03-22 14:10:30.797832: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C25_embedding/C25_embedding_weights]
2022-03-22 14:10:30.797850: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C26_embedding/C26_embedding_weights]
2022-03-22 14:10:30.797868: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C2_embedding/C2_embedding_weights]
2022-03-22 14:10:30.797886: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C3_embedding/C3_embedding_weights]
2022-03-22 14:10:30.797905: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C4_embedding/C4_embedding_weights]
2022-03-22 14:10:30.797923: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C5_embedding/C5_embedding_weights]
2022-03-22 14:10:30.797940: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C6_embedding/C6_embedding_weights]
2022-03-22 14:10:30.797958: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C7_embedding/C7_embedding_weights]
2022-03-22 14:10:30.797976: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C8_embedding/C8_embedding_weights]
2022-03-22 14:10:30.797994: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C9_embedding/C9_embedding_weights]
2022-03-22 14:10:30.798035: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C1/weighted_sum]
2022-03-22 14:10:30.798054: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C10/weighted_sum]
2022-03-22 14:10:30.798072: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C11/weighted_sum]
2022-03-22 14:10:30.798091: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C12/weighted_sum]
2022-03-22 14:10:30.798113: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C13/weighted_sum]
2022-03-22 14:10:30.798132: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C14/weighted_sum]
2022-03-22 14:10:30.798150: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C15/weighted_sum]
2022-03-22 14:10:30.798169: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C16/weighted_sum]
2022-03-22 14:10:30.798188: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C17/weighted_sum]
2022-03-22 14:10:30.798206: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C18/weighted_sum]
2022-03-22 14:10:30.798223: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C19/weighted_sum]
2022-03-22 14:10:30.798242: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C2/weighted_sum]
2022-03-22 14:10:30.798261: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C20/weighted_sum]
2022-03-22 14:10:30.798279: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C21/weighted_sum]
2022-03-22 14:10:30.798297: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C22/weighted_sum]
2022-03-22 14:10:30.798316: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C23/weighted_sum]
2022-03-22 14:10:30.798334: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C24/weighted_sum]
2022-03-22 14:10:30.798353: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C25/weighted_sum]
2022-03-22 14:10:30.798371: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C26/weighted_sum]
2022-03-22 14:10:30.798389: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C3/weighted_sum]
2022-03-22 14:10:30.798408: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C4/weighted_sum]
2022-03-22 14:10:30.798426: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C5/weighted_sum]
2022-03-22 14:10:30.798445: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C6/weighted_sum]
2022-03-22 14:10:30.798467: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C7/weighted_sum]
2022-03-22 14:10:30.798485: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C8/weighted_sum]
2022-03-22 14:10:30.798503: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C9/weighted_sum]
2022-03-22 14:10:30.798945: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_else_scalar] match op[head/loss/xentropy/Select]
2022-03-22 14:10:30.799421: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_else_scalar_in_grad] match op[head/gradients/head/loss/xentropy/Select_grad/Select]
2022-03-22 14:10:30.799453: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_else_scalar_in_grad] match op[head/gradients/head/loss/xentropy/Select_1_grad/Select]
2022-03-22 14:10:30.799528: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_else_scalar_in_grad] match op[head/gradients_1/head/loss/xentropy/Select_grad/Select]
2022-03-22 14:10:30.799546: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_else_scalar_in_grad] match op[head/gradients_1/head/loss/xentropy/Select_1_grad/Select]
2022-03-22 14:10:30.799958: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/head/loss/xentropy/Select_1_grad/Select_1]
2022-03-22 14:10:30.799996: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C10_embedding/C10_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800013: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C11_embedding/C11_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800028: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C12_embedding/C12_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800043: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C13_embedding/C13_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800057: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C14_embedding/C14_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800072: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C15_embedding/C15_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800087: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C16_embedding/C16_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800101: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C17_embedding/C17_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800120: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C18_embedding/C18_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800136: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C19_embedding/C19_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800155: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C1_embedding/C1_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800169: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C20_embedding/C20_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800184: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C21_embedding/C21_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800199: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C22_embedding/C22_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800213: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C23_embedding/C23_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800228: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C24_embedding/C24_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800242: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C25_embedding/C25_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800256: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C26_embedding/C26_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800271: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C2_embedding/C2_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800286: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C3_embedding/C3_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800300: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C4_embedding/C4_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800314: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C5_embedding/C5_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800329: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C6_embedding/C6_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800344: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C7_embedding/C7_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800358: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C8_embedding/C8_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800373: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C9_embedding/C9_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800428: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/head/loss/xentropy/Select_1_grad/Select_1]
2022-03-22 14:10:30.800455: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C1/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800471: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C10/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800493: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C11/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800508: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C12/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800523: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C13/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800538: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C14/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800552: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C15/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800567: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C16/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800582: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C17/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800596: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C18/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800610: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C19/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800625: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C2/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800639: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C20/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800654: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C21/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800670: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C22/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800685: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C23/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800699: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C24/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800714: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C25/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800728: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C26/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800745: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C3/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800760: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C4/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800774: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C5/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800790: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C6/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800804: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C7/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800819: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C8/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800833: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C9/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.919577: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] model_pruner failed: Invalid argument: MutableGraphView::MutableGraphView error: node 'head/gradients/head/loss/xentropy/Select_grad/tuple/control_dependency_1' has missing fanin 'head/gradients/head/loss/xentropy/Select_grad/Select_1'.
2022-03-22 14:10:30.941287: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] function_optimizer failed: Invalid argument: Node 'head/gradients/head/loss/xentropy/Select_grad/tuple/control_dependency_1': Unknown input node 'head/gradients/head/loss/xentropy/Select_grad/Select_1'
2022-03-22 14:11:15.303107: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] model_pruner failed: Invalid argument: MutableGraphView::MutableGraphView error: node 'head/gradients/head/loss/xentropy/Select_grad/tuple/control_dependency_1' has missing fanin 'head/gradients/head/loss/xentropy/Select_grad/Select_1'.
2022-03-22 14:11:15.324636: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] function_optimizer failed: Invalid argument: Node 'head/gradients/head/loss/xentropy/Select_grad/tuple/control_dependency_1': Unknown input node 'head/gradients/head/loss/xentropy/Select_grad/Select_1'
INFO:tensorflow:Running local_init_op.
2022-03-22 14:11:23.962906: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] model_pruner failed: Invalid argument: MutableGraphView::MutableGraphView error: node 'head/gradients/head/loss/xentropy/Select_grad/tuple/control_dependency_1' has missing fanin 'head/gradients/head/loss/xentropy/Select_grad/Select_1'.
2022-03-22 14:11:23.985236: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] function_optimizer failed: Invalid argument: Node 'head/gradients/head/loss/xentropy/Select_grad/tuple/control_dependency_1': Unknown input node 'head/gradients/head/loss/xentropy/Select_grad/Select_1'
INFO:tensorflow:Done running local_init_op.
2022-03-22 14:11:32.103417: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] model_pruner failed: Invalid argument: MutableGraphView::MutableGraphView error: node 'head/gradients/head/loss/xentropy/Select_grad/tuple/control_dependency_1' has missing fanin 'head/gradients/head/loss/xentropy/Select_grad/Select_1'.
2022-03-22 14:11:32.125529: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] function_optimizer failed: Invalid argument: Node 'head/gradients/head/loss/xentropy/Select_grad/tuple/control_dependency_1': Unknown input node 'head/gradients/head/loss/xentropy/Select_grad/Select_1'
2022-03-22 14:11:40.871891: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] model_pruner failed: Invalid argument: MutableGraphView::MutableGraphView error: node 'head/gradients/head/loss/xentropy/Select_grad/tuple/control_dependency_1' has missing fanin 'head/gradients/head/loss/xentropy/Select_grad/Select_1'.
2022-03-22 14:11:40.894088: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] function_optimizer failed: Invalid argument: Node 'head/gradients/head/loss/xentropy/Select_grad/tuple/control_dependency_1': Unknown input node 'head/gradients/head/loss/xentropy/Select_grad/Select_1'
Using TensorFlow version 1.15.5
Checking dataset...
Numbers of training dataset is 8000000
Numbers of test dataset is 2000000
The training steps is 100
The testing steps is 3907
Saving model checkpoints to ./result/model_WIDE_AND_DEEP_1647929426
Traceback (most recent call last):
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Expected begin[0] == 0 (got 1) and size[0] == 0 (got -1) when input.dim_size(0) == 0
	 [[{{node linear/linear_model_1/linear_model/C20/weighted_sum/Slice_2}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train_rebuild.py", line 737, in <module>
    main()
  File "train_rebuild.py", line 537, in main
    checkpoint_dir, tf_config, server)
  File "train_rebuild.py", line 414, in train
    sess.run([model.loss, model.train_op])
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 804, in run
    run_metadata=run_metadata)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1309, in run
    run_metadata=run_metadata)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1410, in run
    raise six.reraise(*original_exc_info)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/six.py", line 719, in reraise
    raise value
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1395, in run
    return self._sess.run(*args, **kwargs)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1468, in run
    run_metadata=run_metadata)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1226, in run
    return self._sess.run(*args, **kwargs)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Expected begin[0] == 0 (got 1) and size[0] == 0 (got -1) when input.dim_size(0) == 0
	 [[node linear/linear_model_1/linear_model/C20/weighted_sum/Slice_2 (defined at /home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for 'linear/linear_model_1/linear_model/C20/weighted_sum/Slice_2':
  File "train_rebuild.py", line 737, in <module>
    main()
  File "train_rebuild.py", line 517, in main
    dense_layer_partitioner=dense_layer_partitioner)
  File "train_rebuild.py", line 116, in __init__
    self._create_model()
  File "train_rebuild.py", line 187, in _create_model
    trainable=True)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/feature_column/feature_column.py", line 504, in linear_model
    retval = linear_model_layer(features)  # pylint: disable=not-callable
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 871, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 234, in wrapper
    return converted_call(f, options, args, kwargs)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 439, in converted_call
    return _call_unconverted(f, args, kwargs, options)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 330, in _call_unconverted
    return f(*args, **kwargs)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/feature_column/feature_column.py", line 696, in call
    weighted_sum = layer(builder)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/layers/base.py", line 564, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 915, in __call__
    outputs = self.call(cast_inputs, *args, **kwargs)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/feature_column/feature_column.py", line 588, in call
    weight_var=self._weight_var)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/feature_column/feature_column.py", line 1938, in _create_weighted_sum
    weight_var=weight_var)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/feature_column/feature_column.py", line 2081, in _create_categorical_column_weighted_sum
    name='weighted_sum')
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/embedding_ops.py", line 1338, in safe_embedding_lookup_sparse
    array_ops.slice(array_ops.shape(result), [1], [-1])
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/array_ops.py", line 855, in slice
    return gen_array_ops._slice(input_, begin, size, name=name)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_array_ops.py", line 9272, in _slice
    "Slice", input=input, begin=begin, size=size, name=name)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

[BUILD] gcc-8.3 build DeepRec fail.

Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Centos 7
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: no
  • TensorFlow installed from (source or binary): source
  • TensorFlow version: r1.15.5-deeprec2204u1
  • Python version:
  • Installed using virtualenv? pip? conda?:
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source): gcc version 8.3.1 20190311 (Red Hat 8.3.1-3) (GCC)
  • CUDA/cuDNN version: cuda11.4
  • GPU model and memory:

Describe the problem

Build deeprec fail when we use gcc 8.3.1. It triggers gcc 8.3.1 compiler bug. The error is as follows:

unique_ali_op_ut.h:498:77: internal compiler error: in is_normal_capture_proxy, at cp/lambda.c:292

Provide the exact sequence of commands / steps that you executed before running into the problem

Any other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

image

[PMEM] It will abort when using PMEM allocator in EV

While using pmem allocator in the WDL model both on libpmem or memkind mode, it would cause "./tensorflow/core/framework/embedding/value_ptr.h:273] Unsupport FreqCounter in subclass of ValuePtrBase
Aborted (core dumped)
"

Here are the call stack information.
#3 0x00001464e19d0f4e in tensorflow::ValuePtr::AddFreq (this=)
at ./tensorflow/core/framework/embedding/value_ptr.h:273
#4 0x00001464e19d6566 in tensorflow::NullableFilter<long long, float, tensorflow::EmbeddingVar<long long, float> >::LookupOrCreateWithFreq (this=0x145fb0105c90, key=, val=0x14609c00cac0, default_value_ptr=)
at ./tensorflow/core/framework/embedding/embedding_filter.h:526
#5 0x00001464e19c35cc in std::function<void (long long, float*, float*)>::operator()(long long, float*, float*) const (
__args#2=, __args#1=, __args#0=, this=0x146100083cb8)
at /usr/include/c++/7/bits/std_function.h:706
#6 tensorflow::KvResourceGatherOp<long long, float>::Compute(tensorflow::OpKernelContext*)::{lambda(long long, long long)#4}::operator()(long long, long long) const (limit=4, start=, __closure=0x146100083c80)
at tensorflow/core/kernels/kv_variable_ops.cc:413
#7 std::_Function_handler<void (long long, long long), tensorflow::KvResourceGatherOp<long long, float>::Compute(tensorflow::OpKernelContext*)::{lambda(long long, long long)#4}>::_M_invoke(std::_Any_data const&, long long&&, std::_Any_data const&) (
__functor=..., __args#0=, __args#1=) at /usr/include/c++/7/bits/std_function.h:316
#8 0x00001464d9948f1e in std::_Function_handler<void (long, long), tensorflow::thread::ThreadPool::ParallelFor(long long, long long, std::function<void (long long, long long)>)::{lambda(long, long)#1}>::_M_invoke(std::_Any_data const&, long&&, std::_Any_data const&) () from /home/zshan/deeprec-env/lib/python3.6/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#9 0x00001464d994f48f in tensorflow::thread::ThreadPool::ParallelFor(long long, long long, std::function<void (long long, long long)>) () from /home/zshan/deeprec-env/lib/python3.6/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#10 0x00001464d971fb52 in tensorflow::Shard(int, tensorflow::thread::ThreadPool*, long long, long long, std::function<void (long long, long long)>) ()
from /home/zshan/deeprec-env/lib/python3.6/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#11 0x00001464e19dce74 in tensorflow::KvResourceGatherOp<long long, float>::Compute (this=0x560ce5050590, c=)
at tensorflow/core/kernels/kv_variable_ops.cc:427
#12 0x00001464d98766a6 in tensorflow::(anonymous namespace)::ExecutorStatetensorflow::PropagatorState::BatchProcess(std::vector<tensorflow::PropagatorState::TaggedNode, std::allocatortensorflow::PropagatorState::TaggedNode >, int, long) ()
--Type for more, q to quit, c to continue without paging--
from /home/zshan/deeprec-env/lib/python3.6/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#13 0x00001464d9876a88 in tensorflow::(anonymous namespace)::ExecutorStatetensorflow::PropagatorState::Process(tensorflow::PropagatorState::TaggedNode, long) ()
from /home/zshan/deeprec-env/lib/python3.6/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#14 0x00001464d9876b5f in std::_Function_handler<void (), tensorflow::(anonymous namespace)::ExecutorStatetensorflow::PropagatorState::RunTask<tensorflow::(anonymous namespace)::ExecutorStatetensorflow::PropagatorState::ScheduleReady(absl::InlinedVector<tensorflow::PropagatorState::TaggedNode, 8ul, std::allocatortensorflow::PropagatorState::TaggedNode >, tensorflow::PropagatorState::TaggedNodeReadyQueue)::{lambda()#1}>(tensorflow::(anonymous namespace)::ExecutorStatetensorflow::PropagatorState::ScheduleReady(absl::InlinedVector<tensorflow::PropagatorState::TaggedNode, 8ul, std::allocatortensorflow::PropagatorState::TaggedNode >, tensorflow::PropagatorState::TaggedNodeReadyQueue)::{lambda()#1}&&)::{lambda()#1}>::_M_invoke(std::_Any_data const&)
() from /home/zshan/deeprec-env/lib/python3.6/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#15 0x00001464d994bb4f in std::_Function_handler<void (), Eigen::ThreadPoolTempltensorflow::thread::EigenEnvironment::ThreadPoolTempl(int, bool, tensorflow::thread::EigenEnvironment)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
from /home/zshan/deeprec-env/lib/python3.6/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#16 0x00001464d9948f78 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
from /home/zshan/deeprec-env/lib/python3.6/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#17 0x00001464d83a9ba3 in execute_native_thread_routine () from /lib64/libstdc++.so.6
#18 0x0000146577a1a17a in start_thread () from /lib64/libpthread.so.0
#19 0x0000146576fbfdc3 in clone () from /lib64/libc.so.6

[ASoC 2022] DeepRec processor supports multiple language.

Background

This is a basic subject of ASoC 2022 and #231 .

DeepRec processor is developed in C++. For users, they have their own serving framework, which may be developed in different languages, such as Java, GO, C++, etc. We need to provide users with access examples in the corresponding language to facilitate users quickly connect to the DeepRec processor.

Target

  1. Design and implement multiple language use cases.
  2. Summarize best practice.

Difficulty

Basic

Mentor

@JackMoriarty [email protected]

Output Requirements

Proficiency in C++ and Python;
Get to know DeepRec;
Able to complete the development under the guidance of the mentor;
Have a certain understanding and interest in deep learning recommendation engines;

背景

这是一个阿里巴巴编程之夏 2022 的基础课题 #231 .

DeeRec提供线上serving模块Processor基于C++开发。对于用户而言,有自己的serving框架,不同的语言开发,譬如Java,GO,C++等,我们需要提供给用户对应语言的接入示例方便用户快速对接DeepRec processor。

目标

1)实现多语言的接入示例。
2)完成最佳实践文档。

难度

基础

导师

@JackMoriarty [email protected]

产出要求

熟练掌握C++和Python;
能够在导师的指导下熟悉并理解相关的代码
了解 DeepRec;
对深度学习推荐引擎有一定了解和兴趣;

Unsupport GlobalStep in subclass of ValuePtrBase

When we save checkpoint, the error F ./tensorflow/core/framework/embedding/value_ptr.h:256] Unsupport GlobalStep in subclass of ValuePtrBase occurs. Because I find that the checkpoint is a temporary file best_checkpoint/best.data-00000-of-00001.tempstate11898667549733680686.

Build error on aarch64

Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): linux ubuntu 18.04
  • Python version: 3.6
  • Installed using virtualenv? pip? conda?: pip
  • Bazel version (if compiling from source): 0.26.1
  • GCC/Compiler version (if compiling from source): 7.5
  • CUDA/cuDNN version: None
  • GPU model and memory: None

Describe the problem
Build from source failed on arm64 server when run command:

bazel build -c opt --config=opt //tensorflow/tools/pip_package:build_pip_package

image

I just use cpu version and build it successfully on x86 machine, so it should be related to the platform.
I find no arm image either, do you have any plan to support for arm.

[ASoC 2022] DeepRec supports multiple evaluator.

Background

This is an advance subject of ASoC 2022 and #231 .

At present, DeepRec cannot support the evaluation of very large models (single node cannot be loaded), multiple PS are required to load large models, and multiple workers are used for distributed evaluation.

Target

  1. Design and implement the capability large model evaluation, support multiple PS loading large model.
  2. Design and implement multiple evaluator node in one job.

Difficulty

Advance

Mentor

@candyzone [email protected]

Output Requirements

Proficiency in C++ and Python;
Get to know DeepRec;
Able to complete the development under the guidance of the mentor;
Have a certain understanding and interest in deep learning recommendation engines;

背景

这是一个阿里巴巴编程之夏 2022 的基础课题 #231 .

DeepRec 支持多evaluator评估:目前DeepRec下无法支持超大模型(单节点无法加载)的评估,需要多个ps加载大模型,并且使用多worker进行分布式评估。

目标

1)支持超大模型通过多PS方式加载模型,实现Evaluation.
2)支持一个任务中使用多个Evaluator节点进行评估。

难度

进阶

导师

@candyzone [email protected]

产出要求

熟练掌握C++和Python;
能够在导师的指导下熟悉并理解相关的代码
了解 DeepRec;
对深度学习推荐引擎有一定了解和兴趣;

[OSPP 2022] DeepRec supports exporting models to key-value NoSQL databases

Motivation
Currently, DeepRec supports exporting models to the checkpoint, but when the model weight file is large, the model import and export performance will be affected. Key-value NoSQL databases (such as LevelDB, Redis, and RocksDB) have the advantages of high performance, high scalability, and support for large data volume. We add this feature to optimize the model import and export performance while supporting the storage needs of more users.

Design
To achieve better import and export performance, we add new ops, which avoid repeated reading and writing of model files to disk by directly manipulating the database, thus reducing time overhead.

The overall design can be divided into three parts.

The first part is the implementation of a generic interface for persisting key-value data in a database, which is used to support persistence in a key-value database.

The second part is to add an op implementation in the op kernel to import and export models. This op saves the Variable/EmbeddingVariable values in memory directly to the database through database calls or loads the models directly from the database.

The third part is to add the op in the process of building the graph.

In the traditional checkpoint saving method, the BundleEntryProto storage format is used to correspond to the file. In the database, we have simplified this step by adding key-value mappings such as node key lists. In addition, in distributed training, ps is responsible for parameter updating. Except for StringJoin, save/ShardedFilename/shard, and save/num_shards, ops in the saving process are executed on ps. So the model preservation process only needs to consider the ps side. When the data is too large, the save op can be placed on each device with the shared parameter, so the meta information from different devices needs to be merged to form a complete checkpoint and we need to rewrite this process.

Additional.
To facilitate the user to view the parameters, we also plan to implement a file viewer that can view the Variable/EmbeddingVariable values and support searching for the values.

[SmartStaged] Prefetching was ignored since timeout.

After open smartstaged in DSSM、DIN、DIEN,something wrong happened. Wait some minutes, it will show Prefetching was ignored since timeout. The docker whl package is built on commit 8db8689
Code to reproduce the issue
The code and the deeprec env is provided in this docker.

docker pull cesg-prc-registry.cn-beijing.cr.aliyuncs.com/cesg-ali/deeprec-modelzoo:220412-8db8689

Run python script in DSSM/DIN/DIEN to reproduce this issue

cd /root/modelzoo/$MODEL
python train.py --steps 1000 --emb_fusion false --smartstaged true

If set --smartstaged to False, it's ok

Other info / logs

INFO:tensorflow:Saving checkpoints for 0 into ./result/model_DSSM_1649818801/model.ckpt.
2022-04-13 11:00:07.986332: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200                                                                                                                     
INFO:tensorflow:Create incremental timer, incremental_save:False, incremental_save_secs:None
2022-04-13 11:00:08.391247: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200                                                                                                                     
2022-04-13 11:00:08.458564: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200                                                                                                                     
2022-04-13 11:00:10.967349: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200                                                                                                                     
2022-04-13 11:00:10.985142: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200                                                                                                                     
2022-04-13 11:00:10.995642: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200                                                                                                                     
2022-04-13 11:00:11.671530: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200                                                                                                                     
2022-04-13 11:00:11.677987: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200                                                                                                                     
INFO:tensorflow:loss = 168.8434, steps = 1
2022-04-13 11:00:12.369923: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200                                                                                                                     
2022-04-13 11:00:14.866723: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200                                                                                                                     
2022-04-13 11:00:14.887854: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200                                                                                                                     
2022-04-13 11:00:14.897801: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200                                                                                                                     
2022-04-13 11:05:14.881883: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:05:14.882675: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:05:14.882799: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:05:14.884134: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:05:14.884259: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:05:14.884944: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:05:14.886524: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:05:14.888192: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:10:14.901689: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:10:14.901695: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:10:14.901945: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:10:14.902099: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:10:14.902229: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:10:14.902321: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:10:14.902579: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:10:14.907505: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:15:14.909300: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:15:14.913124: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:15:14.914027: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:15:14.914081: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:15:14.914123: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:15:14.914326: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:15:14.919999: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
^CKilled

Can not get the value of embedding variables using NewCheckpointReader

from tensorflow.python import pywrap_tensorflow
reader = pywrap_tensorflow.NewCheckpointReader(latest_checkpoint)
var_to_shape_map = reader.get_variable_to_shape_map()
for key in var_to_shape_map:
    print(reader.get_tensor(key))

I want export the value of embedding variables, and I test it in nvtf successfully. But in deeprec, the value is [], an empty list.

[SmartStage] SmartStage has low performance on GPU.

测试环境
image
性能对比
image

[1] Invalid argument: Trying to access resource linear/linear_model/C1/weights/part_0 located in device /job:localhost/replica:0/task:0/device:CPU:0 from device /job:localhost/replica:0/task:0/device:GPU:0
[2] 2022-06-07 09:49:01.768708: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at resource_variable_ops.cc:400 : Invalid argument: Trying to access resource linear/linear_model/C12/weights/part_0 located in device /job:localhost/replica:0/task:0/device:CPU:0 from device /job:localhost/replica:0/task:0/device:GPU:0

Incremental save fails with resource variables

System information

  • TensorFlow version (use command below): 1.12.2
  • Python version: 3.6

Describe the current behavior
Incremental save and restore fails if any resource variable is used.

Describe the expected behavior

Code to reproduce the issue

import tensorflow as tf
tf.Variable(0, use_resource=True)
saver = tf.train.Saver(
    save_relative_paths=True,
    incremental_save_restore=True,
)

Other info / logs

Traceback (most recent call last):
  File "iem_dlc/__main__.py", line 20, in <module>
    incremental_save_restore=True,
  File "/worker/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1388, in __init__
    self.build()
  File "/worker/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1404, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/worker/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1486, in _build
    build_save=build_save, build_restore=build_restore)
  File "/worker/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1053, in _build_internal
    save_tensor = self._AddSaveOps(filename_tensor, saveables)
  File "/worker/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 511, in _AddSaveOps
    tensor_names.append(self._GetTensorNameAndIsSparse(spec, saveable)[0])
  File "/worker/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 360, in _GetTensorNameAndIsSparse
    save_incr_sparse = saveable.op.op._is_sparse and self._incremental_include_normal_var
AttributeError: 'Operation' object has no attribute '_is_sparse'

[MultiLevel EV]core dump while using DRAM_SSDHASH storage type

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Centos7
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: No
  • TensorFlow installed from (source or binary): source
  • TensorFlow version (use command below): 1.15.5
  • Python version: 3.7.4
  • Bazel version (if compiling from source): 0.24.1
  • GCC/Compiler version (if compiling from source): 7.3.1
  • CUDA/cuDNN version: 11.2/8
  • GPU model and memory:

You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior
While using DRAM_SSDHASH as storage_type in StorageOption, process core dumped when SeekToFirst in SSDIterator was called.
image

Describe the expected behavior

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

'EmbeddingVariable' object has no attribute '_is_primary' When using import_meta_graph

Describe the current behavior

  File "/root/workspace/rec-rank-train/vmax/estimator/estimator_v2.py", line 122, in export_big_model
    self.estimator_core.export_big_model(server, checkpoint_path=checkpoint_path)
  File "/root/workspace/rec-rank-train/vmax/core/estimator_core_v2.py", line 415, in export_big_model
    tf.train.import_meta_graph(meta_graph_or_file='/tmp/saved_model/tmp.meta')
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/saver.py", line 1695, in import_meta_graph
    return _import_meta_graph_with_return_elements(meta_graph_or_file,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/saver.py", line 1723, in _import_meta_graph_with_return_elements
    saver = _create_saver_from_imported_meta_graph(meta_graph_def, import_scope,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/saver.py", line 1744, in _create_saver_from_imported_meta_graph
    return Saver()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/saver.py", line 1033, in __init__
    self.build()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/saver.py", line 1045, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/saver.py", line 1112, in _build
    self.saver_def = self._builder._build_internal(  # pylint: disable=protected-access
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/saver.py", line 656, in _build_internal
    restore_op = self._AddRestoreOps(filename_tensor, saveables,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/saver.py", line 491, in _AddRestoreOps
    assign_ops.append(saveable.restore(saveable_tensors, shapes))
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/saving/saveable_object_util.py", line 185, in restore
    with ops.control_dependencies(None if self.var._is_primary else [self.var._primary.initializer]):
AttributeError: 'EmbeddingVariable' object has no attribute '_is_primary'

Code to reproduce the issue

  meta_graph_def = tf.train.export_meta_graph()
  meta_graph_def.meta_info_def.meta_graph_version = str(int(time.time()))
  self.logger.info('meta_graph_version = %s' %
                   meta_graph_def.meta_info_def.meta_graph_version)
  tf.reset_default_graph()
  tf.train.import_meta_graph(meta_graph_def)

and

  meta_graph_def = tf.train.export_meta_graph(filename='/tmp/saved_model/tmp.meta')
  meta_graph_def.meta_info_def.meta_graph_version = str(int(time.time()))
  self.logger.info('meta_graph_version = %s' %
                   meta_graph_def.meta_info_def.meta_graph_version)
  tf.reset_default_graph()
  tf.train.import_meta_graph(meta_graph_or_file='/tmp/saved_model/tmp.meta')

[SmartStaged][Modelzoo] After enable smartstaged feature in WDL of modelzoo, but get an error.

I want to enable smartstaged feature in WDL and follow the steps in DeepRec Docs, but I get an error.

Code to reproduce the issue
I use following codes to enable smartstaged. The full code please see Full code

        next_element = tf.staged(next_element, num_threads=8, capacity=40)
        sess_config.graph_options.optimizer_options.do_smart_stage = True
        hooks.append(tf.make_prefetch_hook())

Run python train.py --steps 1000 --smartstaged True can reproduce error. Use WDL dataset.

logs

INFO:tensorflow:run without loading checkpoint
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into ./result/model_WIDE_AND_DEEP_1647592077/model.ckpt.
INFO:tensorflow:Create incremental timer, incremental_save:False, incremental_save_secs:None
2022-03-18 16:28:14.639138: E tensorflow/core/framework/op_segment.cc:54] Create kernel failed: Invalid argument: Length for attr 'dtypes' of 0 must be at least minimum 1
	; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Length for attr 'dtypes' of 0 must be at least minimum 1
	; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
	 [[prefetch_2/DataBufferPut]]
ERROR:tensorflow:Prefetching was cancelled unexpectedly:

Length for attr 'dtypes' of 0 must be at least minimum 1
	; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
	 [[prefetch_2/DataBufferPut]]
2022-03-18 16:28:14.783644: E tensorflow/core/framework/op_segment.cc:54] Create kernel failed: Invalid argument: Length for attr 'dtypes' of 0 must be at least minimum 1
	; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
Exception in thread PrefetchThread-PrefetchRunner-0:
Traceback (most recent call last):
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/prefetch_runner.py", line 236, in run
    run_fetch(*feed)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287, in _single_operation_run
    self._call_tf_sessionrun(None, {}, [], target_list, None)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Length for attr 'dtypes' of 0 must be at least minimum 1
	; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
	 [[prefetch_2/DataBufferPut]]

ERROR:tensorflow:Prefetching was cancelled unexpectedly:

Length for attr 'dtypes' of 0 must be at least minimum 1
	; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
	 [[prefetch_2/DataBufferPut]]
2022-03-18 16:28:14.871604: E tensorflow/core/framework/op_segment.cc:54] Create kernel failed: Invalid argument: Length for attr 'dtypes' of 0 must be at least minimum 1
	; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
Exception in thread PrefetchThread-PrefetchRunner-2:
Traceback (most recent call last):
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/prefetch_runner.py", line 236, in run
    run_fetch(*feed)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287, in _single_operation_run
    self._call_tf_sessionrun(None, {}, [], target_list, None)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Length for attr 'dtypes' of 0 must be at least minimum 1
	; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
	 [[prefetch_2/DataBufferPut]]

ERROR:tensorflow:Prefetching was cancelled unexpectedly:

Length for attr 'dtypes' of 0 must be at least minimum 1
	; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
	 [[prefetch_2/DataBufferPut]]
2022-03-18 16:28:14.975041: E tensorflow/core/framework/op_segment.cc:54] Create kernel failed: Invalid argument: Length for attr 'dtypes' of 0 must be at least minimum 1
	; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
Exception in thread PrefetchThread-PrefetchRunner-1:
Traceback (most recent call last):
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/prefetch_runner.py", line 236, in run
    run_fetch(*feed)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287, in _single_operation_run
    self._call_tf_sessionrun(None, {}, [], target_list, None)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Length for attr 'dtypes' of 0 must be at least minimum 1
	; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
	 [[prefetch_2/DataBufferPut]]

ERROR:tensorflow:Prefetching was cancelled unexpectedly:

Length for attr 'dtypes' of 0 must be at least minimum 1
	; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
	 [[prefetch_2/DataBufferPut]]
2022-03-18 16:28:15.079552: E tensorflow/core/framework/op_segment.cc:54] Create kernel failed: Invalid argument: Length for attr 'dtypes' of 0 must be at least minimum 1
	; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
Exception in thread PrefetchThread-PrefetchRunner-5:
Traceback (most recent call last):
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/prefetch_runner.py", line 236, in run
    run_fetch(*feed)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287, in _single_operation_run
    self._call_tf_sessionrun(None, {}, [], target_list, None)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Length for attr 'dtypes' of 0 must be at least minimum 1
	; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
	 [[prefetch_2/DataBufferPut]]

ERROR:tensorflow:Prefetching was cancelled unexpectedly:

Length for attr 'dtypes' of 0 must be at least minimum 1
	; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
	 [[prefetch_2/DataBufferPut]]
2022-03-18 16:28:15.183314: E tensorflow/core/framework/op_segment.cc:54] Create kernel failed: Invalid argument: Length for attr 'dtypes' of 0 must be at least minimum 1
	; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
Exception in thread PrefetchThread-PrefetchRunner-6:
Traceback (most recent call last):
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/prefetch_runner.py", line 236, in run
    run_fetch(*feed)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287, in _single_operation_run
    self._call_tf_sessionrun(None, {}, [], target_list, None)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Length for attr 'dtypes' of 0 must be at least minimum 1
	; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
	 [[prefetch_2/DataBufferPut]]

ERROR:tensorflow:Prefetching was cancelled unexpectedly:

Length for attr 'dtypes' of 0 must be at least minimum 1
	; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
	 [[prefetch_2/DataBufferPut]]
2022-03-18 16:28:15.288156: E tensorflow/core/framework/op_segment.cc:54] Create kernel failed: Invalid argument: Length for attr 'dtypes' of 0 must be at least minimum 1
	; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
Exception in thread PrefetchThread-PrefetchRunner-3:
Traceback (most recent call last):
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/prefetch_runner.py", line 236, in run
    run_fetch(*feed)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287, in _single_operation_run
    self._call_tf_sessionrun(None, {}, [], target_list, None)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Length for attr 'dtypes' of 0 must be at least minimum 1
	; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
	 [[prefetch_2/DataBufferPut]]

ERROR:tensorflow:Prefetching was cancelled unexpectedly:

Length for attr 'dtypes' of 0 must be at least minimum 1
	; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
	 [[prefetch_2/DataBufferPut]]
2022-03-18 16:28:15.391508: E tensorflow/core/framework/op_segment.cc:54] Create kernel failed: Invalid argument: Length for attr 'dtypes' of 0 must be at least minimum 1
	; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
Exception in thread PrefetchThread-PrefetchRunner-4:
Traceback (most recent call last):
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/prefetch_runner.py", line 236, in run
    run_fetch(*feed)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287, in _single_operation_run
    self._call_tf_sessionrun(None, {}, [], target_list, None)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Length for attr 'dtypes' of 0 must be at least minimum 1
	; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
	 [[prefetch_2/DataBufferPut]]

ERROR:tensorflow:Prefetching was cancelled unexpectedly:

Length for attr 'dtypes' of 0 must be at least minimum 1
	; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
	 [[prefetch_2/DataBufferPut]]
Exception in thread PrefetchThread-PrefetchRunner-7:
Traceback (most recent call last):
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/prefetch_runner.py", line 236, in run
    run_fetch(*feed)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287, in _single_operation_run
    self._call_tf_sessionrun(None, {}, [], target_list, None)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Length for attr 'dtypes' of 0 must be at least minimum 1
	; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
	 [[prefetch_2/DataBufferPut]]

INFO:tensorflow:loss = 0.6654865, steps = 1
INFO:tensorflow:Saving checkpoints for 1 into ./result/model_WIDE_AND_DEEP_1647592077/model.ckpt.
Using TensorFlow version 1.15.5
Checking dataset...
Numbers of training dataset is 8000000
Numbers of test dataset is 2000000
The training steps is 15625
The testing steps is 3907
Saving model checkpoints to ./result/model_WIDE_AND_DEEP_1647592077
Enable smart staged feature of DeepRec.
Traceback (most recent call last):
  File "train_rebuild.py", line 673, in <module>
    main()
  File "train_rebuild.py", line 495, in main
    checkpoint_dir, tf_config, server)
  File "train_rebuild.py", line 375, in train
    sess.run([model.loss, model.train_op])
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 911, in __exit__
    self._close_internal(exception_type)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 949, in _close_internal
    self._sess.close()
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1216, in close
    self._sess.close()
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1384, in close
    ignore_live_threads=True)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/six.py", line 718, in reraise
    raise value.with_traceback(tb)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/prefetch_runner.py", line 236, in run
    run_fetch(*feed)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287, in _single_operation_run
    self._call_tf_sessionrun(None, {}, [], target_list, None)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Length for attr 'dtypes' of 0 must be at least minimum 1
	; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
	 [[prefetch_2/DataBufferPut]]

[ASoC 2022] Improve DeepRec ModelZoo.

Background

This is an basic subject of ASoC 2022 and #231 .

There are 6 models in ModelZoo in DeepRec. Currently, there is only model code for training. Please add inference code for these models and optimize the inference performance, and summarize performance results.

Target

  1. Design and implement inference code for models in ModelZoo.
  2. Profiling and optimize inference performance for models in ModelZoo.

Difficulty

Basic

Mentor

@shanshanpt [email protected]

Output Requirements

Proficiency in C++ and Python;
Get to know DeepRec;
Able to complete the development under the guidance of the mentor;
Have a certain understanding and interest in deep learning recommendation engines;

背景

这是一个阿里巴巴编程之夏 2022 的基础课题 #231 .

DeepRec中ModelZoo中有6个模型,当前没有支持导出为SavedModel,导致训练和推理不能直接打通。请完善这些模型并且完成训练推理完整链路的测试。

目标

1)实现ModelZoo中6个模型的Inference use case。
2)优化ModelZoo的模型的Inference性能,并总结性能文档。

难度

基础

导师

@shanshanpt [email protected]

产出要求

熟练掌握C++和Python;
能够在导师的指导下熟悉并理解相关的代码
了解 DeepRec;
对深度学习推荐引擎有一定了解和兴趣;

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.