deeprec-ai / deeprec Goto Github PK
View Code? Open in Web Editor NEWDeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.
License: Apache License 2.0
DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.
License: Apache License 2.0
After enabling smartstaged feature in distributed training with modelzoo code, an error occurs.
Other info / logs
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: From /job:ps/replica:0/task:0:
Output 30 of type float does not match declared output type int64 for node {{node prefetch_2/DataBufferTake}}
###
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 840, in <module>
main(tf_config, server)
File "train.py", line 610, in main
checkpoint_dir, tf_config, server)
File "train.py", line 480, in train
sess.run([model.loss, model.train_op])
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 804, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1309, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1410, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 719, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1395, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1468, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1226, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: From /job:ps/replica:0/task:0:
Output 30 of type float does not match declared output type int64 for node node prefetch_2/DataBufferTake (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748)
At present, DeepRec cannot support the evaluation of very large models on a single node. Multiple ps are required to load large models, and multiple workers are used for distributed evaluation.This can improve DeepRec's support for more scenarios
Unlike training models, evaluating models does not require modifying the network structure to improve model accuracy, but instead requires consideration of how to improve the throughput of model evaluation and reduce evaluation latency. DeepRec already supports distributed training, and the evaluation is actually simpler compared to the training process because no updates to ps are involved. In the code, DeepRec first decides whether to initialize the cluster and how to initialize it according to the parameters.
There are two modes of distributed multi-evaluator evaluation of the system that need to be implemented.
1.Mode 1 contains ps, worker and evaluator nodes.DeepRec has implemented the case of a single evaluator in this mode,we need to implement multiple evaluators.One of the ideas is to directly add multiple evaluators to the initialization list of distributed clusters in DeepRec, or use the tf.distribute.Strategy interface
2.Mode 2 only has ps and evaluator nodes.The difference between this mode and mode 1 is that there is no need to train, just load the offline model that has been trained into ps and directly evaluate its performance.
env: A100 8gpu + horovod
framework: DeepRec1.15.5 vs nvidia-tensorflow 1.15.4
model: transformer + mmoe
result: DeepRec 1.8s/step, nv-tf:0.55s/step
When I use the latest commit to build a PMEM memkind environment and execute the launch script, the following error will appear.
2.The build option I used
bazel build --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --host_cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" -c opt --copt="-L/usr/local/lib" --copt="-lpmem" --copt="-lmemkind" --config=opt //tensorflow/tools/pip_package:build_pip_package
The scprit I used
numactl -N 1 ./launch.sh --batch_size=1280 --dim_size=512 --max_mock_id_amplify=1800 --num_steps=2000 --ev_storage=pmem_memkind
error logs
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Traceback (most recent call last):
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: From /job:ps/replica:0/task:0:
MultiLevel EV's Cache size -1 should large than IDs in batch 1280
[[{{node fm/embedding_lookup_36}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./benchmark.py", line 228, in
tf.app.run()
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/pai/lib/python3.6/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/home/pai/lib/python3.6/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "./benchmark.py", line 203, in main
sess.run(train_op)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 804, in run
run_metadata=run_metadata)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1309, in run
run_metadata=run_metadata)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1410, in run
raise six.reraise(*original_exc_info)
File "/home/pai/lib/python3.6/site-packages/six.py", line 719, in reraise
raise value
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1395, in run
return self._sess.run(*args, **kwargs)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1468, in run
run_metadata=run_metadata)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1226, in run
return self._sess.run(*args, **kwargs)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: From /job:ps/replica:0/task:0:
MultiLevel EV's Cache size -1 should large than IDs in batch 1280
[[node fm/embedding_lookup_36 (defined at /home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Original stack trace for 'fm/embedding_lookup_36':
File "./benchmark.py", line 228, in
tf.app.run()
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/pai/lib/python3.6/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/home/pai/lib/python3.6/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "./benchmark.py", line 121, in main
tf.nn.embedding_lookup(fm_w, batch['col{}'.format(sidx)]))
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/ops/embedding_ops.py", line 418, in embedding_lookup
counts=counts)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/ops/embedding_ops.py", line 184, in _embedding_lookup_and_transform
counts=counts),
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/util/dispatch.py", line 180, in wrapper
return target(*args, **kwargs)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/ops/array_ops.py", line 3958, in gather
counts=counts)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/ops/kv_variable_ops.py", line 749, in sparse_read
name=name)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_kv_variable_ops.py", line 647, in kv_resource_gather
validate_indices=validate_indices, name=name)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in init
self._traceback = tf_stack.extract_stack()
Modelzoo perf Test based on [Release] Update DeepRec release version to 1.15.5+deeprec2201. (#43)
.
Test machines: Alibaba Cloud ECS general purpose instance family with high clock speeds - ecs.hfg7.2xlarge.
Test perf result:
Gstep | WDL | WDL | DLRM | DLRM | DeepFM | DeepFM | DSSM | DSSM | DIEN | DIEN | DIN | DIN |
---|---|---|---|---|---|---|---|---|---|---|---|---|
/ | value | percent | value | percent | value | percent | value | percent | value | percent | value | percent |
Commuty TF | 31.92626 | baseline | 82.09168 | baseline | 37.20978 | baseline | 18.54726 | baseline | 14.62987 | baseline | 18.57746 | baseline |
DeepRec FP32 | 34.69318 | 108.67% | 105.4547 | 128.46% | 43.31713 | 116.41% | 21.64175 | 116.68% | 13.27125 | 90.71% | 17.6932 | 95.24% |
DeepRec BF16 | 49.38222 | 154.68% | 114.2221 | 139.14% | 47.34401 | 127.24% | 23.13698 | 124.75% | 13.0392 | 89.13% | 17.20525 | 92.61% |
Test AUC result:
AUC | WDL | WDL | DLRM | DLRM | DeepFM | DeepFM | DSSM | DSSM | DIEN | DIEN | DIN | DIN |
---|---|---|---|---|---|---|---|---|---|---|---|---|
/ | value | percent | value | percent | value | percent | value | percent | value | percent | value | percent |
Commuty TF | 0.775168 | baseline | 0.768852 | baseline | 0.744794 | baseline | 0.504404 | baseline | 0.8443 | baseline | 0.7887 | baseline |
DeepRec FP32 | 0.775515 | 100.04% | 0.771128 | 100.30% | 0.746055 | 100.17% | 0.503653 | 99.85% | 0.8472 | 100.34% | 0.7913 | 100.33% |
DeepRec BF16 | 0.77604 | 100.11% | 0.772185 | 100.43% | 0.741192 | 99.52% | 0.492327 | 97.61% | 0.8358 | 98.99% | 0.7883 | 99.95% |
PS: DSSM dataset is small, so its ACC and AUC is limited.
An error occurred when Auto Graph Fusion enabled in modelzoo's DIEN.
Reproduce the issue
The code and dataset is provide in docker image, docker pull cesg-prc-registry.cn-beijing.cr.aliyuncs.com/cesg-ali/deeprec-modelzoo:220401-dien-issue
The DeepRec installed in the image is built on f4368d6
And run following code to reproduce the issue.
/root/modelzoo/DIEN
python train.py --steps 100 --no_eval --op_fusion True
Other info / logs
2022-04-01 02:58:35.554337: I ./tensorflow/core/graph/template_select_pruning_base.h:70] Found match op by select_pruning_else_const head/gradients/head/loss/xentropy/Select_grad/zeros_like
2022-04-01 02:58:35.554414: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/head/loss/xentropy/Select_grad/Select_1
2022-04-01 02:58:35.554462: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/head/loss/xentropy/Select_grad/tuple/control_dependency_1
2022-04-01 02:58:35.554552: I ./tensorflow/core/graph/template_select_pruning_base.h:70] Found match op by select_pruning_else_const head/gradients/attention_layer/Select_grad/zeros_like
2022-04-01 02:58:35.554612: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/attention_layer/Select_grad/Select_1
2022-04-01 02:58:35.554668: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/attention_layer/Select_grad/tuple/control_dependency_1
2022-04-01 02:58:35.554933: I ./tensorflow/core/graph/template_select_pruning_base.h:70] Found match op by select_pruning_then_const head/gradients/input_layer/input_layer/UID_embedding/UID_embedding_weights_grad/zeros_like
2022-04-01 02:58:35.554993: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/input_layer/UID_embedding/UID_embedding_weights_grad/Select
2022-04-01 02:58:35.555030: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/input_layer/UID_embedding/UID_embedding_weights_grad/tuple/control_dependency
2022-04-01 02:58:35.555062: I ./tensorflow/core/graph/template_select_pruning_base.h:70] Found match op by select_pruning_then_const head/gradients/input_layer/embedding_lookup_4_grad/zeros_like
2022-04-01 02:58:35.555117: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/embedding_lookup_4_grad/Select
2022-04-01 02:58:35.555166: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/embedding_lookup_4_grad/tuple/control_dependency
2022-04-01 02:58:35.555187: I ./tensorflow/core/graph/template_select_pruning_base.h:70] Found match op by select_pruning_then_const head/gradients/input_layer/embedding_lookup_5_grad/zeros_like
2022-04-01 02:58:35.555234: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/embedding_lookup_5_grad/Select
2022-04-01 02:58:35.555279: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/embedding_lookup_5_grad/tuple/control_dependency
2022-04-01 02:58:35.555318: I ./tensorflow/core/graph/template_select_pruning_base.h:70] Found match op by select_pruning_then_const head/gradients/rnn_1/gru1/while/Select_grad/zeros_like
2022-04-01 02:58:35.555383: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/rnn_1/gru1/while/Select_grad/Select
2022-04-01 02:58:35.555449: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/rnn_1/gru1/while/Select_grad/tuple/control_dependency
2022-04-01 02:58:35.555466: I ./tensorflow/core/graph/template_select_pruning_base.h:70] Found match op by select_pruning_then_const head/gradients/input_layer/embedding_lookup_grad/zeros_like
2022-04-01 02:58:35.555530: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/embedding_lookup_grad/Select
2022-04-01 02:58:35.555594: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/embedding_lookup_grad/tuple/control_dependency
2022-04-01 02:58:35.555610: I ./tensorflow/core/graph/template_select_pruning_base.h:70] Found match op by select_pruning_then_const head/gradients/input_layer/embedding_lookup_1_grad/zeros_like
2022-04-01 02:58:35.555673: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/embedding_lookup_1_grad/Select
2022-04-01 02:58:35.555737: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/embedding_lookup_1_grad/tuple/control_dependency
2022-04-01 02:58:35.555764: I ./tensorflow/core/graph/template_select_pruning_base.h:70] Found match op by select_pruning_then_const head/gradients/input_layer/embedding_lookup_2_grad/zeros_like
2022-04-01 02:58:35.555842: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/embedding_lookup_2_grad/Select
2022-04-01 02:58:35.555920: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/embedding_lookup_2_grad/tuple/control_dependency
2022-04-01 02:58:35.555937: I ./tensorflow/core/graph/template_select_pruning_base.h:70] Found match op by select_pruning_then_const head/gradients/input_layer/embedding_lookup_3_grad/zeros_like
2022-04-01 02:58:35.556015: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/embedding_lookup_3_grad/Select
2022-04-01 02:58:35.556092: I ./tensorflow/core/graph/template_select_pruning_base.h:77] remove node: head/gradients/input_layer/embedding_lookup_3_grad/tuple/control_dependency
2022-04-01 02:58:35.556208: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[input_layer/input_layer/UID_embedding/UID_embedding_weights]
2022-04-01 02:58:35.556260: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[input_layer/embedding_lookup]
2022-04-01 02:58:35.556278: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[input_layer/embedding_lookup_1]
2022-04-01 02:58:35.556294: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[input_layer/embedding_lookup_2]
2022-04-01 02:58:35.556312: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[input_layer/embedding_lookup_3]
2022-04-01 02:58:35.556330: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[input_layer/embedding_lookup_4]
2022-04-01 02:58:35.556346: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[input_layer/embedding_lookup_5]
2022-04-01 02:58:35.556676: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_else_scalar] match op[head/loss/xentropy/Select]
2022-04-01 02:58:35.556988: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_else_scalar_in_grad] match op[head/gradients/head/loss/xentropy/Select_grad/Select]
2022-04-01 02:58:35.557014: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_else_scalar_in_grad] match op[head/gradients/head/loss/xentropy/Select_1_grad/Select]
2022-04-01 02:58:35.557041: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_else_scalar_in_grad] match op[head/gradients/rnn_2/gru2/while/Select_1_grad/Select]
2022-04-01 02:58:35.557072: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_else_scalar_in_grad] match op[head/gradients/attention_layer/Select_grad/Select]
2022-04-01 02:58:35.557095: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_else_scalar_in_grad] match op[head/gradients/rnn_1/gru1/while/Select_1_grad/Select]
2022-04-01 02:58:35.557373: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/head/loss/xentropy/Select_1_grad/Select_1]
2022-04-01 02:58:35.557415: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/input_layer/UID_embedding/UID_embedding_weights_grad/Select_1]
2022-04-01 02:58:35.557431: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/rnn_2/gru2/while/Select_1_grad/Select_1]
2022-04-01 02:58:35.557453: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/embedding_lookup_4_grad/Select_1]
2022-04-01 02:58:35.557466: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/embedding_lookup_5_grad/Select_1]
2022-04-01 02:58:35.557495: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/rnn_1/gru1/while/Select_1_grad/Select_1]
2022-04-01 02:58:35.557509: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/rnn_1/gru1/while/Select_grad/Select_1]
2022-04-01 02:58:35.557523: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/embedding_lookup_grad/Select_1]
2022-04-01 02:58:35.557536: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/embedding_lookup_1_grad/Select_1]
2022-04-01 02:58:35.557560: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/embedding_lookup_2_grad/Select_1]
2022-04-01 02:58:35.557572: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/embedding_lookup_3_grad/Select_1]
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2022-04-01 02:58:37.395205: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] function_optimizer failed: Invalid argument: {{node head/gradients/rnn_2/gru2/while/add_1_grad/Reshape}} has inputs from different frames. The input {{node head/gradients/rnn_2/gru2/while/add_1_grad/BroadcastGradientArgs/StackPopV2}} is in frame 'head/gradients/rnn_2/gru2/while/while_context'. The input {{node head/gradients/rnn_2/gru2/while/add_1_grad/Sum}} is in frame ''.
2022-04-01 02:58:37.878245: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] function_optimizer failed: Invalid argument: {{node head/gradients/rnn_2/gru2/while/Switch_3_grad/b_switch}} has inputs from different frames. The input {{node head/gradients/rnn_2/gru2/while/Switch_3_grad_1/NextIteration}} is in frame ''. The input {{node head/gradients/rnn_2/gru2/while/Exit_3_grad/b_exit}} is in frame 'head/gradients/rnn_2/gru2/while/while_context'.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: {{node fused_op_3_select_else_scalar_in_grad}} has inputs from different frames. The input {{node head/gradients/rnn_2/gru2/while/Select_1_grad/Select/StackPopV2}} is in frame 'head/gradients/rnn_2/gru2/while/while_context'. The input {{node head/clip_by_norm_25/Greater/y}} is in frame ''.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 1147, in <module>
main()
File "train.py", line 927, in main
checkpoint_dir, tf_config, server)
File "train.py", line 786, in train
sess.run([model.loss, model.train_op])
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 804, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1309, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1410, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 719, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1395, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1468, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1226, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: {{node fused_op_3_select_else_scalar_in_grad}} has inputs from different frames. The input node head/gradients/rnn_2/gru2/while/Select_1_grad/Select/StackPopV2 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) is in frame 'head/gradients/rnn_2/gru2/while/while_context'. The input node head/clip_by_norm_25/Greater/y (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) is in frame ''.
I want to enable Auto Micro Batch feature in WDL and follow the steps in DeepRec Docs, but I get an error.
Code to reproduce the issue
I use following codes to enable Auto Graph Fusion. The full code please see Full code
if args.op_fusion and not args.tf:
'''Auto Graph Fusion'''
sess_config.graph_options.optimizer_options.do_op_fusion = True
Run python train.py --steps 1000 --no_eval --micro_batch 2
can reproduce error. Use WDL dataset.
When set --micro_batch
(micro_batch_num) to 1, it's OK.
"AutoMicroBatch功能依赖于用户开启图优化的选项" means Auto Graph Fusion
? It can be enabled by --op_fusion True
, but get the same error. And I also get terrible in enabling Auto Graph Fusion
, see issue #126
This seems to be because of the initialization of dataset in MonitorTrainingSession. So this issue is different from #86 which use tf.Session().
logs
INFO:tensorflow:Parsing ./data/train.csv
INFO:tensorflow:Parsing ./data/eval.csv
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Init incremental saver , incremental_save:False, incremental_path:./result/model_WIDE_AND_DEEP_1648002155/.incremental_checkpoint/incremental_model.ckpt
INFO:tensorflow:Graph was finalized.
2022-03-23 10:22:39.913346: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3000000000 Hz
2022-03-23 10:22:39.932151: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x556fea568950 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-03-23 10:22:39.932183: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
INFO:tensorflow:run without loading checkpoint
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into ./result/model_WIDE_AND_DEEP_1648002155/model.ckpt.
INFO:tensorflow:Create incremental timer, incremental_save:False, incremental_save_secs:None
Using TensorFlow version 1.15.5
Checking dataset...
Numbers of training dataset is 8000000
Numbers of test dataset is 2000000
The training steps is 100
The testing steps is 7813
Saving model checkpoints to ./result/model_WIDE_AND_DEEP_1648002155
Traceback (most recent call last):
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.FailedPreconditionError: GetNext() failed because the iterator has not been initialized. Ensure that you have run the initializer operation for this iterator before getting the next element.
[[{{node IteratorGetNext_1/dup0}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train_rebuild.py", line 746, in <module>
main()
File "train_rebuild.py", line 542, in main
checkpoint_dir, tf_config, server)
File "train_rebuild.py", line 414, in train
sess.run([model.loss, model.train_op])
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 804, in run
run_metadata=run_metadata)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1309, in run
run_metadata=run_metadata)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1410, in run
raise six.reraise(*original_exc_info)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/six.py", line 719, in reraise
raise value
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1395, in run
return self._sess.run(*args, **kwargs)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1468, in run
run_metadata=run_metadata)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1226, in run
return self._sess.run(*args, **kwargs)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.FailedPreconditionError: GetNext() failed because the iterator has not been initialized. Ensure that you have run the initializer operation for this iterator before getting the next element.
[[{{node IteratorGetNext_1/dup0}}]]
Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template
System information
Describe the problem
ERROR: /DeepRec/tensorflow/core/kernels/BUILD:4695:1: error while parsing .d file: /root/.cache/bazel/_bazel_root/de860b3f457ade81f033a15040b8fdd2/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/core/kernels/_objs/bias_op_gpu/bias_op_gpu.cu.pic.d (No such file or directory)
In file included from /usr/local/cuda/bin/../targets/x86_64-linux/include/thrust/system/cuda/config.h:33:0,
from /usr/local/cuda/bin/../targets/x86_64-linux/include/thrust/system/cuda/detail/execution_policy.h:35,
from /usr/local/cuda/bin/../targets/x86_64-linux/include/thrust/iterator/detail/device_system_tag.h:23,
from /usr/local/cuda/bin/../targets/x86_64-linux/include/thrust/iterator/detail/iterator_facade_category.h:22,
from /usr/local/cuda/bin/../targets/x86_64-linux/include/thrust/iterator/iterator_facade.h:37,
from bazel-out/host/bin/external/cub_archive/_virtual_includes/cub/third_party/cub/device/../iterator/arg_index_input_iterator.cuh:48,
from bazel-out/host/bin/external/cub_archive/_virtual_includes/cub/third_party/cub/device/device_reduce.cuh:41,
from ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:27,
from tensorflow/core/kernels/bias_op_gpu.cu.cc:28:
/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_namespace.cuh:46:2: error: #error CUB requires a definition of CUB_NS_QUALIFIER when CUB_NS_PREFIX/POSTFIX are defined.
#error CUB requires a definition of CUB_NS_QUALIFIER when CUB_NS_PREFIX/POSTFIX are defined.
^~~~~
Target //tensorflow/tools/pip_package:build_pip_package failed to build
INFO: Elapsed time: 41.305s, Critical Path: 36.17s
INFO: 928 processes: 928 local.
FAILED: Build did NOT complete successfully
Provide the exact sequence of commands / steps that you executed before running into the problem
step1
./configure
here is the content of .tf_configure.bazelrc
build --action_env PYTHON_BIN_PATH="/usr/bin/python3"
build --action_env PYTHON_LIB_PATH="/usr/lib64/python3.6/site-packages"
build --python_path="/usr/bin/python3"
build:xla --define with_xla_support=true
build --config=xla
build:star --define with_star_support=true
build:pmem --define with_pmem_support=true
build --action_env TF_USE_CCACHE="0"
build --action_env CUDA_TOOLKIT_PATH="/usr/local/cuda"
build --action_env TF_CUDA_COMPUTE_CAPABILITIES="7.0,8.0,8.6,6.1"
build --action_env LD_LIBRARY_PATH="/usr/lib64:/usr/local/lib64:/usr/local/lib64:/usr/local/cuda/lib64:/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib:/opt/rh/devtoolset-7/root/usr/lib64/dyninst:/opt/rh/devtoolset-7/root/usr/lib/dyninst:/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib:/usr/lib64/:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64/:/usr/lib64/:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64"
build --action_env GCC_HOST_COMPILER_PATH="/opt/rh/devtoolset-7/root/usr/bin/gcc"
build --config=cuda
build:opt --copt=-march=native
build:opt --copt=-Wno-sign-compare
build:opt --host_copt=-march=native
build:opt --define with_default_optimizations=true
build:v2 --define=tf_api_version=2
test --flaky_test_attempts=3
test --test_size_filters=small,medium
test --test_tag_filters=-benchmark-test,-no_oss,-oss_serial
test --build_tag_filters=-benchmark-test,-no_oss
test --test_tag_filters=-gpu
test --build_tag_filters=-gpu
build --action_env TF_CONFIGURE_IOS="0"
step2:
bazel build --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --host_cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" -c opt --config=opt --copt=-march=native //tensorflow/tools/pip_package:build_pip_package --verbose_failures
Any other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template
System information
Describe the problem
ERROR: /DeepRec/tensorflow/BUILD:893:1: Executing genrule //tensorflow:tf_python_api_gen_v1 failed (Exit 1)
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/tools/api/generator/create_python_api.py", line 27, in <module>
from tensorflow.python.tools.api.generator import doc_srcs
File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/__init__.py", line 73, in <module>
from tensorflow.python.ops.standard_ops import *
File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/ops/standard_ops.py", line 25, in <module>
from tensorflow.python import autograph
File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/autograph/__init__.py", line 35, in <module>
from tensorflow.python.autograph import operators
File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/autograph/operators/__init__.py", line 40, in <module>
from tensorflow.python.autograph.operators.control_flow import for_stmt
File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/autograph/operators/control_flow.py", line 65, in <module>
from tensorflow.python.autograph.operators import py_builtins
File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/autograph/operators/py_builtins.py", line 30, in <module>
from tensorflow.python.data.ops import dataset_ops
File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/data/__init__.py", line 25, in <module>
from tensorflow.python.data import experimental
File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/data/experimental/__init__.py", line 89, in <module>
from tensorflow.python.data.experimental.ops.batching import dense_to_sparse_batch
File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/data/experimental/ops/batching.py", line 20, in <module>
from tensorflow.python.data.ops import dataset_ops
File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/data/ops/dataset_ops.py", line 40, in <module>
from tensorflow.python.data.ops import iterator_ops
File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/data/ops/iterator_ops.py", line 35, in <module>
from tensorflow.python.training.saver import BaseSaverBuilder
File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/training/saver.py", line 57, in <module>
from tensorflow.python.training.saving import saveable_object_util
File "/root/.cache/bazel/_bazel_root/ac437ea991a64a55acdfc27c9ef15814/execroot/org_tensorflow/bazel-out/host/bin/tensorflow/create_tensorflow.python_api_1_tf_python_api_gen_v1.runfiles/org_tensorflow/tensorflow/python/training/saving/saveable_object_util.py", line 33, in <module>
from tensorflow.python.training import saver
ImportError: cannot import name saver
Target //tensorflow/tools/pip_package:build_pip_package failed to build
Provide the exact sequence of commands / steps that you executed before running into the problem
bazel build --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=1" --host_cxxopt="-D_GLIBCXX_USE_CXX11_ABI=1" -c opt --config=v1 --config=opt --config=mkl_threadpool --define build_with_mkl_dnn_v1_only=true //tensorflow/tools/pip_package:build_pip_package
Any other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
Describe the current behavior
Using get_embedding_variable to create an EmbeddingVariable for embedding lookup, but encounters unexpected keyword argument issue while creating slot var, the detailed error stack is:
File "/usr/local/python3.7/lib/python3.7/site-packages/tensorflow_core/python/training/optimizer.py", line 1302, in _zeros_slot
new_slot_variable = slot_creator.create_zeros_slot(var, op_name, slot_config=slot_config)
File "/usr/local/python3.7/lib/python3.7/site-packages/tensorflow_core/python/training/slot_creator.py", line 266, in create_zeros_slot
slot_config=slot_config)
File "/usr/local/python3.7/lib/python3.7/site-packages/tensorflow_core/python/training/slot_creator.py", line 239, in create_slot_with_initializer
dtype, slot_config)
File "/usr/local/python3.7/lib/python3.7/site-packages/tensorflow_core/python/training/slot_creator.py", line 92, in _create_slot_var
ht_partition_num=primary._ht_partition_num)
TypeError: get_embedding_variable_internal() got an unexpected keyword argument 'ht_partition_num'
Describe the expected behavior
Training without error.
Maybe _create_slot_var
should use get_embedding_variable_v2_internal
for all cases.
Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
Refer to the document of Dynamic-dim Embedding Variable, I got that 当使用dynamic dimension embedding variable的时候,在embedding_lookup的时候需要传入blocknum上参数,用来指示每一个特征对应的blocknum
.
But I am confused about how to assign blocknum
for every feature when embedding lookup. Could you please provide a minimal example to show how to initialize and look_up Dynamic-dimension Embedding Variable?
Convergence problem occurs at commit ccb8450, and the metrics go abnormal.
For example, I set original learning rate 0.001 when standalone mode, and 0.001 / sqrt(10) performs well when 10 workers running. But when 20 workers running, 0.001 / sqrt(20) performs very bad. So, is there any suggestion to adjust when the number of workers incresing?
After enable Adaptive embedding, it fails to evaluate model with modelzoo after completing training.
Code to reproduce the issue
With WDL in modelzoo, run python train.py --steps 100 --adaptive_emb true
Other info / logs
Training completed.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:run with loading checkpoint
INFO:tensorflow:Restoring parameters from ./result/model_BST_1653893703/model.ckpt-100
2022-05-30 14:56:04.871953: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/unseq_input_layer/input_layer/adgroup_id_embedding/adgroup_id_embedding_weights][new_name:fused_op_1_select_then_scalar]
2022-05-30 14:56:04.872043: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/unseq_input_layer/input_layer/age_level_embedding/age_level_embedding_weights][new_name:fused_op_2_select_then_scalar]
2022-05-30 14:56:04.872552: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/unseq_input_layer/input_layer/brand_embedding/brand_embedding_weights][new_name:fused_op_3_select_then_scalar]
2022-05-30 14:56:04.872924: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/unseq_input_layer/input_layer/campaign_id_embedding/campaign_id_embedding_weights][new_name:fused_op_4_select_then_scalar]
2022-05-30 14:56:04.873322: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/unseq_input_layer/input_layer/cate_id_embedding/cate_id_embedding_weights][new_name:fused_op_5_select_then_scalar]
2022-05-30 14:56:04.873678: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/unseq_input_layer/input_layer/cms_group_id_embedding/cms_group_id_embedding_weights][new_name:fused_op_6_select_then_scalar]
2022-05-30 14:56:04.874156: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/unseq_input_layer/input_layer/cms_segid_embedding/cms_segid_embedding_weights][new_name:fused_op_7_select_then_scalar]
2022-05-30 14:56:04.874631: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/unseq_input_layer/input_layer/customer_embedding/customer_embedding_weights][new_name:fused_op_8_select_then_scalar]
2022-05-30 14:56:04.875088: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/unseq_input_layer/input_layer/new_user_class_level_embedding/new_user_class_level_embedding_weights][new_name:fused_op_9_sele$
t_then_scalar]
2022-05-30 14:56:04.875571: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/unseq_input_layer/input_layer/occupation_embedding/occupation_embedding_weights][new_name:fused_op_10_select_then_scalar]
2022-05-30 14:56:04.875981: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/unseq_input_layer/input_layer/pid_embedding/pid_embedding_weights][new_name:fused_op_11_select_then_scalar]
2022-05-30 14:56:04.876455: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/unseq_input_layer/input_layer/price_embedding/price_embedding_weights][new_name:fused_op_12_select_then_scalar]
2022-05-30 14:56:04.876896: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/unseq_input_layer/input_layer/pvalue_level_embedding/pvalue_level_embedding_weights][new_name:fused_op_13_select_then_scalar]
2022-05-30 14:56:04.877411: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/unseq_input_layer/input_layer/shopping_level_embedding/shopping_level_embedding_weights][new_name:fused_op_14_select_then_scal
ar]
2022-05-30 14:56:04.877944: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar] match op[input_layer/unseq_input_layer/input_layer/user_id_embedding/user_id_embedding_weights][new_name:fused_op_15_select_then_scalar]
2022-05-30 14:56:04.880237: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/adgroup_id_embedding/adgroup_id_embedding_weights_grad/Select][new_name:f
used_op_1_select_else_scalar_in_grad]
2022-05-30 14:56:04.880278: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/age_level_embedding/age_level_embedding_weights_grad/Select][new_name:fus
ed_op_2_select_else_scalar_in_grad]
2022-05-30 14:56:04.880297: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/campaign_id_embedding/campaign_id_embedding_weights_grad/Select[101/4331$
:fused_op_3_select_else_scalar_in_grad]
2022-05-30 14:56:04.880316: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/cms_group_id_embedding/cms_group_id_embedding_weights_grad/Select][new_na
me:fused_op_4_select_else_scalar_in_grad]
2022-05-30 14:56:04.880335: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/cms_segid_embedding/cms_segid_embedding_weights_grad/Select][new_name:fus
ed_op_5_select_else_scalar_in_grad]
2022-05-30 14:56:04.880351: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/customer_embedding/customer_embedding_weights_grad/Select][new_name:fused
_op_6_select_else_scalar_in_grad]
2022-05-30 14:56:04.880367: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/new_user_class_level_embedding/new_user_class_level_embedding_weights_gra
d/Select][new_name:fused_op_7_select_else_scalar_in_grad]
2022-05-30 14:56:04.880383: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/occupation_embedding/occupation_embedding_weights_grad/Select][new_name:f
used_op_8_select_else_scalar_in_grad]
2022-05-30 14:56:04.880398: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/pid_embedding/pid_embedding_weights_grad/Select][new_name:fused_op_9_sele
ct_else_scalar_in_grad]
2022-05-30 14:56:04.880413: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/price_embedding/price_embedding_weights_grad/Select][new_name:fused_op_10
_select_else_scalar_in_grad]
2022-05-30 14:56:04.880428: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/pvalue_level_embedding/pvalue_level_embedding_weights_grad/Select][new_na
me:fused_op_11_select_else_scalar_in_grad]
2022-05-30 14:56:04.880443: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/shopping_level_embedding/shopping_level_embedding_weights_grad/Select][ne
w_name:fused_op_12_select_else_scalar_in_grad]
2022-05-30 14:56:04.880458: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/user_id_embedding/user_id_embedding_weights_grad/Select][new_name:fused_o
p_13_select_else_scalar_in_grad]
2022-05-30 14:56:04.880521: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/cate_id_embedding/cate_id_embedding_weights_grad/Select][new_name:fused_o
p_14_select_else_scalar_in_grad]
2022-05-30 14:56:04.880537: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_else_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/brand_embedding/brand_embedding_weights_grad/Select][new_name:fused_op_15
_select_else_scalar_in_grad]
2022-05-30 14:56:04.881665: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/adgroup_id_embedding/adgroup_id_embedding_weights_grad/Select_1][new_nam$
:fused_op_1_select_then_scalar_in_grad]
2022-05-30 14:56:04.881789: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/age_level_embedding/age_level_embedding_weights_grad/Select_1][new_name:$
used_op_2_select_then_scalar_in_grad]
2022-05-30 14:56:04.882280: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/campaign_id_embedding/campaign_id_embedding_weights_grad/Select_1][new_n$
me:fused_op_3_select_then_scalar_in_grad]
2022-05-30 14:56:04.882307: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/cms_group_id_embedding/cms_group_id_embedding_weights_grad/Select_1][new$
name:fused_op_4_select_then_scalar_in_grad]
2022-05-30 14:56:04.882325: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/cms_segid_embedding/cms_segid_embedding_weights_grad/Select_1][new_name:$
used_op_5_select_then_scalar_in_grad]
2022-05-30 14:56:04.882343: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/customer_embedding/customer_embedding_weights_grad/Select_1][new_name:fu$
ed_op_6_select_then_scalar_in_grad]
2022-05-30 14:56:04.882359: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/new_user_class_level_embedding/new_user_class_level_embedding_weights_gr$
d/Select_1][new_name:fused_op_7_select_then_scalar_in_grad]
2022-05-30 14:56:04.882375: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/occupation_embedding/occupation_embedding_weights_grad/Select_1][new_nam$
:fused_op_8_select_then_scalar_in_grad]
2022-05-30 14:56:04.882391: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/pid_embedding/pid_embedding_weights_grad/Select_1][new_name:fused_op_9_s$
lect_then_scalar_in_grad]
2022-05-30 14:56:04.882408: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/price_embedding/price_embedding_weights_grad/Select_1][new_name:fused_op$
10_select_then_scalar_in_grad]
2022-05-30 14:56:04.882423: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/pvalue_level_embedding/pvalue_level_embedding_weights_grad/Select_1][new$
name:fused_op_11_select_then_scalar_in_grad]
2022-05-30 14:56:04.882440: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/shopping_level_embedding/shopping_level_embedding_weights_grad/Select_1]$
new_name:fused_op_12_select_then_scalar_in_grad]
2022-05-30 14:56:04.882456: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/user_id_embedding/user_id_embedding_weights_grad/Select_1][new_name:fuse$
_op_13_select_then_scalar_in_grad]
2022-05-30 14:56:04.882515: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/cate_id_embedding/cate_id_embedding_weights_grad/Select_1][new_name:fuse$
_op_14_select_then_scalar_in_grad]
2022-05-30 14:56:04.882533: I ./tensorflow/core/graph/template_select_base.h:41] Fusion template[select_then_scalar_in_grad] match op[head/gradients/input_layer/unseq_input_layer/input_layer/brand_embedding/brand_embedding_weights_grad/Select_1][new_name:fused_op$
15_select_then_scalar_in_grad]
2022-05-30 14:56:05.593580: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200
2022-05-30 14:56:05.598049: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200
2022-05-30 14:56:06.180909: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200
INFO:tensorflow:Running local_init_op.
2022-05-30 14:56:06.308357: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200
2022-05-30 14:56:06.309002: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200
2022-05-30 14:56:06.309340: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200
2022-05-30 14:56:06.336448: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200
INFO:tensorflow:Done running local_init_op.
2022-05-30 14:56:06.649402: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200
2022-05-30 14:56:06.768921: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200
2022-05-30 14:56:06.882066: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200
2022-05-30 14:56:07.804632: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200
2022-05-30 14:56:07.812211: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200
2022-05-30 14:56:08.707163: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200
2022-05-30 14:56:10.078551: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200
2022-05-30 14:56:10.496578: I tensorflow/core/common_runtime/tensorpool_allocator.cc:146] TensorPoolAllocator enabled
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Prefetching was closed.
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, data.shape must start with partitions.shape, got data.shape = [272], partitions.shape = [512]
[[{{node input_layer/unseq_input_layer/input_layer/price_embedding/DynamicPartition_1}}]]
ERROR:tensorflow:Prefetching was cancelled unexpectedly:
data.shape must start with partitions.shape, got data.shape = [272], partitions.shape = [512]
[[{{node input_layer/unseq_input_layer/input_layer/price_embedding/DynamicPartition_1}}]]
Exception in thread PrefetchThread-PrefetchRunner-4:
Traceback (most recent call last):
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/prefetch_runner.py", line 236, in run
run_fetch(*feed)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287, in _single_operation_run
self._call_tf_sessionrun(None, {}, [], target_list, None)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: data.shape must start with partitions.shape, got data.shape = [272], partitions.shape = [512]
[[{{node input_layer/unseq_input_layer/input_layer/price_embedding/DynamicPartition_1}}]]
2022-05-30 14:56:10.811248: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200
2022-05-30 14:56:14.841705: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200 [0/4331]
Traceback (most recent call last):
File "train.py", line 573, in eval
[model.acc_op, model.auc_op, merged])
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 804, in run
run_metadata=run_metadata)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1309, in run
run_metadata=run_metadata)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1408, in run
raise six.reraise(*original_exc_info)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/six.py", line 719, in reraise
raise value
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1395, in run
return self._sess.run(*args, **kwargs)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1468, in run
run_metadata=run_metadata)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1226, in run
return self._sess.run(*args, **kwargs)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.CancelledError: Session was closed.
[[node prefetch_2/TensorBufferTake (defined at /home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Original stack trace for 'prefetch_2/TensorBufferTake':
File "train.py", line 907, in <module>
main()
File "train.py", line 653, in main
next_element = tf.staged(next_element, num_threads=8, capacity=40)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/prefetch.py", line 140, in staged
shared_threads=num_clients)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_tensor_buffer_ops.py", line 535, in tensor_buffer_take
shared_threads=shared_threads, name=name)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
self._traceback = tf_stack.extract_stack()
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 907, in <module>
main()
File "train.py", line 683, in main
checkpoint_dir)
File "train.py", line 576, in eval
print("ACC = {}\nAUC = {}".format(eval_acc, eval_auc))
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 911, in __exit__
self._close_internal(exception_type)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 949, in _close_internal
self._sess.close()
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1216, in close
self._sess.close()
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1384, in close
ignore_live_threads=True)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/six.py", line 718, in reraise
raise value.with_traceback(tb)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/prefetch_runner.py", line 236, in run
run_fetch(*feed)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287, in _single_operation_run
self._call_tf_sessionrun(None, {}, [], target_list, None)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: data.shape must start with partitions.shape, got data.shape = [272], partitions.shape = [512]
[[{{node input_layer/unseq_input_layer/input_layer/price_embedding/DynamicPartition_1}}]]
When I replaced tf.Session() with tf.MonitorTrainingSession(), it take too much time on collect info for Memory Optimization at one steps.
After closed Memory Optimization by export ENABLE_MEMORY_OPTIMIZATION=0
, it's ok.
Code
Here is the full code of rebuilded DIEN.
reproduce the issue
Here is the docker image to reproduce with DeepRec build on commit e3f51a3
docker pull cesg-prc-registry-vpc.cn-beijing.cr.aliyuncs.com/cesg-ali/deeprec-modelzoo:220328-DIEN-issue
cd /root/modelzoo/DIEN
python train.py --steps 300 --no_eval
START_STATISTIC_STEP
and STOP_STATISTIC_STEP
are set to 100 and 200 in the train.py.
I put the DIEN code in the main branch at /root/modelzoo/DIEN-old directory.
logs
START_STATISTIC_STEP
and STOP_STATISTIC_STEP
are set to 100 and 200, and step 193 takes so much time.
If use the default setting(start at 1000,stop and 1100), one step between 1080 to 1095 will take a long time.
INFO:tensorflow:loss = 0.92963386, steps = 191 (0.206 sec)
INFO:tensorflow:loss = 0.93619776, steps = 192 (0.206 sec)
INFO:tensorflow:loss = 0.96694994, steps = 193 (370.673 sec)
INFO:tensorflow:loss = 0.93243694, steps = 194 (0.208 sec)
INFO:tensorflow:loss = 0.94905794, steps = 195 (0.210 sec)
INFO:tensorflow:loss = 0.9613142, steps = 196 (0.210 sec)
INFO:tensorflow:loss = 0.96409273, steps = 197 (0.209 sec)
Welcome to the open source world! If you haven't planned how to spend this summer, come to the Alibaba Summer of Code and code with us! 💻
Alibaba Summer of Code is a global program focused on engaging students directly in open source software development. Under the guidance of the mentor in the Alibaba open source project, students can experience software development in the real world. Alibaba Summer of code will begin from May 30th to September 1st. Students can use the summertime to participate in the open source project and work with the core members of the project.
This is a master issue to track the progress and result of Alibaba Summer of Code 2022.
On this exclusive developer journey, students will have the opportunity to:
Participate in the top projects of the International Open Source Foundation;
Get a scholarship from Alibaba;
Obtain an open source contributor certificate;
Get a fast pass of Alibaba Internship
Get your code adopted and used by the open source project!
@shanshanpt [email protected]
@candyzone [email protected]
@JackMoriarty [email protected]
Browse open idea list here:
#230 Difficulty:Advance
#232 Difficulty:Basic
#233 Difficulty:Basic
Upload your CV and project proposal via ASOC 2022 official website
If you have any questions, visit the event website: https://opensource.alibaba.com/asoc2022
Email address: [email protected]
欢迎来到开源世界! 如果你还没有计划如何度过这个夏天,那就来阿里巴巴编程之夏和我们一起编程吧! 💻
阿里巴巴编程之夏是一个全球性项目,专注于让学生直接参与开源软件开发。 在阿里巴巴开源项目导师的指导下,学生可以在现实世界中体验软件开发。
阿里巴巴代码之夏将于 5 月 30 日至 9 月 1 日开始。 学生可以利用暑期参与开源项目,与项目核心成员一起工作。
在这个独家开发者之旅中,学生将有机会:
参与国际开源基金会的顶级项目;
获得阿里巴巴奖学金;
获得开源贡献者证书;
获得阿里巴巴实习快速通行证
让你的代码被开源项目采纳和使用!
@shanshanpt [email protected]
@candyzone [email protected]
@JackMoriarty [email protected]
浏览如下课题列表:
#230 难度:进阶
#232 难度:基础
#233 难度:基础
通过ASOC 2022 官网上传您的简历和项目提案
如有任何问题,请访问活动网站:https://opensource.alibaba.com/asoc2022
Thanks in advance.
We first test the star_server protocol on the CPU machine, and the training task runs normally. Now, we want to switch to the GPU machine. The cluster info is 2 PS node and 2 GPU-Worker node.
When in star_server protocol, the training task is failed with the ERROR /job:worker/replica:0/task:0/device:GPU:0 unknown device
. But when in grpc++ and grpc, the training task runs normally.
git commit-id, 821d157, branch, master
'''
from future import absolute_import
from future import division
from future import print_function
import os
import numpy as np
import tensorflow as tf
num_x = np.random.randint(0, 10, (500, 10)).astype(dtype=np.float32)
num_y = np.random.randint(0, 10, 500).astype(dtype=np.int64)
dataset = tf.data.Dataset.from_tensor_slices((num_x, num_y))
.batch(10)
iterator = dataset.make_initializable_iterator()
x, labels = iterator.get_next()
outputs = tf.layers.dense(x, 10)
logits = tf.layers.dense(outputs, 10)
loss = tf.losses.sparse_softmax_cross_entropy(labels=labels,
logits=logits)
optimizer = tf.train.AdamOptimizer(learning_rate=0.001)
train_op = optimizer.minimize(loss)
init = tf.global_variables_initializer()
config = tf.ConfigProto()
config.graph_options.optimizer_options.micro_batch_num = 2
with tf.Session(config=config) as sess:
sess.run(iterator.initializer)
sess.run(init)
print("================================")
train_loss, _ = sess.run([loss, train_op])
print(' Loss: %s .' % ( train_loss))
'''
error msg
================================
Traceback (most recent call last):
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.FailedPreconditionError: GetNext() failed because the iterator has not been initialized. Ensure that you have run the initializer operation for this iterator before getting the next element.
[[{{node IteratorGetNext/dup0}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "demo1.py", line 37, in
train_loss, _ = sess.run([loss, train_op])
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/home/pai/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.FailedPreconditionError: GetNext() failed because the iterator has not been initialized. Ensure that you have run the initializer operation for this iterator before getting the next element.
[[{{node IteratorGetNext/dup0}}]]
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"
Describe the current behavior
In some kind of GPU instance in aliyun, I build DeepRec from source following this docs: https://github.com/alibaba/DeepRec#how-to-build, I confirm I enabled GPU, but in this machine, I notice my code only run on CPU, and GPU-Util is always zero and with low GPU Memory-Usage, here is a runtime capture
But on other machines, the same building and execute behavior works normally.
Here is the CPU info which works fine:
# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
stepping : 4
microcode : 0x1
cpu MHz : 2499.998
cache size : 33792 KB
physical id : 0
siblings : 16
core id : 0
cpu cores : 8
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq monitor ssse3 fma cx16 pcid sse
4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsav
eopt xsavec xgetbv1 arat
bogomips : 4999.99
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
Here is the CPU info which works with low GPU util:
$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 79
model name : Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz
stepping : 1
microcode : 0x1
cpu MHz : 2499.996
cache size : 40960 KB
physical id : 0
siblings : 32
core id : 0
cpu cores : 16
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 20
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic
movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat spec_ctrl intel_stibp
bogomips : 4999.99
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
Describe the expected behavior
Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
An error occurred when Multi-Hash Variable enabled in modelzoo's DIEN.
And the doc of Multi-Hash Variable should be updated. https://deeprec.readthedocs.io/zh/latest/Multi-Hash-Variable.html
num_of_partitions
param of get_multihash_variable
is removed in the code which is still in the doc.
It seems that Multi-Hash Variable has something wrong with variable partitioner. error is type object 'float' has no attribute 'base_dtype'
, but object 'float' is the parameter passed down by default.
Without using variable partitioner, another error occurred. 'MultiHashVariable' object has no attribute '_dtype'
Reproduce the issue
The code and dataset is provide in docker image, docker pull cesg-prc-registry.cn-beijing.cr.aliyuncs.com/cesg-ali/deeprec-modelzoo:220401-dien-issue
The DeepRec installed in the image is built on f4368d6
And run following code to reproduce the issue.
/root/modelzoo/DIEN
python train.py --steps 100 --no_eval --multihash True
# Disable variable partitioner
python train.py --steps 100 --no_eval --multihash True --input_layer_partitioner 0 --dense_layer_partitioner 0
Other info / logs
Traceback (most recent call last):
File "train.py", line 1147, in <module>
main()
File "train.py", line 903, in main
dense_layer_partitioner=dense_layer_partitioner)
File "train.py", line 157, in __init__
self._create_model()
File "train.py", line 464, in _create_model
uid_emb, item_emb, his_item_emb, noclk_his_item_emb, sequence_length = self._embedding_input_layer(
File "train.py", line 398, in _embedding_input_layer
self._embedding_dim
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variable_scope.py", line 2344, in get_multihash_variable
aggregation=aggregation)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variable_scope.py", line 1525, in get_variable
aggregation=aggregation)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variable_scope.py", line 805, in get_variable
ht_partition_num=ht_partition_num)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variable_scope.py", line 697, in _true_getter
ht_partition_num=ht_partition_num)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variable_scope.py", line 930, in _get_partitioned_variable
partitions = _call_partitioner(partitioner, shape, dtype)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/variable_scope.py", line 3237, in _call_partitioner
slicing = partitioner(shape=shape, dtype=dtype)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/partitioned_variables.py", line 205, in _partitioner
if dtype.base_dtype == dtypes.string:
AttributeError: type object 'float' has no attribute 'base_dtype'
Traceback (most recent call last):
File "train.py", line 1147, in <module>
main()
File "train.py", line 903, in main
dense_layer_partitioner=dense_layer_partitioner)
File "train.py", line 157, in __init__
self._create_model()
File "train.py", line 464, in _create_model
uid_emb, item_emb, his_item_emb, noclk_his_item_emb, sequence_length = self._embedding_input_layer(
File "train.py", line 423, in _embedding_input_layer
item_embedding_var)
File "train.py", line 344, in _get_embedding_input
sparse_weights=sparse_tensors_weights)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/embedding_ops.py", line 1275, in safe_embedding_lookup_sparse
if not (isinstance(w, resource_variable_ops.ResourceVariable) and dtype in (None, w.dtype)):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 473, in dtype
return self._dtype
AttributeError: 'MultiHashVariable' object has no attribute '_dtype'
I read https://mp.weixin.qq.com/s/aEi6ooG9wDL-GXVWcGWRCw and found DeepRec supports multi level embedding, which can put feature in HBM or DRAM by its hotness. It sounds a very good feature.
Then I read https://deeprec.readthedocs.io/zh/latest/Embedding-Variable.html but I can not found how to use multi level embedding.
My question is:
How to use multi level embedding?
If document is missing recently, could you give some config name or variable name as a clue? Then I can find related source code by myself.
Thanks.
测试环境:A100 单机4卡horovod同步训练 DeepRec vs Nvidia-TF 1.15.4
训练样本:3路embedding,样本存储在hdfs,tfrecord结构 batch_size 1024
模型结构:transformer+mmoe
测试性能:DeepRec 1.8s/step Nvidia-TF 0.55s/step
At present, DeepRec cannot support the evaluation of very large models on a single node. Multiple ps are required to load large models, and multiple workers are used for distributed evaluation.This can improve DeepRec's support for more scenarios
Unlike training models, evaluating models does not require modifying the network structure to improve model accuracy, but instead requires consideration of how to improve the throughput of model evaluation and reduce evaluation latency. DeepRec already supports distributed training, and the evaluation is actually simpler compared to the training process because no updates to ps are involved. In the code, DeepRec first decides whether to initialize the cluster and how to initialize it according to the parameters.
There are two modes of distributed multi-evaluator evaluation of the system that need to be implemented.
1.Mode 1 contains ps, worker and evaluator nodes.DeepRec has implemented the case of a single evaluator in this mode,we need to implement multiple evaluators.One of the ideas is to directly add multiple evaluators to the initialization list of distributed clusters in DeepRec, or use the tf.distribute.Strategy interface
2.Mode 2 only has ps and evaluator nodes.The difference between this mode and mode 1 is that there is no need to train, just load the offline model that has been trained into ps and directly evaluate its performance.
First, we will train a baseline model, then we will restore the parameters of the baseline model, continue to train. When we restore parameters, our code is as follows.
vars_to_warm_start = ['^((?!Adam)(?!pos_dense).)*$']
variables = self.restore_variables()
restorer = tf.compat.v1.train.Saver(var_list=variables, max_to_keep=1)
restorer.restore(session, base_checkpoint_path)
saver= tf.compat.v1.train.Saver(max_to_keep=1)
def restore_variables(self):
list_of_vars = None
if 'vars_to_warm_start' in _Hyperparams:
vars_to_warm_start = _Hyperparams['vars_to_warm_start']
if isinstance(vars_to_warm_start, str) or vars_to_warm_start is None:
# Both vars_to_warm_start = '.*' and vars_to_warm_start = None will match
# everything (in TRAINABLE_VARIABLES) here.
self.logger.info("Warm-starting variables only in GLOBAL_VARIABLES.")
list_of_vars = ops.get_collection(
ops.GraphKeys.GLOBAL_VARIABLES, scope=vars_to_warm_start)
self.logger.info('Loading base model variables: {}'.format(list_of_vars))
saveable_objects = tf.get_collection(tf.GraphKeys.SAVEABLE_OBJECTS,
scope=vars_to_warm_start)
self.logger.info('Loading saveable variables: {}'.format(saveable_objects))
list_of_vars += saveable_objects
elif isinstance(vars_to_warm_start, list):
if all(isinstance(v, str) for v in vars_to_warm_start):
self.logger.info("Warm-starting partial variables in GLOBAL_VARIABLES.")
list_of_vars = []
saveable_objects = []
for v in vars_to_warm_start:
list_of_vars += ops.get_collection(
ops.GraphKeys.GLOBAL_VARIABLES, scope=v)
saveable_objects += tf.get_collection(tf.GraphKeys.SAVEABLE_OBJECTS,
scope=v)
self.logger.info('Loading base model variables: {}'.format(list_of_vars))
self.logger.info('Loading saveable variables: {}'.format(saveable_objects))
list_of_vars += saveable_objects
return list_of_vars
We enable GlobalStepEvict for imei feature at two stage.
If we enable GlobalStepEvict when restoring the baseline model, it will failed when saving checkpoint via saver. The core dump info is:
tensorflow::SaveV2::Compute (this=0x7f8fd20bdec0, context=<optimized out>) at
tensorflow/core/kernels/save_restore_v2_ops.cc:177
tensor_name = "feature_processing/imei_embedding/embedding_weights/Adam"
It seems that there exists a problem when saving the Adam parameters.
If we only resotre tf.trainable_variables(), it saved checkpoint successfully. It failed when restore tf.global_variables() where including Adam parameters.
If we disable GlobalStepEvict when restoring the baseline model, it will run normally, but loss, AUC will be poor.
https://github.com/alibaba/DeepRec/blob/083cb5fc5e895e928a7817faa736706b911861dc/tensorflow/core/graph/graph_constructor.cc#L1983
try to use smart stage encounter an error , fix it by change dest.num_nodes() to dest.num_node_ids()
System information
when i used grpc++ in estimator, i got the following error,but it still training, i don't know whether it is ok
config = tf.estimator.RunConfig( save_checkpoints_secs=10 * 60, keep_checkpoint_max=2, protocol='grpc++' ) model = tf.estimator.Estimator( model_fn=model_fn, params=model_params, model_dir=checkpoint, config=config ) eval_spec = tf.estimator.EvalSpec(...) train_spec = tf.estimator.TrainSpec(...) tf.estimator.train_and_evaluate(model, train_spec, eval_spec)
In the DeepRec-doc, I found that it seems there some problem with ori-estimator,but I bazel failed and don't know what's Estimator check like when using grpc++,in the deeprec last version whether we need to install estimaotr specially?
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"
Describe the current behavior
When using get_dynamic_dimension_embedding_variable function provided by DeepRec, it crashed and raised a Segmentation fault (core dumped) problem, maybe it hits a kernel error.
Describe the expected behavior
Return the correct value.
Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.
import tensorflow as tf
EMBEDDING_DIM = 10
var = tf.get_dynamic_dimension_embedding_variable("uid_embedding_var",
embedding_block_dimension=EMBEDDING_DIM / 2,
embedding_block_num=4)
ids = [21, 34, 78, 99, 56]
blocknums = [4, 1, 4, 3, 1]
emb = tf.nn.embedding_lookup(var, tf.cast(ids, tf.int64), blocknums=blocknums)
init = tf.global_variables_initializer()
sess_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)
with tf.Session(config=sess_config) as sess:
sess.run([init])
print(sess.run([emb]))
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
Here is the logs:
$ python dynamic_dimension_embedding_variable_test1.py
2022-02-24 15:40:16.850638: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499995000 Hz
2022-02-24 15:40:16.851615: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x555e87ed6910 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-02-24 15:40:16.851639: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2022-02-24 15:40:16.854765: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2022-02-24 15:40:17.557625: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1084] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-24 15:40:17.558416: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x555e881cf050 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2022-02-24 15:40:17.558443: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2022-02-24 15:40:17.558719: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1084] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-24 15:40:17.559381: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1599] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:08.0
2022-02-24 15:40:17.559760: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-02-24 15:40:17.566657: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2022-02-24 15:40:17.570319: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2022-02-24 15:40:17.570728: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2022-02-24 15:40:17.571501: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11
2022-02-24 15:40:17.572976: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2022-02-24 15:40:17.573214: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2022-02-24 15:40:17.573357: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1084] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-24 15:40:17.574064: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1084] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-24 15:40:17.574719: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1727] Adding visible gpu devices: 0
2022-02-24 15:40:17.574771: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-02-24 15:40:17.575946: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1139] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-02-24 15:40:17.575961: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1145] 0
2022-02-24 15:40:17.575971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1158] 0: N
2022-02-24 15:40:17.576139: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1084] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-24 15:40:17.576818: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1084] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-02-24 15:40:17.577568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1284] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13945 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:08.0, comp$
te capability: 7.5)
[array([[ 0.6678627 , -0.38343927, -0.65460324, 0.15888363, 0.710193 ,
0.640184 , 0.42595282, 0.7293787 , 1.3838437 , 0.27501038,
-0.96244717, -0.5522712 , -0.46999097, 0.45904443, -0.35207814,
0.39496022, -1.106673 , 0.21438211, -1.1451356 , 0.9796604 ],
[-0.14699893, 0.07010368, 0.22612067, -1.9068893 , -0.44930258,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ],
[-0.36353526, 0.36128962, 0.14200972, 0.07810795, -0.54961 ,
-0.15781127, -0.64423895, 0.97612906, -0.96893233, 0.8196201 ,
-0.7367647 , -0.94786507, 1.1452298 , 1.0325592 , 0.46815377,
-0.4092801 , -0.5371794 , -1.2808001 , -1.057108 , -0.7823616 ],
[-0.88329375, -1.5494045 , -0.4070856 , -1.8068027 , -0.8884988 ,
0.3828017 , -1.0075641 , -1.4119419 , -0.16102602, 0.7351839 ,
1.483396 , 0.6105891 , -0.23226756, 1.6206956 , 0.06422351,
0. , 0. , 0. , 0. , 0. ],
[ 0.7161362 , -0.737407 , -0.8979032 , 1.1798211 , 0.37206918,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ]],
dtype=float32)]
Segmentation fault (core dumped)
auc is unstable when enable auto_micro_batch in DIN model implemented based on deepctr
Build by myself, commit id is 31f83623dde1a1d3792d7f41ba310b29e40abaa7
, released by name r1.15.5-deeprec2204
Everything is ok when using default deeprec environment, and auc is around to 0.716 under multiple experiments. However, when using the feature Auto Micro Batch, the auc fluctuates in this range [0.71-0.74] with slower training performance
below is the code skeleton
import tensorflow as tf
import horovod.tensorflow as hvd
class DIN:
# implemented based on [deepctr](https://github.com/shenweichen/DeepCTR)
pass
def prepareDataSet(data_path, batch_size):
# parsed by tf.data.Dataset with prefetch
pass
def create_model(data_path='.', batch_size=512, learning_rate=0.01):
parsed_dataset = prepareDataSet(data_path, batch_size)
iterator = parsed_dataset.make_one_shot_iterator()
input_features, label = iterator.get_next()
label = tf.reshape(label, [-1, 1])
output = DIN(input_features)
optimizer = tf.train.AdagradOptimizer(learning_rate=learning_rate * hvd.size(), initial_accumulator_value=1e-30)
optimizer = hvd.DistributedOptimizer(optimizer)
loss = tf.keras.losses.BinaryCrossentropy(from_logits=False)(label, output)
global_step = tf.train.get_or_create_global_step()
train_op = optimizer.minimize(loss, global_step=global_step)
_, auc = tf.metrics.auc(label, output)
return train_op, auc
def create_sess_config(deeprec_auto_micro_batch):
sess_config = tf.ConfigProto()
sess_config.gpu_options.allow_growth = False
sess_config.gpu_options.visible_device_list = str(hvd.local_rank())
if deeprec_auto_micro_batch:
sess_config.graph_options.optimizer_options.micro_batch_num = 2
return sess_config
def train(deeprec_auto_micro_batch ):
batch_size = 512 if deeprec_auto_micro_batch else 1024
train_op, auc = create_model(batch_size=batch_size)
sess_config = create_sess_config(deeprec_auto_micro_batch=True)
hooks = [
hvd.BroadcastGlobalVariablesHook(0),
]
with tf.train.MonitoredTrainingSession(hooks=hooks,
config=sess_config) as mon_sess:
fetches = {
"train_op": train_op,
'auc': auc,
}
while not mon_sess.should_stop():
results = mon_sess.run(fetches)
print(results['auc'])
if __name__ == "__main__":
hvd.init()
deeprec_auto_micro_batch = True
train(deeprec_auto_micro_batch)
When I was reading
https://github.com/alibaba/DeepRec/blob/main/triton/tensorflow_backend_tf.cc#L941
https://github.com/alibaba/DeepRec/blob/main/triton/tensorflow_backend_tf.cc#L932
I wonder where function clear_allocator_type() in line 941 and set_allocator_type() in line 932 are defined. I did not find any file in tensorflow related with these functions.
I want to enable Auto Graph Fusion feature in WDL and follow the steps in DeepRec Docs, but I get an error.
Code to reproduce the issue
I use following codes to enable Auto Graph Fusion. The full code please see Full code
if args.op_fusion and not args.tf:
'''Auto Graph Fusion'''
sess_config.graph_options.optimizer_options.do_op_fusion = True
Run python train.py --steps 1000 --no_eval --op_fusion True
can reproduce error. Use WDL dataset.
logs
INFO:tensorflow:Parsing ./data/train.csv
INFO:tensorflow:Parsing ./data/eval.csv
INFO:tensorflow:Graph was finalized.
2022-03-22 14:10:30.688360: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3000000000 Hz
2022-03-22 14:10:30.707518: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5622edbbebf0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-03-22 14:10:30.707558: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
INFO:tensorflow:run without loading checkpoint
2022-03-22 14:10:30.786850: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_else_const head/gradients/head/loss/xentropy/Select_grad/zeros_like
2022-03-22 14:10:30.787074: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/head/loss/xentropy/Select_grad/Select_1
2022-03-22 14:10:30.787248: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_else_const head/gradients_1/head/loss/xentropy/Select_grad/zeros_like
2022-03-22 14:10:30.787437: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/head/loss/xentropy/Select_grad/Select_1
2022-03-22 14:10:30.787920: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C10_embedding/C10_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.788067: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C10_embedding/C10_embedding_weights_grad/Select
2022-03-22 14:10:30.788089: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C11_embedding/C11_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.788232: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C11_embedding/C11_embedding_weights_grad/Select
2022-03-22 14:10:30.788255: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C12_embedding/C12_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.788394: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C12_embedding/C12_embedding_weights_grad/Select
2022-03-22 14:10:30.788415: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C13_embedding/C13_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.788554: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C13_embedding/C13_embedding_weights_grad/Select
2022-03-22 14:10:30.788575: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C14_embedding/C14_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.788714: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C14_embedding/C14_embedding_weights_grad/Select
2022-03-22 14:10:30.788735: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C15_embedding/C15_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.788875: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C15_embedding/C15_embedding_weights_grad/Select
2022-03-22 14:10:30.788895: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C16_embedding/C16_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.789049: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C16_embedding/C16_embedding_weights_grad/Select
2022-03-22 14:10:30.789071: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C17_embedding/C17_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.789214: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C17_embedding/C17_embedding_weights_grad/Select
2022-03-22 14:10:30.789234: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C18_embedding/C18_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.789374: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C18_embedding/C18_embedding_weights_grad/Select
2022-03-22 14:10:30.789395: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C19_embedding/C19_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.789534: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C19_embedding/C19_embedding_weights_grad/Select
2022-03-22 14:10:30.789554: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C1_embedding/C1_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.789693: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C1_embedding/C1_embedding_weights_grad/Select
2022-03-22 14:10:30.789715: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C20_embedding/C20_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.789854: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C20_embedding/C20_embedding_weights_grad/Select
2022-03-22 14:10:30.789874: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C21_embedding/C21_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.790014: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C21_embedding/C21_embedding_weights_grad/Select
2022-03-22 14:10:30.790035: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C22_embedding/C22_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.790177: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C22_embedding/C22_embedding_weights_grad/Select
2022-03-22 14:10:30.790197: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C23_embedding/C23_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.790338: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C23_embedding/C23_embedding_weights_grad/Select
2022-03-22 14:10:30.790359: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C24_embedding/C24_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.790504: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C24_embedding/C24_embedding_weights_grad/Select
2022-03-22 14:10:30.790529: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C25_embedding/C25_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.790687: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C25_embedding/C25_embedding_weights_grad/Select
2022-03-22 14:10:30.790708: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C26_embedding/C26_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.790851: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C26_embedding/C26_embedding_weights_grad/Select
2022-03-22 14:10:30.790872: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C2_embedding/C2_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.791015: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C2_embedding/C2_embedding_weights_grad/Select
2022-03-22 14:10:30.791036: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C3_embedding/C3_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.791183: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C3_embedding/C3_embedding_weights_grad/Select
2022-03-22 14:10:30.791204: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C4_embedding/C4_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.791348: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C4_embedding/C4_embedding_weights_grad/Select
2022-03-22 14:10:30.791369: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C5_embedding/C5_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.791513: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C5_embedding/C5_embedding_weights_grad/Select
2022-03-22 14:10:30.791544: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C6_embedding/C6_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.791686: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C6_embedding/C6_embedding_weights_grad/Select
2022-03-22 14:10:30.791706: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C7_embedding/C7_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.791848: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C7_embedding/C7_embedding_weights_grad/Select
2022-03-22 14:10:30.791874: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C8_embedding/C8_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.792021: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C8_embedding/C8_embedding_weights_grad/Select
2022-03-22 14:10:30.792042: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients/dnn/input_from_feature_columns/input_layer/C9_embedding/C9_embedding_weights_grad/zeros_like
2022-03-22 14:10:30.792188: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients/dnn/input_from_feature_columns/input_layer/C9_embedding/C9_embedding_weights_grad/Select
2022-03-22 14:10:30.792258: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C1/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.792436: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C1/weighted_sum_grad/Select
2022-03-22 14:10:30.792456: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C10/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.792632: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C10/weighted_sum_grad/Select
2022-03-22 14:10:30.792654: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C11/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.792827: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C11/weighted_sum_grad/Select
2022-03-22 14:10:30.792848: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C12/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.793022: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C12/weighted_sum_grad/Select
2022-03-22 14:10:30.793043: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C13/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.793221: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C13/weighted_sum_grad/Select
2022-03-22 14:10:30.793243: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C14/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.793417: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C14/weighted_sum_grad/Select
2022-03-22 14:10:30.793439: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C15/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.793612: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C15/weighted_sum_grad/Select
2022-03-22 14:10:30.793633: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C16/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.793807: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C16/weighted_sum_grad/Select
2022-03-22 14:10:30.793829: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C17/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.794006: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C17/weighted_sum_grad/Select
2022-03-22 14:10:30.794027: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C18/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.794205: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C18/weighted_sum_grad/Select
2022-03-22 14:10:30.794227: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C19/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.794401: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C19/weighted_sum_grad/Select
2022-03-22 14:10:30.794422: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C2/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.794596: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C2/weighted_sum_grad/Select
2022-03-22 14:10:30.794618: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C20/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.794793: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C20/weighted_sum_grad/Select
2022-03-22 14:10:30.794813: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C21/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.794988: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C21/weighted_sum_grad/Select
2022-03-22 14:10:30.795011: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C22/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.795189: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C22/weighted_sum_grad/Select
2022-03-22 14:10:30.795210: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C23/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.795385: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C23/weighted_sum_grad/Select
2022-03-22 14:10:30.795407: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C24/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.795582: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C24/weighted_sum_grad/Select
2022-03-22 14:10:30.795602: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C25/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.795778: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C25/weighted_sum_grad/Select
2022-03-22 14:10:30.795799: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C26/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.795974: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C26/weighted_sum_grad/Select
2022-03-22 14:10:30.795999: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C3/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.796178: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C3/weighted_sum_grad/Select
2022-03-22 14:10:30.796200: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C4/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.796375: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C4/weighted_sum_grad/Select
2022-03-22 14:10:30.796396: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C5/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.796572: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C5/weighted_sum_grad/Select
2022-03-22 14:10:30.796593: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C6/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.796768: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C6/weighted_sum_grad/Select
2022-03-22 14:10:30.796790: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C7/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.796966: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C7/weighted_sum_grad/Select
2022-03-22 14:10:30.796991: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C8/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.797170: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C8/weighted_sum_grad/Select
2022-03-22 14:10:30.797192: I ./tensorflow/core/graph/template_select_pruning_base.h:69] Found match op by select_pruning_then_const head/gradients_1/linear/linear_model_1/linear_model/C9/weighted_sum_grad/zeros_like
2022-03-22 14:10:30.797368: I ./tensorflow/core/graph/template_select_pruning_base.h:75] remove node: head/gradients_1/linear/linear_model_1/linear_model/C9/weighted_sum_grad/Select
2022-03-22 14:10:30.797492: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C10_embedding/C10_embedding_weights]
2022-03-22 14:10:30.797542: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C11_embedding/C11_embedding_weights]
2022-03-22 14:10:30.797564: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C12_embedding/C12_embedding_weights]
2022-03-22 14:10:30.797583: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C13_embedding/C13_embedding_weights]
2022-03-22 14:10:30.797602: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C14_embedding/C14_embedding_weights]
2022-03-22 14:10:30.797621: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C15_embedding/C15_embedding_weights]
2022-03-22 14:10:30.797645: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C16_embedding/C16_embedding_weights]
2022-03-22 14:10:30.797664: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C17_embedding/C17_embedding_weights]
2022-03-22 14:10:30.797682: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C18_embedding/C18_embedding_weights]
2022-03-22 14:10:30.797701: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C19_embedding/C19_embedding_weights]
2022-03-22 14:10:30.797720: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C1_embedding/C1_embedding_weights]
2022-03-22 14:10:30.797739: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C20_embedding/C20_embedding_weights]
2022-03-22 14:10:30.797757: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C21_embedding/C21_embedding_weights]
2022-03-22 14:10:30.797776: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C22_embedding/C22_embedding_weights]
2022-03-22 14:10:30.797794: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C23_embedding/C23_embedding_weights]
2022-03-22 14:10:30.797813: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C24_embedding/C24_embedding_weights]
2022-03-22 14:10:30.797832: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C25_embedding/C25_embedding_weights]
2022-03-22 14:10:30.797850: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C26_embedding/C26_embedding_weights]
2022-03-22 14:10:30.797868: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C2_embedding/C2_embedding_weights]
2022-03-22 14:10:30.797886: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C3_embedding/C3_embedding_weights]
2022-03-22 14:10:30.797905: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C4_embedding/C4_embedding_weights]
2022-03-22 14:10:30.797923: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C5_embedding/C5_embedding_weights]
2022-03-22 14:10:30.797940: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C6_embedding/C6_embedding_weights]
2022-03-22 14:10:30.797958: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C7_embedding/C7_embedding_weights]
2022-03-22 14:10:30.797976: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C8_embedding/C8_embedding_weights]
2022-03-22 14:10:30.797994: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[dnn/input_from_feature_columns/input_layer/C9_embedding/C9_embedding_weights]
2022-03-22 14:10:30.798035: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C1/weighted_sum]
2022-03-22 14:10:30.798054: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C10/weighted_sum]
2022-03-22 14:10:30.798072: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C11/weighted_sum]
2022-03-22 14:10:30.798091: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C12/weighted_sum]
2022-03-22 14:10:30.798113: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C13/weighted_sum]
2022-03-22 14:10:30.798132: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C14/weighted_sum]
2022-03-22 14:10:30.798150: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C15/weighted_sum]
2022-03-22 14:10:30.798169: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C16/weighted_sum]
2022-03-22 14:10:30.798188: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C17/weighted_sum]
2022-03-22 14:10:30.798206: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C18/weighted_sum]
2022-03-22 14:10:30.798223: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C19/weighted_sum]
2022-03-22 14:10:30.798242: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C2/weighted_sum]
2022-03-22 14:10:30.798261: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C20/weighted_sum]
2022-03-22 14:10:30.798279: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C21/weighted_sum]
2022-03-22 14:10:30.798297: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C22/weighted_sum]
2022-03-22 14:10:30.798316: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C23/weighted_sum]
2022-03-22 14:10:30.798334: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C24/weighted_sum]
2022-03-22 14:10:30.798353: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C25/weighted_sum]
2022-03-22 14:10:30.798371: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C26/weighted_sum]
2022-03-22 14:10:30.798389: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C3/weighted_sum]
2022-03-22 14:10:30.798408: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C4/weighted_sum]
2022-03-22 14:10:30.798426: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C5/weighted_sum]
2022-03-22 14:10:30.798445: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C6/weighted_sum]
2022-03-22 14:10:30.798467: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C7/weighted_sum]
2022-03-22 14:10:30.798485: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C8/weighted_sum]
2022-03-22 14:10:30.798503: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar] match op[linear/linear_model_1/linear_model/C9/weighted_sum]
2022-03-22 14:10:30.798945: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_else_scalar] match op[head/loss/xentropy/Select]
2022-03-22 14:10:30.799421: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_else_scalar_in_grad] match op[head/gradients/head/loss/xentropy/Select_grad/Select]
2022-03-22 14:10:30.799453: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_else_scalar_in_grad] match op[head/gradients/head/loss/xentropy/Select_1_grad/Select]
2022-03-22 14:10:30.799528: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_else_scalar_in_grad] match op[head/gradients_1/head/loss/xentropy/Select_grad/Select]
2022-03-22 14:10:30.799546: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_else_scalar_in_grad] match op[head/gradients_1/head/loss/xentropy/Select_1_grad/Select]
2022-03-22 14:10:30.799958: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/head/loss/xentropy/Select_1_grad/Select_1]
2022-03-22 14:10:30.799996: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C10_embedding/C10_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800013: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C11_embedding/C11_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800028: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C12_embedding/C12_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800043: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C13_embedding/C13_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800057: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C14_embedding/C14_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800072: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C15_embedding/C15_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800087: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C16_embedding/C16_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800101: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C17_embedding/C17_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800120: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C18_embedding/C18_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800136: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C19_embedding/C19_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800155: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C1_embedding/C1_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800169: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C20_embedding/C20_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800184: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C21_embedding/C21_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800199: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C22_embedding/C22_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800213: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C23_embedding/C23_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800228: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C24_embedding/C24_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800242: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C25_embedding/C25_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800256: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C26_embedding/C26_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800271: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C2_embedding/C2_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800286: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C3_embedding/C3_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800300: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C4_embedding/C4_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800314: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C5_embedding/C5_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800329: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C6_embedding/C6_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800344: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C7_embedding/C7_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800358: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C8_embedding/C8_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800373: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients/dnn/input_from_feature_columns/input_layer/C9_embedding/C9_embedding_weights_grad/Select_1]
2022-03-22 14:10:30.800428: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/head/loss/xentropy/Select_1_grad/Select_1]
2022-03-22 14:10:30.800455: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C1/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800471: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C10/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800493: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C11/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800508: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C12/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800523: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C13/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800538: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C14/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800552: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C15/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800567: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C16/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800582: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C17/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800596: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C18/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800610: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C19/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800625: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C2/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800639: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C20/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800654: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C21/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800670: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C22/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800685: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C23/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800699: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C24/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800714: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C25/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800728: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C26/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800745: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C3/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800760: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C4/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800774: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C5/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800790: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C6/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800804: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C7/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800819: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C8/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.800833: I ./tensorflow/core/graph/template_select_base.h:36] Fusion template[select_then_scalar_in_grad] match op[head/gradients_1/linear/linear_model_1/linear_model/C9/weighted_sum_grad/Select_1]
2022-03-22 14:10:30.919577: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] model_pruner failed: Invalid argument: MutableGraphView::MutableGraphView error: node 'head/gradients/head/loss/xentropy/Select_grad/tuple/control_dependency_1' has missing fanin 'head/gradients/head/loss/xentropy/Select_grad/Select_1'.
2022-03-22 14:10:30.941287: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] function_optimizer failed: Invalid argument: Node 'head/gradients/head/loss/xentropy/Select_grad/tuple/control_dependency_1': Unknown input node 'head/gradients/head/loss/xentropy/Select_grad/Select_1'
2022-03-22 14:11:15.303107: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] model_pruner failed: Invalid argument: MutableGraphView::MutableGraphView error: node 'head/gradients/head/loss/xentropy/Select_grad/tuple/control_dependency_1' has missing fanin 'head/gradients/head/loss/xentropy/Select_grad/Select_1'.
2022-03-22 14:11:15.324636: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] function_optimizer failed: Invalid argument: Node 'head/gradients/head/loss/xentropy/Select_grad/tuple/control_dependency_1': Unknown input node 'head/gradients/head/loss/xentropy/Select_grad/Select_1'
INFO:tensorflow:Running local_init_op.
2022-03-22 14:11:23.962906: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] model_pruner failed: Invalid argument: MutableGraphView::MutableGraphView error: node 'head/gradients/head/loss/xentropy/Select_grad/tuple/control_dependency_1' has missing fanin 'head/gradients/head/loss/xentropy/Select_grad/Select_1'.
2022-03-22 14:11:23.985236: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] function_optimizer failed: Invalid argument: Node 'head/gradients/head/loss/xentropy/Select_grad/tuple/control_dependency_1': Unknown input node 'head/gradients/head/loss/xentropy/Select_grad/Select_1'
INFO:tensorflow:Done running local_init_op.
2022-03-22 14:11:32.103417: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] model_pruner failed: Invalid argument: MutableGraphView::MutableGraphView error: node 'head/gradients/head/loss/xentropy/Select_grad/tuple/control_dependency_1' has missing fanin 'head/gradients/head/loss/xentropy/Select_grad/Select_1'.
2022-03-22 14:11:32.125529: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] function_optimizer failed: Invalid argument: Node 'head/gradients/head/loss/xentropy/Select_grad/tuple/control_dependency_1': Unknown input node 'head/gradients/head/loss/xentropy/Select_grad/Select_1'
2022-03-22 14:11:40.871891: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] model_pruner failed: Invalid argument: MutableGraphView::MutableGraphView error: node 'head/gradients/head/loss/xentropy/Select_grad/tuple/control_dependency_1' has missing fanin 'head/gradients/head/loss/xentropy/Select_grad/Select_1'.
2022-03-22 14:11:40.894088: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:583] function_optimizer failed: Invalid argument: Node 'head/gradients/head/loss/xentropy/Select_grad/tuple/control_dependency_1': Unknown input node 'head/gradients/head/loss/xentropy/Select_grad/Select_1'
Using TensorFlow version 1.15.5
Checking dataset...
Numbers of training dataset is 8000000
Numbers of test dataset is 2000000
The training steps is 100
The testing steps is 3907
Saving model checkpoints to ./result/model_WIDE_AND_DEEP_1647929426
Traceback (most recent call last):
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Expected begin[0] == 0 (got 1) and size[0] == 0 (got -1) when input.dim_size(0) == 0
[[{{node linear/linear_model_1/linear_model/C20/weighted_sum/Slice_2}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train_rebuild.py", line 737, in <module>
main()
File "train_rebuild.py", line 537, in main
checkpoint_dir, tf_config, server)
File "train_rebuild.py", line 414, in train
sess.run([model.loss, model.train_op])
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 804, in run
run_metadata=run_metadata)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1309, in run
run_metadata=run_metadata)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1410, in run
raise six.reraise(*original_exc_info)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/six.py", line 719, in reraise
raise value
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1395, in run
return self._sess.run(*args, **kwargs)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1468, in run
run_metadata=run_metadata)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1226, in run
return self._sess.run(*args, **kwargs)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Expected begin[0] == 0 (got 1) and size[0] == 0 (got -1) when input.dim_size(0) == 0
[[node linear/linear_model_1/linear_model/C20/weighted_sum/Slice_2 (defined at /home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Original stack trace for 'linear/linear_model_1/linear_model/C20/weighted_sum/Slice_2':
File "train_rebuild.py", line 737, in <module>
main()
File "train_rebuild.py", line 517, in main
dense_layer_partitioner=dense_layer_partitioner)
File "train_rebuild.py", line 116, in __init__
self._create_model()
File "train_rebuild.py", line 187, in _create_model
trainable=True)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/feature_column/feature_column.py", line 504, in linear_model
retval = linear_model_layer(features) # pylint: disable=not-callable
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 871, in __call__
outputs = call_fn(cast_inputs, *args, **kwargs)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 234, in wrapper
return converted_call(f, options, args, kwargs)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 439, in converted_call
return _call_unconverted(f, args, kwargs, options)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 330, in _call_unconverted
return f(*args, **kwargs)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/feature_column/feature_column.py", line 696, in call
weighted_sum = layer(builder)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/layers/base.py", line 564, in __call__
outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 915, in __call__
outputs = self.call(cast_inputs, *args, **kwargs)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/feature_column/feature_column.py", line 588, in call
weight_var=self._weight_var)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/feature_column/feature_column.py", line 1938, in _create_weighted_sum
weight_var=weight_var)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/feature_column/feature_column.py", line 2081, in _create_categorical_column_weighted_sum
name='weighted_sum')
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/embedding_ops.py", line 1338, in safe_embedding_lookup_sparse
array_ops.slice(array_ops.shape(result), [1], [-1])
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/array_ops.py", line 855, in slice
return gen_array_ops._slice(input_, begin, size, name=name)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_array_ops.py", line 9272, in _slice
"Slice", input=input, begin=begin, size=size, name=name)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
self._traceback = tf_stack.extract_stack()
Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template
System information
Describe the problem
Build deeprec fail when we use gcc 8.3.1. It triggers gcc 8.3.1 compiler bug. The error is as follows:
unique_ali_op_ut.h:498:77: internal compiler error: in is_normal_capture_proxy, at cp/lambda.c:292
Provide the exact sequence of commands / steps that you executed before running into the problem
Any other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
While using pmem allocator in the WDL model both on libpmem or memkind mode, it would cause "./tensorflow/core/framework/embedding/value_ptr.h:273] Unsupport FreqCounter in subclass of ValuePtrBase
Aborted (core dumped)"
Here are the call stack information.
#3 0x00001464e19d0f4e in tensorflow::ValuePtr::AddFreq (this=)
at ./tensorflow/core/framework/embedding/value_ptr.h:273
#4 0x00001464e19d6566 in tensorflow::NullableFilter<long long, float, tensorflow::EmbeddingVar<long long, float> >::LookupOrCreateWithFreq (this=0x145fb0105c90, key=, val=0x14609c00cac0, default_value_ptr=)
at ./tensorflow/core/framework/embedding/embedding_filter.h:526
#5 0x00001464e19c35cc in std::function<void (long long, float*, float*)>::operator()(long long, float*, float*) const (
__args#2=, __args#1=, __args#0=, this=0x146100083cb8)
at /usr/include/c++/7/bits/std_function.h:706
#6 tensorflow::KvResourceGatherOp<long long, float>::Compute(tensorflow::OpKernelContext*)::{lambda(long long, long long)#4}::operator()(long long, long long) const (limit=4, start=, __closure=0x146100083c80)
at tensorflow/core/kernels/kv_variable_ops.cc:413
#7 std::_Function_handler<void (long long, long long), tensorflow::KvResourceGatherOp<long long, float>::Compute(tensorflow::OpKernelContext*)::{lambda(long long, long long)#4}>::_M_invoke(std::_Any_data const&, long long&&, std::_Any_data const&) (
__functor=..., __args#0=, __args#1=) at /usr/include/c++/7/bits/std_function.h:316
#8 0x00001464d9948f1e in std::_Function_handler<void (long, long), tensorflow::thread::ThreadPool::ParallelFor(long long, long long, std::function<void (long long, long long)>)::{lambda(long, long)#1}>::_M_invoke(std::_Any_data const&, long&&, std::_Any_data const&) () from /home/zshan/deeprec-env/lib/python3.6/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#9 0x00001464d994f48f in tensorflow::thread::ThreadPool::ParallelFor(long long, long long, std::function<void (long long, long long)>) () from /home/zshan/deeprec-env/lib/python3.6/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#10 0x00001464d971fb52 in tensorflow::Shard(int, tensorflow::thread::ThreadPool*, long long, long long, std::function<void (long long, long long)>) ()
from /home/zshan/deeprec-env/lib/python3.6/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#11 0x00001464e19dce74 in tensorflow::KvResourceGatherOp<long long, float>::Compute (this=0x560ce5050590, c=)
at tensorflow/core/kernels/kv_variable_ops.cc:427
#12 0x00001464d98766a6 in tensorflow::(anonymous namespace)::ExecutorStatetensorflow::PropagatorState::BatchProcess(std::vector<tensorflow::PropagatorState::TaggedNode, std::allocatortensorflow::PropagatorState::TaggedNode >, int, long) ()
--Type for more, q to quit, c to continue without paging--
from /home/zshan/deeprec-env/lib/python3.6/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#13 0x00001464d9876a88 in tensorflow::(anonymous namespace)::ExecutorStatetensorflow::PropagatorState::Process(tensorflow::PropagatorState::TaggedNode, long) ()
from /home/zshan/deeprec-env/lib/python3.6/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#14 0x00001464d9876b5f in std::_Function_handler<void (), tensorflow::(anonymous namespace)::ExecutorStatetensorflow::PropagatorState::RunTask<tensorflow::(anonymous namespace)::ExecutorStatetensorflow::PropagatorState::ScheduleReady(absl::InlinedVector<tensorflow::PropagatorState::TaggedNode, 8ul, std::allocatortensorflow::PropagatorState::TaggedNode >, tensorflow::PropagatorState::TaggedNodeReadyQueue)::{lambda()#1}>(tensorflow::(anonymous namespace)::ExecutorStatetensorflow::PropagatorState::ScheduleReady(absl::InlinedVector<tensorflow::PropagatorState::TaggedNode, 8ul, std::allocatortensorflow::PropagatorState::TaggedNode >, tensorflow::PropagatorState::TaggedNodeReadyQueue)::{lambda()#1}&&)::{lambda()#1}>::_M_invoke(std::_Any_data const&)
() from /home/zshan/deeprec-env/lib/python3.6/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#15 0x00001464d994bb4f in std::_Function_handler<void (), Eigen::ThreadPoolTempltensorflow::thread::EigenEnvironment::ThreadPoolTempl(int, bool, tensorflow::thread::EigenEnvironment)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
from /home/zshan/deeprec-env/lib/python3.6/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#16 0x00001464d9948f78 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
from /home/zshan/deeprec-env/lib/python3.6/site-packages/tensorflow_core/python/../libtensorflow_framework.so.1
#17 0x00001464d83a9ba3 in execute_native_thread_routine () from /lib64/libstdc++.so.6
#18 0x0000146577a1a17a in start_thread () from /lib64/libpthread.so.0
#19 0x0000146576fbfdc3 in clone () from /lib64/libc.so.6
This is a basic subject of ASoC 2022 and #231 .
DeepRec processor is developed in C++. For users, they have their own serving framework, which may be developed in different languages, such as Java, GO, C++, etc. We need to provide users with access examples in the corresponding language to facilitate users quickly connect to the DeepRec processor.
Basic
@JackMoriarty [email protected]
Proficiency in C++ and Python;
Get to know DeepRec;
Able to complete the development under the guidance of the mentor;
Have a certain understanding and interest in deep learning recommendation engines;
这是一个阿里巴巴编程之夏 2022 的基础课题 #231 .
DeeRec提供线上serving模块Processor基于C++开发。对于用户而言,有自己的serving框架,不同的语言开发,譬如Java,GO,C++等,我们需要提供给用户对应语言的接入示例方便用户快速对接DeepRec processor。
1)实现多语言的接入示例。
2)完成最佳实践文档。
基础
@JackMoriarty [email protected]
熟练掌握C++和Python;
能够在导师的指导下熟悉并理解相关的代码
了解 DeepRec;
对深度学习推荐引擎有一定了解和兴趣;
When we save checkpoint, the error F ./tensorflow/core/framework/embedding/value_ptr.h:256] Unsupport GlobalStep in subclass of ValuePtrBase
occurs. Because I find that the checkpoint is a temporary file best_checkpoint/best.data-00000-of-00001.tempstate11898667549733680686
.
https://deeprec.readthedocs.io/zh/latest/StarServer.html#estimator
Try to run PS distributed training with SeaStar servers according to the documentation above, but encounter an error: Load endpoint map from .endpoint_map failed.
Confused about how to generate the endpoint_map, hope for more detailed illustrations in the documentation.
Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template
System information
Describe the problem
Build from source failed on arm64 server when run command:
bazel build -c opt --config=opt //tensorflow/tools/pip_package:build_pip_package
I just use cpu version and build it successfully on x86 machine, so it should be related to the platform.
I find no arm image either, do you have any plan to support for arm.
This is an advance subject of ASoC 2022 and #231 .
At present, DeepRec cannot support the evaluation of very large models (single node cannot be loaded), multiple PS are required to load large models, and multiple workers are used for distributed evaluation.
Advance
Proficiency in C++ and Python;
Get to know DeepRec;
Able to complete the development under the guidance of the mentor;
Have a certain understanding and interest in deep learning recommendation engines;
这是一个阿里巴巴编程之夏 2022 的基础课题 #231 .
DeepRec 支持多evaluator评估:目前DeepRec下无法支持超大模型(单节点无法加载)的评估,需要多个ps加载大模型,并且使用多worker进行分布式评估。
1)支持超大模型通过多PS方式加载模型,实现Evaluation.
2)支持一个任务中使用多个Evaluator节点进行评估。
进阶
熟练掌握C++和Python;
能够在导师的指导下熟悉并理解相关的代码
了解 DeepRec;
对深度学习推荐引擎有一定了解和兴趣;
Motivation
Currently, DeepRec supports exporting models to the checkpoint, but when the model weight file is large, the model import and export performance will be affected. Key-value NoSQL databases (such as LevelDB, Redis, and RocksDB) have the advantages of high performance, high scalability, and support for large data volume. We add this feature to optimize the model import and export performance while supporting the storage needs of more users.
Design
To achieve better import and export performance, we add new ops, which avoid repeated reading and writing of model files to disk by directly manipulating the database, thus reducing time overhead.
The overall design can be divided into three parts.
The first part is the implementation of a generic interface for persisting key-value data in a database, which is used to support persistence in a key-value database.
The second part is to add an op implementation in the op kernel to import and export models. This op saves the Variable/EmbeddingVariable values in memory directly to the database through database calls or loads the models directly from the database.
The third part is to add the op in the process of building the graph.
In the traditional checkpoint saving method, the BundleEntryProto storage format is used to correspond to the file. In the database, we have simplified this step by adding key-value mappings such as node key lists. In addition, in distributed training, ps is responsible for parameter updating. Except for StringJoin, save/ShardedFilename/shard, and save/num_shards, ops in the saving process are executed on ps. So the model preservation process only needs to consider the ps side. When the data is too large, the save op can be placed on each device with the shared parameter, so the meta information from different devices needs to be merged to form a complete checkpoint and we need to rewrite this process.
Additional.
To facilitate the user to view the parameters, we also plan to implement a file viewer that can view the Variable/EmbeddingVariable values and support searching for the values.
After open smartstaged in DSSM、DIN、DIEN,something wrong happened. Wait some minutes, it will show Prefetching was ignored since timeout
. The docker whl package is built on commit 8db8689
Code to reproduce the issue
The code and the deeprec env is provided in this docker.
docker pull cesg-prc-registry.cn-beijing.cr.aliyuncs.com/cesg-ali/deeprec-modelzoo:220412-8db8689
Run python script in DSSM/DIN/DIEN to reproduce this issue
cd /root/modelzoo/$MODEL
python train.py --steps 1000 --emb_fusion false --smartstaged true
If set --smartstaged
to False, it's ok
Other info / logs
INFO:tensorflow:Saving checkpoints for 0 into ./result/model_DSSM_1649818801/model.ckpt.
2022-04-13 11:00:07.986332: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200
INFO:tensorflow:Create incremental timer, incremental_save:False, incremental_save_secs:None
2022-04-13 11:00:08.391247: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200
2022-04-13 11:00:08.458564: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200
2022-04-13 11:00:10.967349: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200
2022-04-13 11:00:10.985142: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200
2022-04-13 11:00:10.995642: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200
2022-04-13 11:00:11.671530: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200
2022-04-13 11:00:11.677987: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200
INFO:tensorflow:loss = 168.8434, steps = 1
2022-04-13 11:00:12.369923: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200
2022-04-13 11:00:14.866723: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200
2022-04-13 11:00:14.887854: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200
2022-04-13 11:00:14.897801: I ./tensorflow/core/common_runtime/kernel_stat.h:74] User collect node stats, start_step is 100, stop_step is 200
2022-04-13 11:05:14.881883: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:05:14.882675: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:05:14.882799: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:05:14.884134: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:05:14.884259: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:05:14.884944: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:05:14.886524: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:05:14.888192: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:10:14.901689: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:10:14.901695: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:10:14.901945: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:10:14.902099: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:10:14.902229: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:10:14.902321: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:10:14.902579: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:10:14.907505: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:15:14.909300: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:15:14.913124: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:15:14.914027: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:15:14.914081: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:15:14.914123: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:15:14.914326: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
2022-04-13 11:15:14.919999: W ./tensorflow/core/kernels/data_buffer_ops.h:91] Prefetching was ignored since timeout.
^CKilled
Describe the current behavior
Every time new checkpoint restore trigger memory leak in TFServing (use EmbeddingVariable in model).
Describe the expected behavior
No memory consumption increase when restore new checkpoint.
from tensorflow.python import pywrap_tensorflow
reader = pywrap_tensorflow.NewCheckpointReader(latest_checkpoint)
var_to_shape_map = reader.get_variable_to_shape_map()
for key in var_to_shape_map:
print(reader.get_tensor(key))
I want export the value of embedding variables, and I test it in nvtf successfully. But in deeprec, the value is [], an empty list.
[1] Invalid argument: Trying to access resource linear/linear_model/C1/weights/part_0 located in device /job:localhost/replica:0/task:0/device:CPU:0 from device /job:localhost/replica:0/task:0/device:GPU:0
[2] 2022-06-07 09:49:01.768708: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at resource_variable_ops.cc:400 : Invalid argument: Trying to access resource linear/linear_model/C12/weights/part_0 located in device /job:localhost/replica:0/task:0/device:CPU:0 from device /job:localhost/replica:0/task:0/device:GPU:0
System information
Describe the current behavior
Incremental save and restore fails if any resource variable is used.
Describe the expected behavior
Code to reproduce the issue
import tensorflow as tf
tf.Variable(0, use_resource=True)
saver = tf.train.Saver(
save_relative_paths=True,
incremental_save_restore=True,
)
Other info / logs
Traceback (most recent call last):
File "iem_dlc/__main__.py", line 20, in <module>
incremental_save_restore=True,
File "/worker/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1388, in __init__
self.build()
File "/worker/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1404, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/worker/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1486, in _build
build_save=build_save, build_restore=build_restore)
File "/worker/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1053, in _build_internal
save_tensor = self._AddSaveOps(filename_tensor, saveables)
File "/worker/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 511, in _AddSaveOps
tensor_names.append(self._GetTensorNameAndIsSparse(spec, saveable)[0])
File "/worker/venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 360, in _GetTensorNameAndIsSparse
save_incr_sparse = saveable.op.op._is_sparse and self._incremental_include_normal_var
AttributeError: 'Operation' object has no attribute '_is_sparse'
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"
Describe the current behavior
While using DRAM_SSDHASH as storage_type in StorageOption, process core dumped when SeekToFirst in SSDIterator was called.
Describe the expected behavior
Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
Does DeepRec support tensorflow serving or should I build a serving image to serve trained models? Is it compatible with tensorflow java api? Thanks!
Describe the current behavior
File "/root/workspace/rec-rank-train/vmax/estimator/estimator_v2.py", line 122, in export_big_model
self.estimator_core.export_big_model(server, checkpoint_path=checkpoint_path)
File "/root/workspace/rec-rank-train/vmax/core/estimator_core_v2.py", line 415, in export_big_model
tf.train.import_meta_graph(meta_graph_or_file='/tmp/saved_model/tmp.meta')
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/saver.py", line 1695, in import_meta_graph
return _import_meta_graph_with_return_elements(meta_graph_or_file,
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/saver.py", line 1723, in _import_meta_graph_with_return_elements
saver = _create_saver_from_imported_meta_graph(meta_graph_def, import_scope,
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/saver.py", line 1744, in _create_saver_from_imported_meta_graph
return Saver()
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/saver.py", line 1033, in __init__
self.build()
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/saver.py", line 1045, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/saver.py", line 1112, in _build
self.saver_def = self._builder._build_internal( # pylint: disable=protected-access
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/saver.py", line 656, in _build_internal
restore_op = self._AddRestoreOps(filename_tensor, saveables,
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/saver.py", line 491, in _AddRestoreOps
assign_ops.append(saveable.restore(saveable_tensors, shapes))
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/saving/saveable_object_util.py", line 185, in restore
with ops.control_dependencies(None if self.var._is_primary else [self.var._primary.initializer]):
AttributeError: 'EmbeddingVariable' object has no attribute '_is_primary'
Code to reproduce the issue
meta_graph_def = tf.train.export_meta_graph()
meta_graph_def.meta_info_def.meta_graph_version = str(int(time.time()))
self.logger.info('meta_graph_version = %s' %
meta_graph_def.meta_info_def.meta_graph_version)
tf.reset_default_graph()
tf.train.import_meta_graph(meta_graph_def)
and
meta_graph_def = tf.train.export_meta_graph(filename='/tmp/saved_model/tmp.meta')
meta_graph_def.meta_info_def.meta_graph_version = str(int(time.time()))
self.logger.info('meta_graph_version = %s' %
meta_graph_def.meta_info_def.meta_graph_version)
tf.reset_default_graph()
tf.train.import_meta_graph(meta_graph_or_file='/tmp/saved_model/tmp.meta')
I want to enable smartstaged feature in WDL and follow the steps in DeepRec Docs, but I get an error.
Code to reproduce the issue
I use following codes to enable smartstaged. The full code please see Full code
next_element = tf.staged(next_element, num_threads=8, capacity=40)
sess_config.graph_options.optimizer_options.do_smart_stage = True
hooks.append(tf.make_prefetch_hook())
Run python train.py --steps 1000 --smartstaged True
can reproduce error. Use WDL dataset.
logs
INFO:tensorflow:run without loading checkpoint
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into ./result/model_WIDE_AND_DEEP_1647592077/model.ckpt.
INFO:tensorflow:Create incremental timer, incremental_save:False, incremental_save_secs:None
2022-03-18 16:28:14.639138: E tensorflow/core/framework/op_segment.cc:54] Create kernel failed: Invalid argument: Length for attr 'dtypes' of 0 must be at least minimum 1
; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Length for attr 'dtypes' of 0 must be at least minimum 1
; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
[[prefetch_2/DataBufferPut]]
ERROR:tensorflow:Prefetching was cancelled unexpectedly:
Length for attr 'dtypes' of 0 must be at least minimum 1
; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
[[prefetch_2/DataBufferPut]]
2022-03-18 16:28:14.783644: E tensorflow/core/framework/op_segment.cc:54] Create kernel failed: Invalid argument: Length for attr 'dtypes' of 0 must be at least minimum 1
; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
Exception in thread PrefetchThread-PrefetchRunner-0:
Traceback (most recent call last):
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/prefetch_runner.py", line 236, in run
run_fetch(*feed)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287, in _single_operation_run
self._call_tf_sessionrun(None, {}, [], target_list, None)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Length for attr 'dtypes' of 0 must be at least minimum 1
; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
[[prefetch_2/DataBufferPut]]
ERROR:tensorflow:Prefetching was cancelled unexpectedly:
Length for attr 'dtypes' of 0 must be at least minimum 1
; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
[[prefetch_2/DataBufferPut]]
2022-03-18 16:28:14.871604: E tensorflow/core/framework/op_segment.cc:54] Create kernel failed: Invalid argument: Length for attr 'dtypes' of 0 must be at least minimum 1
; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
Exception in thread PrefetchThread-PrefetchRunner-2:
Traceback (most recent call last):
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/prefetch_runner.py", line 236, in run
run_fetch(*feed)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287, in _single_operation_run
self._call_tf_sessionrun(None, {}, [], target_list, None)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Length for attr 'dtypes' of 0 must be at least minimum 1
; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
[[prefetch_2/DataBufferPut]]
ERROR:tensorflow:Prefetching was cancelled unexpectedly:
Length for attr 'dtypes' of 0 must be at least minimum 1
; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
[[prefetch_2/DataBufferPut]]
2022-03-18 16:28:14.975041: E tensorflow/core/framework/op_segment.cc:54] Create kernel failed: Invalid argument: Length for attr 'dtypes' of 0 must be at least minimum 1
; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
Exception in thread PrefetchThread-PrefetchRunner-1:
Traceback (most recent call last):
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/prefetch_runner.py", line 236, in run
run_fetch(*feed)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287, in _single_operation_run
self._call_tf_sessionrun(None, {}, [], target_list, None)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Length for attr 'dtypes' of 0 must be at least minimum 1
; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
[[prefetch_2/DataBufferPut]]
ERROR:tensorflow:Prefetching was cancelled unexpectedly:
Length for attr 'dtypes' of 0 must be at least minimum 1
; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
[[prefetch_2/DataBufferPut]]
2022-03-18 16:28:15.079552: E tensorflow/core/framework/op_segment.cc:54] Create kernel failed: Invalid argument: Length for attr 'dtypes' of 0 must be at least minimum 1
; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
Exception in thread PrefetchThread-PrefetchRunner-5:
Traceback (most recent call last):
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/prefetch_runner.py", line 236, in run
run_fetch(*feed)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287, in _single_operation_run
self._call_tf_sessionrun(None, {}, [], target_list, None)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Length for attr 'dtypes' of 0 must be at least minimum 1
; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
[[prefetch_2/DataBufferPut]]
ERROR:tensorflow:Prefetching was cancelled unexpectedly:
Length for attr 'dtypes' of 0 must be at least minimum 1
; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
[[prefetch_2/DataBufferPut]]
2022-03-18 16:28:15.183314: E tensorflow/core/framework/op_segment.cc:54] Create kernel failed: Invalid argument: Length for attr 'dtypes' of 0 must be at least minimum 1
; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
Exception in thread PrefetchThread-PrefetchRunner-6:
Traceback (most recent call last):
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/prefetch_runner.py", line 236, in run
run_fetch(*feed)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287, in _single_operation_run
self._call_tf_sessionrun(None, {}, [], target_list, None)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Length for attr 'dtypes' of 0 must be at least minimum 1
; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
[[prefetch_2/DataBufferPut]]
ERROR:tensorflow:Prefetching was cancelled unexpectedly:
Length for attr 'dtypes' of 0 must be at least minimum 1
; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
[[prefetch_2/DataBufferPut]]
2022-03-18 16:28:15.288156: E tensorflow/core/framework/op_segment.cc:54] Create kernel failed: Invalid argument: Length for attr 'dtypes' of 0 must be at least minimum 1
; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
Exception in thread PrefetchThread-PrefetchRunner-3:
Traceback (most recent call last):
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/prefetch_runner.py", line 236, in run
run_fetch(*feed)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287, in _single_operation_run
self._call_tf_sessionrun(None, {}, [], target_list, None)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Length for attr 'dtypes' of 0 must be at least minimum 1
; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
[[prefetch_2/DataBufferPut]]
ERROR:tensorflow:Prefetching was cancelled unexpectedly:
Length for attr 'dtypes' of 0 must be at least minimum 1
; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
[[prefetch_2/DataBufferPut]]
2022-03-18 16:28:15.391508: E tensorflow/core/framework/op_segment.cc:54] Create kernel failed: Invalid argument: Length for attr 'dtypes' of 0 must be at least minimum 1
; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
Exception in thread PrefetchThread-PrefetchRunner-4:
Traceback (most recent call last):
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/prefetch_runner.py", line 236, in run
run_fetch(*feed)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287, in _single_operation_run
self._call_tf_sessionrun(None, {}, [], target_list, None)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Length for attr 'dtypes' of 0 must be at least minimum 1
; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
[[prefetch_2/DataBufferPut]]
ERROR:tensorflow:Prefetching was cancelled unexpectedly:
Length for attr 'dtypes' of 0 must be at least minimum 1
; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
[[prefetch_2/DataBufferPut]]
Exception in thread PrefetchThread-PrefetchRunner-7:
Traceback (most recent call last):
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/prefetch_runner.py", line 236, in run
run_fetch(*feed)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287, in _single_operation_run
self._call_tf_sessionrun(None, {}, [], target_list, None)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Length for attr 'dtypes' of 0 must be at least minimum 1
; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
[[prefetch_2/DataBufferPut]]
INFO:tensorflow:loss = 0.6654865, steps = 1
INFO:tensorflow:Saving checkpoints for 1 into ./result/model_WIDE_AND_DEEP_1647592077/model.ckpt.
Using TensorFlow version 1.15.5
Checking dataset...
Numbers of training dataset is 8000000
Numbers of test dataset is 2000000
The training steps is 15625
The testing steps is 3907
Saving model checkpoints to ./result/model_WIDE_AND_DEEP_1647592077
Enable smart staged feature of DeepRec.
Traceback (most recent call last):
File "train_rebuild.py", line 673, in <module>
main()
File "train_rebuild.py", line 495, in main
checkpoint_dir, tf_config, server)
File "train_rebuild.py", line 375, in train
sess.run([model.loss, model.train_op])
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 911, in __exit__
self._close_internal(exception_type)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 949, in _close_internal
self._sess.close()
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1216, in close
self._sess.close()
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1384, in close
ignore_live_threads=True)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/six.py", line 718, in reraise
raise value.with_traceback(tb)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/ops/prefetch_runner.py", line 236, in run
run_fetch(*feed)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1287, in _single_operation_run
self._call_tf_sessionrun(None, {}, [], target_list, None)
File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Length for attr 'dtypes' of 0 must be at least minimum 1
; NodeDef: {{node prefetch_2/DataBufferPut}}; Op<name=DataBufferPut; signature=record: -> ; attr=container:string,default=""; attr=dtypes:list(type),min=1; attr=shared_name:string,default=""; attr=shared_capacity:int,default=1,min=1; attr=timeout_millis:int,default=1000,min=1; is_stateful=true>
[[prefetch_2/DataBufferPut]]
This is an basic subject of ASoC 2022 and #231 .
There are 6 models in ModelZoo in DeepRec. Currently, there is only model code for training. Please add inference code for these models and optimize the inference performance, and summarize performance results.
Basic
Proficiency in C++ and Python;
Get to know DeepRec;
Able to complete the development under the guidance of the mentor;
Have a certain understanding and interest in deep learning recommendation engines;
这是一个阿里巴巴编程之夏 2022 的基础课题 #231 .
DeepRec中ModelZoo中有6个模型,当前没有支持导出为SavedModel,导致训练和推理不能直接打通。请完善这些模型并且完成训练推理完整链路的测试。
1)实现ModelZoo中6个模型的Inference use case。
2)优化ModelZoo的模型的Inference性能,并总结性能文档。
基础
熟练掌握C++和Python;
能够在导师的指导下熟悉并理解相关的代码
了解 DeepRec;
对深度学习推荐引擎有一定了解和兴趣;
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.