Thanks for your work on the GNN and putting on a big competition. I tried your work on

cannot get the expected outputs about gnnetworkingchallenge HOT 7 CLOSED

Yujun1212 commented on July 18, 2024

cannot get the expected outputs

from gnnetworkingchallenge.

Comments (7)

MiquelFerriol commented on July 18, 2024

First of all, thank you for your interest in our work and our challenge!

Regarding the error, it looks like this error is a very Tensorflow related very generic error. This error could be caused by a Tensorflow internal error or by an error in the iGNNition framework implementation. So, in order to reproduce it, could you please specify the following?:

Operating System
Python Version (looks like you are using 3.7)
Tensorflow Version
CUDA Version
Example you are trying to run
Does the example run correctly without the GPU flag enabled?

from gnnetworkingchallenge.

Yujun1212 commented on July 18, 2024

The system I used is Ubuntu 18.04, with Python3.7, tensorflow-gpu=2.1.0 and cuda10.1. And the example I'm trying to run is Tensorflow Baseline of 2020. I tried run it without GPU, but there were still these errors. I have no idea that if you could take a look at my logs when you are in free time, which may be a presumptuous request.

Thanks a lot!

from gnnetworkingchallenge.

MiquelFerriol commented on July 18, 2024

Try to remove your current installation of TensorFlow and install it directly using:
pip install tensorflow==2.1.0 (which already includes tensorflow-gpu)
The tensorflow-gpu package seems to have some problems when using the XLA instructions (which seems is your issue).

If the problem does not solve by doing this, you can try to upload the logs file and I will take a look at it.

from gnnetworkingchallenge.

Yujun1212 commented on July 18, 2024

Thank you very much for your reply! I tried to use tensorflow==2.1.0, but these errors still exited. The following is the logs:

(test) yj@DL:/data/yj/Documents/2020test/code$ python3 main.py
2021-06-15 16:39:39.876851: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2021-06-15 16:39:39.876954: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-10.1/lib64
2021-06-15 16:39:39.876968: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
INFO:tensorflow:Using config: {'_model_dir': '../logs/model_log', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 20, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Not using Distribute Coordinator.
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600.
WARNING:tensorflow:From /data/yj/yes/envs/test/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1635: calling BaseResourceVariable.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:tensorflow:From /data/yj/yes/envs/test/lib/python3.7/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
2021-06-15 16:39:40.821831: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-06-15 16:39:40.854148: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:02:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
2021-06-15 16:39:40.855742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 1 with properties:
pciBusID: 0000:82:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
2021-06-15 16:39:40.856123: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2021-06-15 16:39:40.858860: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2021-06-15 16:39:40.860338: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2021-06-15 16:39:40.860725: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2021-06-15 16:39:40.862982: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2021-06-15 16:39:40.864523: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2021-06-15 16:39:40.869364: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-06-15 16:39:40.872986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0, 1
INFO:tensorflow:Calling model_fn.
/data/yj/yes/envs/test/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:433: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
Traceback (most recent call last):
File "/data/yj/yes/envs/test/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 2326, in get_attr
c_api.TF_OperationGetAttrValueProto(self._c_op, name, buf)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Operation 'route_net_model/UnsortedSegmentSum_6' has no attr named '_XlaCompile'.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/data/yj/yes/envs/test/lib/python3.7/site-packages/tensorflow_core/python/ops/gradients_util.py", line 331, in _MaybeCompile
xla_compile = op.get_attr("_XlaCompile")
File "/data/yj/yes/envs/test/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 2330, in get_attr
raise ValueError(str(e))
ValueError: Operation 'route_net_model/UnsortedSegmentSum_6' has no attr named '_XlaCompile'.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "main.py", line 184, in
model_dir=config['DIRECTORIES']['logs'])
File "main.py", line 64, in train_and_evaluate
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
File "/data/yj/yes/envs/test/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
return executor.run()
File "/data/yj/yes/envs/test/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 613, in run
return self.run_local()
File "/data/yj/yes/envs/test/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/training.py", line 714, in run_local
saving_listeners=saving_listeners)
File "/data/yj/yes/envs/test/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 374, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/data/yj/yes/envs/test/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1164, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/data/yj/yes/envs/test/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1194, in _train_model_default
features, labels, ModeKeys.TRAIN, self.config)
File "/data/yj/yes/envs/test/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1152, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/data/yj/Documents/2020test/code/routenet_model.py", line 256, in model_fn
grads = tf.gradients(total_loss, model.trainable_variables)
File "/data/yj/yes/envs/test/lib/python3.7/site-packages/tensorflow_core/python/ops/gradients_impl.py", line 274, in gradients_v2
unconnected_gradients)
File "/data/yj/yes/envs/test/lib/python3.7/site-packages/tensorflow_core/python/ops/gradients_util.py", line 669, in _GradientsHelper
lambda: grad_fn(op, *out_grads))
File "/data/yj/yes/envs/test/lib/python3.7/site-packages/tensorflow_core/python/ops/gradients_util.py", line 336, in _MaybeCompile
return grad_fn() # Exit early
File "/data/yj/yes/envs/test/lib/python3.7/site-packages/tensorflow_core/python/ops/gradients_util.py", line 669, in
lambda: grad_fn(op, *out_grads))
File "/data/yj/yes/envs/test/lib/python3.7/site-packages/tensorflow_core/python/ops/math_grad.py", line 476, in _UnsortedSegmentSumGrad
return _GatherDropNegatives(grad, op.inputs[1])[0], None, None
File "/data/yj/yes/envs/test/lib/python3.7/site-packages/tensorflow_core/python/ops/math_grad.py", line 444, in _GatherDropNegatives
dtype=is_positive_shape.dtype)],
File "/data/yj/yes/envs/test/lib/python3.7/site-packages/tensorflow_core/python/ops/array_ops.py", line 2659, in ones
output = _constant_if_small(one, shape, dtype, name)
File "/data/yj/yes/envs/test/lib/python3.7/site-packages/tensorflow_core/python/ops/array_ops.py", line 2391, in _constant_if_small
if np.prod(shape) < 1000:
File "<array_function internals>", line 6, in prod
File "/data/yj/yes/envs/test/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 3031, in prod
keepdims=keepdims, initial=initial, where=where)
File "/data/yj/yes/envs/test/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 87, in _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
File "/data/yj/yes/envs/test/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 728, in array
" array.".format(self.name))
NotImplementedError: Cannot convert a symbolic Tensor (gradients/route_net_model/UnsortedSegmentSum_6_grad/sub:0) to a numpy array.

from gnnetworkingchallenge.

MiquelFerriol commented on July 18, 2024

Everything looks fine to me...
What I would suggest now is two things:

Create a new virtual environment with a Tensorflow 2.3 (I checked it and the code is compatible)
Make sure all the TF dependencies are satisfied. It looks like it could be a problem with the NumPy installation due to this line:

Cannot convert a symbolic Tensor (gradients/route_net_model/UnsortedSegmentSum_6_grad/sub:0) to a numpy array.

As suggested here and here

from gnnetworkingchallenge.

Yujun1212 commented on July 18, 2024

Thanks a lot! Your advise solved my problem completely. Wish every success in your life!

from gnnetworkingchallenge.

MiquelFerriol commented on July 18, 2024

Closing this issue as the problem seems to be solved.

from gnnetworkingchallenge.

cannot get the expected outputs about gnnetworkingchallenge HOT 7 CLOSED

Comments (7)

Related Issues (11)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent