nervanasystems / ngraph-tf Goto Github PK

Bridge to connect nGraph with TensorFlow

License: Other

ngraph-tf's Introduction

DISCONTINUATION OF PROJECT

This project will no longer be maintained by Intel. Intel has ceased development and contributions including, but not limited to, maintenance, bug fixes, new releases, or updates, to this project. Intel no longer accepts patches to this project.

Intel(R) nGraph(TM) Compiler and runtime for TensorFlow*

This repository moved to the following location: https://github.com/tensorflow/ngraph-bridge.git. Please update your bookmarks

About Intel(R) nGraph(TM)

See the full documentation here: http://ngraph.nervanasys.com/docs/latest

ngraph-tf's People

Contributors

Stargazers

Watchers

Forkers

jiapei100 opencici2006 calvinlcchen fboemer jonnycrunch ananyamukh6 ami-gs cepera github30 stjordanis sreeni-k avijit-chakroborty wlhust evefg4 charudatta10 bird1235456

ngraph-tf's Issues

error while make install command

after I solved the problem of using cmake ..
I hopped into this error . Is there any solution of this problem?

Does ngraph optimize InceptionV3 topology?

Does ngraph optimize InceptionV3 topology?
What are the topologies which we can expect a performance improvement when using nGraph?

How to use resnet50 model in tf_cnn_benchmarks.py?

Hi,
I download the savedmodel and checkpoints from tensorflow/models and untar them into a single directory. But when I run this benchmark, an error reported as:
NotFoundError (see above for traceback): Key v0/cg/affine0/biases not found in c heckpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_INT64, DT_FLOAT, DT_FLOAT , DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _d evice="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/ RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

I tried another models from other sites also. But the almost same error is there.

What resnet model can work with this benchmark and how should I specify the data_dir parameter?

Thanks!

Building from source code shows an error

I followed the instruction here.

(tf-test) ➜  ngraph-tf git:(master) ✗ python --version
Python 3.5.4
(tf-test) ➜  ngraph-tf git:(master) ✗ python build_ngtf.py 
ARTIFACTS location: /home/jaebaek/ngraph-tf/build_cmake/artifacts
Running virtualenv with interpreter /home/jaebaek/tf-test/bin/python3
Using base prefix '/usr'
New python executable in /home/jaebaek/ngraph-tf/build_cmake/venv-tf-py3/bin/python3
Not overwriting existing python script /home/jaebaek/ngraph-tf/build_cmake/venv-tf-py3/bin/python (you must use /home/jaebaek/ngraph-tf/build_cmake/venv-tf-py3/bin/python3)
Installing setuptools, pip, wheel...
done.
Loading virtual environment from: /home/jaebaek/ngraph-tf/build_cmake/venv-tf-py3
Traceback (most recent call last):
  File "build_ngtf.py", line 314, in <module>
    main()
  File "build_ngtf.py", line 141, in main
    load_venv(venv_dir)
  File "/home/jaebaek/ngraph-tf/tools/build_utils.py", line 110, in load_venv
    dict(__file__=activate_this_file), dict(__file__=activate_this_file))
  File "/home/jaebaek/ngraph-tf/build_cmake/venv-tf-py3/bin/activate_this.py", line 46, in <module>
    sys.path[:] = [i for i in new if i not in prev] + [i for i in new if i in prev]
  File "/home/jaebaek/ngraph-tf/build_cmake/venv-tf-py3/bin/activate_this.py", line 46, in <listcomp>
    sys.path[:] = [i for i in new if i not in prev] + [i for i in new if i in prev]
NameError: name 'prev' is not defined
(tf-test) ➜  ngraph-tf git:(master) ✗ /home/jaebaek/ngraph-tf/build_cmake/venv-tf-py3/bin/python3 --version
Python 3.5.4
(tf-test) ➜  ngraph-tf git:(master) ✗

No rule to make target 'ngraph/ngraph_dist/lib/libngraph.so'

when compile, I got this error

[ 17%] Built target ext_ngraph
[ 35%] Built target ext_gtest
[ 37%] Performing update step for 'ext_abseil'
[ 40%] Performing configure step for 'ext_abseil'
-- Configuring done
-- Generating done
-- Build files have been written to: /home/hzhangxyz/Laboratory/source/test_xla/ngraph-tf-0.7.0/build/third-party/abseil/build
[ 42%] Performing build step for 'ext_abseil'
[  2%] Built target absl_throw_delegate
[  4%] Built target absl_dynamic_annotations
[  6%] Built target absl_spinlock_wait
[ 13%] Built target absl_base
[ 15%] Built target absl_malloc_internal
[ 16%] Built target absl_algorithm
[ 18%] Built target absl_container
[ 20%] Built target test_instance_tracker_lib
[ 21%] Built target absl_stack_consumption
[ 26%] Built target absl_symbolize
[ 28%] Built target absl_leak_check
[ 42%] Built target absl_strings
[ 47%] Built target absl_stacktrace
[ 48%] Built target absl_int128
[ 60%] Built target absl_time
[ 62%] Built target absl_examine_stack
[ 69%] Built target absl_synchronization
[ 70%] Built target absl_failure_signal_handler
[ 71%] Built target absl_debugging
[ 73%] Built target absl_utility
[ 75%] Built target absl_numeric
[ 77%] Built target str_format_extension_internal
[ 79%] Built target absl_span
[ 85%] Built target str_format_internal
[ 86%] Built target absl_str_format
[ 89%] Built target absl_hash
[ 90%] Built target absl_memory
[ 92%] Built target absl_meta
[ 94%] Built target absl_bad_any_cast
[ 96%] Built target absl_any
[ 97%] Built target absl_bad_optional_access
[ 98%] Built target absl_optional
[100%] Built target absl_variant
[ 44%] No install step for 'ext_abseil'
[ 46%] Completed 'ext_abseil'
[ 53%] Built target ext_abseil
[ 60%] Built target ngraph_logger
make[2]: *** No rule to make target 'ngraph/ngraph_dist/lib/libngraph.so', needed by 'src/libngraph_bridge.so'.  Stop.
make[1]: *** [CMakeFiles/Makefile2:279: src/CMakeFiles/ngraph_bridge.dir/all] Error 2
make: *** [Makefile:130: all] Error 2

I just download v0.7.0.tar.gz and make build and cmake .. and make, then got this error

environment: archlinux
cmake 3.12.4
gcc 8.2.1 20180831
python 3.7.1

Problems installing graph-tf on MacOS Mojave

The documentation states that the build and installation instructions are identical for Ubuntu 16.04 and OS X.

Option 1:
pip install -U ngraph-tensorflow-bridge

produces the following error:

Could not find a version that satisfies the requirement graph-tensorflow-bridge

Option 2/3:
./bazel-0.16.0-installer-linux-x86_64.sh --user

I get the following message that sounds like an error:

Build informations

Commit
Uncompressing....../Users/xxx/bin/bazel: line 88: /Users/xxx.bazel/bin/bazel-real: cannot execute binary file
/Users/xxx/bin/bazel: line 88: /Users/xxx/.bazel/bin/bazel-real: Undefined error: 0

and I then get additional errors in the following steps.

Is the MacOS installation limited to one of the 3 options?

Are there additional requirements dependent on the MacOS version?

Any help would be greatly appreciated.

Typo in README

build and loaded should be built and loaded.

Illegal instruction (core dumped) when trying to run ngraph-bridge

Hi everyone,

I'm trying to run a simple TF CNN-mnist example with ngraph but I receive the following error when tf.train is called: Illegal instruction (core dumped).
When I run the code without importing ngraph_bridge it works perfectly.

How could I debug it and get more information about what is happening? Does anyone have clue about what could it be?

Thank you.

Illegal instruction (core dumped) when trying to run with nGraph-TensorFlow bridge

Hi everyone,

I'm trying to run an TensorFlow example with nGraph-TensorFlow bridge but I receive the following error when sess.run() is called: Illegal instruction (core dumped).
When I run the code without importing ngraph_bridge it works perfectly.
I found that someone has met this problem before, but no system info is provided and the issue has beed closed. So I list my system info:
OS platform: Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-116-generic x86_64)
CPU： Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
python:Python 3.5.2
GCC： GCC 5.4.0

I built ngraph_bridge with Option 1: Use a pre-built nGraph-TensorFlow bridge, and when running the command:
python -c "import tensorflow as tf; print('TensorFlow version: ',tf.version);import ngraph_bridge; print(ngraph_bridge.version)"
The result is:
TensorFlow version: 1.12.0
nGraph bridge version: b'0.11.0'
nGraph version used for this build: b'0.14.0+56a54ca'
TensorFlow version used for this build: v1.12.0-0-ga6d8ffa

I also tried to specify the version of ngraph_bridge==0.8.0, the result is:
TensorFlow version: r 1.12.0
TensorFlow version installed: 1.12.0 (v1.12.0-0-ga6d8ffae09)
nGraph bridge built with: 1.12.0 (v1.12.0-0-ga6d8ffa)
b'0.8.0'
But core dumped has occured on both versions.

Thank you!

Problems getting ngraph-tf to run under manjaro

I try since some days to get ngraph-tf to run under manjaro and ran into multiple problems.
The goal is to use ngraph-tf with the plaidml backend.

I am testing with the following code:

import tensorflow as tf
import os
import sys
if os.environ.get("USE_TF_KERAS", "1") == "1":
    import tensorflow.keras as keras
    print("Using tensorflow keras version")
else:
    import keras
    print("Using keras with backend %s" % keras.backend.backend())


if len(sys.argv) < 2:
    backend = "CPU"
else:
    backend = sys.argv[1]
if backend == "NONE":
    print("NOT using ngraph")
else:
    import ngraph_bridge
    print("Supported ngraph backend:\n  %s" % "\n  ".join(ngraph_bridge.list_backends()))
    ngraph_bridge.set_backend(backend)
    print("Using ngraph backend %s" % ngraph_bridge.get_currently_set_backend_name())

mnist = keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = keras.models.Sequential([
  keras.layers.Flatten(input_shape=(28, 28)),
  keras.layers.Dense(512, activation="relu"),
  keras.layers.Dropout(0.2),
  keras.layers.Dense(10, activation="softmax")
])
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

print("Predict:", model.predict(x_train[:1]))

model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test)

When trying to run it with tensorflow.keras and the ngraph backend set to PLAIDML (USE_TF_KERAS=1 KERAS_BACKEND="tensorflow" python test_ngrapg_tf.py PLAIDML) i get
a segfault or this stacktrace (sometimes the one, sometimes the other):

Traceback (most recent call last):
  File "test_ngrapg_tf.py", line 39, in <module>
    model.fit(x_train, y_train, epochs=5)
  File "/run/media/nope/data/home/nope/workspace/test/fs/ngraph-tf_master/build_cmake/venv-tf-py3/lib/python3.5/site-packages/tensorflow/python/keras/engine/training.py", line 880, in fit
    validation_steps=validation_steps)
  File "/run/media/nope/data/home/nope/workspace/test/fs/ngraph-tf_master/build_cmake/venv-tf-py3/lib/python3.5/site-packages/tensorflow/python/keras/engine/training_arrays.py", line 329, in model_iteration
    batch_outs = f(ins_batch)
  File "/run/media/nope/data/home/nope/workspace/test/fs/ngraph-tf_master/build_cmake/venv-tf-py3/lib/python3.5/site-packages/tensorflow/python/keras/backend.py", line 3076, in __call__
    run_metadata=self.run_metadata)
  File "/run/media/nope/data/home/nope/workspace/test/fs/ngraph-tf_master/build_cmake/venv-tf-py3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1439, in __call__
    run_metadata_ptr)
  File "/run/media/nope/data/home/nope/workspace/test/fs/ngraph-tf_master/build_cmake/venv-tf-py3/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: Caught exception while compiling op_backend: get_shape() must be called on a node with exactly one output ()

	 [[{{node ngraph_cluster_44}}]]

When trying to run it with keras with the keras backend set to tensorflow (USE_TF_KERAS=0 KERAS_BACKEND="tensorflow" python test_ngrapg_tf.py PLAIDML) i reliable get invalid opencl kernels generated by plaidml (see plaidml/plaidml#322)

Both versions can execute the prediction step just fine, altho keras with tensorflow backend seem to produce wrong values.

With only tensorflow or plaidml via keras (or in the case of tf also tf.keras) and without ngraph-tf it runs without a problem (USE_TF_KERAS=1/0 KERAS_BACKEND="tensorflow" python test_ngrapg_tf.py NONE).
Those tests where made with a self build version of ngraph-tf with and without the --use_prebuilt_tensorflow parameter.

Using the CPU ngraph backend it runs with keras with tensorflow as keras backend and tf.keras altho way slower as just tensorflow-cpu without ngraph in both cases.
Additionally when using keras with backend set to tensorflow the results seem to be wrong.

When trying to run it with the ngraph CPU backend via the pypi version of ngraph-tf installed via pip i get an Illegal instruction crash with keras->tensorflow and tf.keras.

Additional info

I am using python 3.5.5 installed via pyenv.

# uname -a 
Linux seima-pc 5.0.15-1-MANJARO #1 SMP PREEMPT Fri May 10 19:51:04 UTC 2019 x86_64 GNU/Linux

GPU: Radeon RX 580

When compiling ngraph-tf i need to create a link from lib64 to lib in the artifact dir otherwise the ngraph-tf build fails as it expects the lib dir but creates the lib64 dir (not sure if relevant)

Sorry for the wall of text, but i really don't know where it goes wrong.
Please let me know if additional information are required.

pip install and run show an error "undefined symbol: _ZNK10tensorflow4Node11type_stringEv"

After installing tensorflow and ngraph-tf, executing python -c "import tensorflow as tf; print('TensorFlow version: ',tf.__version__);import ngraph_bridge; print(ngraph_bridge.__version__)" showed the following error:

➜  ~ source tf-test/bin/activate
(tf-test) ➜  ~  pip install -U tensorflow 
Requirement already up-to-date: tensorflow in ./tf-test/lib/python3.7/site-packages (1.13.1)
Requirement already satisfied, skipping upgrade: wheel>=0.26 in ./tf-test/lib/python3.7/site-packages (from tensorflow) (0.33.1)
Requirement already satisfied, skipping upgrade: absl-py>=0.1.6 in ./tf-test/lib/python3.7/site-packages (from tensorflow) (0.7.1)
Requirement already satisfied, skipping upgrade: astor>=0.6.0 in ./tf-test/lib/python3.7/site-packages (from tensorflow) (0.7.1)
Requirement already satisfied, skipping upgrade: termcolor>=1.1.0 in ./tf-test/lib/python3.7/site-packages (from tensorflow) (1.1.0)
Requirement already satisfied, skipping upgrade: numpy>=1.13.3 in ./anaconda3/envs/plaidml/lib/python3.7/site-packages (from tensorflow) (1.16.3)
Requirement already satisfied, skipping upgrade: tensorboard<1.14.0,>=1.13.0 in ./tf-test/lib/python3.7/site-packages (from tensorflow) (1.13.1)
Requirement already satisfied, skipping upgrade: keras-applications>=1.0.6 in ./tf-test/lib/python3.7/site-packages (from tensorflow) (1.0.7)
Requirement already satisfied, skipping upgrade: grpcio>=1.8.6 in ./tf-test/lib/python3.7/site-packages (from tensorflow) (1.20.0)
Requirement already satisfied, skipping upgrade: tensorflow-estimator<1.14.0rc0,>=1.13.0 in ./tf-test/lib/python3.7/site-packages (from tensorflow) (1.13.0)
Requirement already satisfied, skipping upgrade: gast>=0.2.0 in ./tf-test/lib/python3.7/site-packages (from tensorflow) (0.2.2)
Requirement already satisfied, skipping upgrade: keras-preprocessing>=1.0.5 in ./tf-test/lib/python3.7/site-packages (from tensorflow) (1.0.9)
Requirement already satisfied, skipping upgrade: protobuf>=3.6.1 in ./tf-test/lib/python3.7/site-packages (from tensorflow) (3.7.1)
Requirement already satisfied, skipping upgrade: six>=1.10.0 in ./anaconda3/envs/plaidml/lib/python3.7/site-packages (from tensorflow) (1.12.0)
Requirement already satisfied, skipping upgrade: werkzeug>=0.11.15 in ./tf-test/lib/python3.7/site-packages (from tensorboard<1.14.0,>=1.13.0->tensorflow) (0.15.2)
Requirement already satisfied, skipping upgrade: markdown>=2.6.8 in ./tf-test/lib/python3.7/site-packages (from tensorboard<1.14.0,>=1.13.0->tensorflow) (3.1)
Requirement already satisfied, skipping upgrade: h5py in ./tf-test/lib/python3.7/site-packages (from keras-applications>=1.0.6->tensorflow) (2.9.0)
Requirement already satisfied, skipping upgrade: mock>=2.0.0 in ./tf-test/lib/python3.7/site-packages (from tensorflow-estimator<1.14.0rc0,>=1.13.0->tensorflow) (2.0.0)
Requirement already satisfied, skipping upgrade: setuptools in ./tf-test/lib/python3.7/site-packages (from protobuf>=3.6.1->tensorflow) (41.0.1)
Requirement already satisfied, skipping upgrade: pbr>=0.11 in ./tf-test/lib/python3.7/site-packages (from mock>=2.0.0->tensorflow-estimator<1.14.0rc0,>=1.13.0->tensorflow) (5.1.3)
(tf-test) ➜  ~  pip install -U ngraph-tensorflow-bridge
Requirement already up-to-date: ngraph-tensorflow-bridge in ./tf-test/lib/python3.7/site-packages (0.12.0)
(tf-test) ➜  ~  python -c "import tensorflow as tf; print('TensorFlow version: ',tf.__version__);import ngraph_bridge; print(ngraph_bridge.__version__)"
TensorFlow version:  1.13.1
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/jaebaek/tf-test/lib/python3.7/site-packages/ngraph_bridge/__init__.py", line 94, in <module>
    os.path.join(libpath, 'libngraph_bridge.' + ext))
  File "/home/jaebaek/anaconda3/envs/plaidml/lib/python3.7/ctypes/__init__.py", line 434, in LoadLibrary
    return self._dlltype(name)
  File "/home/jaebaek/anaconda3/envs/plaidml/lib/python3.7/ctypes/__init__.py", line 356, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /home/jaebaek/tf-test/lib/python3.7/site-packages/ngraph_bridge/libngraph_bridge.so: undefined symbol: _ZNK10tensorflow4Node11type_stringEv
(tf-test) ➜  ~

Backend 'CPU' not registered. Error:dlopen: cannot load any more object with static TLS

hi i follwed Build nGraph bridge from source using TensorFlow source,but when i install the tensorflow and ngraph and then run the axpy.py .The program show the
what(): Backend 'CPU' not registered. Error:dlopen: cannot load any more object with static TLS.
my gcc version is 7.3.0 system are Red Hat 4.8.2-16

Error running make -j4 install command with ngraph-tf errors

I am compiling he-transformer but here it is giving an issue due to ngraph-tf error. Kindly check it out and help me here. Thank you!

build_ngtf.py fails on the activation of venv-tf-py3

The build_ngtf.py script fails about after the start:

> python3 ./build_ngtf.py
Already using interpreter /usr/bin/python3
Using base prefix '/usr'
New python executable in /home/pavel/ngraph/ngraph-tf/build/venv-tf-py3/bin/python3
Also creating executable in /home/pavel/ngraph/ngraph-tf/build/venv-tf-py3/bin/python
Installing setuptools, pip, wheel...
done.
Loading virtual environment from: /home/pavel/ngraph/ngraph-tf/build/venv-tf-py3
Traceback (most recent call last):
  File "./build_ngtf.py", line 534, in <module>
    main()
  File "./build_ngtf.py", line 447, in main
    load_venv(venv_dir)
  File "./build_ngtf.py", line 107, in load_venv
    dict(__file__=activate_this_file), dict(__file__=activate_this_file))
  File "/home/pavel/ngraph/ngraph-tf/build/venv-tf-py3/bin/activate_this.py", line 46, in <module>
    sys.path[:] = [i for i in new if i not in prev] + [i for i in new if i in prev]
  File "/home/pavel/ngraph/ngraph-tf/build/venv-tf-py3/bin/activate_this.py", line 46, in <listcomp>
    sys.path[:] = [i for i in new if i not in prev] + [i for i in new if i in prev]
NameError: name 'prev' is not defined

It looks like a problem in the virtualenv activation script but I can run python3 ./build/venv-tf-py3/bin/activate_this.py without any problems.

I use the latest version of virtualenv: 16.2.0.
Version of python: 3.6.5

P.S. If the build_ngtf.py is started under python 2, not python 3, the activation works fine:

> python ./build_ngtf.py
Already using interpreter /usr/bin/python3
Using base prefix '/usr'
New python executable in /home/pavel/ngraph/ngraph-tf/build/venv-tf-py3/bin/python3
Also creating executable in /home/pavel/ngraph/ngraph-tf/build/venv-tf-py3/bin/python
Installing setuptools, pip, wheel...
done.
Loading virtual environment from: /home/pavel/ngraph/ngraph-tf/build/venv-tf-py3
Loading virtual environment from: /home/pavel/ngraph/ngraph-tf/build/venv-tf-py3
PIP location
/home/pavel/ngraph/ngraph-tf/build/venv-tf-py3/bin/pip
Requirement already up-to-date: pip in ./venv-tf-py3/lib/python3.6/site-packages (18.1)
Requirement already up-to-date: setuptools in ./venv-tf-py3/lib/python3.6/site-packages (40.6.3)
Collecting psutil
...

but the mix of python 3 and python 2 will be used during the rest of the script, for example, tensorflow will be configured with PYTHON_LIB_PATH pointing to build/venv-tf-py3/lib/python2.7/site-packages and of course will fail during the build.

slow, single threaded

Hi, I've been trying to use ngraph to accelerate my tensorflow detector/testing pipeline, but, unfortunately, without any success so far. The inference process either has the same performance, or becomes painstakingly slow.

I'm not quite sure whether I'm installing and using ngraph right.

I'm not quite sure whether this is the right place to ask these questions, since it might just be something obvious that I've missed, thus not being an actual issue, but I couldn't find any other support channel. If there is a different, more appropriate one, please direct me to it.

For installation, I've used pip inside my own dockerfile to install ngraph-tensorflow-bridge, following the instructions on this repo (and also installed plaidml, since I've noticed ngraph looks for it during initialization; I didn't build the ngraph library myself, since I noticed that the bridge supplies the .so, and it doesn't complain when loading it).

Also, I've tried turning on xla, but it has no effect

config = tf.ConfigProto()
config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1
tf.keras.backend.set_session(tf.Session(config=config))

Also, I've tested ngraph with intel-tensorflow. When ngraph is off, intel-tf gets about twice as fast as vanilla tf. When ngraph bridge is imported, the performance is really, really low (i.e., I've got bored of waiting for an operation to finish that takes a few seconds when ngraph is not used).

Also, I've tried both 'NCHW' and 'NHWC' under both the vanilla and the intel distributions of tensorflow.

For usage, I only added import ngraph_bridge after importing tensorflow. Is there something else I'm supposed to do?

I didn't get any stdout/stderr message to help me figure out whether ngraph is actually on or not. I've looked through the output of tensorflow.python.client.device_lib.list_local_devices(), but nothing seems to change when adding the import. The only indication that ngraph is used is when I don't disable my GPU (os.environ["CUDA_VISIBLE_DEVICES"] = ""), and I get an error message.

Here is the code I've used for testing out ngraph (it's based on the keras example in this repo). I think the longest I've waited to see some training progress was 10 minutes (without ngraph, I get a progress bar update after under half a minute).

import numpy as np
import os

os.environ["CUDA_VISIBLE_DEVICES"] = ""
os.environ["KMP_BLOCKTIME"] = "0"
os.environ["OMP_NUM_THREADS"] = "4"
os.environ["KMP_AFFINITY"] = "granularity=fine,verbose,compact,1,0"
os.environ['KERAS_BACKEND'] = 'tensorflow'

import tensorflow as tf
from tensorflow.keras.applications.resnet50 import ResNet50, preprocess_input, decode_predictions
import ngraph_bridge

# A simple script to run inference and training on resnet 50

config = tf.ConfigProto()
config.intra_op_parallelism_threads = 4
config.inter_op_parallelism_threads = 4

# config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1

tf.keras.backend.set_session(tf.Session(config=config))
# tf.keras.backend.set_image_data_format('channels_first')

model = ResNet50(weights=None)

batch_size = 128
img = np.random.rand(batch_size, 224, 224, 3)
# img = np.random.rand(batch_size, 3, 224, 224)

preds = model.predict(preprocess_input(img))
print('Predicted:', decode_predictions(preds, top=3)[0])
model.compile(tf.keras.optimizers.SGD(), loss='categorical_crossentropy')
preds = model.fit(
    preprocess_input(img), np.zeros((batch_size, 1000), dtype='float32'))
print('Ran a train round')

I've also tried ngraph for a different code that doesn't use keras (it uses tensorflow's object_detection API instead). Speed is at least 20% lower when using ngraph. For some models the process would have a much larger memory footprint when using ngraph vs. when not (I noticed that because my laptop started paging and crashed).

I've also noticed that while training the keras examples, only one of the 8 logical cores of my cpu is used. This happens when running inference on the detection model, but a fraction of the time more than one core is saturated.

Thanks

Why bypass XLA

The latest code bypass the XLA ops translation, instead directly translating from TF operations. What is the reason?

Broken with deeplabv3+

tl;dr: the code works fine without ngraph; with ngraph enabled, it dies with the errors show below.

Details:

Been trying to get ngraph working with Google's deeplab v3+, without any luck. The code is being run inside a docker container (the nvcr.io/nvidia/tensorflow:18.12-py3 image) on an nvidia dgx2 (16 GPUs).

Versions:

Python 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609] on linux
TensorFlow version installed: 1.12.0 (unknown)
nGraph bridge built with: 1.12.0 (v1.12.0-0-ga6d8ffa)

The docker container was started with the following command line:

nvidia-docker run -it
--rm
--shm-size=1g
--ulimit memlock=-1
--ulimit stack=67108864
--privileged=true
-v /raid/wingated:/raid/wingated
-v /home/wingated:/home/wingated
-v /mnt/pccfs:/mnt/pccfs
nvcr.io/nvidia/tensorflow:18.12-py3

Here are the errors. I have no idea how to diagnose this. :)

[snip]
INFO:tensorflow:Restoring parameters from /raid/wingated/cancer/deeplab_data/init_models/xception/model.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /raid/wingated/cancer/deeplab_data/logs/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:Error reported to Coordinator: Node ConstantFolding/clone_1/scaled_clone_loss_recip in cluster 1064 has assigned device /job:localhost/replica:0/task:0/device:GPU:1 but another node with assigned device /job:localhost/replica:0/task:0/device:CPU:0 has already been seen in the same cluster
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: Node ConstantFolding/clone_1/scaled_clone_loss_recip in cluster 1064 has assigned device /job:localhost/replica:0/task:0/device:GPU:1 but another node with assigned device /job:localhost/replica:0/task:0/device:CPU:0 has already been seen in the same cluster

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/coordinator.py", line 495, in run
self.run_loop()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/supervisor.py", line 1034, in run_loop
self._sv.global_step])
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Node ConstantFolding/clone_1/scaled_clone_loss_recip in cluster 1064 has assigned device /job:localhost/replica:0/task:0/device:GPU:1 but another node with assigned device /job:localhost/replica:0/task:0/device:CPU:0 has already been seen in the same cluster

build fail with --use_prebuilt_tensorflow

when you're building the bridge with --use_prebuilt_tensorflow, the build fails like this:

...
adding 'ngraph_tensorflow_bridge-0.12.0rc4.dist-info/top_level.txt'
adding 'ngraph_tensorflow_bridge-0.12.0rc4.dist-info/RECORD'
removing build/bdist.linux-x86_64/wheel
OUTPUT WHL FILE: ngraph_tensorflow_bridge-0.12.0rc4-py2.py3-none-manylinux1_x86_64.whl
OUTPUT WHL DST: .../ngraph_20190409/build_cmake/artifacts/ngraph_tensorflow_bridge-0.12.0rc4-py2.py3-none-manylinux1_x86_64.whl
SUCCESSFULLY generated wheel: ngraph_tensorflow_bridge-0.12.0rc4-py2.py3-none-manylinux1_x86_64.whl
PWD: .../ngraph_20190409/build_cmake
cp: cannot stat '.../ngraph_20190409/build_cmake/tensorflow/tensorflow/python': No such file or directory
Traceback (most recent call last):
  File "build_ngtf.py", line 300, in <module>
    main()
  File "build_ngtf.py", line 289, in main
    os.path.join(artifacts_location, "tensorflow")
  File ".../ngraph_20190409/tools/build_utils.py", line 44, in command_executor
    raise Exception("Error running command: " + cmd)
Exception: Error running command: cp -r .../ngraph_20190409/build_cmake/tensorflow/tensorflow/python .../ngraph_20190409/build_cmake/artifacts/tensorflow

this piece of code should take --use_prebuilt_tensorflow into attention:

    # Copy the TensorFlow Python code tree to artifacts directory so that they can
    # be used for running TensorFlow Python unit tests
    command_executor([
        'cp', '-r', build_dir_abs + '/tensorflow/tensorflow/python',
        os.path.join(artifacts_location, "tensorflow")
    ])

Conv2DCustomBackpropFilterOp only supports NHWC

Hi,

When I run tf_cnn_benchmarks.py, it reports that error "Conv2DCustomBackpropFilterOp only supports NHWC" as following:

2018-07-20 00:56:31.727280: E tensorflow/core/common_runtime/executor.cc:696] Executor failed to create kernel. Invalid argument: Conv2DCustomBackpropFilterOp only supports NHWC.
[[Node: v0/tower_0/gradients/v0/tower_0/cg/resnet_v115/conv52/conv2d/Conv2D_grad/Conv2DBackpropFilter = Conv2DBackpropFilter[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](ngraph_cluster_0/_1691, ConstantFolding/v0/tower_0/gradients/v0/tower_0/cg/resnet_v115/conv52/conv2d/Conv2D_grad/ShapeN-matshapes-1, v0/tower_0/gradients/v0/tower_0/cg/resnet_v115/conv52/batchnorm52/FusedBatchNorm_grad/FusedBatchNormGrad)]]

Does it mean ngraph-tf only supports CPU?

libngraph_device.so undefined symbol

I successfully build a python wheel, but when import ngraph, an error raised:

OSError: /opt/conda/lib/python3.6/site-packages/ngraph/libngraph_device.so: undefined symbol: _ZN6ngraph4NodeC2ERKSsRKNS_10NodeVectorE

so, how to solve this error?

Thanks

compile() must be called before call().

after this PR.
NervanaSystems/ngraph#2064

Question: How I can use TF + nGraph + PlaidML ?

I installed TF and nGraph using
python3 build_ngtf.py --build_plaidml_backend --use_prebuilt_tensorflow

I tested TF mnist hello world.
It uses only CPU. How can I use PlaidML + OpenCL for its backend?

Flags as arguments

Hi.
I think it would be a good idea to enable users to change which flags are on during the build process using build_ngtf.py script. Currently it is quite troublesome to edit script manually just to enable IntelGPU or GPU. Especially if you want to automate building process (using docker etc) it is inconvenient that you have to sed the file just to change one flag.
I understand that this is of low priority, but it's definitely something that would make ngraph+tf more user-friendly.

Performance Degradation after source build inside Docker on Intel i7-5820K CPU

Hi there, I am using nGraph to accelerate my model.

As my cpu is not Xeon series, I built nGraph and tensorflow from source inside Docker following Option 2 in README. The build succeeded and pass the model test. However, the inference time is much more slower when using nGraph backend.

CPU: 0.03387284278869629 secs
NGRAPH_CPU: 0.11669778823852539 secs

Could anyone point out possible reason for this?

Btw, I notice there are setting recommendations for Xeon series. (https://ngraph.nervanasys.com/docs/latest/frameworks/generic-configs.html#ngraph-enabled-intel-xeon.) I am wondering if the environment parameter settings would affect a lot.

Any hint is highly appreciated!!