Comments (9)
@tuanavu Thanks for your feedback, we have decoupled/reorganized third parties dependency HPS depends on after 23.06. Since we have pre-installed all HPS/SOK related libraries, there is no need to set LD_PRELOAD. If you must set some custom library file paths, it is recommended to use LD_LIBRARY_PATH to set. FYI @EmmaQiaoCh
@bashimao Please add your comment about reorganizing third parties dependency.
from hugectr.
Hi @yingcanw, following up on this thread, without setting the LD_PRELOAD, I got this error after deploying the model. I used nvcr.io/nvidia/merlin/merlin-tensorflow:23.09
, do you know how to resolve this?
2024-02-06 17:35:54.482183: I tensorflow/cc/saved_model/loader.cc:233] Restoring SavedModel bundle.
2024-02-06 17:35:56.319034: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init
While importing function: __inference_call_7323150
when importing GraphDef to MLIR module in GrapplerHook
2024-02-06 17:35:56.991711: I tensorflow/cc/saved_model/loader.cc:217] Running initialization op on SavedModel bundle at path: /tmp/folderEWbC2X/1/model.savedmodel
2024-02-06 17:35:58.817038: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init
While importing function: __inference_call_7323150
when importing GraphDef to MLIR module in GrapplerHook
And here's the LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/tensorflow:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/repos/dist/lib:/usr/lib/jvm/default-java/lib:/usr/lib/jvm/default-java/lib/server:/opt/tritonserver/lib:/usr/local/hugectr/lib
that I saw in the container.
from hugectr.
Hi @yingcanw, following up on this thread, without setting the LD_PRELOAD, I got this error after deploying the model. I used
nvcr.io/nvidia/merlin/merlin-tensorflow:23.09
, do you know how to resolve this?2024-02-06 17:35:54.482183: I tensorflow/cc/saved_model/loader.cc:233] Restoring SavedModel bundle. 2024-02-06 17:35:56.319034: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init While importing function: __inference_call_7323150 when importing GraphDef to MLIR module in GrapplerHook 2024-02-06 17:35:56.991711: I tensorflow/cc/saved_model/loader.cc:217] Running initialization op on SavedModel bundle at path: /tmp/folderEWbC2X/1/model.savedmodel 2024-02-06 17:35:58.817038: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init While importing function: __inference_call_7323150 when importing GraphDef to MLIR module in GrapplerHook
And here's the
LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/tensorflow:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/repos/dist/lib:/usr/lib/jvm/default-java/lib:/usr/lib/jvm/default-java/lib/server:/opt/tritonserver/lib:/usr/local/hugectr/lib
that I saw in the container.
Please provide more details on which step in the notebook outputs these error messages.
from hugectr.
Sure, Steps to reproduce the behavior:
- Train a TF+SOK model with merlin-tensorflow:23.09 and follow the deployment steps outlined in the HPS TensorFlow Triton deployment demo notebook to export the inference graph with HPS.
- Construct a deployment.yaml and deploy on AWS EKS, without setting the LD_PRELOAD
- Check the container log and see this error:
2024-02-06 17:35:54.482183: I tensorflow/cc/saved_model/loader.cc:233] Restoring SavedModel bundle.
2024-02-06 17:35:56.319034: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init
While importing function: __inference_call_7323150
when importing GraphDef to MLIR module in GrapplerHook
2024-02-06 17:35:56.991711: I tensorflow/cc/saved_model/loader.cc:217] Running initialization op on SavedModel bundle at path: /tmp/folderEWbC2X/1/model.savedmodel
2024-02-06 17:35:58.817038: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init
While importing function: __inference_call_7323150
when importing GraphDef to MLIR module in GrapplerHook
- Try sending some serving request or run perf_analyzer and still seeing the same error.
Note that the same model can be deployed and test successfully with merlin-tensorflow:23.02
and 23.06
by setting LD_PRELOAD=/usr/local/lib/python3.8/dist-packages/merlin_hps-1.0.0-py3.8-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so
from hugectr.
Sure, Steps to reproduce the behavior:
- Train a TF+SOK model with merlin-tensorflow:23.09 and follow the deployment steps outlined in the HPS TensorFlow Triton deployment demo notebook to export the inference graph with HPS.
- Construct a deployment.yaml and deploy on AWS EKS, without setting the LD_PRELOAD
- Check the container log and see this error:
2024-02-06 17:35:54.482183: I tensorflow/cc/saved_model/loader.cc:233] Restoring SavedModel bundle. 2024-02-06 17:35:56.319034: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init While importing function: __inference_call_7323150 when importing GraphDef to MLIR module in GrapplerHook 2024-02-06 17:35:56.991711: I tensorflow/cc/saved_model/loader.cc:217] Running initialization op on SavedModel bundle at path: /tmp/folderEWbC2X/1/model.savedmodel 2024-02-06 17:35:58.817038: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init While importing function: __inference_call_7323150 when importing GraphDef to MLIR module in GrapplerHook
- Try sending some serving request or run perf_analyzer and still seeing the same error.
Note that the same model can be deployed and test successfully with
merlin-tensorflow:23.02
and23.06
by settingLD_PRELOAD=/usr/local/lib/python3.8/dist-packages/merlin_hps-1.0.0-py3.8-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so
From the brief reproduction steps you provided, I still haven't figured out which specific step you met these errors. So I can only guess that you have successfully completed the model training and created_and_save_inference_graph, and then you met the error in this step (Deploy SavedModel using HPS with Triton TensorFlow Backend)
Since we do not have the same AWS environment to reproduce your issue, we have not reproduced the same issue you encountered in the local machine(T4/V100, Intel CPU, Ubuntu 22.04 with 23.12 container ).
However, there is an important note here. You still only need to set the LD_PRELOAD when launching the tritonserver (Cell 13 in the Notebook)as show below, which is the registration mechanism for the custom op required by the triton server. In addition, there is no need to set LD_PRELOAD in any other steps.
LD_PRELOAD=/usr/local/lib/python3.8/dist-packages/merlin_hps-1.0.0-py3.8-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so tritonserver --model-repository=/hugectr/hps_tf/notebooks/model_repo --backend-config=tensorflow,version=2 --load-model=hps_tf_triton --model-control-mode=explicit
from hugectr.
Hi @yingcanw,
The issue seems to circle back to the initial problem discussed in this thread: #440 (comment). When I set LD_PRELOAD=/usr/local/lib/python3.10/dist-packages/merlin_hps-1.0.0-py3.10-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so
when launching tritonserver exactly as you describe above, I encountered a segmentation fault. This error appears to be consistent across the merlin-tensorflow
images from versions 23.08 to 23.12. However, with the 23.12 image, the LD_PRELOAD path pointing to Python 3.8 libraries should no longer be applicable as you mentioned. Could you attempt to reproduce this error using the 23.09 container and provide any findings?
from hugectr.
Thank you for your corrections. There is a typo here. We have upgraded python to 3.10 since 23.08, and we need to update the notebook to modify the triton server launch command.
However, I still haven't reproduced the issue you mentioned on 23.09. But I still want to emphasize the difference in #440 (comment) , users are asked not to set the LD_PRELOAD variable independently(please pay attention to the bold part in the log), LD_PRELOAD is used as an argument when launching the triton server rather than an environment variable. In other words, LD_PRELOAD should not be set independently as an environment variable.
Hope the above information is more clear for you to solve the problem.
docker run --gpus=all --privileged=true --net=host --shm-size 16g -it -v ${PWD}:/hugectr/ nvcr.io/nvidia/merlin/merlin-tensorflow:23.09 /bin/bash
==================================
== Triton Inference Server Base ==NVIDIA Release 23.06 (build 62878575)
Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-licenseNOTE: CUDA Forward Compatibility mode ENABLED.
Using CUDA 12.1 driver version 530.30.02 with kernel driver version 525.85.12.
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.name@machine:/opt/tritonserver# LD_PRELOAD=/usr/local/lib/python3.10/dist-packages/merlin_hps-1.0.0-py3.10-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so tritonserver --model-repository=/hugectr/hugectr/hps_tf/notebooks/model_repo --backend-config=tensorflow,version=2 --load-model=hps_tf_triton --model-control-mode=explicit
I0207 07:40:33.150916 129 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f9316000000' with size 268435456
I0207 07:40:33.156832 129 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864..
W0207 07:40:33.906804 129 server.cc:248] failed to enable peer access for some device pairs
W0207 07:40:33.922110 129 model_lifecycle.cc:108] ignore version directory 'hps_tf_triton_sparse_0.model' which fails to convert to integral number
I0207 07:40:33.922153 129 model_lifecycle.cc:462] loading: hps_tf_triton:1
I0207 07:40:34.251125 129 tensorflow.cc:2577] TRITONBACKEND_Initialize: tensorflow
I0207 07:40:34.251157 129 tensorflow.cc:2587] Triton TRITONBACKEND API version: 1.13
I0207 07:40:34.251161 129 tensorflow.cc:2593] 'tensorflow' TRITONBACKEND API version: 1.132024-02-07 07:40:37.339528: I tensorflow/cc/saved_model/loader.cc:334] SavedModel load for tags { serve }; Status: success: OK. Took 89817 microseconds.
I0207 07:40:37.340078 129 model_lifecycle.cc:815] successfully loaded 'hps_tf_triton'
I0207 07:40:37.340295 129 server.cc:603]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+I0207 07:40:37.340372 129 server.cc:630]
+------------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+------------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
| tensorflow | /opt/tritonserver/backends/tensorflow/libtriton_tensorflow.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.00000 |
| | | 0","version":"2","default-max-batch-size":"4"}} |
+------------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+I0207 07:40:37.340439 129 server.cc:673]
+---------------+---------+--------+
| Model | Version | Status |
+---------------+---------+--------+
| hps_tf_triton | 1 | READY |
+---------------+---------+--------+I...
....I0207 07:40:37.469253 129 grpc_server.cc:2445] Started GRPCInferenceService at 0.0.0.0:8001
I0207 07:40:37.469536 129 http_server.cc:3555] Started HTTPService at 0.0.0.0:8000
I0207 07:40:37.511302 129 http_server.cc:185] Started Metrics Service at 0.0.0.0:8002
from hugectr.
Hi @yingcanw,
Quick update, I believe I figure out the root cause of the seg fault. It appears to be related to configuring the --model-repository
flag to point to a remote S3 bucket. My suspicion is that the underlying issue is with the aws-sdk-cpp
. This is based on observing similar errors, such as "Error: free(): invalid pointer", when manually interacting with S3 objects inside the container.
Steps to reproduce the behavior:
- Launch the
merlin-tensorflow:23.09
container. - SSH into the container.
- Execute the following command. This results in a segmentation fault and core dump.
name@machine:/opt/tritonserver# LD_PRELOAD=/usr/local/lib/python3.10/dist-packages/merlin_hps-1.0.0-py3.10-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so \
tritonserver --model-repository=s3://model_repo \
--backend-config=tensorflow,version=2 --load-model=$MODEL_ID --model-control-mode=explicit \
--grpc-port=6000 --metrics-port=80 --allow-metrics=true --allow-gpu-metrics=true --strict-readiness=true
2024-02-08 08:25:42.570293: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
I0208 08:25:42.928109 28 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f80ba000000' with size 268435456
I0208 08:25:42.930215 28 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
Segmentation fault (core dumped)
- Contrastingly, when the S3 model_repo is downloaded to a local directory and used as follows, no segmentation fault occurs, and the model starts as normal.
name@machine:/opt/tritonserver# LD_PRELOAD=/usr/local/lib/python3.10/dist-packages/merlin_hps-1.0.0-py3.10-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so \
tritonserver --model-repository=/tmp/models \
--backend-config=tensorflow,version=2 --load-model=$MODEL_ID --model-control-mode=explicit \
--grpc-port=6000 --metrics-port=80 --allow-metrics=true --allow-gpu-metrics=true --strict-readiness=true
2024-02-08 08:27:19.845263: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
I0208 08:27:20.222846 33 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fcd94000000' with size 268435456
I0208 08:27:20.225011 33 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0208 08:27:20.230161 33 model_lifecycle.cc:462] loading: 1200429:1
I0208 08:27:20.534202 33 tensorflow.cc:2577] TRITONBACKEND_Initialize: tensorflow
I0208 08:27:20.534240 33 tensorflow.cc:2587] Triton TRITONBACKEND API version: 1.13
I0208 08:27:20.534256 33 tensorflow.cc:2593] 'tensorflow' TRITONBACKEND API version: 1.13
from hugectr.
Thanks a lot for your update, I think this error output can be easily misunderstood(we have verified that if the LD_PRELOAD parameter is set, circular dependencies will cause seg fault). But we don't have the same AWS environment to reproduce this problem.
Due to lazy initialization of HPS, which is not initialized (although hps currently does not support parsing embedding from remote repo and will output file cannot be opened
errors message instead of seg fault) until the first inference request is processed. So I think the you may needs to submit an issue to tensorflow_backend.
from hugectr.
Related Issues (20)
- [Question] COnfiguration issues with mlcommon benchmarking HOT 1
- [Question] How to serve TF2 SOK model in Triton Inference and convert it to ONNX? HOT 1
- [Question] Difference between Embedding Training Cache and GPU Embedding Cache HOT 9
- Support for configuration issues HOT 1
- [Question] How can I pre-calculate the GPU memory required for embedding cache size? HOT 2
- [Question] nv_gpu_cache compiling problem HOT 1
- [BUG] Encountered ETC error of din model when training with multiple keyset. HOT 3
- Trouble installing hugectr_backend for Triton Server HOT 1
- sok-experiment static_map empty_key_sentinel and reclaimed_key_sentinel is not right for int64 [BUG] HOT 4
- [BUG] CUDNN_STATUS_MAPPING_ERROR with cudnnSetStream HOT 21
- build docker failed with 401 Unauthorized (Set Up the Development Environment With Merlin Containers) HOT 4
- [BUG]preprocess.sh 1 criteo failed with 'Schema' object has no attribute 'write' HOT 1
- [Question] Is there pipeline mechanism to help the lookup requests always be handled on device cache in HugeCTR? HOT 1
- [Question] How to dump incremental model to kafka in Release 23.12? HOT 2
- [BUG] Run sok tests error HOT 1
- [BUG] cudaErrorIllegalAddress: an illegal memory access was encounteredThread HOT 4
- [BUG]build failed on gtest! HOT 5
- [Question] Can i read parquet data from HDFS? HOT 6
- [Question] Is there any related architecture design or documentation for embedding collection HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hugectr.