Describe the bug I've encountered a segmentation fa

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-ho

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

[BUG] Seg Fault When Deploying TF+HPS Model with merlin-tensorflow,about nvidia-merlin/hugectr

Comments (9)

yingcanw commented on July 17, 2024

@tuanavu Thanks for your feedback, we have decoupled/reorganized third parties dependency HPS depends on after 23.06. Since we have pre-installed all HPS/SOK related libraries, there is no need to set LD_PRELOAD. If you must set some custom library file paths, it is recommended to use LD_LIBRARY_PATH to set. FYI @EmmaQiaoCh

@bashimao Please add your comment about reorganizing third parties dependency.

from hugectr.

tuanavu commented on July 17, 2024

Hi @yingcanw, following up on this thread, without setting the LD_PRELOAD, I got this error after deploying the model. I used nvcr.io/nvidia/merlin/merlin-tensorflow:23.09, do you know how to resolve this?

2024-02-06 17:35:54.482183: I tensorflow/cc/saved_model/loader.cc:233] Restoring SavedModel bundle.
2024-02-06 17:35:56.319034: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init
	While importing function: __inference_call_7323150
	when importing GraphDef to MLIR module in GrapplerHook
2024-02-06 17:35:56.991711: I tensorflow/cc/saved_model/loader.cc:217] Running initialization op on SavedModel bundle at path: /tmp/folderEWbC2X/1/model.savedmodel
2024-02-06 17:35:58.817038: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init
	While importing function: __inference_call_7323150
	when importing GraphDef to MLIR module in GrapplerHook

And here's the LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/tensorflow:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/repos/dist/lib:/usr/lib/jvm/default-java/lib:/usr/lib/jvm/default-java/lib/server:/opt/tritonserver/lib:/usr/local/hugectr/lib that I saw in the container.

from hugectr.

yingcanw commented on July 17, 2024

Hi @yingcanw, following up on this thread, without setting the LD_PRELOAD, I got this error after deploying the model. I used nvcr.io/nvidia/merlin/merlin-tensorflow:23.09, do you know how to resolve this?
2024-02-06 17:35:54.482183: I tensorflow/cc/saved_model/loader.cc:233] Restoring SavedModel bundle.
2024-02-06 17:35:56.319034: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init
	While importing function: __inference_call_7323150
	when importing GraphDef to MLIR module in GrapplerHook
2024-02-06 17:35:56.991711: I tensorflow/cc/saved_model/loader.cc:217] Running initialization op on SavedModel bundle at path: /tmp/folderEWbC2X/1/model.savedmodel
2024-02-06 17:35:58.817038: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init
	While importing function: __inference_call_7323150
	when importing GraphDef to MLIR module in GrapplerHook
And here's the LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/tensorflow:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/lib:/repos/dist/lib:/usr/lib/jvm/default-java/lib:/usr/lib/jvm/default-java/lib/server:/opt/tritonserver/lib:/usr/local/hugectr/lib that I saw in the container.

Please provide more details on which step in the notebook outputs these error messages.

from hugectr.

tuanavu commented on July 17, 2024

Sure, Steps to reproduce the behavior:

Train a TF+SOK model with merlin-tensorflow:23.09 and follow the deployment steps outlined in the HPS TensorFlow Triton deployment demo notebook to export the inference graph with HPS.
Construct a deployment.yaml and deploy on AWS EKS, without setting the LD_PRELOAD
Check the container log and see this error:

2024-02-06 17:35:54.482183: I tensorflow/cc/saved_model/loader.cc:233] Restoring SavedModel bundle.
2024-02-06 17:35:56.319034: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init
    While importing function: __inference_call_7323150
    when importing GraphDef to MLIR module in GrapplerHook
2024-02-06 17:35:56.991711: I tensorflow/cc/saved_model/loader.cc:217] Running initialization op on SavedModel bundle at path: /tmp/folderEWbC2X/1/model.savedmodel
2024-02-06 17:35:58.817038: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init
    While importing function: __inference_call_7323150
    when importing GraphDef to MLIR module in GrapplerHook

Try sending some serving request or run perf_analyzer and still seeing the same error.

Note that the same model can be deployed and test successfully with merlin-tensorflow:23.02 and 23.06 by setting LD_PRELOAD=/usr/local/lib/python3.8/dist-packages/merlin_hps-1.0.0-py3.8-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so

from hugectr.

yingcanw commented on July 17, 2024

Sure, Steps to reproduce the behavior:

Train a TF+SOK model with merlin-tensorflow:23.09 and follow the deployment steps outlined in the HPS TensorFlow Triton deployment demo notebook to export the inference graph with HPS.

Construct a deployment.yaml and deploy on AWS EKS, without setting the LD_PRELOAD

Check the container log and see this error:
2024-02-06 17:35:54.482183: I tensorflow/cc/saved_model/loader.cc:233] Restoring SavedModel bundle.
2024-02-06 17:35:56.319034: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init
    While importing function: __inference_call_7323150
    when importing GraphDef to MLIR module in GrapplerHook
2024-02-06 17:35:56.991711: I tensorflow/cc/saved_model/loader.cc:217] Running initialization op on SavedModel bundle at path: /tmp/folderEWbC2X/1/model.savedmodel
2024-02-06 17:35:58.817038: E tensorflow/core/grappler/optimizers/tfg_optimizer_hook.cc:134] tfg_optimizer{any(tfg-consolidate-attrs,tfg-toposort,tfg-shape-inference{graph-version=0},tfg-prepare-attrs-export)} failed: INVALID_ARGUMENT: Unable to find OpDef for Init
    While importing function: __inference_call_7323150
    when importing GraphDef to MLIR module in GrapplerHook
Try sending some serving request or run perf_analyzer and still seeing the same error.

Note that the same model can be deployed and test successfully with merlin-tensorflow:23.02 and 23.06 by setting LD_PRELOAD=/usr/local/lib/python3.8/dist-packages/merlin_hps-1.0.0-py3.8-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so

From the brief reproduction steps you provided, I still haven't figured out which specific step you met these errors. So I can only guess that you have successfully completed the model training and created_and_save_inference_graph, and then you met the error in this step (Deploy SavedModel using HPS with Triton TensorFlow Backend)

Since we do not have the same AWS environment to reproduce your issue, we have not reproduced the same issue you encountered in the local machine(T4/V100, Intel CPU, Ubuntu 22.04 with 23.12 container ).
However, there is an important note here. You still only need to set the LD_PRELOAD when launching the tritonserver (Cell 13 in the Notebook)as show below, which is the registration mechanism for the custom op required by the triton server. In addition, there is no need to set LD_PRELOAD in any other steps.

LD_PRELOAD=/usr/local/lib/python3.8/dist-packages/merlin_hps-1.0.0-py3.8-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so tritonserver --model-repository=/hugectr/hps_tf/notebooks/model_repo --backend-config=tensorflow,version=2 --load-model=hps_tf_triton --model-control-mode=explicit

from hugectr.

tuanavu commented on July 17, 2024

Hi @yingcanw,

The issue seems to circle back to the initial problem discussed in this thread: #440 (comment). When I set LD_PRELOAD=/usr/local/lib/python3.10/dist-packages/merlin_hps-1.0.0-py3.10-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so when launching tritonserver exactly as you describe above, I encountered a segmentation fault. This error appears to be consistent across the merlin-tensorflow images from versions 23.08 to 23.12. However, with the 23.12 image, the LD_PRELOAD path pointing to Python 3.8 libraries should no longer be applicable as you mentioned. Could you attempt to reproduce this error using the 23.09 container and provide any findings?

from hugectr.

yingcanw commented on July 17, 2024

Thank you for your corrections. There is a typo here. We have upgraded python to 3.10 since 23.08, and we need to update the notebook to modify the triton server launch command.

However, I still haven't reproduced the issue you mentioned on 23.09. But I still want to emphasize the difference in #440 (comment) , users are asked not to set the LD_PRELOAD variable independently(please pay attention to the bold part in the log), LD_PRELOAD is used as an argument when launching the triton server rather than an environment variable. In other words, LD_PRELOAD should not be set independently as an environment variable.

Hope the above information is more clear for you to solve the problem.

docker run --gpus=all --privileged=true --net=host --shm-size 16g -it -v ${PWD}:/hugectr/ nvcr.io/nvidia/merlin/merlin-tensorflow:23.09 /bin/bash

==================================
== Triton Inference Server Base ==

NVIDIA Release 23.06 (build 62878575)

Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED.
Using CUDA 12.1 driver version 530.30.02 with kernel driver version 525.85.12.
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

name@machine:/opt/tritonserver# LD_PRELOAD=/usr/local/lib/python3.10/dist-packages/merlin_hps-1.0.0-py3.10-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so tritonserver --model-repository=/hugectr/hugectr/hps_tf/notebooks/model_repo --backend-config=tensorflow,version=2 --load-model=hps_tf_triton --model-control-mode=explicit
I0207 07:40:33.150916 129 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f9316000000' with size 268435456
I0207 07:40:33.156832 129 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864..
W0207 07:40:33.906804 129 server.cc:248] failed to enable peer access for some device pairs
W0207 07:40:33.922110 129 model_lifecycle.cc:108] ignore version directory 'hps_tf_triton_sparse_0.model' which fails to convert to integral number
I0207 07:40:33.922153 129 model_lifecycle.cc:462] loading: hps_tf_triton:1
I0207 07:40:34.251125 129 tensorflow.cc:2577] TRITONBACKEND_Initialize: tensorflow
I0207 07:40:34.251157 129 tensorflow.cc:2587] Triton TRITONBACKEND API version: 1.13
I0207 07:40:34.251161 129 tensorflow.cc:2593] 'tensorflow' TRITONBACKEND API version: 1.13

2024-02-07 07:40:37.339528: I tensorflow/cc/saved_model/loader.cc:334] SavedModel load for tags { serve }; Status: success: OK. Took 89817 microseconds.
I0207 07:40:37.340078 129 model_lifecycle.cc:815] successfully loaded 'hps_tf_triton'
I0207 07:40:37.340295 129 server.cc:603]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0207 07:40:37.340372 129 server.cc:630]
+------------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+------------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
| tensorflow | /opt/tritonserver/backends/tensorflow/libtriton_tensorflow.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.00000 |
| | | 0","version":"2","default-max-batch-size":"4"}} |
+------------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+

I0207 07:40:37.340439 129 server.cc:673]
+---------------+---------+--------+
| Model | Version | Status |
+---------------+---------+--------+
| hps_tf_triton | 1 | READY |
+---------------+---------+--------+

I...
....

I0207 07:40:37.469253 129 grpc_server.cc:2445] Started GRPCInferenceService at 0.0.0.0:8001
I0207 07:40:37.469536 129 http_server.cc:3555] Started HTTPService at 0.0.0.0:8000
I0207 07:40:37.511302 129 http_server.cc:185] Started Metrics Service at 0.0.0.0:8002

from hugectr.

tuanavu commented on July 17, 2024

Hi @yingcanw,

Quick update, I believe I figure out the root cause of the seg fault. It appears to be related to configuring the --model-repository flag to point to a remote S3 bucket. My suspicion is that the underlying issue is with the aws-sdk-cpp. This is based on observing similar errors, such as "Error: free(): invalid pointer", when manually interacting with S3 objects inside the container.

Steps to reproduce the behavior:

Launch the merlin-tensorflow:23.09 container.
SSH into the container.
Execute the following command. This results in a segmentation fault and core dump.

name@machine:/opt/tritonserver# LD_PRELOAD=/usr/local/lib/python3.10/dist-packages/merlin_hps-1.0.0-py3.10-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so \
tritonserver --model-repository=s3://model_repo \
--backend-config=tensorflow,version=2 --load-model=$MODEL_ID --model-control-mode=explicit \
--grpc-port=6000 --metrics-port=80 --allow-metrics=true --allow-gpu-metrics=true --strict-readiness=true

2024-02-08 08:25:42.570293: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
I0208 08:25:42.928109 28 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f80ba000000' with size 268435456
I0208 08:25:42.930215 28 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
Segmentation fault (core dumped)

Contrastingly, when the S3 model_repo is downloaded to a local directory and used as follows, no segmentation fault occurs, and the model starts as normal.

name@machine:/opt/tritonserver# LD_PRELOAD=/usr/local/lib/python3.10/dist-packages/merlin_hps-1.0.0-py3.10-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so \
tritonserver --model-repository=/tmp/models \
--backend-config=tensorflow,version=2 --load-model=$MODEL_ID --model-control-mode=explicit \
--grpc-port=6000 --metrics-port=80 --allow-metrics=true --allow-gpu-metrics=true --strict-readiness=true

2024-02-08 08:27:19.845263: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
I0208 08:27:20.222846 33 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fcd94000000' with size 268435456
I0208 08:27:20.225011 33 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0208 08:27:20.230161 33 model_lifecycle.cc:462] loading: 1200429:1
I0208 08:27:20.534202 33 tensorflow.cc:2577] TRITONBACKEND_Initialize: tensorflow
I0208 08:27:20.534240 33 tensorflow.cc:2587] Triton TRITONBACKEND API version: 1.13
I0208 08:27:20.534256 33 tensorflow.cc:2593] 'tensorflow' TRITONBACKEND API version: 1.13

from hugectr.

yingcanw commented on July 17, 2024

Thanks a lot for your update, I think this error output can be easily misunderstood(we have verified that if the LD_PRELOAD parameter is set, circular dependencies will cause seg fault). But we don't have the same AWS environment to reproduce this problem.
Due to lazy initialization of HPS, which is not initialized (although hps currently does not support parsing embedding from remote repo and will output file cannot be opened errors message instead of seg fault) until the first inference request is processed. So I think the you may needs to submit an issue to tensorflow_backend.

from hugectr.

[BUG] Seg Fault When Deploying TF+HPS Model with merlin-tensorflow about hugectr HOT 9 OPEN

Comments (9)

==================================
== Triton Inference Server Base ==

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (9)

================================== == Triton Inference Server Base ==

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

==================================
== Triton Inference Server Base ==