els-rd / transformer-deploy Goto Github PK

View Code? Open in Web Editor NEW

1.6K 1.6K 150.0 33.93 MB

Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀

Home Page: https://els-rd.github.io/transformer-deploy/

License: Apache License 2.0

Dockerfile 0.45% Python 99.11% Makefile 0.44%

deep-learning deployment inference machine-learning natural-language-processing server

transformer-deploy's People

Contributors

Stargazers

Watchers

Forkers

techthiyanes celestialized joewillie548 miesnerjacob heluocs ashispapu razcle raghavsonavane kevinuserdd jimwangzx hongdashen pacman100 tenfleques kforcodeai dumpmemory andreajparker dendihust joonhankim shairoz-deci iafydsttta laplacekorea sam-writer nile649 c00renut suhuynh shantanunair kyletyler-0 vishalsrao zhaohb entn-at abcilike davidinwuhanchina marquisthunder napohou l-net-1992 dulante00 xiaolinpeter averkij samuelrince devin-coder gjtjx pragma-ai yangyin2016 pitchprez yotofu macroustc der-ofenmeister trisongz forsc aaxwaz askainet advit200 pitt-liang nlporg fil82 allensmile nipi64310 stevenhailin eycab swagshaw alekseykorshuk riccorl pj0616 ngtiendong mokcheungyam kamalkraj sairajaji bryonkucharski mr-sunglasses jinmang2 denizbeser ianliyi1996 aliszka francislabountyjr mistaro awesome-archive lakshaykc rocke2020 shinan6 cli99 jqueguiner dxhpc ragiko vpegasus mbrukman sam-h-bean asbabiy binglinchengxiash ilyushin apurvnagvenkar raahulraa thytu burakakrishna varshith15 alexfdo svirmi ash0ts axelmarmet tracywang95 sumitbinnani

transformer-deploy's Issues

How to run inference for T5 tensorrt model deployed on nvidia triton?

I have deployed T5 tensorrt model on nvidia triton server and below is the config.pbtxt file, but facing problem while inferencing the model using triton client.

As per the config.pbtxt file there should be 4 inputs to the tensorrt model along with the decoder ids. But how can we send decoder as input to the model I think decoder is to be generated from models output.

Is there any way to inference using triton client.

name: "tensorrt_model"
platform: "tensorrt_plan"
max_batch_size: 0
input [
 {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [ -1, -1  ]
  },

{
    name: "attention_mask"
    data_type: TYPE_INT32
    dims: [-1, -1 ]
},

{
    name: "decoder_input_ids"
    data_type: TYPE_INT32
    dims: [ -1, -1]
},

{
   name: "decoder_attention_mask"
   data_type: TYPE_INT32
   dims: [ -1, -1 ]
}

]
output [
{
    name: "last_hidden_state"
    data_type: TYPE_FP32
    dims: [ -1, -1, 768 ]
  },

{
    name: "input.151"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1 ]
  }

]

instance_group [
    {
        count: 1
        kind: KIND_GPU
    }
]

T5 demo breaking

Hi, thank you for the recent updates on T5!

I have been testing the T5 demo notebook and I noticed a part breaking. The bitwise operator does not work with floating points. (torch.any(tensor < resolution & tensor > -resolution & tensor != 0))

Environment: Dockerfile master + pip install nvtx seaborn

Additional information:
- nvidia-tensorrt==8.2.4.2
- nvtx==0.2.5
- onnx==1.11.0
- nvidia-cublas-cu11==2022.4.8
- nvidia-cublas-cu117==11.10.1.25
- nvidia-cuda-runtime-cu11==2022.4.25
- nvidia-cuda-runtime-cu117==11.7.60
- nvidia-cudnn-cu11==2022.5.19
- nvidia-cudnn-cu116==8.4.0.27

Steps to reproduce: Run T5 demo notebook to cell In[5]:

def get_random_input_encoder() -> Dict[str, torch.Tensor]:
    max_seq = 512
    seq_len = random.randint(a=1, b=max_seq)
    batch = max_seq // seq_len
    random_input_ids = torch.randint(
        low=0, high=tokenizer.vocab_size, size=(batch, seq_len), dtype=torch.int32, device="cuda"
    )
    inputs = {"input_ids": random_input_ids}
    return inputs


keep_fp32_encoder = get_keep_fp32_nodes(onnx_model_path=encoder_model_path, get_input=get_random_input_encoder)
assert len(keep_fp32_encoder) > 0
enc_model_onnx = convert_fp16(onnx_model=encoder_model_path, nodes_to_exclude=keep_fp32_encoder)
save_onnx(proto=enc_model_onnx, model_path=encoder_fp16_model_path)

del enc_model_onnx
torch.cuda.empty_cache()
gc.collect()

Result: Error

RuntimeError                              Traceback (most recent call last)
Input In [12], in <cell line: 12>()
      8     inputs = {"input_ids": random_input_ids}
      9     return inputs
---> 12 keep_fp32_encoder = get_keep_fp32_nodes(onnx_model_path=encoder_model_path, get_input=get_random_input_encoder)
     13 assert len(keep_fp32_encoder) > 0
     14 enc_model_onnx = convert_fp16(onnx_model=encoder_model_path, nodes_to_exclude=keep_fp32_encoder)

File /usr/local/lib/python3.8/dist-packages/transformer_deploy/backends/ort_utils.py:372, in get_keep_fp32_nodes(onnx_model_path, get_input, early_stop, device)
    368 inputs = get_input()
    369 outputs: Dict[str, torch.Tensor] = inference_onnx_binding(
    370     model_onnx=ort_model_fp32_all_nodes, inputs=inputs, device=device, binding=ort_binding, clone_tensor=False
    371 )
--> 372 keep_node_io = find_node_fp32(graph=output_mapping, output_nodes=outputs)
    374 nodes_to_add = [n for n in keep_node_io if n not in keep_fp32_nodes]
    375 keep_fp32_nodes += nodes_to_add

File /usr/local/lib/python3.8/dist-packages/transformer_deploy/backends/ort_utils.py:304, in find_node_fp32(graph, output_nodes)
    299     # out of FP16 range
    300     print("Tensor is ", tensor)
    301     if (
    302         torch.any(tensor > max_float16)
    303         or torch.any(tensor < min_float16)
--> 304         or (torch.any(tensor < resolution & tensor > -resolution & tensor != 0))  # limited memory footprint
    305     ):
    306         keep_fp32.append(graph[k])
    307 return keep_fp32

RuntimeError: "bitwise_and_cuda" not implemented for 'Float'

Support private HuggingFace Hub models?

I think in order to support private HF Hub models, invocations of .from_pretrained, e.g. here, would need to have a parameter, use_auth_token. This parameter defaults to None. Setting it to True uses a local cached auth token (from calling $ transformers-cli login). It can also be set to a string, the API Key, found at https://huggingface.co/settings/token (or https://huggingface.co/organizations/ORG_NAME/settings/token for organizations).

Would you be open to a PR which adds this? I did something similar in fastT5.

Installing transformers inside nvidia docker container

Trying to run triton inference server using

docker run --rm -p8005:8005 -p8003:8003 -p8004:8004 -v/home/test/triton-serve/server/docs/examples/model_repository/triton_models:/models nvcr.io/nvidia/tritonserver:21.12-py3 tritonserver --model-repository=/models

Gives below error:
UNAVAILABLE: Internal: ModuleNotFoundError: No module named 'transformers'

Feature extraction/dense embeddings Query inference error

Build OnnxRuntime Error

Hi, I was following the compilation steps in the t5 notebook to build OnnxRuntime. But when I run
git checkout -b fix_if e1c04eed29d48f295de1cfbd48713158537cdaa7, the output is:
fatal: reference is not a tree: e1c04eed29d48f295de1cfbd48713158537cdaa7.

Previously I ran:
git clone --recursive https://github.com/Microsoft/onnxruntime cd onnxruntime as suggested.

I wonder what is the reason?

Support other tasks/architectures?

First off: thank you! This is a great project, I'm really grateful you released it publically.

From what I can tell, this supports encoder-only architectures, and the Sequence Classification task (ex). Am I correct? If so, are there plans to support, or interest in supporting, other architectures (encoder/decoder, decoder-only) and/or tasks (Token Classification and Masked token prediction for encoder-only architectures, or Seq2SeqLM for the other architectures)?

Out of memeory error for batch size more than 1 for T5 models.

hey, first of all, thanks for creating this amazing library!

I'm following your T5 implementation with trt,

transformer-deploy/t5.py

Line 222 in b52850d

    
           input_id_shape = TensorRTShape(min_shape=[5, 1], optimal_shape=[5, 500], max_shape=[5, 500], input_name="input_ids")

And, I'm trying to convert the onnx version of the T5 model to tensorrt engine using your build_engine method,

transformer-deploy/src/transformer_deploy/backends/trt_utils.py

Line 64 in 1f2d2c1

def build_engine(

It works fine for a batch size of 1, but for batch size > 1. it's taking longer to build (almost an hour just for the t5-small encoder), and even after that it's not building the model successfully and getting the following error :

[03/18/2022-12:51:55] [TRT] [E] 2: [virtualMemoryBuffer.cpp::resizePhysical::161] Error Code 2: OutOfMemory (no further information)
[03/18/2022-12:51:55] [TRT] [E] 2: [virtualMemoryBuffer.cpp::resizePhysical::161] Error Code 2: OutOfMemory (no further information)
[03/18/2022-12:51:55] [TRT] [E] 10: [optimizer.cpp::computeCosts::2011] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[encoder.embed_tokens.weight...Mul_406]}.)
[03/18/2022-12:51:55] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::609] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed. )
Traceback (most recent call last):
  File "export_onnx_to_trt.py", line 100, in <module>
    build_t5_engine(onnx_encoder_path, trt_encoder_path, [input_id_shape])
  File "export_onnx_to_trt.py", line 86, in build_t5_engine
    engine: ICudaEngine = build_engine(
  File "/app/utils.py", line 209, in build_engine
    engine: ICudaEngine = runtime.deserialize_cuda_engine(trt_engine)
TypeError: deserialize_cuda_engine(): incompatible function arguments. The following argument types are supported:
    1. (self: tensorrt.tensorrt.Runtime, serialized_engine: buffer) -> tensorrt.tensorrt.ICudaEngine

Invoked with: <tensorrt.tensorrt.Runtime object at 0x7f380bbf8930>, None

some system info if that helps;

trt+cuda - 8.2.1-1+cuda11.4
os - ubuntu 20.04.3
gpu - T4 with 15GB memory

the errors say I need more GPU memory, I was wondering how much GPU memory did you use for a batch size of 5? or maybe I'm missing something?

I would really appreciate any help, thank you!

Calibration failure occurred with no scaling factors detected

Hey,

first of all, thanks a lot for your great work. This repo was already a great help to me.

With your quantization update for INT8, however, I ran into a problem. As soon as I activate --quantization, I get the following error:

[01/14/2022-11:18:37] [TRT] [W] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32 or Bool.
[01/14/2022-11:18:37] [TRT] [E] 4: [standardEngineBuilder.cpp::initCalibrationParams::1402] Error Code 4: Internal Error (Calibration failure occurred with no scaling factors detected. This could be due to no int8 calibrator or insufficient custom scales for network layers. Please see int8 sample to setup calibration correctly.)
[01/14/2022-11:18:37] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::609] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed. )

Traceback (most recent call last):
  File "/data/repos/transformer-deploy/src/transformer_deploy/convert.py", line 326, in <module>
    entrypoint()
  File "/data/repos/transformer-deploy/src/transformer_deploy/convert.py", line 322, in entrypoint
    main(commands=args)
  File "/data/repos/transformer-deploy/src/transformer_deploy/convert.py", line 216, in main
    engine: ICudaEngine = build_engine(
  File "/data/repos/transformer-deploy/src/transformer_deploy/backends/trt_utils.py", line 181, in build_engine
    engine: ICudaEngine = runtime.deserialize_cuda_engine(trt_engine)
TypeError: deserialize_cuda_engine(): incompatible function arguments. The following argument types are supported:
    1. (self: tensorrt.tensorrt.Runtime, serialized_engine: buffer) -> tensorrt.tensorrt.ICudaEngine

Invoked with: <tensorrt.tensorrt.Runtime object at 0x7feb14128e30>, None

The problem in the traceback is then just that the trt_engine will be None. I don't get any other warnings or errors, so I'm a bit at a loss. I've tried with distilroberta-base and also with bert-base-uncased, but I get the same error each time. Did you, by any chance, run into the same problem at some point in time or do you see what the issue may be?

Thanks a lot in advance!

add more benchmarks

more sizes
more models

Speed difference ONNX vs TensorRT with samples sorted by sequence length

I noticed something unexpected when comparing two scenarios for a model converted via ONNX and TensorRT (distilroberta with classification head):

Scenario: I use a dataset with varying sentence lengths (~20-60 tokens) and run it randomly sampled through both models
Scenario: I use the same dataset but sort the sentences by sentence length (decreasing) before running it through both models

Result: The TensorRT model does not seem to care about the sequence lengths and keeps the same speed for both scenarios. The ONNX model, however, gets almost twice as fast when I use the second scenario.

I was wondering if tensorRT's optimization does somehow require to pad to the max length internally. I was searching for a parameter or a reason for this behavior but couldn't find anything useful. For conversion, I set the seq-len parameter to 1 60 60.

I was wondering if perhaps someone else has already observed this and knows the reason / a solution.

How to run convert_model in docker instead of local env?

how to run convert_model -m roberta-large-mnli --backend tensorrt onnx pytorch --seq-len 16 128 128 --batch-size 1 32 32 in built docker env?

[ML] Test t5-small model in mixed precision FP16

done by @pommedeterresautee

Installing pytorch-quantization

I get the following error when pytorch-quantization is included in requirements.txt:

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting pytorch-quantization
  Downloading pytorch-quantization-0.0.1.dev5.tar.gz (7.9 kB)
  Preparing metadata (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /opt/conda/bin/python -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-kmmchqi2/pytorch-quantization_5a45d9f46d524a108b6ec28eb5238500/setup.py'"'"'; __file__='"'"'/tmp/pip-install-kmmchqi2/pytorch-quantization_5a45d9f46d524a108b6ec28eb5238500/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-_58u5ior
       cwd: /tmp/pip-install-kmmchqi2/pytorch-quantization_5a45d9f46d524a108b6ec28eb5238500/
  Complete output (16 lines):
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/pip-install-kmmchqi2/pytorch-quantization_5a45d9f46d524a108b6ec28eb5238500/setup.py", line 150, in <module>
      raise RuntimeError(open("ERROR.txt", "r").read())
  RuntimeError:
  ###########################################################################################
  The package you are trying to install is only a placeholder project on PyPI.org repository.
  This package is hosted on NVIDIA Python Package Index.
  
  This package can be installed as:
  ```
  $ pip install nvidia-pyindex
  $ pip install pytorch-quantization
  ```
  ###########################################################################################
  
  ----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/35/ea/c6c4ab73da4e36b9eddea7ff687b98e1bccb59bfb3bd0c24459914fb17f2/pytorch-quantization-0.0.1.dev5.tar.gz#sha256=4702207b088af5a1e58ee31d5ceee14aaa21bc3ef36b39ca996a6ee4d0ffb4dd (from https://pypi.org/simple/pytorch-quantization/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
  Downloading pytorch-quantization-0.0.1.dev4.tar.gz (4.1 kB)
  Preparing metadata (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /opt/conda/bin/python -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-kmmchqi2/pytorch-quantization_b0afa379038d4cbc8e3d8b401f5d74f4/setup.py'"'"'; __file__='"'"'/tmp/pip-install-kmmchqi2/pytorch-quantization_b0afa379038d4cbc8e3d8b401f5d74f4/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-s1sedxj7
       cwd: /tmp/pip-install-kmmchqi2/pytorch-quantization_b0afa379038d4cbc8e3d8b401f5d74f4/
  Complete output (15 lines):
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/pip-install-kmmchqi2/pytorch-quantization_b0afa379038d4cbc8e3d8b401f5d74f4/setup.py", line 150, in <module>
      raise RuntimeError(open("ERROR.txt", "r").read())
  RuntimeError:
  ###########################################################################################
  The package you are trying to install is only a placeholder project on PyPI.org repository.
  This package is hosted on NVIDIA Python Package Index.
  
  This package can be installed as:
  ```
  $ pip install nvidia-pyindex
  $ pip install pytorch-quantization
 ```
  ###########################################################################################
  ----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/d4/3e/e891628c040badc4d18ca48a28bf5a991654161fb32ee5f54ec2317e2664/pytorch-quantization-0.0.1.dev4.tar.gz#sha256=6fea1f1ba851353d65f08098fe19041cd045ca9239e98e5f7058cb1872b6ea57 (from https://pypi.org/simple/pytorch-quantization/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement pytorch-quantization (from versions: 0.0.1.dev4, 0.0.1.dev5)
ERROR: No matching distribution found for pytorch-quantization

I am testing this on pytorch/pytorch:1.10.0-cuda11.3-cudnn8-devel Docker container.

error in triton config files

After running convery.py on a sentence-transformer model, triton config along with onnx model gets generated inside triton_models directory.

However, while serving the model using same config, below error is observed.

UNAVAILABLE: Invalid argument: model 'transformer_onnx_model', tensor 'output': the model expects 2 dimensions (shape [-1,-1]) but the model configuration specifies 2

"faster inference compared to vanilla Pytorch"-- Pytorch CPU or GPU?

Then, if you spend some time, you can build something over ONNX Runtime and Triton inference server. You will usually get from 2X to 4X faster inference compared to vanilla Pytorch. It's cool!

When you say "faster compared to vanilla Pytorch", are you saying faster than vanilla Pytorch in CPU or GPU?

big performance difference on tensorRT

Hi, I just tried the demo code below, in your result, the [TensorRT (FP16)] result is much better than others. However, the results I got are quite different. there is not such a big difference between [TensorRT (FP16)] and others (the output is attached). I wonder if you know what happened or how I can figure out the reason for that. Thank you.

docker run -it --rm --gpus all \
  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
  bash -c "cd /project && \
    convert_model -m \"philschmid/MiniLM-L6-H384-uncased-sst2\" \
    --backend tensorrt onnx \
    --seq-len 16 128 128"

Inference done on Tesla M60
latencies:
[Pytorch (FP32)] mean=6.31ms, sd=1.32ms, min=4.48ms, max=10.75ms, median=6.39ms, 95p=8.63ms, 99p=9.33ms
[Pytorch (FP16)] mean=8.81ms, sd=2.02ms, min=6.59ms, max=55.42ms, median=8.70ms, 95p=11.20ms, 99p=12.16ms
**### [TensorRT (FP16)] mean=4.59ms, sd=1.97ms, min=2.27ms, max=10.38ms, median=4.47ms, 95p=8.02ms, 99p=8.86ms**
[ONNX Runtime (FP32)] mean=5.03ms, sd=2.00ms, min=2.64ms, max=10.45ms, median=5.16ms, 95p=8.37ms, 99p=9.17ms
[ONNX Runtime (optimized)] mean=5.19ms, sd=2.04ms, min=2.80ms, max=10.59ms, median=5.25ms, 95p=8.67ms, 99p=9.40ms

fix model analyzer data shape

document functions

document functions
generate documentation (sphynx?)

How does the local model generate configuration files

Such as title，like config.json 、tokenizer.json and tokenizer_config.json

convert_model: error: argument --task: invalid choice: 'token-classification' (choose from 'classification', 'embedding', 'text-generation')

while running command,

docker run -it --rm --gpus all   -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0   bash -c "cd /project && \
    convert_model -m \"kamalkraj/bert-base-cased-ner-conll2003\" \
    --backend tensorrt onnx \
    --seq-len 16 128 128 \
    --task token-classification"

getting the following error:

usage: convert_model [-h] -m MODEL [-t TOKENIZER] [--task {classification,embedding,text-generation}] [--auth-token AUTH_TOKEN]
                     [-b BATCH_SIZE BATCH_SIZE BATCH_SIZE] [-s SEQ_LEN SEQ_LEN SEQ_LEN] [-q] [-w WORKSPACE_SIZE] [-o OUTPUT] [-n NAME]
                     [-v] [--backend [{onnx,tensorrt} [{onnx,tensorrt} ...]]] [-d {cpu,cuda}] [--nb-threads NB_THREADS]
                     [--nb-instances NB_INSTANCES] [--warmup WARMUP] [--nb-measures NB_MEASURES] [--seed SEED] [--atol ATOL]
convert_model: error: argument --task: invalid choice: 'token-classification' (choose from 'classification', 'embedding', 'text-generation')

is token-classification completely supported yet? thanks

convert to onnx model takes more time on CPU

converting model to onnx with device as cpu takes around 8 minutes

docker run -it --rm --gpus all \
  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
  bash -c "cd /project && \
    convert_model -m \"distilbert-base-uncased-finetuned-sst-2-english\" \
    --backend onnx \
    --seq-len 16 128 128 --device cpu"

The same model is converted to onnx using GPU in less than 2 minutes

docker run -it --rm --gpus all \
  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
  bash -c "cd /project && \
    convert_model -m \"distilbert-base-uncased-finetuned-sst-2-english\" \
    --backend onnx \
    --seq-len 16 128 128"

using huggingface default onnx converter only takes ~20s

python -m transformers.onnx --model=distilbert-base-uncased-finetuned-sst-2-english --feature sequence-classification    onnx/

@pommedeterresautee

Incorrect demo url in documentation

The link to demo page (https://els-rd.github.io/transformer-deploy/run/demo/infinity) in the documentation page https://els-rd.github.io/transformer-deploy/run/ is not working. I think https://github.com/ELS-RD/transformer-deploy/tree/main/demo/infinity may be the right link.

do you have any plan to support wav2vec?

hello, do you have any plan to support wav2vec?

update triton server to 21.11-py3

Support for T5 models

When will we getting support for T5 models?

Support for gpt2 quantization

I tried to quantize (add QDQ layers) the gpt2 model:

batch_size=8
        with QATCalibrate(method="histogram", percentile=99.999) as qat:
            model_q = self.model.cuda()
            qat.setup_model_qat(model_q)  # prepare quantizer to any model

            with torch.no_grad():
                for start_index in range(0, 650, batch_size):
                    end_index = start_index + batch_size
                    data = self.data[start_index:end_index]
                    data = self.tokenizer(data, return_tensors='pt', padding=True, truncation=True, max_length=512)
                    input_torch = {
                        k: torch.tensor(v, dtype=torch.long, device="cuda")
                        for k, v in data.items()
                        if k in ["input_ids", "attention_mask", "token_type_ids"]
                    }
                    model_q(**input_torch)

but no QDQ layers were inserted - I assume that you don't support GPT2 yet. Do you plan add it?

Docker image timeout

Error response from daemon: Get https://ghcr.io/v2/: proxyconnect tcp: dial tcp: lookup http.docker.internal on 192.168.65.5:53: read udp 192.168.65.4:63771->192.168.65.5:53: i/o timeout

I cannot download the docker image!

nvidia-pyindex installation

Installation of packages which depend on nvidia-pyindex fail if nvidia-pyindex is not installed before installing transformer_deploy.
My initial guess was that this occurs because setuptools does not install packages in the order specified in the extras_require argument. I tried adding the package to setup_requires & install_requires arguments of setuptools.set in setup.py, but it did not help.

'assert num_heads > 0' error with DistilBert

I get the following error when I try to optimize distilbert:

AssertionError                            Traceback (most recent call last)
<timed eval> in <module>

/opt/conda/lib/python3.7/site-packages/transformer_deploy/convert.py in main(input_args)
    245             onnx_path=onnx_model_path,
    246             onnx_optim_fp16_path=onnx_optim_fp16_path,
--> 247             use_cuda=True,
    248         )
    249         onnx_model = create_model_for_provider(path=onnx_optim_fp16_path, provider_to_use="CUDAExecutionProvider")

/opt/conda/lib/python3.7/site-packages/transformer_deploy/backends/ort_utils.py in optimize_onnx(onnx_path, onnx_optim_fp16_path, use_cuda)
     72         num_heads=0,  # automatic detection don't work with opset 13
     73         hidden_size=0,  # automatic detection
---> 74         optimization_options=optimization_options,
     75     )
     76 

/opt/conda/lib/python3.7/site-packages/onnxruntime/transformers/optimizer.py in optimize_model(input, model_type, num_heads, hidden_size, optimization_options, opt_level, use_gpu, only_onnxruntime)
    289 
    290     if not only_onnxruntime:
--> 291         optimizer.optimize(optimization_options)
    292 
    293     # Remove the temporary model.

/opt/conda/lib/python3.7/site-packages/onnxruntime/transformers/onnx_model_bert.py in optimize(self, options, add_dynamic_axes)
    317             if options is not None:
    318                 self.attention_mask.set_mask_format(options.attention_mask_format)
--> 319             self.fuse_attention()
    320 
    321         self.fuse_shape()

/opt/conda/lib/python3.7/site-packages/onnxruntime/transformers/onnx_model_bert.py in fuse_attention(self)
     52 
     53     def fuse_attention(self):
---> 54         self.attention_fusion.apply()
     55 
     56     def fuse_gelu(self):

/opt/conda/lib/python3.7/site-packages/onnxruntime/transformers/fusion_base.py in apply(self)
     41                     raise Exception("Can not find node in any graphs")
     42                 self.this_graph_name = graph.name
---> 43                 self.fuse(node, input_name_to_nodes, output_name_to_node)
     44 
     45         op_list = [node.op_type for node in self.nodes_to_add]

/opt/conda/lib/python3.7/site-packages/onnxruntime/transformers/fusion_attention.py in fuse(self, normalize_node, input_name_to_nodes, output_name_to_node)
    444             new_node = self.create_attention_node(mask_index, matmul_q, matmul_k, matmul_v, add_q, add_k, add_v,
    445                                                   q_num_heads, self.hidden_size, root_input,
--> 446                                                   attention_last_node.output[0], add_qk_str)
    447             if new_node is None:
    448                 return

/opt/conda/lib/python3.7/site-packages/onnxruntime/transformers/fusion_attention.py in create_attention_node(self, mask_index, q_matmul, k_matmul, v_matmul, q_add, k_add, v_add, num_heads, hidden_size, input, output, add_qk_str)
    161             Union[NodeProto, None]: the node created or None if failed.
    162         """
--> 163         assert num_heads > 0
    164 
    165         if hidden_size > 0 and (hidden_size % num_heads) != 0:

AssertionError:

While trying to resolve the issue, I observed that it did not occur when optimizer from onnxruntime-tools was used with opt_level 99 (instead of the one in onnxruntime.transformers). But the code then threw Exceptions due to some skip layer normalization issues.

Installation instructions

The line

pip3 install .[GPU] -f https://download.pytorch.org/whl/cu113/torch_stable.html

did not work for me as written, but it did with quotes:

pip3 install ".[GPU]" -f https://download.pytorch.org/whl/cu113/torch_stable.html

Also, the instructions say

make docker_build

but the Makefile does not contain such a rule, only build_docker so the command should probably be

make build_docker.

Output difference between ONNX and Pytorch in T5 notebook

Hi @pommedeterresautee Just checked out the latest T5 optimization notebook . Everything seems to be working well for T5-base pretrained model from huggingface but when I am trying to optimize model from valhalla/t5-base-qa-qg-hl. I am noticing difference in output between the two.

Below is code snippet modified from T5 notebook.

input_text=''' Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum
and first released in 1991, Python's design philosophy emphasizes code
readability with its notable use of significant whitespace.'''
inputs=tokenizer(input_text,return_tensors="pt")
input_ids=inputs.input_ids.cuda()

torch.cuda.synchronize()
with torch.inference_mode():
print("Onnx:")
print(
tokenizer.decode(
model_gen.generate(
inputs=input_ids,
min_length=3,
max_length=60,
num_beams=4,
no_repeat_ngram_size=2,
)[0],
skip_special_tokens=True,
)
)
print("Pytorch:")
print(
tokenizer.decode(
pytorch_model.generate(
input_ids=input_ids,
min_length=3,
max_length=60,
num_beams=4,
no_repeat_ngram_size=2,
)[0],
skip_special_tokens=True,
)
)

And below are the outputs:

Onnx:
What is an interpreted, high-level, general-purpose programming language?
Pytorch:
What language was created by Guido van Rossum?

ONNX opset mismatch error.

Full error message

+--------------------------------+---------+----------------------------------------------------------------------------------------------------------------------------------+
| Model                          | Version | Status                                                                                                                           |
+--------------------------------+---------+----------------------------------------------------------------------------------------------------------------------------------+
| transformer_onnx_model         | 1       | UNAVAILABLE: Internal: onnx runtime error 1: Load model from /models/transformer_onnx_model/1/model.bin failed:/workspace/onnxru |
|                                |         | ntime/onnxruntime/core/graph/model_load_utils.h:47 void onnxruntime::model_load_utils::ValidateOpsetForDomain(const std::unorder |
|                                |         | ed_map<std::__cxx11::basic_string<char>, int>&, const onnxruntime::logging::Logger&, bool, const string&, int) ONNX Runtime only |
|                                |         |  *guarantees* support for models stamped with official released onnx opset versions. Opset 3 is under development and support fo |
|                                |         | r this is limited. The operator schemas and or other functionality may change before next ONNX release and in this case ONNX Run |
|                                |         | time will not guarantee backward compatibility. Current official support for domain ai.onnx.ml is till opset 2.                  |
| transformer_onnx_tokenize      | 1       | READY                                                                                                                            |
| transformer_tensorrt_inference | 1       | READY                                                                                                                            |
| transformer_tensorrt_model     | 1       | READY                                                                                                                            |
| transformer_tensorrt_tokenize  | 1       | READY                                                                                                                            |
+--------------------------------+---------+----------------------------------------------------------------------------------------------------------------------------------+

cmd to reproduce

git clone [email protected]:ELS-RD/transformer-deploy.git
cd transformer-deploy
# Build
make docker_build
# Generate model
docker run -it --rm --gpus all \                                                    
  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \           
  bash -c "cd /project && pip install protobuf==3.20.1 && \                    
    convert_model -m \"philschmid/MiniLM-L6-H384-uncased-sst2\" \
    --backend tensorrt onnx \
    --seq-len 16 128 128"
# Run model
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \
  -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.02-py3 \
  bash -c "pip install transformers && tritonserver --model-repository=/models"

@pommedeterresautee

Transforming https://huggingface.co/csarron/roberta-base-squad-v1

Hi,

I wanted double check how we can transform question answering models like the one mentioned above. (It is question answering model - transformers.AutoModelForQuestionAnswering) For example when using vanilla python for your input:

{ "question": "Did the stock come down?", "context": "Some text about stocks, with reference if stocks went up or down..." }

Q/A pipeline is answering with score, start and end of the sequence and chunk of text (based on start and end), e.g.

{"score": 0.02739531360566616, "start": 103, "end": 173, "answer": "sent stocks sliding to their worst performance in months on Wednesday."}

after I converted this model as classification model, I am seeing given input:

"inputs": [{"name": "TEXT", "shape": [2], "datatype": "BYTES", "data": ["Did the stock come down?", "\Some text about stocks, with reference if stocks went up or down..."]}]}

following output:

{"model_name":"roberta_onnx_inference","model_version":"1","parameters":{"sequence_id":0,"sequence_start":false,"sequence_end":false},"outputs":[{"name":"output","datatype":"FP32","shape":[1,2],"data":[-0.2091064453125,-0.06671142578125]}]}

As probably expected it doesn't really output what it should, I am wondering how we can change/extend the library to generate correct outputs.
Model itself generate set of tensors (the same shape as encoded input) for start and end of sequence https://github.com/huggingface/transformers/blob/master/src/transformers/pipelines/question_answering.py#L371, here I am given just two numbers, so I cannot use the output to invoke postprocessing logic in QA pipeline

Any support appreciate it :).

query_body.bin contents

With an instance of the triton server up, running the test cURL command from the README file

# @ means no data conversion (curl feature)
curl -X POST  http://localhost:8000/v2/models/transformer_onnx_inference/versions/1/infer \
  --data-binary "@demo/query_body.bin" \
  --header "Inference-Header-Content-Length: 160"

gives me the error {"error":"unexpected inference output 'score' for model 'transformer_onnx_inference'"}. Looking at the triton_client.py script, it seems it might be the case that score should be replaced by model_score in the query_body.bin file? However, after that change I still get an error: {"error":"failed to parse the request JSON buffer: Invalid value. at 160"}%.

The triton_client.py script works fine for me, BTW.

GPT2 has slow inference

Hello,

your wrapper for gpt2 does not support 'past_key_values' as huggingface transformers initially do. I've seen your measurements in the gpt2 demo, and at least for pytorch they are not really correct, instead of just simply calling the model with always the same input, you should call the generate method..

I tried to run gpt2 in pytorch both on cpu and gpu (GPU: TESLA T4) with your sample text: "Here is some text to encode Hello World"

here are my results (vanilla pytorch):
gpu no cache: 14s/sequence
gpu cache: 3.6s/sequence

cpu no cache: 114s/sequence
cpu cache: 9.8s/sequence

For every measurement, the result is average out of ten runs of the generate method, I used number of beams=5

when running greedysearch, the difference is not so big, but still..
cpu no cache: 29s
cpu cache: 4.8s

CPU: Intel(R) Xeon(R) Platinum 8259CL CPU

Error in docker container using pip install

What is happening?
I'm using the docker container ghcr.io/els-rd/transformer-deploy:0.4.0 to run the embeddings example in the doc using the following command:

convert_model -m sentence-transformers/msmarco-distilbert-cos-v5 --backend onnx --task embedding --seq-len 128 128 128

When I launch the container and run the above command, I get the expected output without any issues. However, if I update the transformer-deploy package using pip install . and then run the above command, I get the following error:

Traceback (most recent call last):
  File "/usr/local/bin/convert_model", line 8, in <module>
    sys.exit(entrypoint())
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 388, in entrypoint
    main(commands=args)
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 355, in main
    ort_output, time_buffer = launch_inference(
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 113, in launch_inference
    output = infer(batch_input)
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 351, in infer_ort
    results = inference_onnx_binding(model_onnx=ort_model, inputs=inputs, device=commands.device)
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/backends/ort_utils.py", line 255, in inference_onnx_binding
    binding.synchronize_inputs()
AttributeError: 'IOBinding' object has no attribute 'synchronize_inputs'

I want to try something which requires changing some lines in the convert.py file. I was hoping to install using pip after making the change and running test command as illustrated above. Is there more to installing the package? Or is this a bug?

To Reproduce

git clone [email protected]:ELS-RD/transformer-deploy.git
cd transformer-deploy
# docker image may take a few minutes
docker pull ghcr.io/els-rd/transformer-deploy:0.4.0 

# run the docker container
docker run -it --rm --gpus all  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 bash

# run the following commands from inside the container
# run test command. This should run without any issues
convert_model -m sentence-transformers/msmarco-distilbert-cos-v5 --backend onnx --task embedding --seq-len 128 128 128

# install transformer-deploy from the latest commit
pip install .

# run the test command again and it will error out
convert_model -m sentence-transformers/msmarco-distilbert-cos-v5 --backend onnx --task embedding --seq-len 128 128 128

Summary
After installing the latest commit in the docker container, convert command errors out. Is there a different installation procedure ? Or is this a bug?

quantized model inference slow with Triton server than inference directly in python code(notebook)

I follow the guide quantization_end_to_end.ipynb to quantize my custom classification model base on Roberta-base-12layers and finally get the following Latency measures from different backend, the results looks good:
batch size = 1
seq_length = 128
device = GPU Tesla T4

[Pytorch (FP32)] mean=9.66ms, sd=0.19ms, min=9.38ms, max=10.08ms, median=9.60ms, 95p=10.07ms, 99p=10.08ms
[Pytorch (FP16)] mean=9.84ms, sd=0.25ms, min=9.77ms, max=12.27ms, median=9.80ms, 95p=9.93ms, 99p=10.01ms

[ONNX (FP32)] mean=8.41ms, sd=0.70ms, min=7.92ms, max=10.01ms, median=8.08ms, 95p=10.00ms, 99p=10.01ms
[ONNX (FP16)] mean=2.57ms, sd=0.07ms, min=2.40ms, max=3.10ms, median=2.57ms, 95p=2.62ms, 99p=2.63ms

[TensorRT (FP16)] mean=2.54ms, sd=0.46ms, min=2.32ms, max=3.94ms, median=2.39ms, 95p=3.92ms, 99p=3.93ms
[TensorRT (INT-8)] mean=1.98ms, sd=0.17ms, min=1.87ms, max=3.57ms, median=1.95ms, 95p=2.06ms, 99p=2.30ms

but after I deployed the ONNX FP16 model and TesorRT model (both FP16 and INT-8 ) with Triton Server and then do stress test with Jmeter , the result showed that TensorRT INT-8 model is not faster than FP16 model:
batch size = 1
seq_length = 128
threads = 20
GPU utilization between 93%~95%

Triton server with TensorRT INT-8 model, throughput = 398.7/sec
Triton server with TensorRT FP16 model, throughput = 399.1/sec
Triton server with ONNX FP16 model, throughput = 363.3/sec

I just wonder what's wrong with the Triton Server, is it has a "int8 inference" option and I didn't turn it on?

TRT error on fresh install

Running the example command on a fresh install, I get:

$ convert_model -m roberta-large-mnli --backend tensorrt onnx pytorch --seq-len 16 128 128 --batch-size 1 32 32

...
    engine = build_engine(
  File "/home/sam_havens/transformer-deploy/venv/lib/python3.8/site-packages/transformer_deploy/backends/trt_utils.py", line 181, in build_engine
    engine: ICudaEngine = runtime.deserialize_cuda_engine(trt_engine)
TypeError: deserialize_cuda_engine(): incompatible function arguments. The following argument types are supported:
    1. (self: tensorrt.tensorrt.Runtime, serialized_engine: buffer) -> tensorrt.tensorrt.ICudaEngine

Invoked with: <tensorrt.tensorrt.Runtime object at 0x7f5df2f18bb0>, None

Full error

$ convert_model -m roberta-large-mnli --backend tensorrt onnx pytorch --seq-len 16 128 128 --batch-size 1 32 32

Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 1421594533
[12/08/2021-00:18:52] [TRT] [W] onnx2trt_utils.cpp:366: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[12/08/2021-00:19:08] [TRT] [W] Output type must be INT32 for shape outputs
[12/08/2021-00:19:08] [TRT] [W] Output type must be INT32 for shape outputs
[12/08/2021-00:19:08] [TRT] [W] Output type must be INT32 for shape outputs
[12/08/2021-00:19:12] [TRT] [W] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are:
[12/08/2021-00:19:12] [TRT] [W]  (# 1 (SHAPE input_ids))
[12/08/2021-00:19:12] [TRT] [W]  (# 0 (SHAPE attention_mask))
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
[12/08/2021-00:19:39] [TRT] [W] Skipping tactic 0 due to Myelin error: No results returned from cublas heuristic search
[12/08/2021-00:19:39] [TRT] [W] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are:
[12/08/2021-00:19:39] [TRT] [W]  (# 1 (SHAPE input_ids))
[12/08/2021-00:19:39] [TRT] [W]  (# 0 (SHAPE attention_mask))
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
[12/08/2021-00:20:10] [TRT] [W] Skipping tactic 0 due to Myelin error: No results returned from cublas heuristic search
[12/08/2021-00:20:10] [TRT] [E] 10: [optimizer.cpp::computeCosts::2011] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[roberta.embeddings.token_type_embeddings.weight...(Unnamed Layer* 3884) [Shuffle]]}.)
[12/08/2021-00:20:10] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::609] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed. )
Traceback (most recent call last):
  File "/home/sam_havens/transformer-deploy/venv/bin/convert_model", line 11, in <module>
    load_entry_point('transformer-deploy==0.1.1', 'console_scripts', 'convert_model')()
  File "/home/sam_havens/transformer-deploy/venv/lib/python3.8/site-packages/transformer_deploy/convert.py", line 129, in main
    engine = build_engine(
  File "/home/sam_havens/transformer-deploy/venv/lib/python3.8/site-packages/transformer_deploy/backends/trt_utils.py", line 181, in build_engine
    engine: ICudaEngine = runtime.deserialize_cuda_engine(trt_engine)
TypeError: deserialize_cuda_engine(): incompatible function arguments. The following argument types are supported:
    1. (self: tensorrt.tensorrt.Runtime, serialized_engine: buffer) -> tensorrt.tensorrt.ICudaEngine

Invoked with: <tensorrt.tensorrt.Runtime object at 0x7f5df2f18bb0>, None

If you have any ideas, I'd appreciate it! Thank you.

EDIT:

nvidia-smi

$ nvidia-smi
Wed Dec  8 00:35:05 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P8    10W /  70W |    105MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1030      G   /usr/lib/xorg/Xorg                 95MiB |
|    0   N/A  N/A      1213      G   /usr/bin/gnome-shell                8MiB |
+-----------------------------------------------------------------------------+

pip freeze output

$ pip freeze
anyio==3.4.0
appdirs==1.4.4
asgiref==3.4.1
attrs==21.2.0
black==21.12b0
Brotli==1.0.9
certifi==2021.10.8
cffi==1.15.0
charset-normalizer==2.0.9
click==8.0.3
colored==1.4.3
coloredlogs==15.0.1
cryptography==36.0.0
cycler==0.11.0
distro==1.6.0
docker==5.0.3
fastapi==0.70.0
filelock==3.4.0
flake8==4.0.1
flatbuffers==2.0
fonttools==4.28.3
gevent==21.8.0
geventhttpclient==1.5.3
greenlet==1.1.2
grpcio==1.42.0
gunicorn==20.1.0
h11==0.12.0
httplib2==0.20.2
huggingface-hub==0.2.1
humanfriendly==10.0
idna==3.3
iniconfig==1.1.1
isort==5.10.1
joblib==1.1.0
kiwisolver==1.3.2
llvmlite==0.37.0
Mako==1.1.6
MarkupSafe==2.0.1
matplotlib==3.5.0
mccabe==0.6.1
mpmath==1.2.1
mypy-extensions==0.4.3
numba==0.54.1
numpy==1.21.4
nvidia-cublas-cu11==2021.10.25
nvidia-cublas-cu115==11.7.3.1
nvidia-cuda-runtime-cu11==2021.10.25
nvidia-cuda-runtime-cu115==11.5.50
nvidia-cudnn-cu11==2021.11.18
nvidia-cudnn-cu115==8.3.0.98
nvidia-pyindex==1.0.9
nvidia-tensorrt==8.2.1.8
onnx==1.10.2
onnx-graphsurgeon==0.3.14
onnxruntime-gpu==1.10.0
packaging==21.3
pathspec==0.9.0
pdfkit==1.0.0
Pillow==8.4.0
platformdirs==2.4.0
pluggy==1.0.0
polygraphy==0.33.2
prometheus-client==0.12.0
protobuf==3.19.1
psutil==5.8.0
py==1.11.0
pycodestyle==2.8.0
pycparser==2.21
pycuda==2021.1
pydantic==1.8.2
pyflakes==2.4.0
pyparsing==3.0.6
pytest==6.2.5
python-dateutil==2.8.2
python-rapidjson==1.5
pytools==2021.2.9
PyYAML==6.0
regex==2021.11.10
requests==2.26.0
sacremoses==0.0.46
sentencepiece==0.1.96
setuptools-scm==6.3.2
six==1.16.0
sniffio==1.2.0
starlette==0.16.0
sympy==1.9
tokenizers==0.10.3
toml==0.10.2
tomli==1.2.2
torch==1.10.0+cu113
tqdm==4.62.3
transformer-deploy==0.1.1
transformers==4.12.5
triton-model-analyzer==1.10.0
tritonclient==2.16.0
typing-extensions==4.0.1
urllib3==1.26.7
uvicorn==0.15.0
websocket-client==1.2.3
zope.event==4.5.0
zope.interface==5.4.0

Dynamic batching does not give better latency for Roberta running on TensorRT.

Hi, I used your build_engine API to convert the Roberta model. While building if I use the constant batch size for input_shapes, i.e. (min, optimal, max) -> (1,1,1) or (4, 4, 4,). The model yields good results (faster than ort and torch).

But when I convert it with dynamic batch size i.e. (min, optimal, max) -> (1, 4, 4), the model performs really slow compared to ort or torch.

code to understand the problem better:

# fast inference but constrained to use always 4 batches during inferencing
tensor_shapes = list(zip([4, 4, 4], [1, 128, 128]))

# slow inference
tensor_shapes = list(zip([1, 4, 4], [1, 128, 128]))

engine: ICudaEngine = build_engine(
    runtime=runtime,
    onnx_file_path=onnx_model_path,
    logger=trt_logger,
    min_shape=tensor_shapes[0],
    optimal_shape=tensor_shapes[1],
    max_shape=tensor_shapes[2],
    workspace_size=workspace_size * 1024**3,
    fp16=not quantization,
    int8=quantization,
    profiling=True,
)

save_engine(engine=engine, engine_file_path=tensorrt_path)

the complete build and inference logs for slow inference case (when converting with dynamic batch)

[06/02/2022-03:19:09] [TRT] [I] [MemUsageChange] Init CUDA: CPU +312, GPU +0, now: CPU 3789, GPU 2470 (MiB)
[06/02/2022-03:19:09] [TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 3790, GPU 2470 (MiB)
[06/02/2022-03:19:09] [TRT] [I] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 3790 MiB, GPU 2470 MiB
[06/02/2022-03:19:09] [TRT] [I] [MemUsageSnapshot] End constructing builder kernel library: CPU 3924 MiB, GPU 2504 MiB
[06/02/2022-03:19:09] [TRT] [I] parsing TensorRT model
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 1418322027
[06/02/2022-03:19:22] [TRT] [W] onnx2trt_utils.cpp:366: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[06/02/2022-03:19:39] [TRT] [W] Output type must be INT32 for shape outputs
[06/02/2022-03:19:39] [TRT] [W] Output type must be INT32 for shape outputs
[06/02/2022-03:19:39] [TRT] [W] Output type must be INT32 for shape outputs
[06/02/2022-03:19:43] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +512, GPU +226, now: CPU 5802, GPU 2730 (MiB)
[06/02/2022-03:19:43] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +116, GPU +52, now: CPU 5918, GPU 2782 (MiB)
[06/02/2022-03:19:43] [TRT] [I] Timing cache disabled. Turning it on will improve builder speed.
[06/02/2022-03:19:43] [TRT] [W] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are: 
[06/02/2022-03:19:43] [TRT] [W]  (# 1 (SHAPE input_ids))
[06/02/2022-03:19:43] [TRT] [W]  (# 0 (SHAPE attention_mask))
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
[06/02/2022-03:25:32] [TRT] [W] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are: 
[06/02/2022-03:25:32] [TRT] [W]  (# 1 (SHAPE input_ids))
[06/02/2022-03:25:32] [TRT] [W]  (# 0 (SHAPE attention_mask))
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
Warning: Slice op (Unnamed Layer_ 90) [Slice]_slice cannot slice along a uniform dimension.
[06/02/2022-03:30:10] [TRT] [I] Detected 2 inputs and 1 output network tensors.
[06/02/2022-03:30:10] [TRT] [W] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are: 
[06/02/2022-03:30:10] [TRT] [W]  (# 1 (SHAPE input_ids))
[06/02/2022-03:30:10] [TRT] [W]  (# 0 (SHAPE attention_mask))
[06/02/2022-03:30:32] [TRT] [I] Total Host Persistent Memory: 208
[06/02/2022-03:30:32] [TRT] [I] Total Device Persistent Memory: 0
[06/02/2022-03:30:32] [TRT] [I] Total Scratch Memory: 442827264
[06/02/2022-03:30:32] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 774 MiB, GPU 2058 MiB
[06/02/2022-03:30:32] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 0.038945ms to assign 4 blocks to 4 nodes requiring 443041280 bytes.
[06/02/2022-03:30:32] [TRT] [I] Total Activation Memory: 443041280
[06/02/2022-03:30:32] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 5993, GPU 4298 (MiB)
[06/02/2022-03:30:32] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 5993, GPU 4306 (MiB)
[06/02/2022-03:30:32] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +1353, now: CPU 0, GPU 1353 (MiB)
[06/02/2022-03:30:33] [TRT] [I] Loaded engine size: 1364 MiB
[06/02/2022-03:30:33] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 7354, GPU 4282 (MiB)
[06/02/2022-03:30:33] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 7355, GPU 4290 (MiB)
[06/02/2022-03:30:33] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1352, now: CPU 0, GPU 1352 (MiB)
[06/02/2022-03:30:38] [TRT] [I] Loaded engine size: 1364 MiB
[06/02/2022-03:30:38] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 7366, GPU 5636 (MiB)
[06/02/2022-03:30:38] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 7367, GPU 5644 (MiB)
[06/02/2022-03:30:38] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1352, now: CPU 0, GPU 2704 (MiB)
[06/02/2022-03:30:38] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 6002, GPU 5636 (MiB)
[06/02/2022-03:30:38] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 6002, GPU 5644 (MiB)
[06/02/2022-03:30:43] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +423, now: CPU 0, GPU 3127 (MiB)

latencies in ms
--------------------------------------------------
Pytorch 
--------------------------------------------------
[93.5968, 94.0308, 94.8224, 93.6746, 94.5972, 94.0188, 92.3105, 93.6535, 92.4908, 91.4413]
--------------------------------------------------
Onnxruntime 
 --------------------------------------------------
[81.445, 81.3684, 80.2145, 81.5339, 82.9578, 83.6845, 83.6738, 82.6652, 81.5462, 82.8237]
--------------------------------------------------
TensorRT (FP16) 
 --------------------------------------------------
[426.353, 425.1992, 426.0317, 425.8226, 426.8828, 428.0485, 426.3119, 426.4556, 425.4863, 426.0393]
--------------------------------------------------

Is this the expected behavior?

I want to convert the model to use dynamic batches. When inferencing, the model should be able to handle a variable batch size and perform faster. How can I achieve that?

Any help would be greatly appreciated, thank you in advance.

How to convert_model -m "./mycustomodel"

I have a custom model

`
class BERTClass(torch.nn.Module):
def init(self, num_labels=4):
super(BERTClass, self).init()
self.l1 = BertModel.from_pretrained('bert-base-multilingual-uncased')
self.l2 = torch.mean
self.l3 = torch.nn.Linear(768, num_labels)

def forward(self, ids, masks, token_type_ids):        
    last_hidden_state, _ = self.l1(ids, attention_mask=masks, token_type_ids=token_type_ids)
    avg_pooling = self.l2(last_hidden_state, dim=1)
    output = self.l3(avg_pooling)
    
    return output`

How do I go about saving my custom model so that I can run convert_model -m "./mycustomodel"?

Currently I am saving the model this way

`
model_2_save = model.module if hasattr(model, "module") else model

checkpoint = {
    'epoch': args.epochs,
    'num_labels': args.num_labels,
    'max_text_length': MAX_TEXT_LENGTH,
    'state_dict': model_2_save.state_dict()
}

torch.save(checkpoint, args.model_dir + "/pt_model.pt")`

Is it better to save the pretrained bit and convert separately from the fully connected layer and then combine them after conversion or do I need to drive my custom class from PreTrainedModel so as to be able to use save_pretrained? Do you happen to have an example I can follow? Thanks for the amazing repo.

GPT-2 pipeline?

Hello,
thank you for this wonderful implementation.

Do you have any plans implementing notebook with gpt-2 support?
It seems that there would be huge speed benefit, especially with smaller sequence lengths and higher batches.

manage feature extraction

GPT-J support

How can we use GPT-J for inference?

Execute T5 inference with TensorRT

Export of Large Models Fails: onnx2trt_utils.cpp:1571

This issue is a direct consequence of: onnx/onnx-tensorrt#818

/usr/local/lib/python3.8/dist-packages/transformers/models/gpt2/modeling_gpt2.py:196: TracerWarning: Converting a tensor to a Python float might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  attn_weights = attn_weights / (float(value.size(-1)) ** 0.5)
[03/14/2022-09:03:29] [TRT] [E] onnx2trt_utils.cpp:1571: Failed to open file: transformer.wte.weight
[03/14/2022-09:03:29] [TRT] [E] 4: [network.cpp::validate::2633] Error Code 4: Internal Error (Network must have at least one output)
[03/14/2022-09:03:29] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::609] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed. )
Traceback (most recent call last):
  File "/usr/local/bin/convert_model", line 8, in <module>
    sys.exit(entrypoint())
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 357, in entrypoint
    main(commands=args)
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 262, in main
    engine: ICudaEngine = build_engine(
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/backends/trt_utils.py", line 126, in build_engine
    engine: ICudaEngine = runtime.deserialize_cuda_engine(trt_engine)
TypeError: deserialize_cuda_engine(): incompatible function arguments. The following argument types are supported:
    1. (self: tensorrt.tensorrt.Runtime, serialized_engine: buffer) -> tensorrt.tensorrt.ICudaEngine

Invoked with: <tensorrt.tensorrt.Runtime object at 0x7fe431b73a30>, None

Steps to reproduce:

Works:

docker run -it --rm --gpus device=6 \
    --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
    -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
    bash -c "cd /project && \
        convert_model -m distilgpt2 \
        --backend tensorrt \
        --seq-len 1 128 128 \
        --task text-generation"

Doesn't work:

docker run -it --rm --gpus device=6 \
    --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
    -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.4.0 \
    bash -c "cd /project && \
        convert_model -m gpt2-large \
        --backend tensorrt \
        --seq-len 1 128 128 \
        --task text-generation"

got error in optimize onnx when ran gpt2 file from demo/generative-model

getting error when ran this code part
logging.basicConfig()
logging.getLogger().setLevel(logging.INFO)
num_attention_heads, hidden_size = get_model_size(path=model_name)
optimize_onnx(
onnx_path="test-gpt2.onnx",
onnx_optim_model_path="test-gpt2-opt.onnx",
fp16=True,
use_cuda=True,
num_attention_heads=num_attention_heads,
hidden_size=hidden_size,
architecture='gpt2'
)

INFO:fusion_base:Fused LayerNormalization count: 25
INFO:fusion_base:Fused FastGelu count: 12

failed in shape inference <class 'AssertionError'>
failed in shape inference <class 'AssertionError'>
failed in shape inference <class 'AssertionError'>

INFO:onnx_model:Graph pruned: 0 inputs, 0 outputs and 720 nodes are removed
INFO:onnx_model_gpt2:postprocess: remove Reshape count:72
INFO:fusion_base:Fused FastGelu(add bias) count: 12
INFO:onnx_model_bert:opset verion: 13

AssertionError Traceback (most recent call last)

in ()
9 num_attention_heads=num_attention_heads,
10 hidden_size=hidden_size,
---> 11 architecture='gpt2'
12 )

7 frames

/usr/local/lib/python3.7/dist-packages/onnxruntime/transformers/../tools/symbolic_shape_infer.py in add_suggested_merge(self, symbols, apply)
209
210 def add_suggested_merge(self, symbols, apply=False):
--> 211 assert all([(type(s) == str and s in self.symbolic_dims) or is_literal(s) for s in symbols])
212 symbols = set(symbols)
213 for k, v in self.suggested_merge.items():

AssertionError:

[ML] En tant que Dave, je veux documenter sur le fonctionnement des GPUs afin de pouvoir travailler sur l'accélération de l'inférence pour les modèles génératifs

Generate data offline to optimize memory usage

Inference on CPU

I have converted a model as in the tutorial. And now I have a folder triton_models with
model-original.onnx
model.plan
transformer_onnx_model
transformer_tensorrt_inference
transformer_tensorrt_tokenize
model.onnx
transformer_onnx_inference
transformer_onnx_tokenize
transformer_tensorrt_model

When I run docker run -it --rm -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \ -v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:22.01-py3 \ bash -c "pip install transformers && tritonserver --model-repository=/models"

I have the following log

`WARNING: [Torch-TensorRT] - Unable to read CUDA capable devices. Return status: 35
I0215 08:42:16.993018 1 libtorch.cc:1227] TRITONBACKEND_Initialize: pytorch
I0215 08:42:16.993126 1 libtorch.cc:1237] Triton TRITONBACKEND API version: 1.7
I0215 08:42:16.993133 1 libtorch.cc:1243] 'pytorch' TRITONBACKEND API version: 1.7
2022-02-15 08:42:17.201026: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-02-15 08:42:17.246677: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I0215 08:42:17.246838 1 tensorflow.cc:2176] TRITONBACKEND_Initialize: tensorflow
I0215 08:42:17.246867 1 tensorflow.cc:2186] Triton TRITONBACKEND API version: 1.7
I0215 08:42:17.246922 1 tensorflow.cc:2192] 'tensorflow' TRITONBACKEND API version: 1.7
I0215 08:42:17.246933 1 tensorflow.cc:2216] backend configuration:
{}
I0215 08:42:17.249114 1 onnxruntime.cc:2232] TRITONBACKEND_Initialize: onnxruntime
I0215 08:42:17.249175 1 onnxruntime.cc:2242] Triton TRITONBACKEND API version: 1.7
I0215 08:42:17.249183 1 onnxruntime.cc:2248] 'onnxruntime' TRITONBACKEND API version: 1.7
I0215 08:42:17.249189 1 onnxruntime.cc:2278] backend configuration:
{}
I0215 08:42:17.271894 1 openvino.cc:1234] TRITONBACKEND_Initialize: openvino
I0215 08:42:17.271952 1 openvino.cc:1244] Triton TRITONBACKEND API version: 1.7
I0215 08:42:17.271959 1 openvino.cc:1250] 'openvino' TRITONBACKEND API version: 1.7
W0215 08:42:17.272063 1 pinned_memory_manager.cc:236] Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version
I0215 08:42:17.272140 1 cuda_memory_manager.cc:115] CUDA memory pool disabled
E0215 08:42:17.299306 1 model_repository_manager.cc:1844] Poll failed for model directory 'transformer_onnx_model': instance group transformer_onnx_model_0 of model transformer_onnx_model has kind KIND_GPU but no GPUs are available
E0215 08:42:17.311377 1 model_repository_manager.cc:1844] Poll failed for model directory 'transformer_onnx_tokenize': instance group transformer_onnx_tokenize_0 of model transformer_onnx_tokenize has kind KIND_GPU but no GPUs are available
E0215 08:42:17.326131 1 model_repository_manager.cc:1844] Poll failed for model directory 'transformer_tensorrt_model': instance group transformer_tensorrt_model_0 of model transformer_tensorrt_model has kind KIND_GPU but no GPUs are available
E0215 08:42:17.336737 1 model_repository_manager.cc:1844] Poll failed for model directory 'transformer_tensorrt_tokenize': instance group transformer_tensorrt_tokenize_0 of model transformer_tensorrt_tokenize has kind KIND_GPU but no GPUs are available
E0215 08:42:17.336814 1 model_repository_manager.cc:1332] Invalid argument: ensemble transformer_tensorrt_inference contains models that are not available: transformer_tensorrt_tokenize, transformer_tensorrt_model
E0215 08:42:17.336823 1 model_repository_manager.cc:1332] Invalid argument: ensemble transformer_onnx_inference contains models that are not available: transformer_onnx_tokenize, transformer_onnx_model
I0215 08:42:17.336862 1 server.cc:519]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0215 08:42:17.336905 1 server.cc:546]
+-------------+-------------------------------------------------------------------------+--------+
| Backend | Path | Config |
+-------------+-------------------------------------------------------------------------+--------+
| pytorch | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so | {} |
| tensorflow | /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so | {} |
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {} |
| openvino | /opt/tritonserver/backends/openvino_2021_2/libtriton_openvino_2021_2.so | {} |
+-------------+-------------------------------------------------------------------------+--------+

I0215 08:42:17.336925 1 server.cc:589]
+-------+---------+--------+
| Model | Version | Status |
+-------+---------+--------+
+-------+---------+--------+

I0215 08:42:17.337084 1 server.cc:249] Waiting for in-flight requests to complete.
I0215 08:42:17.337097 1 server.cc:264] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models`

How can i fix that? Thanks!