Coder Social home page Coder Social logo

wangguojim / eet Goto Github PK

View Code? Open in Web Editor NEW

This project forked from netease-fuxi/eet

0.0 0.0 0.0 1.94 MB

Easy and Efficient Transformer : Scalable Inference Solution For Large NLP model

License: Apache License 2.0

Cuda 86.83% C++ 6.39% Dockerfile 0.09% Python 6.64% CMake 0.04% C 0.01%

eet's Introduction

Easy and Efficient Transformer

EET

EET(Easy and Efficient Transformer) is an efficient Pytorch inference plugin focus on Transformer-based models with large model sizes and long sequences.

Features

1、Pre-padding decoding. Pre-padding keep the relative position embedding remain unchanged within the context and the generated sequence, reducing the gap between training and inference. Basic on this, we achieve parallel inference for the context and incremental decoding for generated token sequence.
2、High performance. Design highly optimized CUDA kernels, referencing to NVIDIA Faster Transformer with advanced optimization.
3、Flexible.  Provide op-level and model-level APIs, allowing users to construct their model or upgrade partial algorithm flexible.
4、Easy to use. EET could be integrated into Fairseq and Transformers directly by replacement of sepcified files, without any code change.
5、Dynamic batch. EET supports dynamic batch, which changes the order of the batch according to the reorder_state and can end a batch early.
6、Extra-large dimension and extra-long sequences. EET supports GPT hidden_units up to 16384 and sequence lengths up to 4096.
7、Support multiple models, including gpt2, bert, albert, roberta, vit.

EET has been applied to a variety of NetEase online services,such as NiShuiHan, NetEase's cloud music, TianYu, Lofter, etc. In the future, EET will work on urtra-large-scale model inference of trillion parameters.

Frameworks maximum model size maximum sequence length Performance Bert GPT-2 Op-level Fairseq support Transformers support dynamic batch & variable inputs
EET 16384 16384 highest Y Y Y Y Y Y
Faster Transformer Multiples of specific numbers, such as 128, 256, 384, 512 1024 high Y Y N N N N
TensorRT 1024 1024 high Y N N N N N
LightSeq 1024 1024 high Y Y N N N Y
TurboTransformer 1024 1024 medium Y Y N N Y Y
ONNX non-limited non-limited slow Y Y Y N N Y

Decoding mechanism

bert

Quick Start

Environment

  • cuda:>=10.1
  • python:>=3.7
  • gcc:>= 7.4.0
  • torch:>=1.5.0
  • numpy:>=1.19.1

Installation

From Source

If you are installing from source, you will need install the necessary environment.Then, proceed as follows:

$ git clone https://github.com/NetEase-FuXi/EET.git
# to run the demo in the examples for comparsion, we need to install the transformers and fairseq
$ pip install transformers==3.5.0 
$ pip install fairseq==0.10.0 
$ pip install .

Due to the compilation of a large number of cuda kernels, the installation time is relatively long, please be patient.

From Docker

$ git clone https://github.com/NetEase-FuXi/EET.git
$ docker build -t your_docker_name:your_docker_version .
$ nvidia-docker run -it --net=host -v /your/project/directory/:/root/workspace  your_Docker_Name:your_docker_version bash

EET has been installed in the docker.

Run

run BERT in Transformers

$ cd EET/example/python  
$ python bert_transformers_example.py

run GPT2 in Transformers

$ cd EET/example/python   
$ python gpt2_transformers_example.py

run GPT2 in Fairseq

$ cd EET
$ wget https://github.com/NetEase-FuXi/EET/releases/download/EET_V0.0.1_fairseq0.10.0_transformers3.5.0/resource.zip
$ cd example 
$ python gpt2_fairseq_example.py

Supported Models

We currenly support the GPT-2, Bert.

GPT2

gpt2

BERT

bert

Usage

EET provides python User-friendly APIs(python/eet), integrated into Fairseq and Transformers with just a few lines of code. It should be noted that we only support padding on the left.

1、How to inference

useofbert

2、How to customize model
You can refer to Operators APIs listed below to build your own model structure, just by modifying the files under python/eet.

3、How to integrate EET into fairseq
Replace the original transformer.py in Fairseq with our transformer.py and reinstall the Fairseq, that is all ! Transformer.py in EET corresponds to the fusion of transformer.py and transformer_layer.py in fairseq.

4、How to integrate EET into Transformers
Replace the original modeling_bert.py and modeling_gpt2.py in Transformers with our modeling_bert.py and modeling_gpt2.py and reinstall the Transformers, that is all ! modeling_bert.py in EET corresponds to modeling_bert.py in transformers;modeling_gpt2.py in EET corresponds to modelling_gpt2.py in transformers.

5、How to make a server
We choose service-streamer to make the model server, building the service based on your python project directly. Please make sure the dynamic-batch is open if you want a higher throughput.

APIs

  1. model APIs:We provide ready-made APIs for GPT2 and Bert models.

    EET and fairseq class comparison table

    EET fairseq Remarks
    EETTransformerDecoder TransformerDecoder
    EETTransformerDecoderLayer TransformerDecoderLayer
    EETTransformerAttention MultiheadAttention
    EETTransformerFeedforward TransformerDecoderLayer fusion of multiple small operators
    EETTransformerEmbedding Embedding + PositionalEmbedding
    EETTransformerLayerNorm nn.LayerNorm

    EET and transformers class comparison table

    EET transformers Remarks
    EETBertModel BertModel
    EETBertEncoder BertEncoder
    EETBertEncoderLayer BertLayer
    EETBertAttention BertAttention
    EETBertFeedforward BertIntermediate + BertOutput
    EETBertEmbedding BertEmbeddings
    EETGPT2Model GPT2Model
    EETGPT2Decoder GPT2Model transformers has no GPT2Decoder
    EETGPT2DecoderLayer Block
    EETGPT2Attention Attention
    EETGPT2Feedforward MLP
    EETGPT2Embedding nn.Embedding
    EETLayerNorm nn.LayerNorm
  2. operators APIs:We provide all the operators required for Transformer models. You can combine different kernels to build different model structures

    operators APIs Remarks
    masked_multi_head_attention GPT2 self_attention
    cross_multi_head_attention cross_attention
    multi_head_attention Bert self_attention
    ffn FeedForwardNetwork
    embedding transformers & fairseq
    layernorm nn.LayerNorm

Performance

GPT-3 memory usage and performance

We measure the inference time and memory occupancy in different scenarios. Note : Total time are measured with 50% of the context

  • A100 (batch_size=4, max_sequence_length=1024, context_length=512, precision=half)

    Model Name Params Layers Hidden_units inference time of per-token total time of 1024 tokens
    GPT-3 Small 125M 12 768 2.69ms 1.38s
    GPT-3 Medium 350M 24 1024 5.58ms 2.86s
    GPT-3 Large 760M 24 1536 6.64ms 3.41s
    GPT-3 XL 1.3B 24 2048 7.3m 3.76s
    GPT-3 2.7B 2.7B 32 2560 46.1ms 23.6s
    GPT-3 6.7B 6.7B 32 4096 17.2ms 8.85s
    GPT-3 13B 13B 40 5120 29.3ms 15.12s
  • A100 (batch_size=16, max_sequence_length=1024, context_length=512, precision=half)

    Model Name Params Layers Hidden_units inference time of per-token total time of 1024 tokens
    GPT-3 Small 125M 12 768 2.84ms 1.46s
    GPT-3 Medium 350M 24 1024 6ms 3.11s
    GPT-3 Large 760M 24 1536 7.39ms 3.80s
    GPT-3 XL 1.3B 24 2048 8.27m 4.26s
    GPT-3 2.7B 2.7B 32 2560 116ms 59.8s
    GPT-3 6.7B 6.7B 32 4096 23.18ms 12.25s
    GPT-3 13B 13B 40 5120 43.42ms 22.58s

We tested the performance of EET on two GPU hardware platforms. We chose pytorch, NVIDIA Faster Transformers, and lightseq implementations for comparison.

We show GPT2 inference performance here.

  • RTX 2080ti (batch_size=4, hidden_units=1024, sequence_length=1024, precision=fp16)
gpt2_context_2080ti
  • RTX 2080ti (batch_size=4, context_ratio=50%, sequence_length=1024, precision=fp16)
hidden_unit_2080ti
  • A100 (batch_size=4, hidden_units=1024, sequence_length=1024, precision=fp16)
gpt2_context_A100
  • A100 (batch_size=4, context_ratio=50%, sequence_length=1024, precision=fp16)
hidden_unit_A100

Medium size model(hidden_units=1024,max_seq_len=768),compare with lightseq:

1024model_lightseq

Small size model(hidden_units=768,max_seq_len=128),compare with lightseq:

768model_lightseq

We show BERT inference performance here.

  • RTX 2080ti
bert_speedup_2080ti
  • A100
bert_speedup_A100

TODO

  1. int8
  2. sparse

Contact us

You can post your problem with github issues.

eet's People

Contributors

gongzhengli avatar dingjingzhen avatar zhisunyy avatar sidazh avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.