The eet from wangguojim

Easy and Efficient Transformer

中文README

EET(Easy and Efficient Transformer) is an efficient Pytorch inference plugin focus on Transformer-based models with large model sizes and long sequences.

Features

1、Pre-padding decoding. Pre-padding keep the relative position embedding remain unchanged within the context and the generated sequence, reducing the gap between training and inference. Basic on this, we achieve parallel inference for the context and incremental decoding for generated token sequence.
2、High performance. Design highly optimized CUDA kernels, referencing to NVIDIA Faster Transformer with advanced optimization.
3、Flexible. Provide op-level and model-level APIs, allowing users to construct their model or upgrade partial algorithm flexible.
4、Easy to use. EET could be integrated into Fairseq and Transformers directly by replacement of sepcified files, without any code change.
5、Dynamic batch. EET supports dynamic batch, which changes the order of the batch according to the reorder_state and can end a batch early.
6、Extra-large dimension and extra-long sequences. EET supports GPT hidden_units up to 16384 and sequence lengths up to 4096.
7、Support multiple models, including gpt2, bert, albert, roberta, vit.

EET has been applied to a variety of NetEase online services,such as NiShuiHan, NetEase's cloud music, TianYu, Lofter, etc. In the future, EET will work on urtra-large-scale model inference of trillion parameters.

Decoding mechanism
Quick Start
Supported Models
- GPT2
- BERT
Usage
APIs
Performance
- We show GPT2 inference performance here.
- We show BERT inference performance here.
TODO
Contact us

Frameworks	maximum model size	maximum sequence length	Performance	Bert	GPT-2	Op-level	Fairseq support	Transformers support	dynamic batch & variable inputs
EET	16384	16384	highest	Y	Y	Y	Y	Y	Y
Faster Transformer	Multiples of specific numbers, such as 128, 256, 384, 512	1024	high	Y	Y	N	N	N	N
TensorRT	1024	1024	high	Y	N	N	N	N	N
LightSeq	1024	1024	high	Y	Y	N	N	N	Y
TurboTransformer	1024	1024	medium	Y	Y	N	N	Y	Y
ONNX	non-limited	non-limited	slow	Y	Y	Y	N	N	Y

Decoding mechanism

Quick Start

Environment

cuda:>=10.1
python:>=3.7
gcc:>= 7.4.0
torch:>=1.5.0
numpy:>=1.19.1

Installation

From Source

If you are installing from source, you will need install the necessary environment.Then, proceed as follows:

$ git clone https://github.com/NetEase-FuXi/EET.git
# to run the demo in the examples for comparsion, we need to install the transformers and fairseq
$ pip install transformers==3.5.0 
$ pip install fairseq==0.10.0 
$ pip install .

Due to the compilation of a large number of cuda kernels, the installation time is relatively long, please be patient.

From Docker

$ git clone https://github.com/NetEase-FuXi/EET.git
$ docker build -t your_docker_name:your_docker_version .
$ nvidia-docker run -it --net=host -v /your/project/directory/:/root/workspace  your_Docker_Name:your_docker_version bash

EET has been installed in the docker.

Run

run BERT in Transformers

$ cd EET/example/python  
$ python bert_transformers_example.py

run GPT2 in Transformers

$ cd EET/example/python   
$ python gpt2_transformers_example.py

run GPT2 in Fairseq

$ cd EET
$ wget https://github.com/NetEase-FuXi/EET/releases/download/EET_V0.0.1_fairseq0.10.0_transformers3.5.0/resource.zip
$ cd example 
$ python gpt2_fairseq_example.py

Supported Models

We currenly support the GPT-2, Bert.

GPT2

BERT

Usage

EET provides python User-friendly APIs(python/eet), integrated into Fairseq and Transformers with just a few lines of code. It should be noted that we only support padding on the left.

1、How to inference

2、How to customize model
You can refer to Operators APIs listed below to build your own model structure, just by modifying the files under python/eet.

3、How to integrate EET into fairseq
Replace the original transformer.py in Fairseq with our transformer.py and reinstall the Fairseq, that is all ! Transformer.py in EET corresponds to the fusion of transformer.py and transformer_layer.py in fairseq.

4、How to integrate EET into Transformers
Replace the original modeling_bert.py and modeling_gpt2.py in Transformers with our modeling_bert.py and modeling_gpt2.py and reinstall the Transformers, that is all ! modeling_bert.py in EET corresponds to modeling_bert.py in transformers;modeling_gpt2.py in EET corresponds to modelling_gpt2.py in transformers.

5、How to make a server
We choose service-streamer to make the model server, building the service based on your python project directly. Please make sure the dynamic-batch is open if you want a higher throughput.

APIs

model APIs:We provide ready-made APIs for GPT2 and Bert models.

EET and fairseq class comparison table

EET	fairseq	Remarks
EETTransformerDecoder	TransformerDecoder
EETTransformerDecoderLayer	TransformerDecoderLayer
EETTransformerAttention	MultiheadAttention
EETTransformerFeedforward	TransformerDecoderLayer	fusion of multiple small operators
EETTransformerEmbedding	Embedding + PositionalEmbedding
EETTransformerLayerNorm	nn.LayerNorm

EET and transformers class comparison table

EET	transformers	Remarks
EETBertModel	BertModel
EETBertEncoder	BertEncoder
EETBertEncoderLayer	BertLayer
EETBertAttention	BertAttention
EETBertFeedforward	BertIntermediate + BertOutput
EETBertEmbedding	BertEmbeddings
EETGPT2Model	GPT2Model
EETGPT2Decoder	GPT2Model	transformers has no GPT2Decoder
EETGPT2DecoderLayer	Block
EETGPT2Attention	Attention
EETGPT2Feedforward	MLP
EETGPT2Embedding	nn.Embedding
EETLayerNorm	nn.LayerNorm

operators APIs:We provide all the operators required for Transformer models. You can combine different kernels to build different model structures

operators APIs	Remarks
masked_multi_head_attention	GPT2 self_attention
cross_multi_head_attention	cross_attention
multi_head_attention	Bert self_attention
ffn	FeedForwardNetwork
embedding	transformers & fairseq
layernorm	nn.LayerNorm

Performance

GPT-3 memory usage and performance

We measure the inference time and memory occupancy in different scenarios. Note : Total time are measured with 50% of the context

A100 (batch_size=4, max_sequence_length=1024, context_length=512, precision=half)

Model Name	Params	Layers	Hidden_units	inference time of per-token	total time of 1024 tokens
GPT-3 Small	125M	12	768	2.69ms	1.38s
GPT-3 Medium	350M	24	1024	5.58ms	2.86s
GPT-3 Large	760M	24	1536	6.64ms	3.41s
GPT-3 XL	1.3B	24	2048	7.3m	3.76s
GPT-3 2.7B	2.7B	32	2560	46.1ms	23.6s
GPT-3 6.7B	6.7B	32	4096	17.2ms	8.85s
GPT-3 13B	13B	40	5120	29.3ms	15.12s

A100 (batch_size=16, max_sequence_length=1024, context_length=512, precision=half)

Model Name	Params	Layers	Hidden_units	inference time of per-token	total time of 1024 tokens
GPT-3 Small	125M	12	768	2.84ms	1.46s
GPT-3 Medium	350M	24	1024	6ms	3.11s
GPT-3 Large	760M	24	1536	7.39ms	3.80s
GPT-3 XL	1.3B	24	2048	8.27m	4.26s
GPT-3 2.7B	2.7B	32	2560	116ms	59.8s
GPT-3 6.7B	6.7B	32	4096	23.18ms	12.25s
GPT-3 13B	13B	40	5120	43.42ms	22.58s

We tested the performance of EET on two GPU hardware platforms. We chose pytorch, NVIDIA Faster Transformers, and lightseq implementations for comparison.