Coder Social home page Coder Social logo

ericxsun / transnormer Goto Github PK

View Code? Open in Web Editor NEW

This project forked from opennlplab/transnormer

0.0 0.0 0.0 500 KB

[EMNLP 2022] Official implementation of Transnormer in our EMNLP 2022 paper - The Devil in Linear Transformer

Shell 33.91% Python 66.09%

transnormer's Introduction

Transnormer

Official implementation of Transnormer in our EMNLP 2022 paper - The Devil in Linear Transformer. This repo does not contain specific codes, but only scripts and some instructions on how to reproduce the results of the paper. The overall directory is as follows:

Overall Architecture

The overall network architecture is as follows:

network

Experiments

Environment Preparation

Our experiment uses two conda environments, where Autoregressive language modeling, Bidirectional language modeling needs to configure the environment according to the Env1 part, and LRA needs to configure the environment according to the Env2 part.

Env1

First build the conda environment based on the yaml file:

conda env create --file env1.yaml

If you meet an error when installing torch, just remove torch and torchvision in the yaml file, rerun the above command, and then run the below commands:

pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements_env1.txt

Then install our version of fairseq:

git clone https://github.com/OpenNLPLab/fairseq-evo.git
cd fairseq
pip install --editable ./

Env2

Build the conda environment based on the yaml file:

conda env create --file env2.yaml

Autoregressive language model

1) Preprocess the data

First download and prepare the WikiText-103 dataset:

path_to_fairseq=fairseq
cd $path_to_fairseq/examples/language_model/
bash prepare-wikitext-103.sh
cd ../..

Next preprocess/binarize the data:

TEXT=examples/language_model/wikitext-103
fairseq-preprocess \
    --only-source \
    --trainpref $TEXT/wiki.train.tokens \
    --validpref $TEXT/wiki.valid.tokens \
    --testpref $TEXT/wiki.test.tokens \
    --destdir data-bin/wikitext-103 \
    --workers 20

This step comes from fairseq.

2) Train the autoregressive language model

Use the following command to train language model:

bash script_alm.sh

You should change data_dir to preprocessed data. If you are using a slurm cluster, please add --distributed-port $PORT to fairseq-train's parameter.

Bidirectional language model

1) Preprocess the data

First download the WikiText-103 dataset:

wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip

Next encode it with the GPT-2 BPE:

mkdir -p gpt2_bpe
wget -O gpt2_bpe/encoder.json https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json
wget -O gpt2_bpe/vocab.bpe https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe
for SPLIT in train valid test; do \
    python -m examples.roberta.multiprocessing_bpe_encoder \
        --encoder-json gpt2_bpe/encoder.json \
        --vocab-bpe gpt2_bpe/vocab.bpe \
        --inputs wikitext-103-raw/wiki.${SPLIT}.raw \
        --outputs wikitext-103-raw/wiki.${SPLIT}.bpe \
        --keep-empty \
        --workers 60; \
done

Finally preprocess/binarize the data using the GPT-2 fairseq dictionary:

wget -O gpt2_bpe/dict.txt https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt
fairseq-preprocess \
    --only-source \
    --srcdict gpt2_bpe/dict.txt \
    --trainpref wikitext-103-raw/wiki.train.bpe \
    --validpref wikitext-103-raw/wiki.valid.bpe \
    --testpref wikitext-103-raw/wiki.test.bpe \
    --destdir data-bin/wikitext-103 \
    --workers 60

This step comes from fairseq.

2) Train the bidirectional language model

Use the following command to train language model:

bash train_blm.sh

You should change data_dir to preprocessed data. If you are using a slurm cluster, please add --distributed-port $PORT to fairseq-train's parameter.

LRA

1) Preparation

Download the codebase:

git clone https://github.com/OpenNLPLab/lra.git

Download the data:

wget https://storage.googleapis.com/long-range-arena/lra_release.gz
mv lra_release.gz lra_release.tar.gz 
tar -xvf lra_release.tar.gz

2) Training

Use the following script to run the experiments, you should change PREFIX to your lra path, change tasks to a specific task and change model_config to t1 or t2:

python script_lra.py

Standalone code

Citation

@inproceedings{qin-etal-2022-devil,
    title = "The Devil in Linear Transformer",
    author = "Qin, Zhen  and
      Han, Xiaodong  and
      Sun, Weixuan  and
      Li, Dongxu  and
      Kong, Lingpeng  and
      Barnes, Nick  and
      Zhong, Yiran",
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.emnlp-main.473",
    pages = "7025--7041",
    abstract = "Linear transformers aim to reduce the quadratic space-time complexity of vanilla transformers. However, they usually suffer from degraded performances on various tasks and corpus. In this paper, we examine existing kernel-based linear transformers and identify two key issues that lead to such performance gaps: 1) unbounded gradients in the attention computation adversely impact the convergence of linear transformer models; 2) attention dilution which trivially distributes attention scores over long sequences while neglecting neighbouring structures. To address these issues, we first identify that the scaling of attention matrices is the devil in unbounded gradients, which turns out unnecessary in linear attention as we show theoretically and empirically. To this end, we propose a new linear attention that replaces the scaling operation with a normalization to stabilize gradients. For the issue of attention dilution, we leverage a diagonal attention to confine attention to only neighbouring tokens in early layers. Benefiting from the stable gradients and improved attention, our new linear transformer model, transNormer, demonstrates superior performance on text classification and language modeling tasks, as well as on the challenging Long-Range Arena benchmark, surpassing vanilla transformer and existing linear variants by a clear margin while being significantly more space-time efficient. The code is available at https://github.com/OpenNLPLab/Transnormer .",
}

Wip

  • Check the training script.
  • Add standalone code.
  • Update transnormer-pytorch.

transnormer's People

Contributors

doraemonzzz avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.