Introduction

This branch contains the updated code which can use more pretrained language models contained in huggingface/transformers.

Requirements and Installation

All requirements are updated in Dockerfile.

PyTorch version == 1.5.0
Python version == 3.6
huggingface/transformers version == 3.5.0

Installing from source

To install fairseq from source and develop locally:

git clone https://github.com/bert-nmt/bert-nmt
cd bert-nmt
git checkout update-20-10
pip install --editable .

Getting Started

Data Preprocessing

First, you should run Fairseq prepare-xxx.sh to get tokenized&bped files like:

# bash prepare-iwslt14.sh
train.en train.de valid.en valid.de test.en test.de

Then you can use makedataforbert.sh to get input file for BERT model (please note that the path is correct). You can get

# cd iwslt14.tokenized.de-en
# cp ../makedataforbert.sh .
# bash makedataforbert.sh de
train.en train.de valid.en valid.de test.en test.de train.bert.de valid.bert.de test.bert.de

Then preprocess data like Fairseq:

cd ../../..
TEXT=examples/translation/iwslt14.tokenized.de-en
src=de
tgt=en
destdir=iwslt_${src}_${tgt}
python preprocess.py --source-lang $src --target-lang $tgt \
  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
  --destdir $destdir  --joined-dictionary --bert-model-name bert-base-german-dbmdz-uncased

Train a vanilla NMT model using Fairseq

Using data above and standard Fairseq repository, you can get a pretrained NMT model.

The version should be a8f28ecb63ee01c33ea9f6986102136743d47ec2.

git clone https://github.com/pytorch/fairseq
git checkout a8f28ecb63ee01c33ea9f6986102136743d47ec2

Train a BERT-fused NMT model

The important options I have added are

        parser.add_argument('--bert-model-name', default='bert-base-uncased', type=str)
        parser.add_argument('--warmup-from-nmt', action='store_true', )
        parser.add_argument('--warmup-nmt-file', default='checkpoint_nmt.pt', )
        parser.add_argument('--encoder-bert-dropout', action='store_true',)
        parser.add_argument('--encoder-bert-dropout-ratio', default=0.25, type=float)

--bert-model-name specify the BERT model name, provided in file.
--warmup-from-nmt indicate you will also use a pretrained NMT model to train your BERT-fused NMT model. If you this option, we suggest you use --reset-lr-scheduler, too.
--warmup-nmt-file specify the NMT model name (in your savedir).
--encoder-bert-dropout indicate you will use drop-net trick.
--encoder-bert-dropout-ratio specify the ratio ($\in [0, 0.5]$) used in drop-net. This is my training script example:

#!/usr/bin/env bash
nvidia-smi

cd /yourpath/bertnmt
python3 -c "import torch; print(torch.__version__)"

src=de
tgt=en
bedropout=0.5
ARCH=transformer_s2_iwslt_de_en
DATAPATH=/yourdatapath
SAVEDIR=checkpoints/iwed_${src}_${tgt}_${bedropout}
mkdir -p $SAVEDIR
if [ ! -f $SAVEDIR/checkpoint_nmt.pt ]
then
    cp /your_pretrained_nmt_model $SAVEDIR/checkpoint_nmt.pt
fi
if [ ! -f "$SAVEDIR/checkpoint_last.pt" ]
then
warmup="--warmup-from-nmt --reset-lr-scheduler"
else
warmup=""
fi

export CUDA_VISIBLE_DEVICES=${1:-0}
python train.py $DATAPATH \
    -a $ARCH --optimizer adam --lr 0.0005 -s $src -t $tgt --label-smoothing 0.1 \
    --dropout 0.3 --max-tokens 4000 --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 \
    --criterion label_smoothed_cross_entropy --max-update 150000 --warmup-updates 4000 --warmup-init-lr '1e-07' \
    --adam-betas '(0.9,0.98)' --save-dir $SAVEDIR --share-all-embeddings $warmup \
    --encoder-bert-dropout --encoder-bert-dropout-ratio $bedropout \
    --bert-model-name bert-base-german-dbmdz-uncased | tee -a $SAVEDIR/training.log

Generate

Using the generate.py to test model is the same as the Fairseq, but you should add --bert-model-name to indicate your BERT model name.

python generate.py dictpath --path model_path -s $src -t $tgt \
--batch-size 128 --beam 5 --remove-bpe  --bert-model-name bert-base-german-dbmdz-uncased

Using the interactive.py to test model is a little different from the Fairseq. You should follow this procedure:

sed -r 's/(@@ )|(@@ ?$)//g' $bpefile > $bpefile.debpe
$MOSE/scripts/tokenizer/detokenizer.perl -l $src < $bpefile.debpe > $bpefile.debpe.detok
paste -d "\n" $bpefile $bpefile.debpe.detok > $bpefile.in
cat $bpefile.in | python interactive.py dictpath --path model_path -s $src -t $tgt \
--buffer-size 1024 --batch-size 128 --beam 5 --remove-bpe  > output.log

We get a new result on IWSLT'14 de->en task with bert-base-german-dbmdz-uncased, as shown below

Generate test with beam=5: BLEU4 = 37.34, 69.7/45.1/31.2/21.9 (BP=0.974, ratio=0.975, syslen=127837, reflen=131156)

atmahou / bertnmt-ensemble Goto Github PK

bertnmt-ensemble's Introduction

Introduction

Requirements and Installation

Getting Started

Data Preprocessing

Train a vanilla NMT model using Fairseq

Train a BERT-fused NMT model

Generate

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent