Coder Social home page Coder Social logo

fairseq-zh-en's Introduction

Chinese-English NMT

Experiments and reproduction of pretrained models trained on WMT17 Chinese-English using fairseq

Abstract

A big pain point for any RNN/LSTM model training is that they are very time consuming, so fairseq proposed fully convolutional architecture is very appealing. Some cursory experiments show much faster training time for fconv (Fully Convolutional Sequence-to-Sequence) compared to blstm (Bi-LSTM), while yielding comparable results. While fconv measures slightly worse BLEU scores vs blstm, some manual tests seem to favor fconv. A hybrid model using convenc (Convolutional encoder, LSTM decoder) trains for much more epochs but performs much worse BLEU score.

Model Epochs Training Time BLEU4 (beam1) BLEU4 (beam5) BLEU4 (beam10) BLEU4 (beam20)
fconv 25 ~4.5hrs 63.49 62.22 62.52 62.74
fconv_enc7 33 ~5hrs 66.40 65.52 65.8 65.96
fconv_dec5 28 ~5hrs 65.65 64.71 64.91 64.98
blstm 30 ~8hrs 64.59 64.15 64.38 63.76
convenc 47 ~7hrs 50.91 56.71 56.83 53.66

Download

Pretrained models:

Install

Follow fairseq installation, then:

# Chinese tokenizer
$ pip install jieba

# English tokenizer
$ pip install nltk
$ mkdir -p ~/nltk_data/tokenizers/
$ wget https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt.zip -o ~/nltk_data/tokenizers/punkt.zip
$ unzip ~/nltk_data/tokenizers/punkt.zip ~/nltk_data/tokenizers/

Additionally, we use scripts from Moses and Subword-nmt

git clone https://github.com/moses-smt/mosesdecoder
git clone https://github.com/rsennrich/subword-nmt

Additional Setup

CUDA might need to link libraries to path.

# Couldn't open CUDA library libcupti.so.8.0. LD_LIBRARY_PATH: /git/torch/install/lib:
$ cd $LD_LIBRARY_PATH; 
$ sudo ln -s  /usr/local/cuda-8.0/extras/CUPTI/lib64/* $LD_LIBRARY_PATH/

Preprocessing

Word Token

We tokenize dataset, using nltk.word_tokenizer for English and jieba for Chinese word segmentation.

Casing

We remove cases from English and converted all string to lower case.

Merge blank lines

We note that dataset often has blank lines. In some cases, this is formatting, but there are cases where a long English sentence is translated to 2 Chinese sentences. This appears as a sentence followed by blank line on English corpus. To deal with this, we merge the 2 Chinese sentences onto same line, and then remove the blank line from both corpus.

There are also formatting issues, where English corpus has blank line while Chinese corpus has a single .. We treat this as both blank lines and remove them.

Additional data cleaning

We note that there are further work that can be added to data cleaning:

  • remove non-English/Chinese sentences. I think there was a Russian sentence.
  • remove (HTML?) markup
  • remove non-breaking white space. \xa20 was found and converted to whitespace.

Preprocessing

Preprocessing is run by wmt_prepare.sh.

  1. We download, unzip, tokenize, and clean dataset in preprocess/wmt.py.

  2. We learn subword vocabulary using apply_bpe.

  3. Then preprocess datasets to binary using fairseq preprocess

Training

Run wmt17_train.sh which does the following:

$ DATADIR=data-bin/wmt17_en_zh
$ TRAIN=trainings/wmt17_en_zh

# Standard bi-directional LSTM model
$ mkdir -p $TRAIN/blstm
$ fairseq train -sourcelang en -targetlang zh -datadir $DATADIR \
    -model blstm -nhid 512 -dropout 0.2 -dropout_hid 0 -optim adam -lr 0.0003125 \
    -savedir $TRAIN/blstm

# Fully convolutional sequence-to-sequence model
$ mkdir -p $TRAIN/fconv
$ fairseq train -sourcelang en -targetlang zh -datadir $DATADIR \
    -model fconv -nenclayer 4 -nlayer 3 -dropout 0.2 -optim nag -lr 0.25 -clip 0.1 \
    -momentum 0.99 -timeavg -bptt 0 -savedir $TRAIN/fconv

# Convolutional encoder, LSTM decoder
$ mkdir -p trainings/convenc
$ fairseq train -sourcelang en -targetlang zh -datadir $DATADIR \
    -model conv -nenclayer 6 -dropout 0.2 -dropout_hid 0 -savedir trainings/convenc

Generate

Run wmt17_generate.sh, or run generate-line as follows:

$ DATADIR=data-bin/wmt17_en_zh

# Optional: optimize for generation speed (fconv only)
$ fairseq optimize-fconv -input_model trainings/fconv/model_best.th7 -output_model trainings/fconv/model_best_opt.th7

$ fairseq generate-lines -sourcedict $DATADIR/dict.en.th7 -targetdict $DATADIR/dict.zh.th7 -path trainings/fconv/model_best_opt.th7 -beam 10 -nbest 2
# you actually have to implement the solution
# <unk> 实际上 必须 实施 解决办法 。

$ fairseq generate-lines -sourcedict $DATADIR/dict.en.th7 -targetdict $DATADIR/dict.zh.th7 -path trainings/blstm/model_best.th7 -beam 10 -nbest 2
# you actually have to implement the solution
# <unk> , 这些 方案 必须 非常 困难 

$ fairseq generate-lines -sourcedict $DATADIR/dict.en.th7 -targetdict $DATADIR/dict.zh.th7 -path trainings/convenc/model_best.th7 -beam 10 -nbest 2
# you actually have to implement the solution
# <unk> 这种 道德 又 能 实现 这些 目标 。 


References

@article{gehring2017convs2s,
  author          = {Gehring, Jonas, and Auli, Michael and Grangier, David and Yarats, Denis and Dauphin, Yann N},
  title           = "{Convolutional Sequence to Sequence Learning}",
  journal         = {ArXiv e-prints},
  archivePrefix   = "arXiv",
  eprinttype      = {arxiv},
  eprint          = {1705.03122},
  primaryClass    = "cs.CL",
  keywords        = {Computer Science - Computation and Language},
  year            = 2017,
  month           = May,
}
@article{gehring2016convenc,
  author          = {Gehring, Jonas, and Auli, Michael and Grangier, David and Dauphin, Yann N},
  title           = "{A Convolutional Encoder Model for Neural Machine Translation}",
  journal         = {ArXiv e-prints},
  archivePrefix   = "arXiv",
  eprinttype      = {arxiv},
  eprint          = {1611.02344},
  primaryClass    = "cs.CL",
  keywords        = {Computer Science - Computation and Language},
  year            = 2016,
  month           = Nov,
}

License

fairseq is licensed from its original repo.

Pretrained models in this repo are BSD-licensed.

fairseq-zh-en's People

Contributors

twairball avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

fairseq-zh-en's Issues

Pretrained model can not be loaded?

[root@localhost fairseq-zh-en]# ./wmt17_generate.sh
optimizing fconv for decoding
decoding to tmp/wmt17_en_zh/fconv_test
/root/torch/install/bin/luajit: .../install/share/lua/5.1/fairseq/models/ensemble_model.lua:134: inconsistent tensor size, expected r_ [10 x 33859], t [10 x 33859] and src [10 x 20490] to have the same number of elements, but got 338590, 338590 and 204900 elements respectively at /root/torch/pkg/torch/lib/TH/generic/THTensorMath.c:887
stack traceback:
[C]: in function 'add'
.../install/share/lua/5.1/fairseq/models/ensemble_model.lua:134: in function 'generate'
...torch/install/share/lua/5.1/fairseq/scripts/generate.lua:213: in main chunk
[C]: in function 'require'
...install/lib/luarocks/rocks/fairseq-cpu/scm-1/bin/fairseq:17: in main chunk
[C]: at 0x004064f0
| [zh] Dictionary: 33859 types
| [en] Dictionary: 29243 types
| IndexedDataset: loaded data-bin/wmt17_en_zh with 2000 examples

Training Dataset

Hi, did you get the reported result by only training on the News Commentary v12 dataset (0.2 million pairs)? Because I saw your preprocess script only download the news dataset. However, I cannot reproduce your result, not even close.

Could you please provide more description of the dataset you used for training?

fairseq command not found

I have faced numerous directory search issues besides "fairseq" command not found. I have already installed https://github.com/pytorch/fairseq

Could anyone advise ?

[phung@archlinux fairseq-zh-en]$ ls
challenger.md data-bin 'merge blanks.ipynb' nltk_data README.md tmp wmt17_generate.sh wmt17_train.sh
data 'Dataset misaligned.ipynb' mosesdecoder preprocess subword-nmt trainings wmt17_prepare.sh
[phung@archlinux fairseq-zh-en]$ sh ./wmt17_prepare.sh
Building prefix dict from the default dictionary ...
DEBUG:jieba:Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
DEBUG:jieba:Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.834 seconds.
DEBUG:jieba:Loading model cost 0.834 seconds.
Prefix dict has been built succesfully.
DEBUG:jieba:Prefix dict has been built succesfully.
INFO:prepare:tokenizing: tmp/wmt17_en_zh/training/news-commentary-v12.zh-en.en
INFO:tokenizer: [0] nltk.word_tokenize: 1929 or 1989?

Traceback (most recent call last):
File "./preprocess/wmt.py", line 58, in
prepare.prepare_dataset(DATA_DIR, TMP_DIR, ds)
File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/fairseq/fairseq-zh-en/preprocess/prepare.py", line 79, in prepare_dataset
tokenized = tokenizer.tokenize_file(tmp_filepath)
File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/fairseq/fairseq-zh-en/preprocess/tokenizer.py", line 60, in tokenize_file
_tokenized = tokenize(line, is_sgm, is_zh, lower_case, delim)
File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/fairseq/fairseq-zh-en/preprocess/tokenizer.py", line 40, in tokenize
_tok = jieba.cut(_line.rstrip('\r\n')) if is_zh else nltk.word_tokenize(_line)
File "/usr/lib/python3.6/site-packages/nltk/tokenize/init.py", line 128, in word_tokenize
sentences = [text] if preserve_line else sent_tokenize(text, language)
File "/usr/lib/python3.6/site-packages/nltk/tokenize/init.py", line 94, in sent_tokenize
tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
File "/usr/lib/python3.6/site-packages/nltk/data.py", line 836, in load
opened_resource = _open(resource_url)
File "/usr/lib/python3.6/site-packages/nltk/data.py", line 954, in open
return find(path
, path + ['']).open()
File "/usr/lib/python3.6/site-packages/nltk/data.py", line 675, in find
raise LookupError(resource_not_found)
LookupError:


Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:

import nltk
nltk.download('punkt')

Searched in:
- '/home/phung/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- '/usr/nltk_data'
- '/usr/share/nltk_data'
- '/usr/lib/nltk_data'
- ''


./wmt17_prepare.sh: line 12: ../mosesdecoder/scripts/training/clean-corpus-n.perl: No such file or directory
./wmt17_prepare.sh: line 13: ../mosesdecoder/scripts/training/clean-corpus-n.perl: No such file or directory
./wmt17_prepare.sh: line 14: ../mosesdecoder/scripts/training/clean-corpus-n.perl: No such file or directory
Encoding subword with BPE using ops=32000
./wmt17_prepare.sh: line 23: data/wmt17_en_zh/train.clean.en: No such file or directory
./wmt17_prepare.sh: line 24: data/wmt17_en_zh/train.clean.zh: No such file or directory
Applying vocab to training
./wmt17_prepare.sh: line 27: data/wmt17_en_zh/train.clean.en: No such file or directory
./wmt17_prepare.sh: line 28: data/wmt17_en_zh/train.clean.zh: No such file or directory
Generating vocab: vocab.32000.bpe.en
./wmt17_prepare.sh: line 32: ../subword-nmt/get_vocab.py: No such file or directory
cat: data/wmt17_en_zh/train.32000.bpe.en: No such file or directory
Generating vocab: vocab.32000.bpe.zh
./wmt17_prepare.sh: line 35: ../subword-nmt/get_vocab.py: No such file or directory
cat: data/wmt17_en_zh/train.32000.bpe.zh: No such file or directory
Applying vocab to valid
./wmt17_prepare.sh: line 39: data/wmt17_en_zh/valid.clean.en: No such file or directory
./wmt17_prepare.sh: line 40: data/wmt17_en_zh/valid.clean.zh: No such file or directory
Applying vocab to test
./wmt17_prepare.sh: line 44: data/wmt17_en_zh/test.clean.en: No such file or directory
./wmt17_prepare.sh: line 45: data/wmt17_en_zh/test.clean.zh: No such file or directory
Preprocessing datasets...
./wmt17_prepare.sh: line 52: fairseq: command not found
[phung@archlinux fairseq-zh-en]$

分词时报错

在news-commentary-v12.zh-en.en中,98000行左右有一段其他文字,编码方式不同,报错:UnicodeEncodeError: 'ascii' codec can't encode character '\u2013' in position 21: ordinal not in range(128)
请问这个怎么解决?

jieba tokenizer

Hello! Appreciate your work on this.

In the preprocess/process.py, you mentioned using Jieba for tokenizing -zh words but I don't see it implemented there. Could you help clarify?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.