The repository is a PyTorch implementation of transformer. We also implement attentional LSTM and use machine translation to compare the two models. The MT data and preprocessing code is based on the Link. The transformer code is based on the Link
The repository contains two translation pairs: en-fr, en-zh. We provide some example files for illustrating the proper paths of each file. Please follow the steps in setup to download and put the files in their corresponding paths.
- Please download the French-English corpus.
- Unzip the corpus and rename europarl-v7.fr-en.en, europarl-v7.fr-en.fr to en.txt, fr.txt, respectively.
- Put en.txt, fr.txt in datasets/en_fr
- Download the SpaCy models we need for preprocessing. The full list of languages of SpaCy is here.
python -m spacy download en_core_web_sm
python -m spacy download fr_core_news_sm
Use bash run.sh
to run the code, it would generate the following items in directory results
.
- model.cpu.pt : The model which can be loaded with CPU environment.
- model.pt: The model which can be loaded with GPU or CPU environment, based on if the model is trained with GPU.
- src.vocab: Source language vocabulary.
- tgt.vocab: Target language vocabulary.
Specify the model_type variable to be Transformer or LSTM.
We use Back-translated news released by WMT20 for our English-Chinese Translation dataset. The dataset could be found here
- Python: 3.8.10
- torch: 1.9.0
- torchtext: 0.10.0
- CUDA: 11.3