achyudh / tardis Goto Github PK

4.0 3.0 4.0 263 KB

Ensemble Seq2Seq neural machine translation model running on PySpark using Elephas

License: GNU General Public License v3.0

Python 100.00%

tardis's Introduction

Tardis

Ensemble Seq2Seq neural machine translation model running on PySpark using Elephas

An ensemble of the neural machine translation model from from Sequence to Sequence Learning with Neural Networks by Sutskever et al. [1] trained over PySpark using Elephas. We assess the effectiveness of our model on the EN-FR and EN-DE datasets from WMT-14.

Prerequisites

Keras >= 2.2.4
Elephas >= 0.4
Pandas >= 0.23.4

Getting started

Download the en_de dataset under data/datasets/en_de:
- Download train.en and train.de
- Download newstest2012.en, newstest2012.de, newstest2015.en and newstest2015.de
Repeat the same process for the en_vi dataset under data/datasets/en_vi
Download the FastText WikiText embeddings for English, German and Vietnamese
To run the single node Seq2Seq model on a GPU, issue the following command from the project root directory:
- python -m lib.model --gpu <gpu_no> --dataset <lang_pair> --batch-size <batch_size>
To run the single node TinySeq2Seq model on a CPU, issue the following command from the project root directory:
- python -m lib.model --cpu [--ensemble] --dataset <lang_pair> --batch-size <batch_size>
To run the TinySeq2Seq ensemble on multiple nodes:
- Generate the egg file by running - must run after every change in the code: python setup.py bdist_egg
- Issue the following command from the project root directory: (WIP)
- spark-submit --driver-memory 1G -m lib/model/__main__.py --cpu [--ensemble] --dataset <lang_pair> --batch-size <batch_size> --recurrent-unit gru

Note: Beam search is used by default during testing. Add the flag --beam-size 0 to use greedy search.

References

[1] Sutskever, I., Vinyals, O. and Le, Q.V., 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104-3112). [2] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In Empirical Methods in Natural Language Processing (EMNLP).

tardis's People

Contributors

Stargazers

Watchers

Forkers

karkaroff kartikmehta09 aifahim

tardis's Issues

Train distributed model with Elephas

Pickle inputs for faster loading times

Is your feature request related to a problem? Please describe.
Since we pre-process the inputs every time we run the script, it takes around an hour to load the data and pre-process it

Describe the solution you'd like
Pickle all inputs after pre-processing and store it to disk. Add data/cache module to load from disk

Describe alternatives you've considered
We considered multi-processing for pre-processing the input in parallel, but this doesn't result in considerably lower loading times due to single threaded bottlenecks

Additional context
Add any other context or screenshots about the feature request here.

Implement beam search decoder

Implement an ensemble that runs on PySpark with Elephas

Hyperparameter optimization

Get model working with Elephas

Implement ensemble with Elephas

Fix targets used for computing test BLEU score

Describe the bug
We are currently using pre-processed targets with replaced UNKs for computing the BLEU score.

To Reproduce
Run the single node Seq2Seq model on a GPU, issue the following command from the project root directory: python -m lib.model --epochs 1 --dataset en_vi --devices 0,1 --batch-size 32 --num-layers 2 --vocab-size 10000

Expected behavior
Store raw target sequences and pass those to the BLEU score method.

Implement the Seq2Seq NMT model

Compress seq2seq model to train on the CPU

Use swiftapply to speed up Pandas apply operations

Is your feature request related to a problem? Please describe.
The current pre-processing script is too slow for the entire dataset as it is single threaded.

Describe the solution you'd like
Swifter (https://github.com/jmcarpenter2/swifter) provides swiftapply, a drop-in replacement for Pandas apply that automatically decides whether it is faster to perform dask parallel processing or use a simple pandas apply.

Get the model working on a single node with GPU

Test on en_de dataset

Add gradient checkpointing

Is your feature request related to a problem? Please describe.
VRAM usage is currently too high and running a model for anything more than a vocab size of 5k results in OOM.

Describe the solution you'd like
Gradient checkpointing allows fitting of 10x larger models onto a GPU, at only a 20% increase in computation time.
Step 1: Port code over to tf.keras from standalone keras
Step 2: Overwrite the function that python has registered to the tf.gradients name to use the memory saving version:

from tensorflow.python.keras._impl.keras import backend as K
K.__dict__["gradients"] = memory_saving_gradients.gradients_memory

Describe alternatives you've considered
We are already scaling the model across 2 GPUs for the encoder and decoder. This helps, but we need to reduce VRAM consumption to increase the vocabulary size