Coder Social home page Coder Social logo

tardis's Introduction

Tardis

Ensemble Seq2Seq neural machine translation model running on PySpark using Elephas

An ensemble of the neural machine translation model from from Sequence to Sequence Learning with Neural Networks by Sutskever et al. [1] trained over PySpark using Elephas. We assess the effectiveness of our model on the EN-FR and EN-DE datasets from WMT-14.

Prerequisites

  • Keras >= 2.2.4
  • Elephas >= 0.4
  • Pandas >= 0.23.4

Getting started

  • Download the en_de dataset under data/datasets/en_de:

  • Repeat the same process for the en_vi dataset under data/datasets/en_vi

  • Download the FastText WikiText embeddings for English, German and Vietnamese

  • To run the single node Seq2Seq model on a GPU, issue the following command from the project root directory:

    • python -m lib.model --gpu <gpu_no> --dataset <lang_pair> --batch-size <batch_size>
  • To run the single node TinySeq2Seq model on a CPU, issue the following command from the project root directory:

    • python -m lib.model --cpu [--ensemble] --dataset <lang_pair> --batch-size <batch_size>
  • To run the TinySeq2Seq ensemble on multiple nodes:

    • Generate the egg file by running - must run after every change in the code: python setup.py bdist_egg
    • Issue the following command from the project root directory: (WIP)
    • spark-submit --driver-memory 1G -m lib/model/__main__.py --cpu [--ensemble] --dataset <lang_pair> --batch-size <batch_size> --recurrent-unit gru

Note: Beam search is used by default during testing. Add the flag --beam-size 0 to use greedy search.

References

[1] Sutskever, I., Vinyals, O. and Le, Q.V., 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104-3112). [2] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In Empirical Methods in Natural Language Processing (EMNLP).

tardis's People

Contributors

achyudh avatar zeynepakkalyoncu avatar karkaroff avatar

Stargazers

Umut avatar  avatar  avatar  avatar

Watchers

James Cloos avatar  avatar  avatar

tardis's Issues

Pickle inputs for faster loading times

Is your feature request related to a problem? Please describe.
Since we pre-process the inputs every time we run the script, it takes around an hour to load the data and pre-process it

Describe the solution you'd like
Pickle all inputs after pre-processing and store it to disk. Add data/cache module to load from disk

Describe alternatives you've considered
We considered multi-processing for pre-processing the input in parallel, but this doesn't result in considerably lower loading times due to single threaded bottlenecks

Additional context
Add any other context or screenshots about the feature request here.

Fix targets used for computing test BLEU score

Describe the bug
We are currently using pre-processed targets with replaced UNKs for computing the BLEU score.

To Reproduce
Run the single node Seq2Seq model on a GPU, issue the following command from the project root directory: python -m lib.model --epochs 1 --dataset en_vi --devices 0,1 --batch-size 32 --num-layers 2 --vocab-size 10000

Expected behavior
Store raw target sequences and pass those to the BLEU score method.

Use swiftapply to speed up Pandas apply operations

Is your feature request related to a problem? Please describe.
The current pre-processing script is too slow for the entire dataset as it is single threaded.

Describe the solution you'd like
Swifter (https://github.com/jmcarpenter2/swifter) provides swiftapply, a drop-in replacement for Pandas apply that automatically decides whether it is faster to perform dask parallel processing or use a simple pandas apply.

Add gradient checkpointing

Is your feature request related to a problem? Please describe.
VRAM usage is currently too high and running a model for anything more than a vocab size of 5k results in OOM.

Describe the solution you'd like
Gradient checkpointing allows fitting of 10x larger models onto a GPU, at only a 20% increase in computation time.
Step 1: Port code over to tf.keras from standalone keras
Step 2: Overwrite the function that python has registered to the tf.gradients name to use the memory saving version:

from tensorflow.python.keras._impl.keras import backend as K
K.__dict__["gradients"] = memory_saving_gradients.gradients_memory

Describe alternatives you've considered
We are already scaling the model across 2 GPUs for the encoder and decoder. This helps, but we need to reduce VRAM consumption to increase the vocabulary size

Add model checkpointing

Is your feature request related to a problem? Please describe.
We need to checkpoint models every epoch so that we can choose the best model for testing.

Describe the solution you'd like
Add a model checkpointing callback to Seq2Seq.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.