Coder Social home page Coder Social logo

oaarnikoivu / robust-nmt Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 0.0 175 KB

Code for my MSc Dissertation titled: "Robustness of Machine Translation for Low-Resource Languages."

Home Page: https://project-archive.inf.ed.ac.uk/msc/20215190/msc_proj.pdf

Shell 73.06% Python 26.94%
nlp nmt neural-machine-translation fairseq bpe-dropout subword-nmt subword-regularization defensive-distillation pytorch mbart transformer rnn nematus

robust-nmt's Introduction

Robustness of Machine Translation for Low-Resource Languages

Report

Abstract

It is becoming increasingly common for researchers and practitioners to rely on methods within the field of Neural Machine Translation (NMT) that require the use of an extensive amount of auxiliary data. This is especially true for low-resource NMT where the availability of large-scale corpora is limited. As a result, the field of low-resource NMT without the use of supplementary data has received less attention. This work challenges the idea that modern NMT systems are poorly equipped for low-resource NMT by examining a variety of different systems and techniques in simulated Finnish-English low-resource conditions. This project shows that under certain low-resource conditions, the performance of the Transformer can be considerably improved via simple model compression and regularization techniques. In medium-resource settings, it is shown that an optimized Transformer is competitive with language model fine-tuning, in both in-domain and out-of-domain conditions. As an attempt to further improve robustness towards samples distant from the training distribution, this work explores subword regularization using BPE-Dropout, and defensive distillation. It is found that an optimized Transformer is superior in comparison to subword regularization, whereas defensive distillation improves domain robustness on domains that are the most distant from the original training distribution. A small manual evaluation is implemented where the goal is to assess the robustness of each system and technique towards adequacy and fluency. The results show that under some low-resource conditions, translations generated by most systems are in fact grammatical, however, highly inadequate.

Install required librarires

./scripts/install_libraries.sh 

Download data

./scripts/download_data.sh data

Transformer preprocessing

Truecaser learned on full in-domain Europarl corpus

./scripts/transformer/preprocessing/truecase.sh

In-domain Byte Pair Encoding

./scripts/transformer/preprocessing/preprocess.sh [experiment name] [corpus size] [number of bpe merge operations]

Out-of-domain Byte Pair Encoding

./scripts/transformer/preprocessing/preprocess_ood.sh [experiment name] [corpus size] [domain]

Binarize

In-domain

./scripts/transformer/preprocessing/binarize_transformer.sh [experiment name] [corpus size]

Out-of-domain

./scripts/transformer/preprocessing/binarize_transformer_ood.sh [experiment name] [corpus size]

BPE-Dropout

Copy the training corpus l=64 times

./scripts/transformer/preprocessing/copy_corpus.sh [corpus size]

Apply BPE-Dropout with p = 0.1

./scripts/transformer/preprocessing/preprocess_bpe_dropout.sh [experiment name] [corpus size]

Binarize BPE-Dropout

In-domain

./scripts/transformer/preprocessing/binarize_bpe_dropout.sh [experiment name] [corpus size]

Out-of-domain

./scripts/transformer/preprocessing/binarize_bpe_dropout_ood.sh [experiment name] [corpus size]

Transformer Training and Evaluation

To train an indivudal model, see scripts under scripts/transformer/training

To evaluate an individual model, see scripts under scripts/transformer/evaluation

Find example slurm scripts for training under scripts/transformer/training/slurm

Find example slurm scripts for evaluation under scripts/transformer/evaluation/slurm

Distillation

For distillation to work, first you must have trained a Transformer on one of the europarl subsets following the steps above.

To generate a distilled training set, see scripts/transformer/translate

To prepare distilled training set for the student network:

./scripts/transformer/preprocessing/binarize_distillation.sh [experiment name] [corpus size]

./scripts/transformer/preprocessing/binarize_distillation_ood.sh [experiment name] [corpus size]

To train the student network, see scripts under scripts/transformer/training

To evaluate the student network, see scripts under scripts/transformer/evaluation

Distillation training

For distillation training to work with Fairseq, modify the TransformerDecoder class under /tools/fairseq/fairseq/models/transformer.py to the following:

def upgrade_state_dict_named(self, state_dict, name):
       # Keep the current weights for the decoder embedding table 
       for k in state_dict.keys():
           if 'decoder.embed_tokens' in k:
               state_dict[k] = self.embed_tokens.weight

       """Upgrade a (possibly old) state dict for new versions of fairseq."""
       if isinstance(self.embed_positions, SinusoidalPositionalEmbedding):
           weights_key = "{}.embed_positions.weights".format(name)
           if weights_key in state_dict:
               del state_dict[weights_key]
           state_dict[
               "{}.embed_positions._float_tensor".format(name)
           ] = torch.FloatTensor(1)

       if f"{name}.output_projection.weight" not in state_dict:
           if self.share_input_output_embed:
               embed_out_key = f"{name}.embed_tokens.weight"
           else:
               embed_out_key = f"{name}.embed_out"
           if embed_out_key in state_dict:
               state_dict[f"{name}.output_projection.weight"] = state_dict[
                   embed_out_key
               ]
               if not self.share_input_output_embed:
                   del state_dict[embed_out_key]

       for i in range(self.num_layers):
           # update layer norms
           layer_norm_map = {
               "0": "self_attn_layer_norm",
               "1": "encoder_attn_layer_norm",
               "2": "final_layer_norm",
           }
           for old, new in layer_norm_map.items():
               for m in ("weight", "bias"):
                   k = "{}.layers.{}.layer_norms.{}.{}".format(name, i, old, m)
                   if k in state_dict:
                       state_dict[
                           "{}.layers.{}.{}.{}".format(name, i, new, m)
                       ] = state_dict[k]
                       del state_dict[k]

       version_key = "{}.version".format(name)
       if utils.item(state_dict.get(version_key, torch.Tensor([1]))[0]) <= 2:
           # earlier checkpoints did not normalize after the stack of layers
           self.layer_norm = None
           self.normalize = False
           state_dict[version_key] = torch.Tensor([1])

       return state_dict

This allows you to initialize the parameters of the student network using the parameters of the teacher model.

mBART25

Installing the pretrained model

./scripts/mbart/get_pretrained_model.sh

Tokenization

./scripts/mbart/preprocessing/spm_tokenize.sh [corpus size]

./scripts/mbart/preprocessing/spm_tokenize_ood.sh

Build new dictionary based on in-domain text that is being fine-tuned

./scripts/mbart/build_vocab.sh [corpus size]

Prune the pre-trained model

./scripts/mbart/trim_mbart.sh

Binarize

./scripts/mbart/preprocessing/binarize.sh [corpus size]

./scripts/mbart/preprocessing/binarize_ood.sh [corpus size]

Training and Evaluation

For fine-tuning mBART25, see /scripts/mbart/finetune.sh

For evaluating mBART25, see /scripts/mbart/eval.sh and /scripts/mbart/eval_ood.sh

Find example slurm scripts for training and evaluation in /scripts/mbart/slurm

I encountered a bug when fine-tuning mBART25 which was fixed by modifying the init function of the TranslationFromPretrainedBARTTask class under /tools/fairseq/fairseq/tasks/translation_from_pretrained_bart.py to the following:

def __init__(self, args, src_dict, tgt_dict):
        super().__init__(args, src_dict, tgt_dict)
        self.args = args # required for mbart finetuning, can uncomment otherwise 
        self.langs = args.langs.split(",")
        for d in [src_dict, tgt_dict]:
            for l in self.langs:
                d.add_symbol("[{}]".format(l))
            d.add_symbol("<mask>")

RNN

Build network dictionaries

./scripts/rnn/jsonify.sh [experiment name] [corpus size]

Training and Evaluation

See /scripts/rnn/train.sh and /scripts/rnn/translate.sh

Find example slurm scripts for training and evaluation in /scripts/rnn/slurm

robust-nmt's People

Contributors

oaarnikoivu avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.