Coder Social home page Coder Social logo

data_selection's Introduction

Data selection for machine translation

This repo contains code for experiments on data selection for machine translation. The model, data and experiments are described in our paper (https://arxiv.org/abs/2109.07591). The main focus of this repo is to explore various tradeoffs and interactions between data selection and finetuning on out-of-domain and in domain data. The code is written in python / flax. The model is a vanilla transformer. The data used is from tfds.datasets. We use the WMT data; specifically Paracrawl and News Commentary.

Dependencies

All dependencies are listed in requirements.txt. Models are implemented using the flax/ jax libraries or Huggingface Transformers. Data is sourced from Tensorflow Datasets (https://www.tensorflow.org/datasets/api_docs/python/tfds).

Files

The main runner is train.py. This is an example below. There are two helper runners; clf_infer.py and compute_is.py. Both are to compute selection scores using either the Descriminative Classifier (DC) or Constrastive Data Selection (CDS) respectively.

Example

python train.py -- model_dir=models/ --dataset_name='newscommentary_paracrawl'
--aux_eval_dataset='newscomment_eval_ft'
--batch_size=128 --num_train_steps=15000
--emb_dim=512 --mlp_dim=2048 --num_heads=8 --paracrawl_size=4500000
--vocab_path 'tokenizer/sentencepiece_model' --restore_checkpoints
--data_dir='data/' --chkpts_to_keep=1
--checkpoint_freq=5000 --eval_frequency=100
--pretrained_model_dir='pretrained_models/' --save_checkpoints=False
--is_scores_path='scores/scores.csv' --data_selection_size=5e5 --compute_bleu=False

Note: If there is no tokenizer, one will be created. data_dir must be populated. You can download and preprare the data using TF dataset builder (https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder). The scores.csv file is a file of the selection scores for each example in the dataset.

Citations

@article{iter2021complementarity, title={On the Complementarity of Data Selection and Fine Tuning for Domain Adaptation}, author={Iter, Dan and Grangier, David}, journal={arXiv preprint arXiv:2109.07591}, year={2021}, url={https://arxiv.org/abs/2109.07591} }

This code branches from the Flax WMT example: https://github.com/google/flax/tree/master/examples/wmt

@software{flax2020github, author = {Jonathan Heek and Anselm Levskaya and Avital Oliver and Marvin Ritter and Bertrand Rondepierre and Andreas Steiner and Marc van {Z}ee}, title = {{F}lax: A neural network library and ecosystem for {JAX}}, url = {http://github.com/google/flax}, version = {0.3.4}, year = {2020}, }

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.