Coder Social home page Coder Social logo

noisy-text's Introduction

Noisy-Text

Add noise to your text, inspired by Edunov et al. (2018) "Understanding Back-Translation at scale"

Made at Qwant Research during my internship

It is often a good idea to add noise to your syntetic text data, when using backtranslation for example

Edunov et al. (2018) showed that doing so can help to provide a stronger training signal

This repository contains:

  • A script to reproduce experiments described by Edunov et al. (2018) in their noise approach
  • A simple architecture so you can play with noise parameters or implement your own noise functions

Installation

Libraries you'll need to run the project:

{tqdm}

Clone the repo using

git clone https://github.com/valentinmace/noisy-text.git

Usage

I've implemented the 3 noise functions described in the paper:

  1. Delete words with given probability (default is 0.1)
  2. Replace words by a filler token with given probability (default is 0.1)
  3. Swap words up to a certain range (default range is 3)

The default parameters are to reproduce Edunov et al. (2018) experiments but you can play with them and maybe find better values

Example of simple usage

python add_noise.py data/example --progress

Example of complete usage

python add_noise.py data/example --delete_probability 0.9 --replace_probability 0.9  --filler_token 'MASK' --permutation_range 3

Important Note

If you are using a subword tool such as SentencePiece after adding noise to your corpus, notice that your replacement token (which is 'BLANK' by default) might be segmented into somthing like '▁B LAN K'

I recommend to make a pass on your corpus to correct it: (adapt it to your token and segmentation)

sed -i 's/▁B LAN K/▁BLANK/g' yourtextfile

Results

I've run NMT experiments on WMT 2019 de-en corpus, using all available parallel data and adding the monolingual news-crawl 2018 via backtranslation.

After translating it from german to english to have my syntetic data, I added noise to it using this repo, giving the following results. All results are BLEU Scores

The first table reports a Transformer model identical to the "base model" in Vaswani et al. (2017), the second table reports a "Transformer Big" model, from the same paper

Model newstest2017 newstest2018
baseline 26.62 40.47
backtranslation only 27.06 40.06
backtranslation + noise 27.88 41.92

Transformer base model

Model newstest2017 newstest2018
baseline 29.75 45.8
backtranslation + noise 31.33 47.4

Transformer Big model

Notes

Do not hesitate to contact me if you need some help, need a feature or see some bug

Feel free and welcome to contribute

Meta

Valentin Macé – LinkedInYouTubeTwitter - [email protected]

Distributed under the MIT license. See LICENSE for more information.

noisy-text's People

Contributors

valentinmace avatar

Stargazers

JOJO!!!! avatar Lê Ngọc Đức avatar Ramakrishna Appicharla avatar Sangjee Dondrub avatar Musa Dzhabirov avatar Dennis Byington avatar Farhan Hai Khan avatar Ashwanth Kumar avatar  avatar Badr-Eddine Marani avatar Byounghoon Lee avatar Maxim Korablev avatar xihajun avatar  avatar Chitreddy_Sairam avatar  avatar Jianfeng Chi avatar Akmal avatar Xinqi.Wang avatar Ruijie Yan avatar Olajuge avatar Ramsey avatar Zhuosheng Zhang avatar Jiaqing Zhang avatar yuanke avatar 爱可可-爱生活 avatar Andrei Biswas avatar devfon avatar Derrick avatar Liang Ding avatar Muammar Zikri Aksana avatar Pawel Cyrta avatar Jianhao Yan avatar Mathias Müller avatar Nicolas Spring avatar Shulai Zhang avatar  avatar Christophe Servan avatar Robin Mancini avatar

Watchers

James Cloos avatar  avatar

noisy-text's Issues

Different if condition in delete and replace noise function

Thank you for this repo, this really helped me. But I have one doubt
add_noise.py, line number (43-46)
you have applied all 3 noise functions, using output of one function as input in other (line) ... does it have to be like this? or I can simply use one noise function individually? I think it can be like this, (delete/replace + permutation optional) ... but want to reconfirm...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.