Coder Social home page Coder Social logo

nlp_data_aug's Introduction

nlp_data_aug

Experiments with NLP Data Augmentation

Data Augmentation Techniques

(some ideas from this thread in fast.ai forums)

  • Noun,Verb and Adjective replacement in the IMDB dataset from the EDA paper
  • Pronoun replacement to come soon
  • "Back Translation": Translating from one language to another and then back to the original to utilize the “noise” in back translation as augmented text.
  • Character Perturbation for augmentation (more from Mike) Surface Pattern Perturbation
    • Because ULMFit doesn't model characters, I will switch to surface patterns of word tokens instead. UNK token perturbation will be committed soon.

Issues with the codebase

  • Cuda dropout non-deterministic?? (more from Mike) Most of cuDNN nondeterministic issues are almost implicitly solved with fastai
  • Hyperparameters
    • from the ULMFiT paper (using pre-1.0 fastai):

      We use the AWD-LSTM language model (Merity et al., 2017a) with an embedding size of 400, 3 layers, 1150 hidden activations per layer, and a BPTT batch size of 70. We apply dropout of 0.4 to layers, 0.3 to RNN layers, 0.4 to input embedding layers, 0.05 to embedding layers, and weight dropout of 0.5 to the RNN hidden-to-hidden matrix. The classifier has a hidden layer of size 50. We use Adam with β1 = 0.7 instead of the default β1 = 0.9 and β2 = 0.99, similar to (Dozat and Manning, 2017). We use a batch size of 64, a base learning rate of 0.004 and 0.01 for fine-tuning the LM and the classifier respectively, and tune the number of epochs on the validation set of each task.

      On small datasets such as TREC-6, we fine-tune the LM only for 15 epochs without overfitting, while we can fine-tune longer on larger datasets. We found 50 epochs to be a good default for fine-tuning the classifier.

    • from fastai/course-v3/nbs/dl1/lesson3-imdb.ipynb (May 3, 2019)

      learner bs lr bptt wd clip drop_mult to_fp16()
      lm 48 1e-2 70 0.01 None 0.3 No
      cf 48 2e-2 70 0.01 None 0.5 No
    • from fastai/fastai/examples/ULMFit.ipynb (June 11, 2019)

      Fine-tuning a forward and backward langauge model to get to 95.4% accuracy on the IMDB movie reviews dataset. This tutorial is done with fastai v1.0.53.

      The example was run on a Titan RTX (24 GB of RAM) so you will probably need to adjust the batch size accordinly. If you divide it by 2, don't forget to divide the learning rate by 2 as well in the following cells. You can also reduce a little bit the bptt to gain a bit of memory.

      learner bs lr bptt wd clip drop_mult to_fp16()
      lm 256 2e-2 80 0.1 0.1 1.0 Yes
      cf 128 1e-1 80 0.1† None 0.5 No

      † forward cf's wd used default value 0.01

    • from fastai/course-nlp/nn-imdb-more.ipynb (June 12, 2019)

      learner bs lr bptt wd clip drop_mult to_fp16()
      lm 128 1e-2*bs/48 70 0.01 None 1.0 Yes
      cf 128 2e-2*bs/48 70 0.01 None 0.5 Yes

Tasks

  • Use data augmentation techniques in succession on IMDB classification task to report performance
  • Make use of baseline translation model from OpenNMT for "back translation" task

Misc

nlp_data_aug's People

Contributors

the-asir avatar tianjianjiang avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

tianjianjiang

nlp_data_aug's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.