Coder Social home page Coder Social logo

nlp's Introduction

Natural Language Processing

Post with usful links: transformers are gnns

Index

Theory
Applications

Resources


Theory

🛠 Pipeline

  1. Preprocess
    • Tokenization: Split the text into sentences and the sentences into words.
    • Lowercasing: Usually done in Tokenization
    • Punctuation removal: Remove words like ., ,, :. Usually done in Tokenization
    • Stopwords removal: Remove words like and, the, him. Done in the past.
    • Lemmatization: Verbs to root form: organizes, will organize organizingorganize This is better.
    • Stemming: Nouns to root form: democratic, democratizationdemocracy. This is faster.
  2. Extract features
    • Document features
      • Bag of Words (BoW): Counts how many times a word appears in a text. (It can be normalize by text lenght)
      • TF-IDF: Measures relevance for each word in a document, not frequency like BoW.
      • N-gram: Probability of N words together.
      • Sentence and document vectors. paper2014, paper2017
    • Word features
      • Word Vectors: Unique representation for every word (independent of its context).
        • Word2Vec: By Google in 2013
        • GloVe: By Standford
        • FastText: By Facebook
      • Contextualized Word Vectors: Good for polysemic words (meaning depend of its context).
        • CoVE: in 2017
        • ELMO: Done with with bidirectional LSTMs. By allen Institute in 2018
        • Transformer encoder: Done with with self-attention. ⭐
  3. Build model
    • Bag of Embeddings
    • Linear algebra/matrix decomposition
      • Latent Semantic Analysis (LSA) that uses Singular Value Decomposition (SVD).
      • Non-negative Matrix Factorization (NMF)
      • Latent Dirichlet Allocation (LDA): Good for BoW
    • Neural nets
      • Recurrent NNs decoder (LSTM, GRU)
      • Transformer decoder (GPT, BERT, ...) ⭐
    • Hidden Markov Models

Others

  • Regular expressions: (Regex) Find patterns.
  • Parse trees: Syntax od a sentence

🔤 Tokenization: The input representation

  • Character tokenization
  • Subword tokenization The best, used in recent models. ⭐
  • Word tokenization: Used in traditional NLP.

BPE tokenization of the word _subwords

N-gram

Probability of N words together. Read this.

Example

Toy corpus:

  • <start> I like apples <end>
  • <start> I like oranges <end>
  • <start> I do not like broccoli <end>

Then:

  • P(<start> I like) = P(I | <start>) * P(like | I) = 1 * 0.66 = 0.66
  • P(<start> I like apples) = P(I | <start>) * P(like | I) * P(apples | like) = 1 * 0.66 * 0.5 = 0.33

🔮 Recurrent & Convolutional models

  • RNN: Recurrent Nets. No parallel tokens ☹️
    • GRU
    • LSTM
      • AWD-LSTM: regular LSTM with tuned dropout hyper-parameters.
  • CNN: Convolutional Nets. Parallel tokens 🙂

  • Tricks
    • Teacher forcing: Feed to the decoder the correct previous word, insted of the predicted previous word (at the beggining of training)
    • Attention: Learns weights to perform a weighted average of the words embeddings.

🔮 Transformers models

Self-Attention
(Transformer Encoder)
Masked Self-Attention
(Transformer Decoder)
Advantage Context on both sides Auto-Regression
Pretraining Bidirectional LM (better) Unidirectional LM
Examples BERT GPT, GPT-2
Best one ALBERT ? T5, Meena ?
Applications Clasification Text generation

Notes

  • Auto-Regression is when the final output token becomes input.
  • Original transformer combines both encoder and decoder, (is the only transformer doing this).
  • Transformer-XL is a recurrent transformer decoder.
  • XLNet has both Context on both sides and Auto-Regression.
  • 🤗 Huggingface transformers is a package with pretrained transformers models (PyTorch & Tensorflow).
Model Creator Date Breif description 🤗
1st Transfor. Google Jun. 2017 Transforer encoder & decoder
ULMFiT Fast.ai Jan. 2018 Regular LSTM
ELMo AllenNLP Feb. 2018 Bidirectional LSTM
GPT OpenAI Jun. 2018 Transformer decoder on LM
BERT Google Oct. 2018 Transformer encoder on MLM (& NSP)
TransformerXL Google Jan. 2019 Recurrent transformer decoder
XLM/mBERT Facebook Jan. 2019 Multilingual LM
Transf. ELMo AllenNLP Jan. 2019
GPT-2 OpenAI Feb. 2019 Good text generation
ERNIE Baidu Apr. 2019
ERNIE Tsinghua May. 2019 Transformer with Knowledge Graph
XLNet Google Jun. 2019 BERT + Transformer-XL
RoBERTa Facebook Jul. 2019 BERT without NSP
DistilBERT Hug. Face Aug. 2019 Compressed BERT
MiniBERT Google Aug. 2019 Compressed BERT
MultiFiT Fast.ai Sep. 2019 Multi-lingual ULMFiT (QRNN) post
CTRL Salesforce Sep. 2019 Controllable text generation
MegatronLM Nvidia Sep. 2019 Big models with parallel training
ALBERT Google Sep. 2019 Reduce BERT params (param sharing)
DistilGPT-2 Hug. Face Oct. 2019 Compressed GPT-2
T5 Google Oct. 2019 Text-to-Text Transfer Transformer
ELECTRA ? Dec. 2019 An efficient LM pretraining
Reformer Google Jan. 2020 The Efficient Transformer
Meena Google Jan. 2020 A Human-like Open-Domain Chatbot
Model 2L 3L 6L 12L 18L 24L 36L 48L 54L 72L
1st Transformer yes
ULMFiT yes
ELMo yes
GPT 110M
BERT 110M 340M
Transformer-XL 257M
XLM/mBERT Yes Yes
Transf. ELMo
GPT-2 117M 345M 762M 1542M
ERNIE Yes
XLNet: 110M 340M
RoBERTa 125M 355M
MegatronLM 355M 2500M 8300M
DistilBERT 66M
MiniBERT Yes
ALBERT
CTRL 1630M
DistilGPT-2 82M

URGENT:

Transformer architecture

Transformer input

  1. Tokenizer: Create subword tokens. Methods: BPE...
  2. Embedding: Create vectors for each token. Sum of:
    • Token Embedding
    • Positional Encoding: Information about tokens order (e.g. sinusoidal function).
  3. Dropout

Transformer blocks (6, 12, 24,...)

  1. Normalization
  2. Multi-head attention layer (with a left-to-right attention mask)
    • Each attention head uses self attention to process each token input conditioned on the other input tokens.
    • Left-to-right attention mask ensures that only attends to the positions that precede it to the left.
  3. Normalization
  4. Feed forward layers:
    1. Linear H→4H
    2. GeLU activation func
    3. Linear 4H→H

Transformer output

  1. Normalization
  2. Output embedding
  3. Softmax
  4. Label smothing: Ground truth -> 90% the correct word, and the rest 10% divided on the other words.
  • Lowest layers: morphology
  • Middle layers: syntax
  • Highest layers: Task-specific semantics

👨🏻‍🏫 Transfer Learning

Step Task Data Who do this?
1 Language Model Pretraining 📚 Lot of text corpus (eg. Wikipedia) 🏭 Google or Facebook
2 Language Model Finetunning 📗 Only you domain text corpus 💻 You
3 Your supervised task 📗🏷️ You labeled domain text 💻 You

📉 Losses

  • Language modeling: we project the hidden-state on the word embedding matrix to get logits and apply a cross-entropy loss on the portion of the target corresponding to the gold reply.
  • Next-sentence prediction: we pass the hidden-state of the last token (the end-of-sequence token) through a linear layer to get a score and apply a cross-entropy loss to classify correctly a gold answer among distractors.

📏 Metrics

Score Description Interpretation
Perplexity LM The lower the better.
GLUE An avergae of different scores for NLU
BLEU For Translation. Compare generated with reference sentences (N-gram) The higher the better.
RACE ReAding Comprehension dataset collected from English Examinations The higher the better.
SQuAD Stanford Question Answering Dataset The higher the better.

BLEU limitation

"He ate the apple" & "He ate the potato" has the same BLEU score.

BLEU at your own risk


Applications

Application Description Type
🏷️ Part-of-speech tagging (POS) Identify nouns, verbs, adjectives, etc. (aka Parsing). 🔤
📍 Named entity recognition (NER) Identify names, organizations, locations, medical codes, etc. 🔤
👦🏻❓ Coreference Resolution Identify several ocuurences on the same person/objet like he, she 🔤
🔍 Text categorization Identify topics present in a text (sports, politics, etc). 🔤
Question answering Answer questions of a given text (SQuAD, DROP dataset). 💭
👍🏼 👎🏼 Sentiment analysis Possitive or negative comment/review classification. 💭
🔮 Language Modeling (LM) Predict the next word. Unupervised. 💭
🔮 Masked Language Modeling (MLM) Predict the omitted words. Unupervised. 💭
📗→📄 Summarization Crate a short version of a text. 💭
🈯→🆗 Translation Translate into a different language. 💭
🆓→🆒 Chatbot Interact in a conversation. 💭
💁🏻→🔠 Speech recognition Speech to text. See AUDIO cheatsheet. 🗣️
🔠→💁🏻 Speech generation Text to speech. See AUDIO cheatsheet. 🗣️
  • 🔤: Natural Language Processing (NLP)
  • 💭: Natural Language Understanding (NLU)
  • 🗣️: Speech and sound (speak and listen)

🈯 Translation

📋 Summarization

🤖 Chatbot

Model backbone: Transformer decoder like GPT or GPT2 (pretrained for LM).

Input data

  1. Persona: One or several personality sentences. (BLUE)
  2. History: The history of the dialog. (PINK)
  3. Reply: The tokens of the current answer. (GREEN)

Embeddings

  • Word embedding: Information about word semantics.
  • Position embedding: Information about word order.
  • Segment embedding: nformation about type (personality, history or reply).

Double Heads Model for multi-task loss

  • One head for language modeling loss.
  • Other head for next-sentence classification loss.

References

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.