Coder Social home page Coder Social logo

will-thompson-k / deeplearning-nlp-models Goto Github PK

View Code? Open in Web Editor NEW
75.0 2.0 9.0 3.88 MB

A small, interpretable codebase containing the re-implementation of a few "deep" NLP models in PyTorch. Colab notebooks to run with GPUs. Models: word2vec, CNNs, transformer, gpt.

License: MIT License

Python 33.07% Jupyter Notebook 66.86% Shell 0.02% Makefile 0.05%
word2vec nlp deep-learning embeddings attention transformer pytorch tutorials nlp-papers deeplearning-nlp-models

deeplearning-nlp-models's Introduction

deeplearning-nlp-models

Coveralls github Travis (.com) CodeFactor Grade GitHub

A small, interpretable codebase containing the re-implementation of a few "deep" NLP models in PyTorch.

This is presented as an (incomplete) starting point for those interested in getting into the weeds of DL architectures in NLP. Annotated models are presented along with some notes.

There are links to run these models on colab with GPUs ๐ŸŒฉ๏ธ via notebooks.

Current models: word2vec, CNNs, transformer, gpt. (Work in progress)

Meta

BERT: Reading. Comprehending.

Note: These are toy versions of each model.

Contents

Models

These NLP models are presented chronologically and, as you might expect, build off each other.

Model Class Model Year
Embeddings
1. Word2Vec Embeddings (Self-Supervised Learning) 2013
CNNs
2. CNN-based Text Classification (Binary Classification) 2014
Transformers
3. The O.G. Transformer (Machine Translation) 2017
4. OpenAI's GPT Model (Language Model) 2018, 2019, 2020

Features

This repository has the following features:

  • model overviews: A brief overview of each model's motivation and design are provided in separate README.md files.
  • Jupyter notebooks (easy to run on colab w/ GPUs): Jupyter notebooks showing how to run the models and some simple analyses of the model results.
  • self-contained: Tokenizers, dataset loaders, dictionaries, and all the custom utilities required for each problem.

Endgame

After reviewing these models, the world's your oyster in terms of other models to explore:

Char-RNN, BERT, ELMO, XLNET, all the other BERTs, BART, Performer, T5, etc....

Roadmap

Future models to implement:

  • Char-RNN (Kaparthy)
  • BERT

Future repo features:

  • Tensorboard plots
  • Val set demonstrations
  • Saving checkpoints/ loading models
  • BPE (from either openai/gpt-2 or facebook's fairseq library)

Setup

You can install the repo using pip:

pip install git+https://github.com/will-thompson-k/deeplearning-nlp-models 

Structure

Here is a breakdown of the repository:

  • nlpmodels/models: The model code for each paper.
  • nlpmodels/utils: Contains all the auxiliary classes related to building a model, including datasets, vocabulary, tokenizers, samplers and trainer classes. (Note: Most of the non-model files are thrown into utils. I would advise against that in a larger repo.)
  • tests: Light (and by no means comprehensive) coverage.
  • notebooks: Contains the notebooks and write-ups for each model implementation.

A few useful commands:

  • make test: Run the full suite of tests (you can also use setup.py test and run_tests.sh).
  • make test_light: Run all tests except the regression tests.
  • make lint: If you really like linting code (also can run run_pylint.sh).

Requirements

Python 3.6+

Here are the package requirements (found in requirements.txt)

  • numpy==1.19.1
  • tqdm==4.50.2
  • torch==1.7.0
  • datasets==1.0.2
  • torchtext==0.8.0

Citation

@misc{deeplearning-nlp-models,
  author = {Thompson, Will},
  url = {https://github.com/will-thompson-k/deeplearning-nlp-models},
  year = {2020}
}

License

MIT

deeplearning-nlp-models's People

Contributors

will-thompson-k avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

deeplearning-nlp-models's Issues

Subsampling of frequent words

I was looking through your implementation of subsampling of frequent words in https://github.com/will-thompson-k/deeplearning-nlp-models/blob/master/nlpmodels/utils/elt/skipgram_dataset.py#L68 and specifically how you generate your sampling table in

def get_word_discard_probas(self):
. Looks like your implementation slightly differs from the original paper: https://github.com/tmikolov/word2vec/blob/20c129af10659f7c50e86e3be406df663beff438/word2vec.c#L407.

Something like this worked for me if I pass a collections.Counter or dict with the item counts.

def sampling_probabilities(item_counts, sample=1e-5):
    counts = np.array(list(item_counts.values()))
    total_count = counts.sum()
    probabilities = (np.sqrt(counts / (sample * total_count)) + 1) * (sample * total_count) / counts
    # Only useful if you wish to plot the probability distribution
    #probabilities = np.minimum(probabilities, 1.0)
    return {k: probabilities[i] for i, k in enumerate(item_counts.keys())}

Using 1e-5 for sampling for one of my smaller datasets I get a around a 17% chance of keeping the most frequent item. This will of course differ a lot from dataset to dataset. There is a StackOverflow thread discussing the sampling: https://stackoverflow.com/questions/58772768/word2vec-subsampling-implementation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.