Coder Social home page Coder Social logo

charlevelcnn's Introduction

charLevelCNN

Pretraining method for character-level CNNs - converts word embeddings to character level embeddings

Methodology

Word-level embeddings are extremely common, and large organizations have released .vec files trained using Word2Vec/fasttext/GloVe on enormous datasets (like Common Crawl). This makes it easy to pre-seed a word-level model with reasonable word embeddings without needing the time or resources to train on a dataset of that size yourself.

But if instead you want to train a character-level model, pretraining isn't as easy.

This repository attemps to solve that problem by using pretrained word embeddings as input to train a shallow character-level CNN that can then be used as the bottom layers of a deeper model.

The model takes pairs of (word_vector_i, character_representation_j) from the w2v/ft/glove .vec file and predicts cosine similarity between words i and j.

Results

With a 4-layer CNN architecture, a maximum word length of 20 characters, and a negative sampling ratio of 4:1, the model converged on the validation set after 90 epochs (patience=10).

So without seeing more than just the words in the .vec file, we already have a decent character-level embedder:

# 'concedes' is in the test set, never seen by the model
checkScore("acknowledges", "concedes", letters, words, maxCharLen)
>0.898567

# we also have robustness to spelling errors (this misspelling was not in our data at all)
checkScore("acknowledges", "conceeds", letters, words, maxCharLen)
>0.873896

# and we can compare multi-word character inputs
checkScore("acknowledges", "concedes the point", letters, words, maxCharLen
>0.626232

# but unrelated words remain much lower
checkScore("acknowledges", "cat", letters, words, maxCharLen)
>0.0234105

checkScore("acknowledges", "dog", letters, words, maxCharLen)
>0.0299126

The model is far from perfect, but could provide a great jump-start to char-level models trained on non-web-scale datasets.

Experiments

Tested the performance of a pretrained model on the AG news dataset (4-class headline+snippet categorization), available here: https://github.com/mhjabreel/CharCNN/tree/master/data/ag_news_csv

Caveats:

  • only tried one simple architecture
  • no optimization of the pre-trained bottom layers
  • .vec file used in pretraining was all lower-cased

Results:

  • without pretraining error rate: 13%
  • with pretraining error rate: 17% (worse than without pretraining)
  • without pretraining, we need a much lower learning rate for the model to converge

to_train_or_not_to_pretrain_that_is_the_question

Hypotheses:

  • if the vocabulary of the news dataset is limited, then much of the pretrained information is 'wasted.'
  • without pretraining, any number of layers can model words, whereas with a pretrained model, you start with 4 layers devoted to just single-word embedding
  • possibility of dead neurons

Requirements

The repo requires a little more work to be conveniently repurposable, but it would work with a .vec file saved in data/ and referenced in the readWordVecs() function.

Tested with:

keras=2.0.8
numpy=1.13.3
fastText=0.1.0

PCA'd embedding of characters

charlevelcnn's People

Contributors

jamesmf avatar

Watchers

James Cloos avatar  avatar

Forkers

msgpo

charlevelcnn's Issues

Try same methodology but with .vec trained on AG data

The initial experiment on the AG News dataset showed that pretraining using a .vec file from a Common Crawl fasttext run didn't improve things.

One hypothesis is that the large .vec from that run contained mostly words not used in the news dataset, and thus using model weights to capture that information is a waste.

To test that, I need to repeat the procedure using word vectors that come from the AG training dataset itself.

Deal with character length while reading in

Current code doesn't consider the maxCharLen input while reading in the vectors. This results in the word list having words longer than maxCharLen. At the moment, the code will just truncate, but in the future it probably shouldn't train on fractions of words. Using the glove.6B.100d.txt file, token length drops off before 20, but has outliers as long as 60+.

Add new pretraining methods

Currently this repository explores the idea of taking an existing set of word embeddings and pretraining some levels of a character-level CNN.

While that has the (perhaps large) benefit of using web-scale data without needing much processing, it has the drawback of creating features that exist only at a word level.

Assuming that we instead have a corpus on which we'd like to pretrain, I would also like to test the methodology of accepting (string_1, string_2) pairs, and predicting if string_1 is before, immediately preceding, immediately succeeding, after, or from a different document. It might also include fake data, noised data, etc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.