sujitpal / eeap-examples Goto Github PK

Code for Document Similarity on Reuters dataset using Encode, Embed, Attend, Predict recipe

Jupyter Notebook 94.17% Python 5.83%

eeap-examples's Introduction

eeap-examples

Introduction

This repository contains some examples of applying the Embed, Encode, Attend, Predict (EEAP) recipe proposed by Matthew Honnibal, creator of the SpaCy deep learning toolkit, for building Deep Learning pipelines.

I also gave a talk about this at my talk at PyData Seattle 2017.

Code is in Python. All models are built using the awesome Keras library. Supporting code uses NLTK and Scikit-Learn.

The examples use 4 custom Attention layers, also available here as a Python include file. The examples themselves are written as Jupyter notebooks.

A good complete implementation of attention can be found here.

Data

Please refer to data/README.md for instructions on how to download the data necessary to run these examples.

Examples

Document Classification Task

The document classification task attempts to build a classification model for documents by treating it as a sequence of sentences, and sentences as sequence of words. We start with the bag of words approach, computing document embeddings as an average of its sentence embeddings, and sentence embeddings as an average of its word embeddings. Next we build a hierarchical model for building sentence embeddings using a bidirectional LSTM, and embed this model within one that builds document embeddings by encoding the output of this model using another bidirectional LSTM. Finally we add attention layers to each level (sentence and document model). Our final model is depicted in the figure below:

The models were run against the Reuters 20 Newsgroups data in order to classify a given document into one of 20 classes. The chart below shows the results of running these different experiments. The interesting value here is the test set accuracy, but we have shown training and validation set accuracies as well for completeness.

As you can see, the accuracy rises from about 71% for the bag of words model to about 82% for the hierarchical model that incorporates the Matrix Vector Attention models.

Document Similarity Task

The Document Similarity task uses a nested model similar to the document classification task, where the sentence model generates a sentence embedding from a sequence of word embeddings, and a document model embeds the sentence model to generate a document embedding. A pair of such networks are set up to produce document vectors from the documents being compared, and the concatenated vector fed into a fully connected network to predict a binary (similar / not similar) outcome.

The dataset for this was manufactured from the Reuters 20 newsgroup dataset. TF-IDF vectors were generated for all 10,000 test set documents, and the similarity between all pairs of these vectors were calculated. Then the top 5 percentile was selected as the positive set and the bottom 5 percentile as the negative set. Even so, there does not appear to be too much differentiation, similarity values differed by about 0.2 between the two sets. A 1% sample was then drawn from either set to make the training set for this network.

We built two models, one without attention at either the sentence or document layer, and one with attention on the document layer. Results are shown below:

Sentence Similarity Task

The Sentence Similarity task uses the Semantic Similarity Task dataset from 2012. The objective is to classify a pair of sentences into a continuous scale of similarity from 0 to 5. We build a regression network as shown below. Our loss function is Mean Squared Error and Optimizer is RMSProp. Evaluation is done by computing the RMSE between the label similarity and the network predictions of the test set. In addition, we also compute the Pearson and Spearman (rank) correlations between the labels and predictions of the test set.

Our baseline is a hierarchical network that computes an encoding for each sentence in the pair, where the encodings without attention are used to generate the prediction. We compare the baseline to Matrix Matrix dot attention proposed by Parikh, et al where the inputs are scaled to [-1, 1] (MM-dot(s)). Next we compare with an unscaled version of this (MM-dot). Finally, we introduce two new attention implementations based on a description on this Tensorflow NMT page - specifically, an additive attention (MM-add) proposed by Bahdanau, et al, and a multiplicative attention (MM-mult) proposed by Luong, et al. Both operate on the encoder outputs without scaling via tanh. Results are shown below. As can be seen, the MM-add and MM-mult result in lower RMSE and generally higher Pearson and Spearman correlations than the baseline.

eeap-examples's People

Stargazers

Watchers

Forkers

mohabdel2013 nieshaoshuai guanlongtianzi stevenlol wolkerzheng benjamesbabala leidongfeng alvis-huang lu839684437 laisun zssasa geraltlin huskyeder leezqcst raghavendranpm krishnad alomdaelmasry falconzyx daehwanahn mllog mahendra-ramajayam gargaditya soumenms2015 quantumgame mvl1208 samarth92 jhy1993 anuragkankanala zhoutf samarthagarwal23 authman calculatedcontent quangvu0702 kaihuilau jkhlot yugam1 poivrenoir tumeteor youlina3 martian07 iamshwin auserj sulasen jx57 arlenzhu pvcastro gauravgajbhiye cosecant-csc almoslmi angelo337 skywindy yijunwu hulalazz yynst2 rremani databill86 davidkabiito sepidina saiful9379 kanish84in anandprabhakar0507 strategist922 rcschen abe2g onirban ronykalfarisi tommylitlle fabio-cancio-sena dragomirradev dtriepke shenbennwdsl munkarkin96 zsoftwarerepository pzq7025 rebal-tech anukkrit149 divyanks elly1109 ravijoe eric11eca dipongkor karthikshivaram24 ponykid petcai nangal

eeap-examples's Issues

ng-vocab.tsv file not found

Hi,

I am running this notebook: https://github.com/sujitpal/eeap-examples/blob/master/src/04c-ng-clf-eeap.ipynb

When I run "Load Vocabulary" cell, it gives this error:

<ipython-input-4-e0b7b98e6728> in <module>()
      1 word2id = {"PAD": 0, "UNK": 1}
----> 2 fvocab = open(VOCAB_FILE, "rb")
      3 for i, line in enumerate(fvocab):
      4     word, count = line.strip().split("\t")
      5     if int(count) <= MIN_OCCURS:

FileNotFoundError: [Errno 2] No such file or directory: `'../data/ng-vocab.tsv'

ng-sim-datagen.py throwing error.

Python3 : TypeError: a bytes-like object is required, not 'str'
Python2 : IOError: [Errno socket error] [SSL: UNKNOWN_PROTOCOL] unknown protocol (_ssl.c:590)

Can someone put the dataset on drive if they are able to run this script?

Some suggestions on attention and document similarity

Hey,

First, thanks for the kind words in various places :). I came across your posts, which led me here.

I also spent quite some time working on similarity models. I think they're surprisingly difficult to implement correctly in most deep learning toolkits. There are two problems:

It's pretty hard to maintain the symmetry. We'd like to guarantee that the Siamese network always maps a pair of identical sentences into identical vectors. Intuitively identical inputs should give 1.0 similarity, right? But lots of thing can go wrong to prevent this. Here are some of the problems I've had in different implementations:

i. Dropout needs to be synchronised across the two 'halves' of the network. If we redraw the weights for the two sentences, we'll end up with different vectors for the same input. This makes the model converge very slowly.

ii. Batch normalization. I don't remember exact results, but I do remember I ended up not wanting to use batch norm in Siamese networks, because I found it too difficult to reason about.

iii. In one of my models, I assigned random vectors to OOV words, without ensuring the same word always mapped to the same vector.

Most libraries make you pad your inputs with zeros. Most attention layers then do some sort of pooling operation. If you're averaging, you need to normalize by only the input tokens, and exclude the padding tokens. We also want to make sure we're not sending gradient through the padding tokens too. I could never get this correct in Keras. You can find my effort to replicate Parikh et al.'s decomposable attention model here: https://github.com/explosion/spaCy/tree/master/examples/keras_parikh_entailment . Other people have worked on the code since, but as far as I know it's still not correct.

Finally, an extra tip that should help your similarity models :). I notice you're using pre-trained embeddings, and are using a fixed-size vocabulary. This means that all words outside your vocabulary will be mapped to the same representation. If you think about it, this is pretty bad: if our input sentences match on some rare word, that's a great feature! I think the best solution is to augment the static vectors with a learned component. Here's an example network that does this: https://github.com/explosion/thinc/blob/master/examples/text-pair/glove_mwe_multipool_siamese.py#L162

The network in that example uses a trickier "Embed" step, that is the sum of the static vectors, and then learned vectors from my HashEmbed class, which uses the "hashing trick" that has been popular in sparse linear models. The insight is similar to Bloom filters, so a recent paper has called this "Bloom embeddings". Basically you just mod the key into a fixed size table, and compute multiple conditionally independent keys per word. This way allows the table to map a very large number of vocabulary items to (mostly) distinct representations, with relatively few rows.

This hash embedding trick isn't the only solution to the OOV problem. Using a character LSTM to create the OOV word features would probably work well too --- but much more slowly.

Missing data file creating problem with result verification

ng-vocab.tsv file is missing. I checked this issue . Added this code to generate the file and verify the result. It gave me accuracy score of 55% (approx). Then instead of using this I used this dataset. Result improved to 59%. I think the dataset is playing an important role here. So I would request to post the original file, if not then atleast the process of getting or generating it, so that we can run it and get the result that you mentioned. (71% accuracy). I ran this code.

How to setup and import custom_attn?

I guess custom_attn is a customized package,
so how can I setup and import it, Thanks!