Coder Social home page Coder Social logo

document2vec's Introduction

Document2Vec

Finding document vectors from pre-trained word2vec word vectors

Build Status MIT License

How to install

Simply install from the git repo like so:

pip install -e git+git://github.com/cemoody/Document2Vec.git#egg=Package
# on a shared machine without system-python access add --user

How to use

The word2vec file must be a trained gensim Word2Vec file and cannot be Mikolov's pre-trained vectors. This is because training a new document vector requires the syn1 layer which the C version of word2vec throws away.

Initialize Document2Vec with pre-trained word vectors from a pre-existing word2vec training run like so:

from document2vec.document2vec import Document2Vec
from document2vec.corpora import SeriesCorpus
import pandas as pd
# This must be a gensim Word2Vec or Doc2Vec pickle
d2v = Document2Vec("/home/moody/projects/Parachute/data/data-all-02.py2")
sentences = pd.Series(['i love jackets', 'blue is my favorite color'])
corpus = SeriesCorpus(sentences)
doc_vectors = d2v.transform(corpus)

And then semantic similarities can be evaluated directly:

from scipy.spatial.distance import cosine
# vector for 'i love jackets'
v0 = doc_vectors[0, :] 
# vector the word 'jackets'
v1 = d2v['jackets']
similarity = 1 - cosine(v0, v1)
print(similarity) # 0.320
# Of course, the similarity with a word that is literally
# in the sentence is going to be quite high
# What if we try something similar, like coats?
v2 = d2v['coats']
similarity = 1 - cosine(v0, v2)
print(similarity) # 0.265
# And then if we try a very something very dissimilar from the sentece
# like the city of New York we get low similarity:
v3 = d2v['new_york']
similarity = 1 - cosine(v0, v3)
print(similarity) # 0.02

Monitoring training

It can be useful to monitor the training over many iteration to make sure doc2vec is at (least locally) doing what it should be doing:

from scipy.spatial.distance import cosine
import numpy as np
def monitor(model):
    print model.alpha,
    for word in ['jackets', 'jacket', 'coats', 'dog']:
        print word,': ', 1.0 - cosine(model['SENT_0'], model[word]),
    print " "
d2v.monitor = monitor
doc_vectors = d2v.transform(corpus)

Will print something similar to the following:

0.25000 jackets :  0.347975713494 jacket :  0.150385576332 coats : 0.305263268479 dog :  0.121432161320
0.20002 jackets :  0.301431248517 jacket :  0.113824911821 coats : 0.272647329817 dog :  0.125565730551
0.15004 jackets :  0.296385793196 jacket :  0.108801409463 coats : 0.267922727947 dog :  0.126922837909
0.10006 jackets :  0.293973052240 jacket :  0.106190931536 coats : 0.265730524733 dog :  0.126504370045
0.05008 jackets :  0.293425048701 jacket :  0.105495592420 coats : 0.264931351959 dog :  0.125495564005

document2vec's People

Contributors

cemoody avatar

Watchers

Yuanyuan Zhang avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.