Coder Social home page Coder Social logo

wmd's Introduction

Word Mover's Distance (WMD) from Matthew J Kusner

Source: http://mkusner.github.io/

fig1

Here is version 1.0 of Python and Matlab code for the Word Mover's Distance from the paper "From Word Embeddings to Document Distances"

Prerequisites

  • Python 2.7
  • packages:
  • gensim
  • numpy
  • scipy

If you download Anaconda Python 2.7 it has everything.

You'll also need to download word2vec embedding trained on the Google News corpus (described briefly here in 'Pre-trained word and phrase vectors')

Building

You'll need to build:

  • python-emd-master/: just go into the directory and type make
  • If you want to use matlab then you'll have to build emd/ . Just open matlab, go to the directory, and type build_emd

Getting started

Here's some example code with all_twitter_by_line.txt:

python get_word_vectors.py all_twitter_by_line.txt twitter_vec.pk twitter_vec.mat 
python wmd.py twitter_vec.pk twitter_wmd_d.pk 

Matlab:

>> wmd_mat (changing load_file to 'twitter_vec.mat' and save_file to whatever you like) 

More detailed explanation

get_word_vectors.py: This extracts the word vectors and BOW vectors. This is the script you will run first. You call it like this:

python get_word_vectors.py input_file.txt vectors.pk vectors.mat 

the last argument saves a .mat file (I think you technically have to now, but I will make this optional soon). The first argument is the text document you want to process, it assumes the input text file is in the following format:

doc1_label_ID \t word1 word2 word3 word4 
doc2_label_ID \t word1 word2 word3 word4 
... 

Specifically, each document is on one line. The first thing on the line (doc1_label_ID) signifies the label of the document. For example if you have a set of tweets labeled by their sentiment (e.g. positive, negative, neutral), then this describes the label. Look at the file all_twitter_by_line.txt for an example. This is followed by a tab character: \t. Then each word of the document is separated by a space (it can be multiple spaces, it doesn't matter). The words can have punctuation and whatnot, this gets stripped by the python script.

The second argument is the name of the pickle file that saves the word vectors, and the third is a mat file with the same results (used for matlab code later if you like).

After you run this code then you'll run wmd.py. This computes the distance matrix between all documents in the saved file above. You call it like this:

python wmd.py vectors.pk dist_matrix.pk 

where vectors.pk was generated by the first script.

Use wmd_mat.m if you'd like to use Matlab instead of wmd.py. You will need to change the variable load_file to vectors.mat and save_file to whatever name you like.

KNN

In the paper, we used cross-validation to set k for each dataset and tried these k's [1,3,5,7,9,11,13,15,17,19]. We also implemented a KNN function that given a k (or a list of k's) would only classify a point if the majority of the k nearest neighbors voted on the same class. If not, then we would reduce k (by 2) and consider if for this smaller k there was a majority vote for a class. This would continue this way until either a majority was reached or k=1 (in which case we just use the nearest neighbors vote). This function is in the file knn_fall_back.m

Paper Datasets

Here is a Dropbox link to the datasets used in the paper: https://www.dropbox.com/sh/nf532hddgdt68ix/AABGLUiPRyXv6UL2YAcHmAFqa?dl=0

They're all matlab .mat files and have the following variables (note the similarity to the demo dataset):

for bbcsport, twitter, recipe, classic, amazon

  • X [1,n+ne]: each cell corresponds to a document and is a [d,u] matrix where d is the dimensionality of the word embedding, u is the number of unique words in that document, n is the number of training points, and ne is the number of test points. Each column is the word2vec vector for a particular word.
  • Y [1,n+ne]: the label of each document
  • BOW_X [1,n+ne]: each cell in the cell array is a vector corresponding to a document. The size of the vector is the number of unique words in the document, and each entry is how often each unique word occurs.
  • words [1,n+ne]: each cell corresponds to a document and is itself a {1,u} cell where each entry is the actual word corresponding to each unique word
  • TR [5,n]: each row corresponds to a random split of the training set, each entry is the index with respect to the full dataset. So for example, to get the BOW of the training set for the third split do: BOW_xtr = BOW_X(TR(3,:))
  • TE [5,ne]: same as TR except for the test set

for ohsumed, reuters (r8), 20news (20ng2_500)

The only difference with the above datasets is that because there are pre-defined train-test splits, there are already variables BOW_xtr, BOW_xte, xtr, xte, ytr, yte.

Raw Datasets

Here's a folder with all the raw data: https://www.dropbox.com/sh/f44z3nt3i5279yt/AACHBs4qiISGPdBjB_aEgDVMa?dl=0 (it also has some extra datasets we ended up not using)

The main subtleties are:

  • We do not have raw data for the recipe dataset unfortunately, just BOW
  • Reuters is here:
  • We used stop_words_115.txt to remove stop words in all datasets except twitter (which has so few words per document that removing stop words hurt training accuracy)
  • For ohsumed we used the first 10 classes
  • For 20 news we additionally removed words that appear less than 5 times across all documents, and limited each document to the 500 most common words (i.e., we removed the 501st, 502st, 503rd... most common words in each document, if they existed)
  • We used the 5 train/test splits for bbcsport, twitter, classic, amazon as defined in TR and TE in the BOW data .mat files described above
  • For ohsumed, reuters, 20news (20ng) the train/test splits are already defined so we didn't use 5 different splits.

Feedback & Contact

Let me know if you have any questions at mkusner AT wustl DOT edu. Please cite using the following BibTeX entry (instead of Google Scholar):

@inproceedings{kusner2015doc, 
   title={From Word Embeddings To Document Distances}, 
   author={Kusner, M. J. and Sun, Y. and Kolkin, N. I. and Weinberger, K. Q.}, 
   booktitle={ICML}, 
   year={2015}, 
} 

wmd's People

Contributors

mkusner avatar renaud avatar taineleau avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wmd's Issues

memory leak

in emd.i


%typemap(freearg) signature_t * {
    if ($1 != NULL) {
        PyObject **features_array = (PyObject **) $1->Features;
        int weights_count = (int)$1->n;
        int i = 0;
        for (i = 0; i < weights_count; ++i) {
            Py_XDECREF(features_array[i]);
        }
        free((PyObject **) $1->Features);
        free((float *) $1->Weights);
        free((signature_t *) $1);
    }
}

i want to use WMD to train chinese data,there's some errors ,plz help me!

root@user-virtual-machine:/home/user/WMD# python wmd.py asd.pk asdwmd.pk
[pool :] <multiprocessing.pool.Pool object at 0x7f327f1cc150>
0 out of 3
1 out of 3
emd: Signature size is limited to 100
2 out of 3
emd: Signature size is limited to 100


stop.txt and training data all use in chinese. how can i solve this problem???

install error=> ld: unknown option: -shared

mac os high sierra 10.13
install error

Building object file 'emd.o'.
-n
cc -o emd.o -c emd.c -fPIC -I/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/include/python2.7 -I/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/include/python2.7
In file included from emd.c:20:
./emd.h:22:9: warning: 'INFINITY' macro redefined [-Wmacro-redefined]
#define INFINITY 1e20
^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.13.sdk/usr/include/math.h:68:9: note: previous definition is
here
#define INFINITY HUGE_VALF
^
1 warning generated.

Generating C interface
swig -python emd.i

Building object file 'emd_wrap.o'.
-n
cc -o emd_wrap.o -c emd_wrap.c -fPIC -I/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/include/python2.7 -I/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/include/python2.7
In file included from emd_wrap.c:3020:
./emd.h:22:9: warning: 'INFINITY' macro redefined [-Wmacro-redefined]
#define INFINITY 1e20
^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.13.sdk/usr/include/math.h:68:9: note: previous definition is
here
#define INFINITY HUGE_VALF
^
1 warning generated.

Linking wrapper library '_emd.so'.
-n
ld -shared -o _emd.so emd.o emd_wrap.o
ld: unknown option: -shared
make: *** [_emd.so] Error 1
rm emd_wrap.o emd.o emd_wrap.c

ImportError: No module named _emd

Why is that I keep getting "ImportError: No module named _emd" error from emd.py? I use python 2.7.

May I ask what is '_emd' ? I assume it's not the same as pyemd?

Thanks in advance for your time!

Current wmd implementation does not match GenSim

It is not really an issue, but compatibility with GenSim library.

Using the first twitter corpus texts, i.e.

now all apple has to do is get swype on the iphone and it will be crack iphone that is

and

apple will be adding more carrier support to the iphone 4s just announced,

I get 0.99 distance using GenSim wmd implementation and 2.6625 using this implementation (original and from the paper's author).

At first sight, I thought that it was related to your stop words list. That said, debugging your code I see that the first and second texts become:

apple swype iphone iphone crack
apple adding carrier support iphone 4s announced

However, running with the words above, I still get a completely different result. Using GenSim and filtering your stop words (as above) I get 0.96 wmd.

Is there any place where this compatibility is discussed?
Could anybody please confirm if the same numbers are returned for different implementations?

This highly impacts the effectiveness of using GenSim implementation to find semantically close texts.

no moudle named multiarray

I have installed numpy and in the shell,when I type "import numpy.core.multiarray",it's ok.
I don't know why this problem appear?

luoyj@luoyj-Lenovo-M490:~/wmd-master$ python wmd.py twitter_vec.pk twitter_wmd_d.pk
Traceback (most recent call last):
File "wmd.py", line 11, in
[X, BOW_X, y, C, words] = pickle.load(f)
File "/usr/lib/python2.7/pickle.py", line 1378, in load
return Unpickler(file).load()
File "/usr/lib/python2.7/pickle.py", line 858, in load
dispatchkey
File "/usr/lib/python2.7/pickle.py", line 1090, in load_global
klass = self.find_class(module, name)
File "/usr/lib/python2.7/pickle.py", line 1124, in find_class
import(module)
ImportError: No module named multiarray

How to work with result (.pk) file

Hi,

first of all thank you for the great work and nice implementation!

The tool works fine for me and I will use it for document comparison in the socal media context. Can you please give me some advise how to work with the resulting "...wmd_d.pk" file? First I thought the result would be a textfile with a readable matrix in it but now I think I need any additional software?

Thank you very much!

technology independent output file

It would be very nice if the output distance matrix file were independent of python formats. So we use it in another languages as well.

Paralleling processing

Wow, great paper! Thank you for making the code OSS.

The documentation says that the Python wrapper is not suitable for parallel execution:

The wrapper is not suited for concurrent execution. It uses a global variable for the distance callback function, so calling emd from concurrent threads will result in undefined behavior.

However, the function get_wmd calls emd concurrently. Can you please explain?

Makefile:51: recipe for target 'emd_wrap.c' failed

# git clone https://github.com/mkusner/wmd.git
Cloning into 'wmd'...
remote: Counting objects: 41, done.
remote: Total 41 (delta 0), reused 1 (delta 0), pack-reused 40
Unpacking objects: 100% (41/41), done.
Checking connectivity... done.
# cd wmd/
# pip install gensim numpy scipy
# cd python-emd-master/
# make
>>> Building object file 'emd.o'.
    cc -o emd.o -c emd.c -fPIC -I/usr/include/python2.7 -I/usr/include/x86_64-linux-gnu/python2.7 
In file included from emd.c:20:0:
emd.h:22:0: warning: "INFINITY" redefined
 #define INFINITY       1e20
 ^
In file included from /usr/include/math.h:41:0,
                 from emd.c:18:
/usr/include/x86_64-linux-gnu/bits/inf.h:26:0: note: this is the location of the previous definition
 # define INFINITY (__builtin_inff())
 ^
In file included from emd.c:20:0:
emd.h:32:20: warning: extra tokens at end of #include directive
 #include "Python.h";
                    ^

>>> Generating C interface
swig -python emd.i
make: swig: Command not found
Makefile:51: recipe for target 'emd_wrap.c' failed
make: *** [emd_wrap.c] Error 127
rm emd.o

installation issues

swig is required, but not mentioned.

in emd.h, the include of Python.h has a ; that should be removed

the meaning of row and column in distance matrix(WMD_D)

Dear sir,
I feel sorry to trouble you :
After I run wmd.py, it may get a distance matrix between all documents.
But I am puzzled about the row and column of this distance matrix:
Is every row of distance matrix representing each document? that is the document vector?
Is every column of distance matrix representing the same word of each document?

Thank you so much!

installation issues solved

i had installation issues similar to before-mentioned ones.

running
sudo apt-get install python-dev # for python2.x installs
or
sudo apt-get install python3-dev # for python3.x installs
and
removing ";" from include Python.h; in emd.h

solved the problems

Obtaining flow information through python interfance

Hello,

Thank you for the great work and nice implementation. It really helps me!
I know that i can obtain distances through emd( (X[i], BOW_X[i]), (X[j], BOW_X[j]), distance). But how can I get the flow information (transportation matrix)? I have no idea of getting it through python interface.

Zhe Zhao

Makefile:39: recipe for target '_emd.so' failed

Interesting topic and paper. I tried to compile the makefile on Ubuntu (15.04) using Python 2.7, including all of the required libraries, but there is an error that I could not solve it, here is the output of running make:
wmd- error page

I would be thankful if you can help me to solve this. Thanks.

Deadlock in Multiprocessing

Thank you for your implementation of your paper.

First, I tried with your code and data. It worked well. (all_twitter_by_line.txt)

Second, I tried 20newsgroup data which was in your paperwork.

Then, I got

"emd: Maximum number of iterations has been reached 1013"

error because of limitation, MAX_SIG_SIZE 100.

So, I change it to over maximum size of unique keywords in 20newsgroup dataset( =5284).

Now, I have trouble with blocking after some steps.

I think it's because of multiprocessing.

I check CPU availability, it was 99% in multiCPU, multicore environment.

Is there any solution for this?

problem with WCD

@mkusner I read your paper and want to use your WCD+RWMD method to calculate docs similarity in my doc recommendation project. I found the code for RWMD in matlab, but didn't find the code for WCD. Is it the file named distance.m?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.