mkusner / wmd Goto Github PK

Word Mover's Distance from Matthew J Kusner's paper "From Word Embeddings to Document Distances"

MATLAB 6.73% C 78.40% Python 11.90% Makefile 2.97%

wmd's Introduction

Word Mover's Distance (WMD) from Matthew J Kusner

Here is version 1.0 of Python and Matlab code for the Word Mover's Distance from the paper "From Word Embeddings to Document Distances"

Prerequisites

Python 2.7
packages:
gensim
numpy
scipy

If you download Anaconda Python 2.7 it has everything.

You'll also need to download word2vec embedding trained on the Google News corpus (described briefly here in 'Pre-trained word and phrase vectors')

Building

You'll need to build:

python-emd-master/: just go into the directory and type make
If you want to use matlab then you'll have to build emd/ . Just open matlab, go to the directory, and type build_emd

Getting started

Here's some example code with all_twitter_by_line.txt:

python get_word_vectors.py all_twitter_by_line.txt twitter_vec.pk twitter_vec.mat 
python wmd.py twitter_vec.pk twitter_wmd_d.pk

Matlab:

>> wmd_mat (changing load_file to 'twitter_vec.mat' and save_file to whatever you like)

More detailed explanation

get_word_vectors.py: This extracts the word vectors and BOW vectors. This is the script you will run first. You call it like this:

python get_word_vectors.py input_file.txt vectors.pk vectors.mat

the last argument saves a .mat file (I think you technically have to now, but I will make this optional soon). The first argument is the text document you want to process, it assumes the input text file is in the following format:

doc1_label_ID \t word1 word2 word3 word4 
doc2_label_ID \t word1 word2 word3 word4 
...

Specifically, each document is on one line. The first thing on the line (doc1_label_ID) signifies the label of the document. For example if you have a set of tweets labeled by their sentiment (e.g. positive, negative, neutral), then this describes the label. Look at the file all_twitter_by_line.txt for an example. This is followed by a tab character: \t. Then each word of the document is separated by a space (it can be multiple spaces, it doesn't matter). The words can have punctuation and whatnot, this gets stripped by the python script.

The second argument is the name of the pickle file that saves the word vectors, and the third is a mat file with the same results (used for matlab code later if you like).

After you run this code then you'll run wmd.py. This computes the distance matrix between all documents in the saved file above. You call it like this:

python wmd.py vectors.pk dist_matrix.pk

where vectors.pk was generated by the first script.

Use wmd_mat.m if you'd like to use Matlab instead of wmd.py. You will need to change the variable load_file to vectors.mat and save_file to whatever name you like.

KNN

In the paper, we used cross-validation to set k for each dataset and tried these k's [1,3,5,7,9,11,13,15,17,19]. We also implemented a KNN function that given a k (or a list of k's) would only classify a point if the majority of the k nearest neighbors voted on the same class. If not, then we would reduce k (by 2) and consider if for this smaller k there was a majority vote for a class. This would continue this way until either a majority was reached or k=1 (in which case we just use the nearest neighbors vote). This function is in the file knn_fall_back.m

Paper Datasets

Here is a Dropbox link to the datasets used in the paper: https://www.dropbox.com/sh/nf532hddgdt68ix/AABGLUiPRyXv6UL2YAcHmAFqa?dl=0

They're all matlab .mat files and have the following variables (note the similarity to the demo dataset):

for bbcsport, twitter, recipe, classic, amazon

X [1,n+ne]: each cell corresponds to a document and is a [d,u] matrix where d is the dimensionality of the word embedding, u is the number of unique words in that document, n is the number of training points, and ne is the number of test points. Each column is the word2vec vector for a particular word.
Y [1,n+ne]: the label of each document
BOW_X [1,n+ne]: each cell in the cell array is a vector corresponding to a document. The size of the vector is the number of unique words in the document, and each entry is how often each unique word occurs.
words [1,n+ne]: each cell corresponds to a document and is itself a {1,u} cell where each entry is the actual word corresponding to each unique word
TR [5,n]: each row corresponds to a random split of the training set, each entry is the index with respect to the full dataset. So for example, to get the BOW of the training set for the third split do: BOW_xtr = BOW_X(TR(3,:))
TE [5,ne]: same as TR except for the test set

for ohsumed, reuters (r8), 20news (20ng2_500)

The only difference with the above datasets is that because there are pre-defined train-test splits, there are already variables BOW_xtr, BOW_xte, xtr, xte, ytr, yte.

Raw Datasets

Here's a folder with all the raw data: https://www.dropbox.com/sh/f44z3nt3i5279yt/AACHBs4qiISGPdBjB_aEgDVMa?dl=0 (it also has some extra datasets we ended up not using)

The main subtleties are:

We do not have raw data for the recipe dataset unfortunately, just BOW
Reuters is here:
- train: https://www.cs.umb.edu/~smimarog/textmining/datasets/r8-train-all-terms.txt
- test: https://www.cs.umb.edu/~smimarog/textmining/datasets/r8-test-all-terms.txt
We used stop_words_115.txt to remove stop words in all datasets except twitter (which has so few words per document that removing stop words hurt training accuracy)
For ohsumed we used the first 10 classes
For 20 news we additionally removed words that appear less than 5 times across all documents, and limited each document to the 500 most common words (i.e., we removed the 501st, 502st, 503rd... most common words in each document, if they existed)
We used the 5 train/test splits for bbcsport, twitter, classic, amazon as defined in TR and TE in the BOW data .mat files described above
For ohsumed, reuters, 20news (20ng) the train/test splits are already defined so we didn't use 5 different splits.

Feedback & Contact

Let me know if you have any questions at mkusner AT wustl DOT edu. Please cite using the following BibTeX entry (instead of Google Scholar):

@inproceedings{kusner2015doc, 
   title={From Word Embeddings To Document Distances}, 
   author={Kusner, M. J. and Sun, Y. and Kolkin, N. I. and Weinberger, K. Q.}, 
   booktitle={ICML}, 
   year={2015}, 
}

wmd's People

Contributors

Stargazers

Watchers

Forkers

mcelvg peimanb dav009 brightbird qingniufly ml-lab liyingrenjie nlp-zsy renaud peratham adrianhust colinferguson pan001 chenglongchen jfhrecoba lizsz darlwen lu839684437 fangzheng354 mmallad wencanluo colourbrain fujihara-laboro meangrape duankai jamesxia4 hitluobin shuaiwanggit compass-wang sachuin23 hitflame 1206lyp wdan zshwuhan theolivenbaum gitforhf dellabear harpribot vyraun nooralahzadeh angelaying wangzhen-nlp columbiadvmm inimah coloratto hi-ylf zhangjiulong asnjudy elitonperin ajaytalati colinsongf rock999 sakshi11 lilanjuntest yuancz fancyerii sharon-yuan xinghudamowang ahmadhany dotrado qgogithub sophiealex taineleau-zz vikingmew gaohuang nyrt changfengfeng novellll v-shinc robets2020 mindis sanjeeku imdark yuanchangtian danieljunior laisun yutingliu gjjg1331jggj amano-ginji frankfqchen xiaojieliu7 ansvver hujunxianligong luyangcat pulin-zou karanr-hexaware aenal-abie ihaeyong posedge yangjiezhou haozj cyzhangathit kfzyqin ghiblifield afcarl zhouyonglong dancing-with-coffee pankajmehar youxuanxue liclone

wmd's Issues

emd: Signature size exceeds 100

how to change the signature size in emd module?

Does this code run in python 3?

Does this code run in python 3.x? Thanks!

memory leak

in emd.i


%typemap(freearg) signature_t * {
    if ($1 != NULL) {
        PyObject **features_array = (PyObject **) $1->Features;
        int weights_count = (int)$1->n;
        int i = 0;
        for (i = 0; i < weights_count; ++i) {
            Py_XDECREF(features_array[i]);
        }
        free((PyObject **) $1->Features);
        free((float *) $1->Weights);
        free((signature_t *) $1);
    }
}

i want to use WMD to train chinese data,there's some errors ,plz help me!

root@user-virtual-machine:/home/user/WMD# python wmd.py asd.pk asdwmd.pk
[pool :] <multiprocessing.pool.Pool object at 0x7f327f1cc150>
0 out of 3
1 out of 3
emd: Signature size is limited to 100
2 out of 3
emd: Signature size is limited to 100

stop.txt and training data all use in chinese. how can i solve this problem???

Error: emd: Signature size is limited to 100

During the wmd distance matrix computation, it occurs "emd: Signature size is limited to 100" several times. What should be done?

install error=> ld: unknown option: -shared

mac os high sierra 10.13
install error

Building object file 'emd.o'.
-n
cc -o emd.o -c emd.c -fPIC -I/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/include/python2.7 -I/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/include/python2.7
In file included from emd.c:20:
./emd.h:22:9: warning: 'INFINITY' macro redefined [-Wmacro-redefined]
#define INFINITY 1e20
^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.13.sdk/usr/include/math.h:68:9: note: previous definition is
here
#define INFINITY HUGE_VALF
^
1 warning generated.

Generating C interface
swig -python emd.i

Building object file 'emd_wrap.o'.
-n
cc -o emd_wrap.o -c emd_wrap.c -fPIC -I/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/include/python2.7 -I/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/include/python2.7
In file included from emd_wrap.c:3020:
./emd.h:22:9: warning: 'INFINITY' macro redefined [-Wmacro-redefined]
#define INFINITY 1e20
^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.13.sdk/usr/include/math.h:68:9: note: previous definition is
here
#define INFINITY HUGE_VALF
^
1 warning generated.

Linking wrapper library '_emd.so'.
-n
ld -shared -o _emd.so emd.o emd_wrap.o
ld: unknown option: -shared
make: *** [_emd.so] Error 1
rm emd_wrap.o emd.o emd_wrap.c

Quality performance compared to Sørensen–Dice coefficient

Has anyone (at least partially) evaluated the quality of outputs compared to Sørensen–Dice coefficient?

Please share your findings (ideally from production 😉), thanks!

ImportError: No module named _emd

Why is that I keep getting "ImportError: No module named _emd" error from emd.py? I use python 2.7.

May I ask what is '_emd' ? I assume it's not the same as pyemd?

Thanks in advance for your time!

Current wmd implementation does not match GenSim

It is not really an issue, but compatibility with GenSim library.

Using the first twitter corpus texts, i.e.

now all apple has to do is get swype on the iphone and it will be crack iphone that is

and

apple will be adding more carrier support to the iphone 4s just announced,

I get 0.99 distance using GenSim wmd implementation and 2.6625 using this implementation (original and from the paper's author).

At first sight, I thought that it was related to your stop words list. That said, debugging your code I see that the first and second texts become:

apple swype iphone iphone crack
apple adding carrier support iphone 4s announced

However, running with the words above, I still get a completely different result. Using GenSim and filtering your stop words (as above) I get 0.96 wmd.

Is there any place where this compatibility is discussed?
Could anybody please confirm if the same numbers are returned for different implementations?

This highly impacts the effectiveness of using GenSim implementation to find semantically close texts.

no moudle named multiarray

I have installed numpy and in the shell,when I type "import numpy.core.multiarray",it's ok.
I don't know why this problem appear?

luoyj@luoyj-Lenovo-M490:~/wmd-master$ python wmd.py twitter_vec.pk twitter_wmd_d.pk
Traceback (most recent call last):
File "wmd.py", line 11, in
[X, BOW_X, y, C, words] = pickle.load(f)
File "/usr/lib/python2.7/pickle.py", line 1378, in load
return Unpickler(file).load()
File "/usr/lib/python2.7/pickle.py", line 858, in load
dispatchkey
File "/usr/lib/python2.7/pickle.py", line 1090, in load_global
klass = self.find_class(module, name)
File "/usr/lib/python2.7/pickle.py", line 1124, in find_class
import(module)
ImportError: No module named multiarray

How to work with result (.pk) file

Hi,

first of all thank you for the great work and nice implementation!

The tool works fine for me and I will use it for document comparison in the socal media context. Can you please give me some advise how to work with the resulting "...wmd_d.pk" file? First I thought the result would be a textfile with a readable matrix in it but now I think I need any additional software?

Thank you very much!

technology independent output file

It would be very nice if the output distance matrix file were independent of python formats. So we use it in another languages as well.

Do i need Matlab to run or change the code ?

I was wondering if matlab is required to run/edit the code ?

Paralleling processing

Wow, great paper! Thank you for making the code OSS.

The documentation says that the Python wrapper is not suitable for parallel execution:

The wrapper is not suited for concurrent execution. It uses a global variable for the distance callback function, so calling emd from concurrent threads will result in undefined behavior.

However, the function get_wmd calls emd concurrently. Can you please explain?

there are lots of NaN's in the distance matrix for the example dataset

When I run the example script inside VMWare with Ubuntu installed as a guest OS, I get a matrix with around 100K NaN entries. Could it be a problem with the EMD solver?

Makefile:51: recipe for target 'emd_wrap.c' failed

# git clone https://github.com/mkusner/wmd.git
Cloning into 'wmd'...
remote: Counting objects: 41, done.
remote: Total 41 (delta 0), reused 1 (delta 0), pack-reused 40
Unpacking objects: 100% (41/41), done.
Checking connectivity... done.
# cd wmd/
# pip install gensim numpy scipy
# cd python-emd-master/
# make
>>> Building object file 'emd.o'.
    cc -o emd.o -c emd.c -fPIC -I/usr/include/python2.7 -I/usr/include/x86_64-linux-gnu/python2.7 
In file included from emd.c:20:0:
emd.h:22:0: warning: "INFINITY" redefined
 #define INFINITY       1e20
 ^
In file included from /usr/include/math.h:41:0,
                 from emd.c:18:
/usr/include/x86_64-linux-gnu/bits/inf.h:26:0: note: this is the location of the previous definition
 # define INFINITY (__builtin_inff())
 ^
In file included from emd.c:20:0:
emd.h:32:20: warning: extra tokens at end of #include directive
 #include "Python.h";
                    ^

>>> Generating C interface
swig -python emd.i
make: swig: Command not found
Makefile:51: recipe for target 'emd_wrap.c' failed
make: *** [emd_wrap.c] Error 127
rm emd.o

installation issues

swig is required, but not mentioned.

in emd.h, the include of Python.h has a ; that should be removed

the meaning of row and column in distance matrix(WMD_D)

Dear sir,
I feel sorry to trouble you :
After I run wmd.py, it may get a distance matrix between all documents.
But I am puzzled about the row and column of this distance matrix:
Is every row of distance matrix representing each document? that is the document vector?
Is every column of distance matrix representing the same word of each document?

Thank you so much!

Makefile:45: recipe for target 'emd.o' failed

Hello, I'm encountering this problem.

Anyone has ideas for this issue? Your help is highly appreciated.

when i run python2 example1.py ,i got this problem:

emd: Unexpected error in findBasicVariables!This typically happens when the EPSILON defined in emd.h is not right for the scale of the problem.

installation issues solved

i had installation issues similar to before-mentioned ones.

running
sudo apt-get install python-dev # for python2.x installs
or
sudo apt-get install python3-dev # for python3.x installs
and
removing ";" from include Python.h; in emd.h

solved the problems

Obtaining flow information through python interfance

Hello,

Thank you for the great work and nice implementation. It really helps me!
I know that i can obtain distances through emd( (X[i], BOW_X[i]), (X[j], BOW_X[j]), distance). But how can I get the flow information (transportation matrix)? I have no idea of getting it through python interface.

Zhe Zhao

Makefile:39: recipe for target '_emd.so' failed

Interesting topic and paper. I tried to compile the makefile on Ubuntu (15.04) using Python 2.7, including all of the required libraries, but there is an error that I could not solve it, here is the output of running make:

I would be thankful if you can help me to solve this. Thanks.

Deadlock in Multiprocessing

Thank you for your implementation of your paper.

First, I tried with your code and data. It worked well. (all_twitter_by_line.txt)

Second, I tried 20newsgroup data which was in your paperwork.

Then, I got

"emd: Maximum number of iterations has been reached 1013"

error because of limitation, MAX_SIG_SIZE 100.

So, I change it to over maximum size of unique keywords in 20newsgroup dataset( =5284).

Now, I have trouble with blocking after some steps.

I think it's because of multiprocessing.

I check CPU availability, it was 99% in multiCPU, multicore environment.

Is there any solution for this?

problem with WCD

@mkusner I read your paper and want to use your WCD+RWMD method to calculate docs similarity in my doc recommendation project. I found the code for RWMD in matlab, but didn't find the code for WCD. Is it the file named distance.m?

Is it possible to use GloVe with your code?

Hello Matthew,
Please let me know how I can use the GloVe 840 billion corpus with your code for embedding purpose.
Kindly, guide how to use it.
Thanks in advance.