hugochan / kate Goto Github PK

View Code? Open in Web Editor NEW

142.0 6.0 49.0 4.91 MB

Code & data accompanying the KDD 2017 paper "KATE: K-Competitive Autoencoder for Text"

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

autoencoder representation-learning text-mining topic-modeling word-embedding text-embedding deep-learning

kate's Introduction

KATE: K-Competitive Autoencoder for Text

Code & data accompanying the KDD2017 paper "KATE: K-Competitive Autoencoder for Text"

Prerequisites

This code is written in python. To use it you will need:

Python 2.7
A recent version of Numpy
A recent version of NLTK
Tensorflow = 1.15.2
Keras = 2.0.6

Getting started

To preprocess the corpus, e.g., 20 Newsgroups, just run the following:

    python construct_20news.py -train [train_dir] -test [test_dir] -o [out_dir] -threshold [word_freq_threshold] -topn [top_n_words]

It outputs 4 json files under the [out_dir] directory: train_data, train_label, test_data and test_label. You can download the preprocessed data we used in our experiments here.

To train the KATE model, just run the following:

    python train.py -i [train_data] -nd [num_topics] -ne [num_epochs] -bs [batch_size] -nv [num_validation] -ctype kcomp -ck [top_k] -sm [model_file]

To predict on the test set, just run the following:

    python pred.py -i [test_data] -lm [model_file] -o [output_doc_vec_file] -st [output_topics] -sw [output_sample_words] -wc [output_word_clouds]

To train a simple classifier, just run the following:

  python run_classifier.py [train_doc_codes] [train_doc_labels] [test_doc_codes] [test_doc_labels] -nv [num_validation] -ne [num_epochs] -bs [batch_size]

To train baseline methods, e.g., VAE, just run the following:

     python train_vae.py -i [train_data] -nd [num of dimensions] -ne [num_epochs] -bs [batch_size] -nv [num_validation] -sm [model_file]

Notes

In order to apply the KATE model to your own dataset, you will need to preprocess the dataset on your own. Basically, prepare the vocabulary and Bag-of-Words representation of each document.
The KATE model learns vector representations of words (which are in the vocabulary) as well as documents in an unsupervised manner. It can also extracts topics from corpus. Document labels will be needed only if you want to for example train a document classifier based on learned document vectors.

FAQ

KeyError when plotting word clouds

Make sure the words belong to the vocabulary. See here.

Architecture

Experiment results on 20 Newsgroups

PCA on the 20-D document vectors

TSNE on the 20-D document vectors

Five nearest neighbors in the word representation space

Extracted topics

Text classification results on 20 Newsgroups

Visualization of the normalized topic-word weight matrices of KATE & LDA (KATE learns distinctive patterns)

Reference

If you found this code useful, please cite the following paper:

Yu Chen and Mohammed J. Zaki. "KATE: K-Competitive Autoencoder for Text." In Proceedings of the ACM SIGKDD International Conference on Data Mining and Knowledge Discovery. Aug 2017.

@inproceedings {chen2017kate,
author = { Yu Chen and Mohammed J. Zaki },
title = { KATE: K-Competitive Autoencoder for Text },
booktitle = { Proceedings of the ACM SIGKDD International Conference on Data Mining and Knowledge Discovery },
doi = { http://dx.doi.org/10.1145/3097983.3098017 },
year = { 2017 },
month = { Aug }
}

Other research papers that applied the KATE model:

Chen, Yu, Rhaad M. Rabbani, Aparna Gupta, and Mohammed J. Zaki. "Comparative text analytics via topic modeling in banking." In 2017 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1-8. IEEE, 2017.

@inproceedings{chen2017comparative,
  title={Comparative text analytics via topic modeling in banking},
  author={Chen, Yu and Rabbani, Rhaad M and Gupta, Aparna and Zaki, Mohammed J},
  booktitle={2017 IEEE Symposium Series on Computational Intelligence (SSCI)},
  pages={1--8},
  year={2017},
  organization={IEEE}
}

kate's People

Contributors

Stargazers

Watchers

Forkers

tonydeep stevenlol benjamesbabala wangtingc generalsemantics pierce1987 calculatedcontent fabriciorsf franceszhou husihao scapeqin leeamen kaeflint shanxiaolei cmwenliu shubhampachori12110095 polaris79 chenboability t-jamano auserj qiang2100 dangxuanhong kunwangrui wendy0601 yanxg wentropy zeddmaxx annakontorovich trevor-walker32 qmeeus fzy0728 kyoungrok0517 swsanalytics johngiorgi zakimjz freekang zeta1999 hiivin skatingboy2006 almugabo manal-almodala yoch chendingliang gaojie-wang saranyab21 vaishnavi-rao dogukanguneyboun

kate's Issues

visualization using pca and tsne

I hope you did not get tired because of my questions, I am almost there:)
Could you let me know how can I visualize the topics as well.

I can see visualize.py script, and these argument is necessary:

doc_codes, doc_labels, classes_to_visual, save_file

So we already have doc_codes and doc_labels got it from predict.py script.

What do you mean by classes here?
I appreciate your help:)

UTF 8 - codec can't decode byte

I am trying to run the preprocess step of 20news on the data downloaded here.

I get the following error:

$ python construct_20news.py -train 20news/20news-bydate-train/ -test 20news/20news-bydate-test/ -o 20news/out
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfe in position 1146: invalid start byte

To avoid the issue I replaced this line with the following:

with codecs.open(filename, 'r', encoding='UTF-8', errors='ignore') as fp:

But then the following steps fail:

$ python pred.py -i 20news/out/test.corpus -lm 20news/out/model -o 20news/out/output_doc_vec_file -st 20news/out/output_topics -sw 20news/out/output_sample_words -wc 20news/out/output_word_clouds
Using TensorFlow backend.
2017-09-11 15:23:01.788147: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-11 15:23:01.788191: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-11 15:23:01.788199: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-09-11 15:23:01.788204: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-11 15:23:01.788209: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
/usr/local/lib/python2.7/dist-packages/keras/models.py:251: UserWarning: No training configuration found in save file: the model was not compiled. Compile it manually.
warnings.warn('No training configuration found in save file: '
Saved doc codes file to 20news/out/output_doc_vec_file
Saved topics file to 20news/out/output_topics
Traceback (most recent call last):
File "pred.py", line 162, in
main()
File "pred.py", line 159, in main
test(args)
File "pred.py", line 114, in test
word_cloud(weights, vocab, queries, save_file=args.word_clouds)
File "/usr/local/google/home/eliav/virtual_env/home/kate/KATE-master/autoencoder/testing/visualize.py", line 35, in word_cloud
words = [(i, vocab[i]) for i in s]
KeyError: 'cash'

I guess the two step are connected and the fact that I added an ignore caused the issue.

Thanks in advance,
Eliav

[Question] on positive and negative neurons

Hi!

I came across your paper in arxiv and it's nice to see the code being open-sourced. I am also interested in autoencoders and I'm applying it to my research on protein function prediction. Nice work and good results, I just have some questions on the K-competitive layer:

I'd like to clarify how the positive and negative neurons were chosen. If I understood correctly, they are assigned as a result of the feedforward step in z. Is this correct?
If we assign a value of k greater than 2, and obtain multiple positive winners. How do we know which one takes which positive loser? Or do all of them "soak up" the energy?
Were there any previous research on the effects of reallocating energy in a neural network? Were these inspired by RBMs? What is the use of redistributing the energy instead of letting them be (a bit similar to winner-take-all AE)?

That's all and thank you so much! 😄

[Question] Back-propagation through losers via alpha

Hello,

I have a question about your back-propagation.
In k-sparse autoencoder, it suffers from the dead hidden neurons problem, so they solved this by scheduling the sparsity level over epochs.

but in KATE, the gradients still flow through the losers via alpha amplification connection.
I know that there is no gradient flow directly from the output to losers, but I still can't understand how can those weights of losers (loser to input) be updated?
or I just misunderstood something ?!
Could you please examine the procedure in detail? I am very interested in this method~

Thank you very much!

how alpha get collaborated in backpropagation phase

I hope it becomes my last question

So, I am clear on the theoretical concept, but while in implementation I can not get how keras can get the alpha to participate in the back propagation.
Actually I know my question is not related mainly to your implementation but the way that keras works.

From what I know when keras back propagates, it tries to update weight matrices automatically, I could not find any resource how can it gets other neurons which already being set to zero to participate with alpha percentage.

I hope I could give some hint what I mean.

Thanks for taking the time:)

Custom Dataset Help!

Hi,

I am trying to implement a custom data set with the KATE model, but I am unclear on a few things:

I have created a BoW from my training data using Gensim but I am not sure what you mean by vocabulary?
I am quite new to using GitHub in general, is there a way I can train the model within Jupyter Lab?
If there is a more clear tutorial somewhere, could you point me in the right direction?

I would really appreciate your advice!

Kind regards,

Cellan.

clustering approach

Hello,

Actually I have problem understanding your code. though I was expected to see clustering approach which does not need any label, I can see in part of your code you are trying to create the label for the text data.

I want to apply your code with some modification on health data for my research. I do not have label so I am wondering whether or not this code is helpful for me
It would be appreciated if you can clarify a little bit.

Thanks:)

[Question] What is the actual sense of using contractive_loss?

Hello! Thank you for the availability of the code. I am experimenting with your autoencoder on short russian texts (length up to 128 words, 50k stemmed words in vocabulary, 1.5M texts in the dataset). I can share further results with you if you are interested.

So my question is about using contractive_loss. I found it in the code, and it seems like it is really more preferable for me than crossentropy. But i cannot find any mentions about it in the paper. What can you say about it? Did you experiment with it? When it is really more suitable according to your experience? Thanks

error unicode data

I am trying to run on 20news group which I got this error:

Traceback (most recent call last):
File "construct_20news.py", line 30, in
main()
File "construct_20news.py", line 23, in main
train_corpus, test_corpus = construct_train_test_corpus(args.train_path, args.test_path, args.out_dir, threshold=args.threshold, topn=args.topn)
File "/home/saria/Downloads/KATE-master/autoencoder/preprocessing/preprocessing.py", line 167, in construct_train_test_corpus
train_docs, vocab_dict, train_word_freq = construct_corpus(train_path, True, threshold=threshold, topn=topn, recursive=True)
File "/home/saria/Downloads/KATE-master/autoencoder/preprocessing/preprocessing.py", line 120, in construct_corpus
word_freq, doc_word_freq = load_data(corpus_path, recursive)
File "/home/saria/Downloads/KATE-master/autoencoder/preprocessing/preprocessing.py", line 112, in load_data
raise e
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf3 in position 426: invalid continuation byte

Thanks

VAE

Hi,

Thanks for sharing your interesting project. Any idea why did not you apply the approach on the VAE?

topics or word representations

Hi :)

where the word representation has been saved.

How can I see the list of words being extracted from Kate model?

( for the constructing bow I made that and feed to the model as I faced with several error s probably the format of mine was different, I stick with the way yo make using 20_newsgroup) I applied the same for my case. I suppose no problem should be there as I have almost the same dataset only with one different thing. I have one folder and inside that there are 150K number of documents. so i suppose it should be fine. Im sharing here in case if you think there maybe problem please let me know :) )

Thanks :)

some confusion according to the code

Hi, hugo,

Actually, it may not related explicitly to the issues here but I could not find an email to communicate with that. I want to apply your approach on a very large textual medical data. (Two sets of 500k documents (count of documents) in which its sizes are 400mgbyte)
If your method does well we will apply it to the different application of the medical we have.
hope you don't mind answering a couple of questions I have.
Also, I only care about having the topic modeling part so I do not have the label to calculate the accuracy.

why there is a couple of constructing dataset ex: 20news-group, rueters...
I know that you have checked your approach on a couple of datasets but why you make the different way of behaving with these datasets? Actually, it's confusing when there is a couple of them.
in your example to run the code you are using this line python construct_20news.py -train [train_dir] -test [test_dir] -o [out_dir] -threshold [word_freq_threshold] -topn [top_n_words].
so when should we use the others?
when I run my own dataset, I would expect to have a vocabulary out of my dataset generated by the methods, then why there should be the issue of not finding some words like "interest". (according to your saying: We can only plot word clouds for those appearing in the vocabulary. (why do we need to have an already prepare vocabulary?) In your case, the word interest hardcoded in the original script is not part of the vocabulary. ) so what should I do if I want to visualize my own data as you did?
Also, I can see that vocabulary defined in run_lda, but not in K competitive, so why while I was running k-competitive It raises that error?
I want to have the same comparison you had in the paper with several approaches (I greatly appreciate your work its awesome that you have implemented almost all and include in your code and make it public 👍 ), so I need to know the steps I need to take to apply those approach on my datasets.
I know that it taking time to make your readme a little bit easier to follow but it will be greatly appreciated :). like you did not explain how can we resolve the issue of not finding words in the vocabulary... or running the other approaches so we can compare your approach with other approaches on very different dataset so in this case, the strongness of your approach will be even more clear as it does well on very different datasets.

Many thanks in advance for taking time :)

Dense_tied function

Hi again :)

Do you mind explaining why do we need this function here? why we just do not have a simple dense layer?

I would add that I would like to know more about your experiences, did you see any improvement using weight_tied?
Thanks!

PCA projection plot.

I have trained a model on newsgroup data, now I am trying to plot the pca projection. I tried to run the code plot.py that calls the visualize.py to plot the pca projection, but it doesn't work. If I understand what you exactly you are trying to plot, I could write an script using the results of the training phase, but I couldn't follow the steps in the code. What is the input here? The final document distribution generated from the training phase? how is it a 2D input for pca? I also read your paper, but was't clear in the paper either. I will appreciate if you clarify how to use this.
for example there are 1108 news documents for the label "comp.sys.mac.hardware" -- after training the model--, should I put all of them together as a 2D array as one input of PCA and do the same for other labels ?

using sigmoid function

Hello again :)

So im now detailing on the approach.

in the last part of the Effect of parameter tuning you discussed that you got the best result with tan function. as your idea was to divide positive and negative neuron and then find the k competitive neuron among the positive and negative ones, how you could examine with sigmoid function?
did you consider range like below 0.5 and above?

VAE loss and classification accuracy

hi :) sorry to bother you again
I tried train_vae.py to train a vae model without competition layer but got high loss (60-70).
(and dimension is 128)
earlystopping happened in about 20-50 epoch though I already tried a larger patience.
The loss includes kld and reconstruction error but is this loss value in normal scale?
and then I used pred_vae.py to predict 20news document embeddings.
but after running run_classifier.py, I can only get about 22% classification accuracy.

Do you have any suggestion for improving the vae model!?
thank you

error in prediction (last line)

hi,

when I run the last line I got this error:

firstly it really will be helpful if you can add an example to each run, for example for the last run what will be sunsitute to [output_word_clouds]?
I just create some empty files and gave the path.

this is the error :

Using TensorFlow backend. run k_comp_tanh WARNING:tensorflow:From /infodev1/home/m193053/PycharmProjects/KATE-master/autoencoder/utils/keras_utils.py:143: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version. Instructions for updating: keep_dims is deprecated, use keepdims instead 2018-06-06 10:33:58.571203: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA /home/m193053/anaconda3/envs/conda27/lib/python2.7/site-packages/keras/models.py:282: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually. warnings.warn('No training configuration found in save file: ' Saved doc codes file to /home/m193053/Downloads/out/vecfiles Saved topics file to /home/m193053/Downloads/out/topics Traceback (most recent call last): File "pred.py", line 162, in <module> main() File "pred.py", line 159, in main test(args) File "pred.py", line 114, in test word_cloud(weights, vocab, queries, save_file=args.word_clouds) File "/infodev1/home/m193053/PycharmProjects/KATE-master/autoencoder/testing/visualize.py", line 35, in word_cloud words = [(i, vocab[i]) for i in s] KeyError: 'interest'

and this is what I run:
python pred.py -i /home/m193053/Downloads/out/test.corpus -lm /home/m193053/Downloads/out/model -o /home/m193053/Downloads/out/vecfiles -st /home/m193053/Downloads/out/topics -sw /home/m193053/Downloads/out/words -wc /home/m193053/Downloads/out/clouds

vecfiles, topics, wordsand clouds I create a text file and give the path but I am not sure it is correct way to do.

Thanks for your time :)

different result over running different version of tensorflow

Hi again,
Finally I got the reason why I can not replicate your result. until now I was using tensorfloe 1.14. which in this version we get 69 percent accuracy on 20 newsgorup data set with 512 nd.

However, once I changed the tf version back to 1.2.1 it replicate the same result as reported in paper.
an example of extracted topics with version 1.14

line subject organ armenian isra turkish israel armenia arab turkey

line subject organ car bike speed brake truck ride motorcycl

line organ subject clipper escrow chip livesey isc cramer key

line subject organ cramer optilink gay clayton homosexu gun crimin

line subject organ april space san sexual washington cal ron

I appreciate it if you can help with this part, why that much difference with only changing the version of tensorflow!

Thanks.

[Question] output topics representation

Hello!

This is a part of my output topic representation, but it looks a little strange because the words seems not so important (the.is.with.will.or.s.some......).
My input training data is 20newsgroup and I separate it into 70% training set. 30% testing set.

I think maybe I did something wrong somewhere.
Do you have any suggestion!?

thank you very much.

[Question] Possible to apply in different kinds of data?

Hi everyone,

Thank you for answering my question in #1 . Hope you won't mind some of my questions:

I would like to ask if this approach is dataset-dependent, meaning, is KATE optimally designed for text data? Does feeding image data, say MNIST. a possible application for this work?
Are there any future research directions regarding the architecture of KATE? I'm interested in the idea of competition between neurons and will probably explore them more in my research.

Thank you again!

neurons in the middle layer

If we consider the latent space say 20, and k =12, then we will have only 12 neurons in the middle layer which is active.
My question is that then when we print our topics in the latent space, does it mean that some of the neurons are already getting shut down so we only have 12 left?

Am I missing something?

Reproduce reported results

Hi @hugochan,

Could you please public the actual 20 news data that you used to train model and get the results reported in paper?

I downloaded data from the link http://qwone.com/~jason/20Newsgroups/ (filename is 20news-bydate.tar.gz) and used your code but cannot reproduce your reported results. (I also used your processed data but didn't work as well)

Thanks,

Sharing the trained models?

Hi,

I was wondering if you could share the trained models presented in the paper for each dataset? (similar to how you have shared the preprocessed data).

This would be immensely useful for anyone who wants to benchmark against your method, as we could effectively just provide the model file to the -lm argument of pred.py.

[Question] How were word embeddings made if KATE expects a log normalized word count matrix?

In the paper you showed nearest neighbor similarity of particular words. How did you do this if the feature vectors that KATE expected were vectors where each index was the log normalized word frequency for a particular document?

steps to achieve accuracy

Could you explain which scripts need to be run to feed the output of encoder to the classifier and finally calculate the accuracy?
you have two files, run_clf and run_classifier.

getting NVDM run

Any idea how to get NVDM works?
I need it for the comparison purpose to report its performance over different parameters than the ones reported in the paper.

Thanks~

seeing rare word

Hi,

Do you have any idea how can I change the parameters to see more rare words?
I have changed my data set such that I put _ between two words so the result would be word1_word2. kind of bigram in cases I needed them as bigram. when I look at the result I can not see any output in which have the bigram. I was thinking maybe they have not happened in my data set frequently.

Do you have any idea how can I see maybe the more rare word in the output?
Thanks,
:)

running other model

Hi again:)

I would like to see the difference between the KATE model output with other model.

do you mind letting me know how to run other model mostly: DocNADE, VAE, Word2vec?
For DocNADE I do not know what is "train_doc_codes", how to create this? Actually I think if I firstly run lib2svm it will make this format of corpus data but I got an error during running I was wondering should I take any other step?

Word2vec: there is two script one is run_w2vec and another is run_doc_w2vec. which one is the script should I run also what is this parameters:
train flag,docname, path to the trained model (if its in training phase), path to the word2vec mod file.

the val_loss can't converge in tensorflow 1.15.2, while tf 1.2.1 works

Running train.py with Tensorflow 1.15.2 in 20news, I got the val_loss 0.6x, while the training loss got 0.0x
However, the problem won't happen in tensorflow 1.2.1. Other dependencies are same with requirement.txt

Would you mind if you check this problem?

justification

Also, How can you justify using count of the words as the training should feed to the model?
what if we feed word2vec and then try to make a better representation using this model?