gregversteeg / corex_topic Goto Github PK

View Code? Open in Web Editor NEW

627.0 29.0 119.0 14.91 MB

Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx

License: Apache License 2.0

Python 54.99% Jupyter Notebook 45.01%

python machine-learning unsupervised-learning topic-modeling information-theory

corex_topic's People

Contributors

Stargazers

Watchers

Forkers

benjamesbabala tozammel egaebel lucasbj shaya7 ml-lab dongqing7 sciknowengine lfierro karthi2016 cins-china afcarl ryanjgallagher mystery-college-of-the-adapts devanshudesai sauchilee linlin-cheng vvssttkk veselovska viveksck nshah171 maganaluis hkhoont alcantarar deepcolin cbowdon knapply weiaiwayne gnishimura agile-innovations im-pek scottishfold007 sketing gasparddebussche emailhy wayswang ferdiwo zorzalerrante hbeychaner sjyttkl ilpizzo84 khallil charlesxavier1313 sandy4321 tamil-palace fagan2888 twielfaert jonasherfort shark803 uwnetlab freekang liujunchangqq rogervaas guyaglionby latinostats fw192020 alfredoarturo marcpol76 kandloic vanessasml khuyentran1401 aychies dineshjs jregm zafercavdar fffaby jkyang01 gdh756462786 nawshad jronayne srravula1 guoruijiao nmghanna-504 hidetomasuoka davidcoallier elitist7 lp3719z ethan-harris0n mlopatka yang2018 brucewuzhang kevingeng ghostintheshellarise farshi hhxx2015 soufal jmpina sissiyuan towashington ping-gong okara83 caiyishu aahmadai jungel2star cryptowealth-technology wzjames cameleogrey babajideowoyele techthiyanes courses-worldwide

corex_topic's Issues

2 Visualization Issues

Hello,

I'm having trouble with two visualization issues in corex_topic, both from the provided examples.

The first comes from the README.md file in the below code which returns the error "IndexError: tuple index out of range". All of the code in the README.md file above these lines of code work just fine. Can you advise on how to fix?

from corextopic import vis_topic as vt
vt.vis_rep(topic_model, column_label=words, prefix='topic-model-example')

The second issue comes from viz_hierarchy in the corex_topic_example.ipynb file that is provided. This returns a folder with two files: "groups.txt" and "topics.txt". "groups.txt" is populated but "topics.txt" is not. Do I have to create the DiGraph manually after this step? I do not see it in the folder and I cannot get it to print in the Jupyter Notebook.

vt.vis_hierarchy([topic_model, tm_layer2, tm_layer3], column_label=words, max_edges=200, prefix='topic-model-example')

I've checked the open and closed issues and noticed similar issues to these but have not yet seen solutions. Thanks in advance for any help.

AttributeError: 'DiGraph' object has no attribute 'node'

Hi, I followed strictly the corex_topic_example notebook with my own data (7611 research paper abstracts) and everything ran fine until the visualization step of the example hierarchical topic model at the end for which I got this error (cf.below).

Any idea where this would come from? Many thanks for your answer.

Regards,

Michel

vt.vis_hierarchy([topic_model, tm_layer2, tm_layer3], column_label=words, max_edges=200,
prefix='topic-model-example')
weight threshold is 0.000000 for graph with max of 200.000000 edges
Traceback (most recent call last):

File "", line 2, in
prefix='topic-model-example')

File "/home/michel/Documents/CURRENT WORK/Correlation Explanation/corex_topic/corextopic/vis_topic.py", line 68, in vis_hierarchy
g = make_graph(weights, node_weights, l1_labels, max_edges=max_edges)

File "/home/michel/Documents/CURRENT WORK/Correlation Explanation/corex_topic/corextopic/vis_topic.py", line 139, in make_graph
g.node[(layer + 1, j)]['weight'] = 0.3 * node_weights[layer][j] / max_node_weight

AttributeError: 'DiGraph' object has no attribute 'node'

scipy compatibility problems after anaconda upgrade

upon the import of corex I get the following error

Traceback (most recent call last):
File "", line 1, in
File "/miniconda3/lib/python3.7/site-packages/corextopic/corextopic.py", line 25, in
from scipy.misc import logsumexp # Tested with 0.13.0
ImportError: cannot import name 'logsumexp' from 'scipy.misc' (/miniconda3/lib/python3.7/site-packages/scipy/misc/init.py)

solution might be here:
https://github.com/cvxgrp/cvxpy/issues/640
(I can import: "from scipy.special import logsumexp")

Corex with large data

I would like to use corex to extract topics from a 60 million word corpus divided into chunks of 100 words, I am wondering how to scale COREX so that it can cope with a document - term matrix with 600.000 rows.

How to retrieve documents according to their topic

Hi, Greg, after successfully fitting the models, how should I retrieve all the documents according the topic?

Hierarchical topic model visualization

The graphviz functions for visualizing the topic model are finicky and it's costly to have to update them with respect to both networkx and graphviz.

Two proposed options :
1 . We add a function that makes it easy to get to the hierarchy edge list so that others could more easily visualize the hierarchy themselves
2. We go a step further and rework the visualization code so it just uses networkx

Topic Word Shifts

The total correlation of each topic (and overall) is a function of additive contributions from each word. So, given a trained CorEx topic model and two different sets of documents, the difference in TC (either per topic or overall), can be written as a ranked list in terms of which words contribute most to that difference (see attached document). This yields an interpretable measure of how topical information differs between different sets of documents, and allows us to make careful qualitative results when comparing documents, particularly out-of-sample documents.

get_topic_word_shift(X1, X2, topic_n=None)
Input
X1, X2: doc-term matrices of shapes n_docs1 x n_words, n_docs2 x n_words, where the columns of each matrix correspond to the same words as the original doc-term used to train the CorEx topic model
topic_n: either None or an integer. In the case of None, returns the ranked list of words contributing to the difference across all topics. In the case of an integer, specifics which topic specifically to compute the TC difference contributions within.
Output
word_contributions: list of tuples, (word, normalized contribution). A ranking (from high to low) of how much each word contributes to the difference in TC between X1 and X2

Most of the machinery for this function is already in calculate_latent and normalize_latent. The einsums likely need to just be changed a bit, but I've been struggling to parse the right way of doing it. Any help would be appreciated.

corex_topic_word_shift.pdf

Metrics for Model Selection

Hi,

I'm testing some semi-supervised models, each with 20 topics created through lists of roughly 15 anchor words per topic. The documents within the corpus I'm working with have a large variance in word length (150 - 20,000+). I've broken the documents into smaller batches to help control for document length, and am looking to find the batch size and anchor strength which creates the best model.

I know that total correlation is the measure which CorEx maximizes when constructing the topic model, but in my experimenting with anchor strength I've found that TC always increased linearly with anchor strength, even when it's set into the thousands. So far I've been evaluating my models by comparing the anchor words of each topic to the words returned from .get_topics(), and I was wondering if there is a more quantitative way of selecting one model over another? I've looked into using other packages to measure the sematic similarity between the anchor words and the different words retrieved by .get_topics(), but wanted to reach out to see if there's any other metrics out there to measure model performance.

Additionally, besides batch size and anchor strength, are there any other parameters I should be aware of when fitting a model? Any help would be greatly appreciated.

How to save and restore a corex topic model?

For running a model that takes a long time, it would be useful to save the model to be read in again later for additional hierarchical modeling. Is there a suggested way to save the model? This would need to save the weights and the assignments, I suppose, but how do you restore? Thanks.

How we can testing the model on new data ?

Hello, thank you for this tutoriel, i want to build a anchored model for text classification (i have 5 classes) sentences, so i trained an anchored model with 5 topic, but how can i test the model on new sentences ? there is a "predict" attribute but i have an error

"starred_words" variable not found in notebook

Hey all! I'm looking at the notebook you provided (thank you!), and in one of the code sections (second under Hierarchical Topic Models), there's a call with an undefined variable: starred_words.

vt.vis_hierarchy([topic_model, tm_layer2, tm_layer3], column_label=starred_words, max_edges=200, prefix='topic-model-example')

Coherence Scores

Hi,

Thank you for the great package.

I noticed in your paper that you measure the coherence scores of corex outputs (https://www.aclweb.org/anthology/Q17-1037.pdf)

However, in the class I do not see a method to output the coherence values. Could you point me in the right direction?

Thanks in adance!

Adam

Some documents not belonging to any topics

Hello,

I'm running into a problem where some documents are not being labeled to any topic and some documents are being labeled to all the topics. When I check the probabilities using p_y_given_x, I get this (where n_hidden = 5):

[[9.99999e-01 1.00000e-06 1.00000e-06 1.00000e-06 9.99999e-01]
[1.00000e-06 1.00000e-06 1.00000e-06 1.00000e-06 1.00000e-06]
[1.00000e-06 1.00000e-06 1.00000e-06 1.00000e-06 1.00000e-06]
...
[1.00000e-06 9.99999e-01 1.00000e-06 9.99999e-01 1.00000e-06]
[9.99999e-01 9.99999e-01 1.00000e-06 9.99999e-01 9.99999e-01]
[9.99999e-01 9.99999e-01 9.99999e-01 9.99999e-01 9.99999e-01]]

Any idea as to why this is happening?

[Question] Hierarchical Topic Model

From the ReadMe file, we can create a Hierarchical Topic Model from something like this:

# Train the first layer
topic_model = ct.Corex(n_hidden=100)
topic_model.fit(X)

# Train successive layers
tm_layer2 = ct.Corex(n_hidden=10)
tm_layer2.fit(topic_model.labels)

tm_layer3 = ct.Corex(n_hidden=1)
tm_layer3.fit(tm_layer2.labels)

I'm wondering how I can use this Hierarchical Topic Model? Moreover, what are some methods that I should use for the model?

Topic in document with 0.99 prob but no one word intersects between documents and topic

Hello

I have 200k documents and I create 100 topics. I look at the terms and see that the topics are good.
But when I want to look at examples for each topic I do probs, _ = topic_model.transform(count_matrix, details=True). Then I create new column for each for example dataframe['topic=0']=pd.Series(probs[:, 0]). Then I sort dataframe by prob value decrease and I see that about 1/3 of the document is relevant to the topic but others are irrelevant. Moreover no one word intersects between documents and topic. No indication of similarity between documents and topic.

I noticed that last ~10 topics have few words (3-8) in get_topics method result, random words and prob values ~ 0.2-0.3 which is above average

Could you advise me how I can change the model, in particular, recalculation of probability estimates document-topic ? Ty

'Anchor word not in word column labels provided to CorEx:

How to skip if anchor words not in topic and still produce results for those words available

Clearing self.words before saving model

I don't get the point of clearing self.words before saving the CorEx model considering most people load the model to use .get_topics() and clearing self.words basically makes the function useless for prediction purposes

How to do word cloud or frequency distribution on each topic?

First of all thanks for the wonderful work. It works perfectly, I got my topics with right anchor words. Everything is working fine, however I want to see the word cloud or frequency distribution of each topic. How can do that? Thanks in advance.

Priority is always given to the first anchor from anchor words

I have a dataset that consists of 10 thousand documents. It definitely contains documents for 16 topics. With anchor words, I want to classify a dataset into 16 topics. For each topic, I set anchor words (some anchors have more words, some less, but on average about 50 words per topic).
For each topic anchor words are set in a separate list, then I check for the presence of anchor words in the texts and add them to the general list of lists anchors.

But at the output, one topic always dominates (90-95%) in my documents, and this is the topic whose words are set first in the anchor words (I checked this by changing the order of the anchor words).

For example, I have a desserts and alcoholic drinks theme. If I put the anchor words of the theme desserts first in the list of anchor words, then this theme will prevail in the output. If I first put the anchor words of the topic of alcoholic beverages, then the topic of alcoholic beverages will prevail.

To prevail this means that 90% or more of the documents are marked with the first topic of the anchor words. Other of the 16 topics also appear in the output, but much less often and also wrong.

Can you please tell me why this is happening and what am I doing possibly wrong?

Thank you in advance for your help and answer!

[Warning] ResourceWarning: unclosed file

Definition

After saving and loading with pickle, file descriptors are not closed.

How to reproduce

...
from corextopic import corextopic

corex_model = corextopic.Corex(n_hidden=10, verbose=True, max_iter=200)
corex_model.fit(corpus, words=words)

path = "path/to/corex_model.pkl"
corex_model.save(path, ensure_compatibility=False)

> ResourceWarning: unclosed file <_io.BufferedWriter name='path/to/corex_model.pkl'>


loaded_model = corextopic.load(path)
> ResourceWarning: unclosed file <_io.BufferedReader name='path/to/corex_model.pkl'>

Can't extract topics from simple example, unless I add a constant row

Thank you for this package. I'm keen to explore it further.

I created a simple array to test whether I could extract two topics (unsupervised) where it's clear to the human eye what the two topics should be. My code is below, with 10 documents and 4 words. I find that 9 times out of 10, Corex yields a single topic with all four words, and throws an error because the second topic has 0 words. This is the same if I duplicate the data, or add additional words.

However, if I add a [1, 1, 1, 1] row to the array, interpreted as a document containing every word, things stabilise. Corex will correctly extract the two topics ['tiger', 'bear'] and ['carrot', 'tomato'] every time.

Do you know what might be going on here? From what I can tell, the code initialises with word frequencies, so possibly it struggles with frequencies of 0 in small data sets?

import numpy as np
from corextopic import corextopic as ce

simple_data = np.array(
  [[1, 1, 0, 0],
   [1, 0, 0, 0],
   [0, 1, 0, 0],
   [1, 1, 0, 0],
   [1, 1, 0, 0],
   [0, 0, 1, 1],
   [0, 0, 1, 0],
   [0, 0, 0, 1],
   [0, 0, 1, 1],
   [0, 0, 1, 1]]
)

simple_corex_model = ce.Corex(n_hidden = 2)
simple_corex_model.fit(
  X = simple_data,
  words = ["bear", "tiger", "carrot", "tomato"],
  docs = ["animal", "animal", "animal", "animal", "animal",
          "food", "food", "food", "food", "food"]
)

topics = simple_corex_model.get_topics()
for topic_n, topic in enumerate(topics):
  words,mis = zip(*topic)
  topic_str = str(topic_n+1)+': '+','.join(words)
  print(topic_str)

Can't visualize use vis_hierarchy()

I try to run the example code and vis_hierarchy()
I have download the force.html according** to issue 19
But When I open the force.html, the page is whole blank and I inspect the element.
In console it says:
ncaught TypeError: Cannot read property 'push' of undefined at t (d3.v2.min.js?2.9.3:3) at e (d3.v2.min.js?2.9.3:3) at Object.n.start (d3.v2.min.js?2.9.3:3) at force.html:34 at d3.v2.min.js?2.9.3:2 at r (d3.v2.min.js?2.9.3:2) at XMLHttpRequest.r.onreadystatechange (d3.v2.min.js?2.9.3:2)

Seems that the src used in force.html has some problem
How can I solve this
Thanks a lot

pip install

Should setup proper files so that this can easily be installed via pip.

[Question] Anchoring multiple times

In the example from the readme file, there are 3 different anchoring strategies. I'm interested in 2 of them, Anchoring single sets of words to multiple topics and Anchoring different sets of words to multiple topics. I'm wondering if I should combine two of the strategies together (or more) to get a better result. For example, using the example from the ReadMe file:

Anchor the specific list of words for every individual document

topic_model.fit(X, words=words, anchors=[['bernese', 'mountain', 'dog'], ['mountain', 'rocky', 'colorado']], anchor_strength=2)

Anchor general words throughout all of the documents

topic_model.fit(X, words=words, anchors=['protest', 'protest', 'protest', 'riot', 'riot', 'riot'], anchor_strength=2)

Will fitting the model with two different anchor words lists improve the result in general (or change anything at all), or will it decrease the quality of the result?

Also, does repeating the words in the anchor_words list change how the model view the words (increase its strength)? In the second code, the words 'protest' and 'riot' are repeated thrice.

Dimension mismatch error

Hi, I have trained the corextopic model and trying to predict the topic for a text document. My code for prediction is given below

with open('textdata.txt', 'r') as file:
    text_data = file.read().replace('\n', ' ')
doc_word1 = vectorizer.transform([text_data])
doc_word1 = ss.csr_matrix(doc_word1)

anchored_topic_model.predict(doc_word1)

and the snipper from error log is

--> 517                 raise ValueError('dimension mismatch')
    518 
    519             result = self._mul_multivector(np.asarray(other))

ValueError: dimension mismatch

Can somebody help in this? Thanks

Does corex have a predict function?

Hello,

I have trained a CorEx model on a set of documents, but I now have new documents I want to infer topics for using the prior model. Is there a way to do this using CorEx?

CorEx pickle.load() breaks with Unicode words

If 'words' is passed to the CorEx object upon training, and 'words' contains Unicode characters, then you cannot load the CorEx object using pickle.load() after saving it using pickle.dump().

If you are planning on saving and later using the topic model object, then you can either not load words into it and just make the topics yourself using the indices, or you can save the parts of the CorEx topic model that you want for later. Both of these workarounds are a hassle.

Not sure how to fix this.

Allow anchoring parameter to be set more flexibly

Currently, the anchoring parameter must be set to be the same across all words that are anchored. Since the theory allows it, we should allow a user to pass a list of anchors (if they want) where they the list can consist of integers (anchor all words in this topic with the same parameter), lists (anchor the words in this particular topic with these anchors), or both (a mix of setting the anchor to be the same for all words in some topics, and setting the parameter for each word in some topics). A user should still be allowed to just pass an integer to the anchoring parameter if they do not want to specify each topic.

ex.

anchors = [['dog', 'cat'], 'apple']
anchor_strengths = [[2, 3], 4]
topic_model.fit(X, words=words, anchors=anchors, anchor_strength=anchor_strengths)

This would anchor "dog" to Topic 1 with anchor_strength=2, "cat" to Topic 1 with anchor_strength=3, and "apple" to Topic 2 with anchor_strength=4.

Opening this as an issue because I keep forgetting to get around to it.

Incremental Modeling

Hi guys, thank you so much for developing and sharing the CorEx model. I've been working on an NLP project and have found the anchored model super helpful. I'm wondering whether it is possible to do batch processing or incremental modeling on CorEx? For example, if I already built a model but have a new batch of document coming in with new vocabulary. Is it possible to update original model with the new data?

Thank you!

PyPi Package Version

Would it be possible to update the version of Corex on PyPi to match the state of this repo? We would like to use Corex in a research project, but the currently available version we get from pip install (1.0.4) does not support the newest version of SciPy (meaning we have to include the source code manually).

How can I run the model to find out which topic a new document belongs to?

If I train a model, how can I run in in a new text (not used in the initial dataset used to fit) to find out which topic it belongs to?

Not getting enough topics

I tried running corex_topic with a training matrix of size approx 100,000x10,000. I ran Corex with settings n_hidden=1000, max_iter=1000 but only about 200 of them were non-empty. This could be a symptom of my data, of course (and perhaps there ARE only 200 topics), but are there other parameters that could be tuned to generate way more? Thanks.

get_topics() returns words that appear to be out of order in terms of MI

Hi (and thank you for this really cool and interesting work)

I've got a situation where topic words appear to be out of order, except for the first topic.

For example, with 2 anchored topics, the first topic words returned are listed with MI sorted in decreasing order, as expected.

However, for the second topic, MI decreases, but then increases again.

...looking at get_topics() it's not clear how this could happen -- the code looks right, and I'm not aware of any strange issues with np.argsort().

Any ideas what I should check next? Is this expected behavior in certain instances?

Index error when using anchors

When using the fit method with anchors I get an index error from this line:

corex_topic/corextopic/corextopic.py

Line 185 in 8399148

    
           p_y_given_x[:, j] = 0.5 * p_y_given_x[:, j] + 0.5 * X[:, a].mean(axis=1).A1  # Assumes X is a binary matrix

The error is understandable because if X is a 2d array, then X[:,i] is a 1d slice and thereforeX[:,i].mean(axis=1) is undefined because there is no dimension 1.

I've installed version corextopic==1.0.5 from pypi.

I can reproduce this for any arguments passed to anchors

eror in vis_hierarchy

when I running
vt.vis_hierarchy([topic_model, tm_layer2, tm_layer3],column_label=words, max_edges=200, prefix='topic-model-example')

I get

NameError                                 Traceback (most recent call last)
<ipython-input-99-9a4b427527bd> in <module>
      1 vt.vis_hierarchy([topic_model, tm_layer2, tm_layer3],
----> 2                  column_label=words, max_edges=200, prefix='topic-model-example')

~/env/py36/lib/python3.6/site-packages/corextopic/vis_topic.py in vis_hierarchy(corexes, column_label, max_edges, prefix, n_anchors)
     58         inds = np.where(alpha[j] >= 1.)[0]
     59         inds = inds[np.argsort(-alpha[j, inds] * mis[j, inds])]
---> 60         group_number = u"red_" + unicode(j) if j < n_anchors else unicode(j)
     61         label = group_number + u':' + u' '.join([annotate(column_label[ind], corexes[0].sign[j,ind]) for ind in inds[:6]])
     62         label = textwrap.fill(label, width=25)

NameError: name 'unicode' is not defined

model_ct.predict_proba() explanation

Hi Greg,

I was trying corextopic for supervised topic modeling (more precisely classification) and was using the model.predict_proba(<clean_vectorize_data>). This gives me output something similar to (array([[0.999, 0.0022]]), array([[0.198, -0.205]]). Could you please explain what these values are. That will be a great help.

Thanks in advance.

Providing exclusion words for topics

Hi first of all I wanted to say thank you for the great work! I very recently came across CorEx and found the solution is much better than traditional topic modelling solutions like LDA, especially the anchor words which enables to produce more meaningful and reasonable topics based on domain knowledge. Now I'm thinking of the opposite, say based on domain knowledge I know certain words shouldn't belong to certain topic, it would be good to provide some so-called "exclusion words" on top of the anchor words for the topic. Do you think this could be some feature added in the model? Thank you!

error save model

I fitted my model

topic_model = ct.Corex(n_hidden=50, words=words, max_iter=300, verbose=1, seed=1)
topic_model.fit(doc_word, words=words)

after I want to save her: topic_model.save(filename='./model/corex.dat')
and getting the error::

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<timed eval> in <module>

~/env/py36/lib/python3.6/site-packages/corextopic/corextopic.py in save(self, filename)
    490         if path.dirname(filename) and not path.exists(path.dirname(filename)):
    491             makedirs(path.dirname(filename))
--> 492         pickle.dump(self, open(filename, 'wb'), protocol=-1)
    493         # Restore words to CorEx object
    494         self.words = temp_words

OSError: [Errno 22] Invalid argument

Haw fix this problem?

Anchoring error: 'numpy.ndarray' object has no attribute 'A1'

topic_model = ct.Corex(n_hidden=33, max_iter=200, verbose=False, seed=2)
topic_model.fit(doc_word, words=features_list,anchors = [['pope','king']],anchor_strength = 3)

Output:
*** AttributeError: 'numpy.ndarray' object has no attribute 'A1'

The very same model and data without anchoring works fine

Tradeoffs with using fractional counts

Hi @ryanjgallagher @gregversteeg,
I understand that the model is binary in nature and that you have also given a feature to use fractional counts which you mention is experimental. A use case that I am tackling might require fractional counts. So what according to you would be the potential risks of using fractional counts. Also, any specific reason as to why you'd initially proceeded with binarization than the fractional count approach. Just asking this to understand the potential risks that might be involved with using the model in this different fashion.
Thanks in advance

GPU Implementation

This is an enhancement. Given that CorEx utilises a semi-supervised approach it would be advantageous to have a GPU Implementation as the reduced wait for feedback would allow for more rapid development of topics. https://developer.nvidia.com/how-to-cuda-python

gregversteeg / corex_topic Goto Github PK

corex_topic's People

Contributors

Stargazers

Watchers

Forkers

corex_topic's Issues

Definition

How to reproduce

Anchor the specific list of words for every individual document

Anchor general words throughout all of the documents

Recommend Projects

Recommend Topics

Recommend Org