adjidieng / etm Goto Github PK

View Code? Open in Web Editor NEW

536.0 536.0 126.0 196.43 MB

Topic Modeling in Embedding Spaces

License: MIT License

Python 100.00%

etm's People

Contributors

Stargazers

Watchers

Forkers

xl60 vyraun dutinghou zhouyonglong burakakrishna legendtianjin hoangcuong2011 vikingmew aalisha zhanghaonan777 avashlin sainiudit anirudh-murali artemovae srossgupta zachalexan gretatuckute zehuag mchaduteau carina-kauf carolinazheng so2jia isr-wang chaoguanghuo subhasree two222 ddehueck williamscott701 mprithiv lmillersims-umd datawrestler hackerwei shshnk94 ajw-42 cuent rileymiller bekyilma nair-p charlottelambert mrwaterzhou xiongshufeng mona-timmermann qixiang109 rrania4r lfmatosm jdenes sanjro harirajeev kingomalek andym2013 sandhyavasthi boomer-ai amir9ume liyijing024 phillip1029 thusithac lixin832500 tilmanbeck mukami12 chuajiesheng grinazarko haseeb33 g-github-science lostarist 460176980 lu-tan shivanibhoite shelizi jadore801120 curiszhou nawshad ssharoff espoirmur ahoho haluk dorotheakal feherg abcp4 ashjanalsulaimani lievannlmatics ketencimert tarikaltuncu bondfeld meg-archie unt-haihua nobrowning shwetha-bharadwaj bnosac-dev pacifikus jiachen-gu sri-dl-ml-ai gipster jlealtru tinltan ningzili mregan1314 lievan jungel2star fasladodo lahpan

etm's Issues

Preprocessing

Could you please upload a preprocessing script that creates all files from data/./? The files are:
bow_tr_counts.mat
bow_tr_tokens.mat
bow_ts_counts.mat
bow_ts_h1_counts.mat
bow_ts_h1_tokens.mat
bow_ts_h2_counts.mat
bow_ts_h2_tokens.mat
bow_ts_tokens.mat
bow_va_counts.mat
bow_va_tokens.mat
vocab.pkl

Thank you in advance!

recon_loss

Hi, I cannot understand the expression "recon_loss = -(preds * bows).sum(1)“ in etm.py forward() function. Could you help me explain it? The loss function seems to be different from the equation defined in the paper. Thanks!

Model

Could you tell me the meaning of the parameters in this formula？
kl_theta = -0.5 * torch.sum(1 + logsigma_theta - mu_theta.pow(2) - logsigma_theta.exp(), dim=-1).mean()

How to obtain document-topic proportions (the thetas) for each document

Hi,

How do I obtain the document-topic proportions for each document in the corpus?

Thank you
Luke

evaluate

What is the reason why this code is never used.
https://github.com/adjidieng/ETM/blob/master/main.py#L245:L251

Although you seem to want to use it in https://github.com/adjidieng/ETM/blob/master/main.py#L333 and https://github.com/adjidieng/ETM/blob/master/main.py#L324 the function evaluate never makes use of the counts and tokens selected in https://github.com/adjidieng/ETM/blob/master/main.py#L245:L251

How can i get the topic distribution of a document?

It would be grateful if anybody can tell me how to do that ^_^

Example Code for Use

I'm having some trouble figuring out the appropriate input and output for the model after it is created. Is there any example you can provide for the use and what I can expect to have returned? As I understand it, it should return the predicted topics, with the embedding of the document being passed, correct?

Test data partition in evaluation()

Hi, absolutely love the repo and the paper. Congratulations on developing this.

Would just like to ask about the evaluation function in main.py

Line 275: preds is derived from normalized_data_batch_1 but data_batch_2 is a different set of document altogether.

recon_loss = -(preds * data_batch_2).sum(1)

What does recon_loss mean if it matches the reconstruction of one set of documents against a different set of documents?

FileNotFoundError: [Errno 2] No such file or directory: 'data/20ng_embeddings.txt'

thanks a lot

Validation set loss is being calculated on the Test set.

Unfortunately the validation loss used to train the model is currently being calculated on the test set, which means that the test set perplexity performance metric is not a reliable indicator of out-of-sample generalisation (cf. main.py lines 256-282).

The original intention to calculate the validation loss on the validation set is clear from main.py lines 244-251, however the variables defined there are not used subsequently in the "evaluate" function.

Topic coherence calculation

Hi, I've read your code and have some questions about the function "get_document_frequency" and "get_topic_coherence" in your utils.py.

def get_document_frequency(data, wi, wj=None):
    if wj is None:
        D_wi = 0
        for l in range(len(data)):
            doc = data[l].squeeze(0)
            if len(doc) == 1: 
                continue
            else:
                doc = doc.squeeze()
            if wi in doc:
                D_wi += 1
        return D_wi
    D_wj = 0
    D_wi_wj = 0
    for l in range(len(data)):
        doc = data[l].squeeze(0)
        if len(doc) == 1: 
            doc = [doc.squeeze()]
        else:
            doc = doc.squeeze()
        if wj in doc:
            D_wj += 1
            if wi in doc:
                D_wi_wj += 1
    return D_wj, D_wi_wj

def get_topic_coherence(beta, data, vocab):
    D = len(data) ## number of docs...data is list of documents
    print('D: ', D)
    TC = []
    num_topics = len(beta)
    for k in range(num_topics):
        print('k: {}/{}'.format(k, num_topics))
        top_10 = list(beta[k].argsort()[-11:][::-1])
        top_words = [vocab[a] for a in top_10]
        TC_k = 0
        counter = 0
        for i, word in enumerate(top_10):
            # get D(w_i)
            D_wi = get_document_frequency(data, word)
            j = i + 1
            tmp = 0
            while j < len(top_10) and j > i:
                # get D(w_j) and D(w_i, w_j)
                D_wj, D_wi_wj = get_document_frequency(data, word, top_10[j])
                # get f(w_i, w_j)
                if D_wi_wj == 0:
                    f_wi_wj = -1
                else:
                    f_wi_wj = -1 + ( np.log(D_wi) + np.log(D_wj)  - 2.0 * np.log(D) ) / ( np.log(D_wi_wj) - np.log(D) )
                # update tmp: 
                tmp += f_wi_wj
                j += 1
                counter += 1
            # update TC_k
            TC_k += tmp 
        TC.append(TC_k)
    print('counter: ', counter)
    print('num topics: ', len(TC))
    TC = np.mean(TC) / counter
    print('Topic coherence is: {}'.format(TC))

In your code, you calculate "D_wj" by using "D_wj, D_wi_wj = get_document_frequency(data, word, top_10[j])". But if I use "D_wj = get_document_frequency(data, top_10[j])" to get the value of "D_wj" as you've done in the calculation of "D_wi", it seems reasonable because "D_wi" and "D_wj" should have been calculated in the same way. And when the condition "len(doc) == 1" is true, we need to jump to the next iteration as you write in your code:

for l in range(len(data)):
            doc = data[l].squeeze(0)
            if len(doc) == 1: 
                continue

However, when using "D_wj, D_wi_wj = get_document_frequency(data, word, top_10[j])", according to the function "get_document_frequency", the calculation of "D_wj" will jump to the second half part of this function and we will encounter:

if len(doc) == 1: 
            doc = [doc.squeeze()]

But then, we will encounter this part (for one word document condition):

if wj in doc:
            D_wj += 1

I don't think this is a proper method to deal with the calculation of D_wj. Therefore, I suspect this calculation has some problems.
Thanks!

read embedding matrix when not using trained embeddings

Hi all.

I have an issue understanding the read_embedding_matrix used in main.py.

The model_path here is specific to the local path organization from the authors but it is not clear how this file should be produced.

I would be thankful for some support on this.

Run ETM on my own dataset

Hello,
Is it possible to know how this data is like: "raw/new_york_times_text/nyt_docs.txt"?
I am trying to fit my own dataset but don't know to which type should I transform ...
Wish somebody can help!

Thanks!

Is that true that a lot of repeated topics appear?

Hi,

Thanks for your interesting paper and this repository!

I tried train ETM on both 20ng and my own dataset with num_topics = 50.

Among the 50 topics I found some repeated topics, like ['writes', 'article', 'good', 'people', 'make', 'read', 'thing', 'time', 'lot'] (repeated for 4 times) and ['time', 'good', 'problem', 'work', 'back', 'problems', 'ago', 'thing', 'couple'] (repeated for 2 times).

Does anyone observe the same phenomenon?

Any suggestion on multiple topics for one document?

Thank you for your work on ETM model. I applied my documents using ETM. ETM gave clearer cut topics than LDA did.

The original LDA could have multiple topics assign to a single document. In the paper, you are using softmax for theta - topic embedding. The softmax tend to assign one topic for one document. I am wondering if you can give me some suggestion on how I can use ETM to get multiple topics from a single document.
I am using get_theta(normalized_data_batch) to get the topic distribution.

https://github.com/WalterKung/DataConference2020/blob/master/P2_TOPIC_MODEL/SS_TOPIC_MODEL_Stock_by_news.ipynb

multi GPU capabilities

Hi, really interesting project.
Are there plans to make it multi-GPU ?

getting nan as loss for Short reviews

Is it applicable for short reviews? Minimum how many words must be there in a review for the model to run excluding the stopwords.

I am getting nan as loss since my output tensor from q_theta is a tensor full of nan.

perform bad on classifiction task

I apply ETM on the dataset of youtube titles, each title belongs one of 10 classes. And there are 10000 titles for each class, the average length of title is 10. I set num_topics = 200.

After 1000 epochs training, I use theta (topic distribution for each title, which is a 200*1 vector) gotten from ETM as the input of SVM and try to do classification. But the result is bad, F1 score is 0.62.

However, simply using counter vector as input (43020*1 vector) will have F1 around 0.75.
Does anyone know the potential reason to explain it? Thanks!

Topic Coherence Computation: Division by 45?

Why are they dividing by 45 for topic coherence based on normalised PMI? It says in the paper but the computation in the code looks different to me.

Most important topics interpretation

maybe this question is dumb but I don't understand why the average of the weighted document-topic-proportions is a metric for the most important topics?

thetaWeightedAvg = sums * theta
thetaWeightedAvg = thetaWeightedAvg  /  num_docs
print('\nThe 10 most used topics are {}'.format(thetaWeightedAvg.argsort()[::-1][:10]))

From my understanding, the product of each document frequency (sums) with document-topic probabilities theta amplifies or reduces probability-based on the actual probability. And the average provides some insights on which topics are important in the whole corpus. Is it right? Also, what would be the difference if we only average the document-topic proportions (no weighting)

Hard coded queries for nearest neighbor

Hello,

Thanks for an interesting paper and for sharing the code.

I've been trying this method on some non-English datasets and a small stumbling block is the hard coded queries for nearest neighbor in main.py (lines 209 and 374). This might be a problem for some English datasets as well.

I've fixed this in my own experiments by querying ten random words from the vocab, but I thought I'd flag it here in case you want to address this. I'd be happy to submit a PR myself if that's alright.

Negative coherence on short texts

Hi, I saw that one can use DETM on short texts. I tried ETM on short texts (each text contains only one sentence) and it seemed to work. However, the coherence score became negative. How should I interpret it? Does lower coherence always mean worse? Or do scores closer to 0 mean worse?
Whenever I try ETM on normal-length texts (consisting of more than one sentence), the coherence is always positive, so I assume that negative coherence is caused by short length

Add predictive measure to utils.py

Hi,

I've read your great paper. I noticed that you measure log likelihood on test set as a predictive metric. But in utils.py, I found that you'd provided me just topic coherence and diversity measures.

Would you consider to add that predictive measure to utils.py?

missing dependency in data_nyt.py

it seems that import os is missed

代码运行

您好，我在复现您的代码代码中，报出无文件的错误

就是这些问题 No such file or directory: 'raw/new_york_times_text/nyt_docs.txt'
No such file or directory: 'data/20ng_embeddings.txt'
希望得到您的帮助！！万分感谢

rising KL_theta values

How to get the topic vector？

I don’t seem to find the output method of the theme vector in the code, or I need to modify the code to get the theme vector。Does anyone know how to achieve this?

Training model on own dataset

Is there a way I can train the model on my own dataset ? I am hitting nan on my performance metrics recon_loss, kld_theta = model(data_batch, normalized_data_batch) in main.py

Confuse about the data loader function

Hi, thanks for your wonderful job. But I encounter confusion about the data loader function. Detail as below:

parser.add_argument('--data_path', type=str, default='data/20ng', help='directory containing data')

I can't find any code that refers to the '--data_path' parameter, so why do we need to add it as input in the following command.

python main.py --mode train --dataset 20ng --data_path data/20ng --num_topics 50 --train_embeddings 1 --epochs 1000

How do these two parameters doc_terms_file_name and terms_filename do? I don't understand, even I can't find 'tf_idf_doc_terms_matrix_time_window_1' anywhere (such as the provided dataset directory.)

vocab, training_set, valid, test_1, test_2 = data.get_data(doc_terms_file_name="tf_idf_doc_terms_matrix_time_window_1",
                                                           terms_filename="tf_idf_terms_time_window_1")

embedding

Hi,

I tried to run this ETM on my own dataset. The embedding part is quite hard to understand. I managed to use skipgram.py to train
a word2vec model and get embeddings.txt but failed to use it through the read_embedding_matrix function. Could anyone help me to fix this. it is quite challenging to run this code on a new dataset... Thanks!!!

args.clip

if args.clip > 0:
torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip)

hi everybody!

I just want to know what role dose the above block play? Why is the args.clip default 0 ?

doucment completion evaluation results in a nan

Hello,

Thank you for your work. I am running into an issue with the code when trying it with different datasets.
With some datasets, the document completion evaluation during training produces a nan when using the evaluation docs, and I am not sure understand what that means. Does that mean that the model is not as good in predicting the second half? or is it something wrong with the learning process?

memory requirement

hi,

i was trying to get this running on my own dataset of 40K documents and a vocabulary of 76K.
But the computer "says no": need 100Gb RAM...
Am I doing something wrong here or is this really this memory-greedy?

It ran fine with about 1K docs and 5K voc. For 10K docs, it required about 36Gb RAM. Are these normal numbers?

br,
Pieter

a bug in test dataset splitting

I noticed that there is bug in the preprocessing code for 20ng(scripts/data_20ng.py)

ETM/scripts/data_20ng.py

Line 88 in 52b090b

    
           docs_ts = [[word2id[w] for w in init_docs[idx_d+num_docs_tr].split() if w in word2id] for idx_d in range(tsSize)]

missing the idx_permute index convert

How to modify the code to number of topics other than 50?

What changes can we make to the code where we can change the number of topics from 50 to any other number?
For my data I need to have only 5-6 topics.

I did change in number of topics in the first command that creates the embedding vectors, I was able to do it, but I am facing the following issue.

dataset

Hi,

I tried to run this ETM on my own dataset. I managed to use data_nyt.py to generate a number of .mat files but failed to
use it in the main.py. The doc_term_file_name and terms_filename are hard to understand. Would it be possible that anyone
helps me to fix this? Thanks!