Coder Social home page Coder Social logo

etm's People

Contributors

adjidieng avatar ahoho avatar espoirmur avatar haseeb33 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

etm's Issues

Hard coded queries for nearest neighbor

Hello,

Thanks for an interesting paper and for sharing the code.

I've been trying this method on some non-English datasets and a small stumbling block is the hard coded queries for nearest neighbor in main.py (lines 209 and 374). This might be a problem for some English datasets as well.

I've fixed this in my own experiments by querying ten random words from the vocab, but I thought I'd flag it here in case you want to address this. I'd be happy to submit a PR myself if that's alright.

Training model on own dataset

Is there a way I can train the model on my own dataset ? I am hitting nan on my performance metrics recon_loss, kld_theta = model(data_batch, normalized_data_batch) in main.py

代码运行

您好,我在复现您的代码代码中,报出无文件的错误
image
就是这些问题 No such file or directory: 'raw/new_york_times_text/nyt_docs.txt'
No such file or directory: 'data/20ng_embeddings.txt'
希望得到您的帮助!!万分感谢

recon_loss

Hi, I cannot understand the expression "recon_loss = -(preds * bows).sum(1)“ in etm.py forward() function. Could you help me explain it? The loss function seems to be different from the equation defined in the paper. Thanks!

Preprocessing

Could you please upload a preprocessing script that creates all files from data/./? The files are:
bow_tr_counts.mat
bow_tr_tokens.mat
bow_ts_counts.mat
bow_ts_h1_counts.mat
bow_ts_h1_tokens.mat
bow_ts_h2_counts.mat
bow_ts_h2_tokens.mat
bow_ts_tokens.mat
bow_va_counts.mat
bow_va_tokens.mat
vocab.pkl

Thank you in advance!

getting nan as loss for Short reviews

Is it applicable for short reviews? Minimum how many words must be there in a review for the model to run excluding the stopwords.

I am getting nan as loss since my output tensor from q_theta is a tensor full of nan.

Example Code for Use

I'm having some trouble figuring out the appropriate input and output for the model after it is created. Is there any example you can provide for the use and what I can expect to have returned? As I understand it, it should return the predicted topics, with the embedding of the document being passed, correct?

How to modify the code to number of topics other than 50?

What changes can we make to the code where we can change the number of topics from 50 to any other number?
For my data I need to have only 5-6 topics.

I did change in number of topics in the first command that creates the embedding vectors, I was able to do it, but I am facing the following issue.

Screen Shot 2020-12-12 at 15 38 42

Model

Could you tell me the meaning of the parameters in this formula?
kl_theta = -0.5 * torch.sum(1 + logsigma_theta - mu_theta.pow(2) - logsigma_theta.exp(), dim=-1).mean()

Validation set loss is being calculated on the Test set.

Unfortunately the validation loss used to train the model is currently being calculated on the test set, which means that the test set perplexity performance metric is not a reliable indicator of out-of-sample generalisation (cf. main.py lines 256-282).

The original intention to calculate the validation loss on the validation set is clear from main.py lines 244-251, however the variables defined there are not used subsequently in the "evaluate" function.

args.clip

if args.clip > 0:
torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip)

hi everybody!

I just want to know what role dose the above block play? Why is the args.clip default 0 ?

Confuse about the data loader function

Hi, thanks for your wonderful job. But I encounter confusion about the data loader function. Detail as below:

parser.add_argument('--data_path', type=str, default='data/20ng', help='directory containing data')
  1. I can't find any code that refers to the '--data_path' parameter, so why do we need to add it as input in the following command.
python main.py --mode train --dataset 20ng --data_path data/20ng --num_topics 50 --train_embeddings 1 --epochs 1000
  1. How do these two parameters doc_terms_file_name and terms_filename do? I don't understand, even I can't find 'tf_idf_doc_terms_matrix_time_window_1' anywhere (such as the provided dataset directory.)
vocab, training_set, valid, test_1, test_2 = data.get_data(doc_terms_file_name="tf_idf_doc_terms_matrix_time_window_1",
                                                           terms_filename="tf_idf_terms_time_window_1")

embedding

Hi,

I tried to run this ETM on my own dataset. The embedding part is quite hard to understand. I managed to use skipgram.py to train
a word2vec model and get embeddings.txt but failed to use it through the read_embedding_matrix function. Could anyone help me to fix this. it is quite challenging to run this code on a new dataset... Thanks!!!

perform bad on classifiction task

I apply ETM on the dataset of youtube titles, each title belongs one of 10 classes. And there are 10000 titles for each class, the average length of title is 10. I set num_topics = 200.

After 1000 epochs training, I use theta (topic distribution for each title, which is a 200*1 vector) gotten from ETM as the input of SVM and try to do classification. But the result is bad, F1 score is 0.62.

However, simply using counter vector as input (43020*1 vector) will have F1 around 0.75.
Does anyone know the potential reason to explain it? Thanks!

dataset

Hi,

I tried to run this ETM on my own dataset. I managed to use data_nyt.py to generate a number of .mat files but failed to
use it in the main.py. The doc_term_file_name and terms_filename are hard to understand. Would it be possible that anyone
helps me to fix this? Thanks!

memory requirement

hi,

i was trying to get this running on my own dataset of 40K documents and a vocabulary of 76K.
But the computer "says no": need 100Gb RAM...
Am I doing something wrong here or is this really this memory-greedy?

It ran fine with about 1K docs and 5K voc. For 10K docs, it required about 36Gb RAM. Are these normal numbers?

br,
Pieter

Test data partition in evaluation()

Hi, absolutely love the repo and the paper. Congratulations on developing this.

Would just like to ask about the evaluation function in main.py

Line 275: preds is derived from normalized_data_batch_1 but data_batch_2 is a different set of document altogether.

recon_loss = -(preds * data_batch_2).sum(1) 

What does recon_loss mean if it matches the reconstruction of one set of documents against a different set of documents?

Topic coherence calculation

Hi, I've read your code and have some questions about the function "get_document_frequency" and "get_topic_coherence" in your utils.py.

def get_document_frequency(data, wi, wj=None):
    if wj is None:
        D_wi = 0
        for l in range(len(data)):
            doc = data[l].squeeze(0)
            if len(doc) == 1: 
                continue
            else:
                doc = doc.squeeze()
            if wi in doc:
                D_wi += 1
        return D_wi
    D_wj = 0
    D_wi_wj = 0
    for l in range(len(data)):
        doc = data[l].squeeze(0)
        if len(doc) == 1: 
            doc = [doc.squeeze()]
        else:
            doc = doc.squeeze()
        if wj in doc:
            D_wj += 1
            if wi in doc:
                D_wi_wj += 1
    return D_wj, D_wi_wj 
def get_topic_coherence(beta, data, vocab):
    D = len(data) ## number of docs...data is list of documents
    print('D: ', D)
    TC = []
    num_topics = len(beta)
    for k in range(num_topics):
        print('k: {}/{}'.format(k, num_topics))
        top_10 = list(beta[k].argsort()[-11:][::-1])
        top_words = [vocab[a] for a in top_10]
        TC_k = 0
        counter = 0
        for i, word in enumerate(top_10):
            # get D(w_i)
            D_wi = get_document_frequency(data, word)
            j = i + 1
            tmp = 0
            while j < len(top_10) and j > i:
                # get D(w_j) and D(w_i, w_j)
                D_wj, D_wi_wj = get_document_frequency(data, word, top_10[j])
                # get f(w_i, w_j)
                if D_wi_wj == 0:
                    f_wi_wj = -1
                else:
                    f_wi_wj = -1 + ( np.log(D_wi) + np.log(D_wj)  - 2.0 * np.log(D) ) / ( np.log(D_wi_wj) - np.log(D) )
                # update tmp: 
                tmp += f_wi_wj
                j += 1
                counter += 1
            # update TC_k
            TC_k += tmp 
        TC.append(TC_k)
    print('counter: ', counter)
    print('num topics: ', len(TC))
    TC = np.mean(TC) / counter
    print('Topic coherence is: {}'.format(TC))

In your code, you calculate "D_wj" by using "D_wj, D_wi_wj = get_document_frequency(data, word, top_10[j])". But if I use "D_wj = get_document_frequency(data, top_10[j])" to get the value of "D_wj" as you've done in the calculation of "D_wi", it seems reasonable because "D_wi" and "D_wj" should have been calculated in the same way. And when the condition "len(doc) == 1" is true, we need to jump to the next iteration as you write in your code:

for l in range(len(data)):
            doc = data[l].squeeze(0)
            if len(doc) == 1: 
                continue

However, when using "D_wj, D_wi_wj = get_document_frequency(data, word, top_10[j])", according to the function "get_document_frequency", the calculation of "D_wj" will jump to the second half part of this function and we will encounter:

if len(doc) == 1: 
            doc = [doc.squeeze()]

But then, we will encounter this part (for one word document condition):

if wj in doc:
            D_wj += 1

I don't think this is a proper method to deal with the calculation of D_wj. Therefore, I suspect this calculation has some problems.
Thanks!

Any suggestion on multiple topics for one document?

Thank you for your work on ETM model. I applied my documents using ETM. ETM gave clearer cut topics than LDA did.

The original LDA could have multiple topics assign to a single document. In the paper, you are using softmax for theta - topic embedding. The softmax tend to assign one topic for one document. I am wondering if you can give me some suggestion on how I can use ETM to get multiple topics from a single document.
I am using get_theta(normalized_data_batch) to get the topic distribution.

https://github.com/WalterKung/DataConference2020/blob/master/P2_TOPIC_MODEL/SS_TOPIC_MODEL_Stock_by_news.ipynb

Most important topics interpretation

maybe this question is dumb but I don't understand why the average of the weighted document-topic-proportions is a metric for the most important topics?

thetaWeightedAvg = sums * theta
thetaWeightedAvg = thetaWeightedAvg  /  num_docs
print('\nThe 10 most used topics are {}'.format(thetaWeightedAvg.argsort()[::-1][:10]))

From my understanding, the product of each document frequency (sums) with document-topic probabilities theta amplifies or reduces probability-based on the actual probability. And the average provides some insights on which topics are important in the whole corpus. Is it right? Also, what would be the difference if we only average the document-topic proportions (no weighting)

How to get the topic vector?

I don’t seem to find the output method of the theme vector in the code, or I need to modify the code to get the theme vector。Does anyone know how to achieve this?

doucment completion evaluation results in a nan

Hello,

Thank you for your work. I am running into an issue with the code when trying it with different datasets.
With some datasets, the document completion evaluation during training produces a nan when using the evaluation docs, and I am not sure understand what that means. Does that mean that the model is not as good in predicting the second half? or is it something wrong with the learning process?

Run ETM on my own dataset

Hello,
Is it possible to know how this data is like: "raw/new_york_times_text/nyt_docs.txt"?
I am trying to fit my own dataset but don't know to which type should I transform ...
Wish somebody can help!

Thanks!

Is that true that a lot of repeated topics appear?

Hi,

Thanks for your interesting paper and this repository!

I tried train ETM on both 20ng and my own dataset with num_topics = 50.

Among the 50 topics I found some repeated topics, like ['writes', 'article', 'good', 'people', 'make', 'read', 'thing', 'time', 'lot'] (repeated for 4 times) and ['time', 'good', 'problem', 'work', 'back', 'problems', 'ago', 'thing', 'couple'] (repeated for 2 times).

Does anyone observe the same phenomenon?

Add predictive measure to utils.py

Hi,

I've read your great paper. I noticed that you measure log likelihood on test set as a predictive metric. But in utils.py, I found that you'd provided me just topic coherence and diversity measures.

Would you consider to add that predictive measure to utils.py?

Negative coherence on short texts

Hi, I saw that one can use DETM on short texts. I tried ETM on short texts (each text contains only one sentence) and it seemed to work. However, the coherence score became negative. How should I interpret it? Does lower coherence always mean worse? Or do scores closer to 0 mean worse?
Whenever I try ETM on normal-length texts (consisting of more than one sentence), the coherence is always positive, so I assume that negative coherence is caused by short length

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.