adjidieng / etm Goto Github PK
View Code? Open in Web Editor NEWTopic Modeling in Embedding Spaces
License: MIT License
Topic Modeling in Embedding Spaces
License: MIT License
Hello,
Thanks for an interesting paper and for sharing the code.
I've been trying this method on some non-English datasets and a small stumbling block is the hard coded queries for nearest neighbor in main.py
(lines 209 and 374). This might be a problem for some English datasets as well.
I've fixed this in my own experiments by querying ten random words from the vocab, but I thought I'd flag it here in case you want to address this. I'd be happy to submit a PR myself if that's alright.
Hi,
How do I obtain the document-topic proportions for each document in the corpus?
Thank you
Luke
Hi, really interesting project.
Are there plans to make it multi-GPU ?
Is there a way I can train the model on my own dataset ? I am hitting nan on my performance metrics recon_loss, kld_theta = model(data_batch, normalized_data_batch)
in main.py
it seems that import os
is missed
Hi, I cannot understand the expression "recon_loss = -(preds * bows).sum(1)“ in etm.py forward() function. Could you help me explain it? The loss function seems to be different from the equation defined in the paper. Thanks!
Could you please upload a preprocessing script that creates all files from data/./? The files are:
bow_tr_counts.mat
bow_tr_tokens.mat
bow_ts_counts.mat
bow_ts_h1_counts.mat
bow_ts_h1_tokens.mat
bow_ts_h2_counts.mat
bow_ts_h2_tokens.mat
bow_ts_tokens.mat
bow_va_counts.mat
bow_va_tokens.mat
vocab.pkl
Thank you in advance!
Is it applicable for short reviews? Minimum how many words must be there in a review for the model to run excluding the stopwords.
I am getting nan as loss since my output tensor from q_theta is a tensor full of nan.
What is the reason why this code is never used.
https://github.com/adjidieng/ETM/blob/master/main.py#L245:L251
Although you seem to want to use it in https://github.com/adjidieng/ETM/blob/master/main.py#L333 and https://github.com/adjidieng/ETM/blob/master/main.py#L324 the function evaluate never makes use of the counts and tokens selected in https://github.com/adjidieng/ETM/blob/master/main.py#L245:L251
I'm having some trouble figuring out the appropriate input and output for the model after it is created. Is there any example you can provide for the use and what I can expect to have returned? As I understand it, it should return the predicted topics, with the embedding of the document being passed, correct?
Could you tell me the meaning of the parameters in this formula?
kl_theta = -0.5 * torch.sum(1 + logsigma_theta - mu_theta.pow(2) - logsigma_theta.exp(), dim=-1).mean()
It would be grateful if anybody can tell me how to do that ^_^
Unfortunately the validation loss used to train the model is currently being calculated on the test set, which means that the test set perplexity performance metric is not a reliable indicator of out-of-sample generalisation (cf. main.py lines 256-282).
The original intention to calculate the validation loss on the validation set is clear from main.py lines 244-251, however the variables defined there are not used subsequently in the "evaluate" function.
if args.clip > 0:
torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip)
hi everybody!
I just want to know what role dose the above block play? Why is the args.clip default 0 ?
Hi, thanks for your wonderful job. But I encounter confusion about the data loader function. Detail as below:
parser.add_argument('--data_path', type=str, default='data/20ng', help='directory containing data')
python main.py --mode train --dataset 20ng --data_path data/20ng --num_topics 50 --train_embeddings 1 --epochs 1000
vocab, training_set, valid, test_1, test_2 = data.get_data(doc_terms_file_name="tf_idf_doc_terms_matrix_time_window_1",
terms_filename="tf_idf_terms_time_window_1")
Hi,
I tried to run this ETM on my own dataset. The embedding part is quite hard to understand. I managed to use skipgram.py to train
a word2vec model and get embeddings.txt but failed to use it through the read_embedding_matrix function. Could anyone help me to fix this. it is quite challenging to run this code on a new dataset... Thanks!!!
I apply ETM on the dataset of youtube titles, each title belongs one of 10 classes. And there are 10000 titles for each class, the average length of title is 10. I set num_topics = 200.
After 1000 epochs training, I use theta (topic distribution for each title, which is a 200*1 vector) gotten from ETM as the input of SVM and try to do classification. But the result is bad, F1 score is 0.62.
However, simply using counter vector as input (43020*1 vector) will have F1 around 0.75.
Does anyone know the potential reason to explain it? Thanks!
Hi,
I tried to run this ETM on my own dataset. I managed to use data_nyt.py to generate a number of .mat files but failed to
use it in the main.py. The doc_term_file_name and terms_filename are hard to understand. Would it be possible that anyone
helps me to fix this? Thanks!
hi,
i was trying to get this running on my own dataset of 40K documents and a vocabulary of 76K.
But the computer "says no": need 100Gb RAM...
Am I doing something wrong here or is this really this memory-greedy?
It ran fine with about 1K docs and 5K voc. For 10K docs, it required about 36Gb RAM. Are these normal numbers?
br,
Pieter
Hi, absolutely love the repo and the paper. Congratulations on developing this.
Would just like to ask about the evaluation
function in main.py
Line 275: preds
is derived from normalized_data_batch_1 but data_batch_2
is a different set of document altogether.
recon_loss = -(preds * data_batch_2).sum(1)
What does recon_loss
mean if it matches the reconstruction of one set of documents against a different set of documents?
Hi, I've read your code and have some questions about the function "get_document_frequency" and "get_topic_coherence" in your utils.py.
def get_document_frequency(data, wi, wj=None):
if wj is None:
D_wi = 0
for l in range(len(data)):
doc = data[l].squeeze(0)
if len(doc) == 1:
continue
else:
doc = doc.squeeze()
if wi in doc:
D_wi += 1
return D_wi
D_wj = 0
D_wi_wj = 0
for l in range(len(data)):
doc = data[l].squeeze(0)
if len(doc) == 1:
doc = [doc.squeeze()]
else:
doc = doc.squeeze()
if wj in doc:
D_wj += 1
if wi in doc:
D_wi_wj += 1
return D_wj, D_wi_wj
def get_topic_coherence(beta, data, vocab):
D = len(data) ## number of docs...data is list of documents
print('D: ', D)
TC = []
num_topics = len(beta)
for k in range(num_topics):
print('k: {}/{}'.format(k, num_topics))
top_10 = list(beta[k].argsort()[-11:][::-1])
top_words = [vocab[a] for a in top_10]
TC_k = 0
counter = 0
for i, word in enumerate(top_10):
# get D(w_i)
D_wi = get_document_frequency(data, word)
j = i + 1
tmp = 0
while j < len(top_10) and j > i:
# get D(w_j) and D(w_i, w_j)
D_wj, D_wi_wj = get_document_frequency(data, word, top_10[j])
# get f(w_i, w_j)
if D_wi_wj == 0:
f_wi_wj = -1
else:
f_wi_wj = -1 + ( np.log(D_wi) + np.log(D_wj) - 2.0 * np.log(D) ) / ( np.log(D_wi_wj) - np.log(D) )
# update tmp:
tmp += f_wi_wj
j += 1
counter += 1
# update TC_k
TC_k += tmp
TC.append(TC_k)
print('counter: ', counter)
print('num topics: ', len(TC))
TC = np.mean(TC) / counter
print('Topic coherence is: {}'.format(TC))
In your code, you calculate "D_wj" by using "D_wj, D_wi_wj = get_document_frequency(data, word, top_10[j])
". But if I use "D_wj = get_document_frequency(data, top_10[j])
" to get the value of "D_wj" as you've done in the calculation of "D_wi", it seems reasonable because "D_wi" and "D_wj" should have been calculated in the same way. And when the condition "len(doc) == 1
" is true, we need to jump to the next iteration as you write in your code:
for l in range(len(data)):
doc = data[l].squeeze(0)
if len(doc) == 1:
continue
However, when using "D_wj, D_wi_wj = get_document_frequency(data, word, top_10[j])
", according to the function "get_document_frequency", the calculation of "D_wj" will jump to the second half part of this function and we will encounter:
if len(doc) == 1:
doc = [doc.squeeze()]
But then, we will encounter this part (for one word document condition):
if wj in doc:
D_wj += 1
I don't think this is a proper method to deal with the calculation of D_wj. Therefore, I suspect this calculation has some problems.
Thanks!
Thank you for your work on ETM model. I applied my documents using ETM. ETM gave clearer cut topics than LDA did.
The original LDA could have multiple topics assign to a single document. In the paper, you are using softmax for theta - topic embedding. The softmax tend to assign one topic for one document. I am wondering if you can give me some suggestion on how I can use ETM to get multiple topics from a single document.
I am using get_theta(normalized_data_batch) to get the topic distribution.
maybe this question is dumb but I don't understand why the average of the weighted document-topic-proportions is a metric for the most important topics?
thetaWeightedAvg = sums * theta
thetaWeightedAvg = thetaWeightedAvg / num_docs
print('\nThe 10 most used topics are {}'.format(thetaWeightedAvg.argsort()[::-1][:10]))
From my understanding, the product of each document frequency (sums
) with document-topic probabilities theta
amplifies or reduces probability-based on the actual probability. And the average provides some insights on which topics are important in the whole corpus. Is it right? Also, what would be the difference if we only average the document-topic proportions (no weighting)
thanks a lot
I noticed that there is bug in the preprocessing code for 20ng(scripts/data_20ng.py)
Line 88 in 52b090b
missing the idx_permute
index convert
I don’t seem to find the output method of the theme vector in the code, or I need to modify the code to get the theme vector。Does anyone know how to achieve this?
Hello,
Thank you for your work. I am running into an issue with the code when trying it with different datasets.
With some datasets, the document completion evaluation during training produces a nan when using the evaluation docs, and I am not sure understand what that means. Does that mean that the model is not as good in predicting the second half? or is it something wrong with the learning process?
Hello,
Is it possible to know how this data is like: "raw/new_york_times_text/nyt_docs.txt"?
I am trying to fit my own dataset but don't know to which type should I transform ...
Wish somebody can help!
Thanks!
Hi,
Thanks for your interesting paper and this repository!
I tried train ETM on both 20ng and my own dataset with num_topics = 50.
Among the 50 topics I found some repeated topics, like ['writes', 'article', 'good', 'people', 'make', 'read', 'thing', 'time', 'lot'] (repeated for 4 times) and ['time', 'good', 'problem', 'work', 'back', 'problems', 'ago', 'thing', 'couple'] (repeated for 2 times).
Does anyone observe the same phenomenon?
Hi,
I've read your great paper. I noticed that you measure log likelihood on test set as a predictive metric. But in utils.py, I found that you'd provided me just topic coherence and diversity measures.
Would you consider to add that predictive measure to utils.py?
Hi, I saw that one can use DETM on short texts. I tried ETM on short texts (each text contains only one sentence) and it seemed to work. However, the coherence score became negative. How should I interpret it? Does lower coherence always mean worse? Or do scores closer to 0 mean worse?
Whenever I try ETM on normal-length texts (consisting of more than one sentence), the coherence is always positive, so I assume that negative coherence is caused by short length
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.