Could you please upload a preprocessing that creates all files from data/./? Th

Hi There, We just added the s to pre-process a dataset in the

Hi - loved your s. They help a lot. The only minor bug is that the output has t

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Preprocessing about etm HOT 7 OPEN

adjidieng commented on July 28, 2024 6

Preprocessing

from etm.

Comments (7)

adjidieng commented on July 28, 2024 8

Hi everyone,

I am posting below an email that we wrote that replies to this question.

Thanks.

Thank you very much for your interest in the ETM model! We're glad you're looking into it.

The formatting of the data is as follows. All data files are in a bag-of-words format. Their names are bow_XX_YY.mat, where
XX = {tr, ts, ts_h1, ts_h2, va} # training, test, test(first half of each doc), test(second half of each doc), validation
YY = {tokens, counts} # content of the file: tokens or counts
Each file contains a list of documents. That is, each list is of the form [doc_1, doc_2, ..., doc_N]. Each element doc_i is itself a list with integers. The integers represent either the vocabulary terms (they are 0-indexed) for the "tokens" files, or the word counts for the "counts" files. For example, if doc_1=[0, 14, 17] in the file ending in "tokens.mat" and doc_1=[3, 1, 2] in the file ending in "counts.mat", that means that term 0 occurs 3 times in the document, term 14 appears once, and term 17 appears twice.

To be more specific, here is how we created the bow_tr_YY.mat files from bow_tr (which is a scipy sparse matrix in CSR format containing the bag-of-words representation of all documents in the training set):

    def split_bow(bow_in, n_docs):
        indices = [[w for w in bow_in[doc,:].indices] for doc in range(n_docs)]
        counts = [[c for c in bow_in[doc,:].data] for doc in range(n_docs)]
        return indices, counts

    bow_tr_tokens, bow_tr_counts = split_bow(bow_tr, n_docs_tr)
    savemat('bow_tr_tokens', {'tokens': bow_tr_tokens}, do_compression=True)
    savemat('bow_tr_counts', {'counts': bow_tr_counts}, do_compression=True)

Finally, vocab.pkl is simply a list containing the strings corresponding to the vocabulary terms. We created this file using

pickle.dump(vocab, f)

We hope that helps!

from etm.

adjidieng commented on July 28, 2024 5

Hi There,

We just added the scripts to pre-process a dataset in the repo. Please check that out and let us know if you have other questions.

from etm.

arnicas commented on July 28, 2024 2

Hi - loved your scripts. They help a lot. The only minor bug is that the output has to have .mat at the end, and at least the 20ng one didn't write them out that way. Minor point!

from etm.

Aalisha commented on July 28, 2024 1

Would it be possible to explain in brief how these data files /data/./ were created from NewsGroup Dataset and what does the data in each of these files represent?

Thanking you!

from etm.

tutubalinaev commented on July 28, 2024

@adjidieng could you please comment my request?

from etm.

Mandark27 commented on July 28, 2024

token file contains all the tokens or words present in a review. you can use vocab.pkl to see the words.

Count file tells you the count of each of that token in that review.

from etm.

thousandoaks commented on July 28, 2024

Hi everyone,

I am posting below an email that we wrote that replies to this question.

Thanks.

Thank you very much for your interest in the ETM model! We're glad you're looking into it.

The formatting of the data is as follows. All data files are in a bag-of-words format. Their names are bow_XX_YY.mat, where
XX = {tr, ts, ts_h1, ts_h2, va} # training, test, test(first half of each doc), test(second half of each doc), validation
YY = {tokens, counts} # content of the file: tokens or counts
Each file contains a list of documents. That is, each list is of the form [doc_1, doc_2, ..., doc_N]. Each element doc_i is itself a list with integers. The integers represent either the vocabulary terms (they are 0-indexed) for the "tokens" files, or the word counts for the "counts" files. For example, if doc_1=[0, 14, 17] in the file ending in "tokens.mat" and doc_1=[3, 1, 2] in the file ending in "counts.mat", that means that term 0 occurs 3 times in the document, term 14 appears once, and term 17 appears twice.

To be more specific, here is how we created the bow_tr_YY.mat files from bow_tr (which is a scipy sparse matrix in CSR format containing the bag-of-words representation of all documents in the training set):
    def split_bow(bow_in, n_docs):
        indices = [[w for w in bow_in[doc,:].indices] for doc in range(n_docs)]
        counts = [[c for c in bow_in[doc,:].data] for doc in range(n_docs)]
        return indices, counts

    bow_tr_tokens, bow_tr_counts = split_bow(bow_tr, n_docs_tr)
    savemat('bow_tr_tokens', {'tokens': bow_tr_tokens}, do_compression=True)
    savemat('bow_tr_counts', {'counts': bow_tr_counts}, do_compression=True)
Finally, vocab.pkl is simply a list containing the strings corresponding to the vocabulary terms. We created this file using
pickle.dump(vocab, f)
We hope that helps!

Thanks for this explanation, it really helps !
I am aware that creating bag-of-words is out of the scope of this project, however it would really help any reference on how to transform documents onto bag-of-words as this usually requires highly subjective transformations (e.g. tokenization, lemmatizing, stop words, etc).

thanks a lot
David L.

from etm.

Preprocessing about etm HOT 7 OPEN

Comments (7)

Thanks.

Thanks.

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent