Coder Social home page Coder Social logo

Preprocessing about etm HOT 7 OPEN

adjidieng avatar adjidieng commented on July 28, 2024 6
Preprocessing

from etm.

Comments (7)

adjidieng avatar adjidieng commented on July 28, 2024 8

Hi everyone,

I am posting below an email that we wrote that replies to this question.

Thanks.

Thank you very much for your interest in the ETM model! We're glad you're looking into it.

The formatting of the data is as follows. All data files are in a bag-of-words format. Their names are bow_XX_YY.mat, where
XX = {tr, ts, ts_h1, ts_h2, va} # training, test, test(first half of each doc), test(second half of each doc), validation
YY = {tokens, counts} # content of the file: tokens or counts
Each file contains a list of documents. That is, each list is of the form [doc_1, doc_2, ..., doc_N]. Each element doc_i is itself a list with integers. The integers represent either the vocabulary terms (they are 0-indexed) for the "tokens" files, or the word counts for the "counts" files. For example, if doc_1=[0, 14, 17] in the file ending in "tokens.mat" and doc_1=[3, 1, 2] in the file ending in "counts.mat", that means that term 0 occurs 3 times in the document, term 14 appears once, and term 17 appears twice.

To be more specific, here is how we created the bow_tr_YY.mat files from bow_tr (which is a scipy sparse matrix in CSR format containing the bag-of-words representation of all documents in the training set):

    def split_bow(bow_in, n_docs):
        indices = [[w for w in bow_in[doc,:].indices] for doc in range(n_docs)]
        counts = [[c for c in bow_in[doc,:].data] for doc in range(n_docs)]
        return indices, counts

    bow_tr_tokens, bow_tr_counts = split_bow(bow_tr, n_docs_tr)
    savemat('bow_tr_tokens', {'tokens': bow_tr_tokens}, do_compression=True)
    savemat('bow_tr_counts', {'counts': bow_tr_counts}, do_compression=True)

Finally, vocab.pkl is simply a list containing the strings corresponding to the vocabulary terms. We created this file using

pickle.dump(vocab, f)

We hope that helps!

from etm.

adjidieng avatar adjidieng commented on July 28, 2024 5

Hi There,

We just added the scripts to pre-process a dataset in the repo. Please check that out and let us know if you have other questions.

from etm.

arnicas avatar arnicas commented on July 28, 2024 2

Hi - loved your scripts. They help a lot. The only minor bug is that the output has to have .mat at the end, and at least the 20ng one didn't write them out that way. Minor point!

from etm.

Aalisha avatar Aalisha commented on July 28, 2024 1

Would it be possible to explain in brief how these data files /data/./ were created from NewsGroup Dataset and what does the data in each of these files represent?

Thanking you!

from etm.

tutubalinaev avatar tutubalinaev commented on July 28, 2024

@adjidieng could you please comment my request?

from etm.

Mandark27 avatar Mandark27 commented on July 28, 2024

token file contains all the tokens or words present in a review. you can use vocab.pkl to see the words.

Count file tells you the count of each of that token in that review.

from etm.

thousandoaks avatar thousandoaks commented on July 28, 2024

Hi everyone,

I am posting below an email that we wrote that replies to this question.

Thanks.

Thank you very much for your interest in the ETM model! We're glad you're looking into it.

The formatting of the data is as follows. All data files are in a bag-of-words format. Their names are bow_XX_YY.mat, where
XX = {tr, ts, ts_h1, ts_h2, va} # training, test, test(first half of each doc), test(second half of each doc), validation
YY = {tokens, counts} # content of the file: tokens or counts
Each file contains a list of documents. That is, each list is of the form [doc_1, doc_2, ..., doc_N]. Each element doc_i is itself a list with integers. The integers represent either the vocabulary terms (they are 0-indexed) for the "tokens" files, or the word counts for the "counts" files. For example, if doc_1=[0, 14, 17] in the file ending in "tokens.mat" and doc_1=[3, 1, 2] in the file ending in "counts.mat", that means that term 0 occurs 3 times in the document, term 14 appears once, and term 17 appears twice.

To be more specific, here is how we created the bow_tr_YY.mat files from bow_tr (which is a scipy sparse matrix in CSR format containing the bag-of-words representation of all documents in the training set):

    def split_bow(bow_in, n_docs):
        indices = [[w for w in bow_in[doc,:].indices] for doc in range(n_docs)]
        counts = [[c for c in bow_in[doc,:].data] for doc in range(n_docs)]
        return indices, counts

    bow_tr_tokens, bow_tr_counts = split_bow(bow_tr, n_docs_tr)
    savemat('bow_tr_tokens', {'tokens': bow_tr_tokens}, do_compression=True)
    savemat('bow_tr_counts', {'counts': bow_tr_counts}, do_compression=True)

Finally, vocab.pkl is simply a list containing the strings corresponding to the vocabulary terms. We created this file using

pickle.dump(vocab, f)

We hope that helps!

Thanks for this explanation, it really helps !
I am aware that creating bag-of-words is out of the scope of this project, however it would really help any reference on how to transform documents onto bag-of-words as this usually requires highly subjective transformations (e.g. tokenization, lemmatizing, stop words, etc).

thanks a lot
David L.

from etm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.