Comments (7)
Hi everyone,
I am posting below an email that we wrote that replies to this question.
Thanks.
Thank you very much for your interest in the ETM model! We're glad you're looking into it.
The formatting of the data is as follows. All data files are in a bag-of-words format. Their names are bow_XX_YY.mat
, where
XX = {tr, ts, ts_h1, ts_h2, va} # training, test, test(first half of each doc), test(second half of each doc), validation
YY = {tokens, counts} # content of the file: tokens or counts
Each file contains a list of documents. That is, each list is of the form [doc_1, doc_2, ..., doc_N]. Each element doc_i is itself a list with integers. The integers represent either the vocabulary terms (they are 0-indexed) for the "tokens" files, or the word counts for the "counts" files. For example, if doc_1=[0, 14, 17] in the file ending in "tokens.mat" and doc_1=[3, 1, 2] in the file ending in "counts.mat", that means that term 0 occurs 3 times in the document, term 14 appears once, and term 17 appears twice.
To be more specific, here is how we created the bow_tr_YY.mat
files from bow_tr
(which is a scipy sparse matrix in CSR format containing the bag-of-words representation of all documents in the training set):
def split_bow(bow_in, n_docs):
indices = [[w for w in bow_in[doc,:].indices] for doc in range(n_docs)]
counts = [[c for c in bow_in[doc,:].data] for doc in range(n_docs)]
return indices, counts
bow_tr_tokens, bow_tr_counts = split_bow(bow_tr, n_docs_tr)
savemat('bow_tr_tokens', {'tokens': bow_tr_tokens}, do_compression=True)
savemat('bow_tr_counts', {'counts': bow_tr_counts}, do_compression=True)
Finally, vocab.pkl
is simply a list containing the strings corresponding to the vocabulary terms. We created this file using
pickle.dump(vocab, f)
We hope that helps!
from etm.
Hi There,
We just added the scripts to pre-process a dataset in the repo. Please check that out and let us know if you have other questions.
from etm.
Hi - loved your scripts. They help a lot. The only minor bug is that the output has to have .mat at the end, and at least the 20ng one didn't write them out that way. Minor point!
from etm.
Would it be possible to explain in brief how these data files /data/./ were created from NewsGroup Dataset and what does the data in each of these files represent?
Thanking you!
from etm.
@adjidieng could you please comment my request?
from etm.
token file contains all the tokens or words present in a review. you can use vocab.pkl to see the words.
Count file tells you the count of each of that token in that review.
from etm.
Hi everyone,
I am posting below an email that we wrote that replies to this question.
Thanks.
Thank you very much for your interest in the ETM model! We're glad you're looking into it.
The formatting of the data is as follows. All data files are in a bag-of-words format. Their names are
bow_XX_YY.mat
, where
XX = {tr, ts, ts_h1, ts_h2, va} # training, test, test(first half of each doc), test(second half of each doc), validation
YY = {tokens, counts} # content of the file: tokens or counts
Each file contains a list of documents. That is, each list is of the form [doc_1, doc_2, ..., doc_N]. Each element doc_i is itself a list with integers. The integers represent either the vocabulary terms (they are 0-indexed) for the "tokens" files, or the word counts for the "counts" files. For example, if doc_1=[0, 14, 17] in the file ending in "tokens.mat" and doc_1=[3, 1, 2] in the file ending in "counts.mat", that means that term 0 occurs 3 times in the document, term 14 appears once, and term 17 appears twice.To be more specific, here is how we created the
bow_tr_YY.mat
files frombow_tr
(which is a scipy sparse matrix in CSR format containing the bag-of-words representation of all documents in the training set):def split_bow(bow_in, n_docs): indices = [[w for w in bow_in[doc,:].indices] for doc in range(n_docs)] counts = [[c for c in bow_in[doc,:].data] for doc in range(n_docs)] return indices, counts bow_tr_tokens, bow_tr_counts = split_bow(bow_tr, n_docs_tr) savemat('bow_tr_tokens', {'tokens': bow_tr_tokens}, do_compression=True) savemat('bow_tr_counts', {'counts': bow_tr_counts}, do_compression=True)
Finally,
vocab.pkl
is simply a list containing the strings corresponding to the vocabulary terms. We created this file usingpickle.dump(vocab, f)
We hope that helps!
Thanks for this explanation, it really helps !
I am aware that creating bag-of-words is out of the scope of this project, however it would really help any reference on how to transform documents onto bag-of-words as this usually requires highly subjective transformations (e.g. tokenization, lemmatizing, stop words, etc).
thanks a lot
David L.
from etm.
Related Issues (20)
- perform bad on classifiction task HOT 3
- FileNotFoundError: [Errno 2] No such file or directory: 'data/20ng_embeddings.txt'
- How to get the topic vector? HOT 3
- rising KL_theta values
- 代码运行 HOT 1
- Is that true that a lot of repeated topics appear? HOT 7
- Topic Coherence Computation: Division by 45? HOT 2
- How to modify the code to number of topics other than 50? HOT 1
- Validation set loss is being calculated on the Test set.
- Negative coherence on short texts HOT 1
- Run ETM on my own dataset HOT 3
- How to obtain document-topic proportions (the thetas) for each document HOT 3
- a bug in test dataset splitting HOT 1
- evaluate
- Add predictive measure to utils.py
- dataset
- embedding HOT 3
- Confuse about the data loader function HOT 6
- read embedding matrix when not using trained embeddings HOT 1
- args.clip
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from etm.