Comments (15)
Yeah, in the composition file it's the topics in the same order as they appear in the topics file. So if you look at the topics file you have:
0 word word word bla bla bla
1 word bla word bla
2 word bla word bla
etc.
In the topic composition file, it's just the same order, only horizontally, so you'll have
file topic0weight topic1weight topic2weight ...
I usually just manually add a row on top of the composition file with the topic numbers from 0 to however many I'd decided to have, makes it a bit easier to keep track.
from mallet.
Seconded. Downloaded the latest Mallet yesterday. I hade to make a slight alteration to some of my code that reads the doc topics file. My presumption is that the topics remain in order from 0...n, but the topic ID (#) is no longer output.
from mallet.
Thanks for the second -- however, I'm not sure I get what you mean that the topics are in order 0...n. They didn't used to be in order of topic numbers but in order of weight in the document. What mallet used to produce was "topic proportion topic proportion", like this:
0 file:/path/ocr_10.2307_1840442.txt 16 0.4329549238 2 0.3217675612 18 0.0649332732
So now I don't know what the numbers are supposed to represent in the new version - what topics do they refer to? I'm using the old version (2.0.7) for the moment.
from mallet.
Right. So I had forgotten that they were in sorted order like that. My assumption was that with the removal of the topic ID (and as you've pointed out the sorting order), that the order was now just by topic ID. So: [topic 0 weight, topic 1 weight, topic 2 weight....]
from mallet.
As it stands that output is useless...if it was a conscious change then I think it should he changed back. My version gives "topic" "weight" which is helpful.
I can look into the source code to see what is up...which version are we referring to again?
On Oct 17, 2015, at 9:32 AM, Jonathan Armoza [email protected] wrote:
Right. So I had forgotten that they were in sorted order like that. My assumption was that with the removal of the topic ID (and as you've pointed out the sorting order), that the order was now just by topic ID. So: [topic 0 weight, topic 1 weight, topic 2 weight....]
—
Reply to this email directly or view it on GitHub.
from mallet.
2.0.8 (downloaded from mallet website) and 2.0.9 (latest build cloned from github yesterday). Thanks!
from mallet.
Sorry for bumping this, but, does anyone have an answer yet?
Are they just the probabilities of the topics in order? Or?
I just ran MALLET on 800k documents. (12 Hours). I desperately do not want to re-run the training.
from mallet.
Just ran the latest version from Github and see the same behavior. I actually welcome this change because the earlier format was more complex to process. Now it is much more like a regular, sortable table or a data frame. Assuming, of course, the weights are indeed saved in ascending order of the topics. A row of headers would help! And it would have been nice if this change had been documented or explained somehow, somewhere. Or maybe it has?
[Edit] I rewrote my topic score extraction script and the new structure saves me a few loops and a lot of time!
from mallet.
Add the option --doc-topics-threshold with a value larger than 0.0 and you will get the 'old style' output with the topic ids and percentages in decreasing order.
This output is OK for visualizing the topics and their distribution across text sources, but I agree with @christofs that the new format is better for processing using a development tool/language.
The issue should be closed since there is no reason to go back.
from mallet.
from mallet.
I was just bitten by this and lost hours trying to debug my various processing scripts before finding out that the file format has changed. Turns out the last time I used mallet the format was different.
At the moment I'm not even diving into what the new format is or whether it is sufficient, no time right now. I'll just add the --doc-topics-threshold
option.
Please — such breaking changes should be documented. The "notes" make no mention of this change (this is where I looked first), neither does the webpage.
from mallet.
I'm sorry to hear that -- I was very reluctant to make the change to dense output for exactly this reason. I agree that it's not documented well enough.
from mallet.
Yes, but the change really is a boon - especially in teaching topic modeling to novices, it makes it so easy and neat to draw graphs of a topic over time (well, over an ordered set of documents) with basic spreadsheet software. If someone could just write a paragraph - easily visible to anyone downloading the new version - explaining the change clearly, I think everyone would be happy.
from mallet.
I re-read the discussion and I'm not sure what the new format actually is. Is it topic weights ordered by topic, using a dense representation?
from mallet.
Ok, thanks — in that case, it doesn't affect me much, I'll just have to rewrite the processing code. The change could indeed be for the better (I don't care either way, but others do), so I'd just like to ask that in the future such format changes be documented, at least in the release notes. This will help people avoid wasting time hunting non-existing bugs.
from mallet.
Related Issues (20)
- Topic Modelling for Non-text data HOT 3
- Bug or deadcode in SelectiveSGML2TokenSequence.java
- Running LDA model in python and got error message "returned non-zero exit status 127" HOT 1
- Suppress or redirect to file Mallet LDA building messages HOT 1
- Test Failure HOT 1
- Lemmatization HOT 1
- Not clear if `trainingProportions` may be `null` HOT 2
- Hyperparameter optimization when training Labeled LDA? HOT 1
- Recommended Word Number HOT 1
- mallet bash wrapper script misses an option to set Java heap size like most other scripts. HOT 2
- Computing Perplexity HOT 1
- Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded HOT 1
- puzzling results with simple test code, replicates on gensim -- possibly because of very short documents? HOT 2
- Auto-correlation between samples (Binkley et al.)
- Lemmatizing with Mallet HOT 1
- 404 on download page HOT 2
- "https://mimno.github.io/Mallet/classifier-devel" is missing in homepage! HOT 2
- Can't run mallet via python due to FileNotFoundError HOT 4
- How to use the word stemming function in Mallet HOT 2
- Exception in thread "main" java.lang.ClassCastException: cc.mallet.types.FeatureSequence cannot be cast
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mallet.