This may not be the right place for this question, but the current version of mallet d

topic composition file looks very strange about mallet HOT 15 OPEN

mimno commented on August 15, 2024

topic composition file looks very strange

from mallet.

Comments (15)

vhulden commented on August 15, 2024 1

Yeah, in the composition file it's the topics in the same order as they appear in the topics file. So if you look at the topics file you have:

0 word word word bla bla bla
1 word bla word bla
2 word bla word bla
etc.

In the topic composition file, it's just the same order, only horizontally, so you'll have

file topic0weight topic1weight topic2weight ...

I usually just manually add a row on top of the composition file with the topic numbers from 0 to however many I'd decided to have, makes it a bit easier to keep track.

from mallet.

jarmoza commented on August 15, 2024

Seconded. Downloaded the latest Mallet yesterday. I hade to make a slight alteration to some of my code that reads the doc topics file. My presumption is that the topics remain in order from 0...n, but the topic ID (#) is no longer output.

from mallet.

vhulden commented on August 15, 2024

Thanks for the second -- however, I'm not sure I get what you mean that the topics are in order 0...n. They didn't used to be in order of topic numbers but in order of weight in the document. What mallet used to produce was "topic proportion topic proportion", like this:

0 file:/path/ocr_10.2307_1840442.txt 16 0.4329549238 2 0.3217675612 18 0.0649332732

So now I don't know what the numbers are supposed to represent in the new version - what topics do they refer to? I'm using the old version (2.0.7) for the moment.

from mallet.

jarmoza commented on August 15, 2024

Right. So I had forgotten that they were in sorted order like that. My assumption was that with the removal of the topic ID (and as you've pointed out the sorting order), that the order was now just by topic ID. So: [topic 0 weight, topic 1 weight, topic 2 weight....]

from mallet.

commented on August 15, 2024

As it stands that output is useless...if it was a conscious change then I think it should he changed back. My version gives "topic" "weight" which is helpful.

I can look into the source code to see what is up...which version are we referring to again?

On Oct 17, 2015, at 9:32 AM, Jonathan Armoza [email protected] wrote:

Right. So I had forgotten that they were in sorted order like that. My assumption was that with the removal of the topic ID (and as you've pointed out the sorting order), that the order was now just by topic ID. So: [topic 0 weight, topic 1 weight, topic 2 weight....]

—
Reply to this email directly or view it on GitHub.

from mallet.

vhulden commented on August 15, 2024

2.0.8 (downloaded from mallet website) and 2.0.9 (latest build cloned from github yesterday). Thanks!

from mallet.

dhawaljoh commented on August 15, 2024

Sorry for bumping this, but, does anyone have an answer yet?

Are they just the probabilities of the topics in order? Or?

I just ran MALLET on 800k documents. (12 Hours). I desperately do not want to re-run the training.

from mallet.

christofs commented on August 15, 2024

Just ran the latest version from Github and see the same behavior. I actually welcome this change because the earlier format was more complex to process. Now it is much more like a regular, sortable table or a data frame. Assuming, of course, the weights are indeed saved in ascending order of the topics. A row of headers would help! And it would have been nice if this change had been documented or explained somehow, somewhere. Or maybe it has?

[Edit] I rewrote my topic score extraction script and the new structure saves me a few loops and a lot of time!

from mallet.

mihaiiancu commented on August 15, 2024

Add the option --doc-topics-threshold with a value larger than 0.0 and you will get the 'old style' output with the topic ids and percentages in decreasing order.
This output is OK for visualizing the topics and their distribution across text sources, but I agree with @christofs that the new format is better for processing using a development tool/language.
The issue should be closed since there is no reason to go back.

from mallet.

vhulden commented on August 15, 2024

I did eventually figure it out and agree the new format is better (way easier for visualization over time, for one). But I would really wish that a little note could be added somewhere on the documentation page, as most tutorials were written when the old style format was in effect so people get incorrect information and this is just plain confusing. A lot of non-techy people experiment with MALLET...

…

On Mon, Mar 13, 2017 at 4:49 PM, mihaiiancu ***@***.***> wrote: Add the option --doc-topics-threshold with a value larger than 0.0 and you will get the 'old style' output with the topic ids and percentages in decreasing order. This output is OK for visualizing the topics and their distribution across text sources, but I agree with @christofs <https://github.com/christofs> that the new format is better for processing using a development tool/language. The issue should be closed since there is no reason to go back. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#41 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AOdouKQtPXuuP4BXAvWMGQDkvBW9g3pBks5rlgAtgaJpZM4GQo7D> .

from mallet.

jwr commented on August 15, 2024

I was just bitten by this and lost hours trying to debug my various processing scripts before finding out that the file format has changed. Turns out the last time I used mallet the format was different.

At the moment I'm not even diving into what the new format is or whether it is sufficient, no time right now. I'll just add the --doc-topics-threshold option.

Please — such breaking changes should be documented. The "notes" make no mention of this change (this is where I looked first), neither does the webpage.

from mallet.

mimno commented on August 15, 2024

I'm sorry to hear that -- I was very reluctant to make the change to dense output for exactly this reason. I agree that it's not documented well enough.

from mallet.

vhulden commented on August 15, 2024

Yes, but the change really is a boon - especially in teaching topic modeling to novices, it makes it so easy and neat to draw graphs of a topic over time (well, over an ordered set of documents) with basic spreadsheet software. If someone could just write a paragraph - easily visible to anyone downloading the new version - explaining the change clearly, I think everyone would be happy.

from mallet.

jwr commented on August 15, 2024

I re-read the discussion and I'm not sure what the new format actually is. Is it topic weights ordered by topic, using a dense representation?

from mallet.

jwr commented on August 15, 2024

Ok, thanks — in that case, it doesn't affect me much, I'll just have to rewrite the processing code. The change could indeed be for the better (I don't care either way, but others do), so I'd just like to ask that in the future such format changes be documented, at least in the release notes. This will help people avoid wasting time hunting non-existing bugs.

from mallet.

topic composition file looks very strange about mallet HOT 15 OPEN

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent