Coder Social home page Coder Social logo

Comments (15)

vhulden avatar vhulden commented on August 15, 2024 1

Yeah, in the composition file it's the topics in the same order as they appear in the topics file. So if you look at the topics file you have:

0 word word word bla bla bla
1 word bla word bla
2 word bla word bla
etc.

In the topic composition file, it's just the same order, only horizontally, so you'll have

file topic0weight topic1weight topic2weight ...

I usually just manually add a row on top of the composition file with the topic numbers from 0 to however many I'd decided to have, makes it a bit easier to keep track.

from mallet.

jarmoza avatar jarmoza commented on August 15, 2024

Seconded. Downloaded the latest Mallet yesterday. I hade to make a slight alteration to some of my code that reads the doc topics file. My presumption is that the topics remain in order from 0...n, but the topic ID (#) is no longer output.

from mallet.

vhulden avatar vhulden commented on August 15, 2024

Thanks for the second -- however, I'm not sure I get what you mean that the topics are in order 0...n. They didn't used to be in order of topic numbers but in order of weight in the document. What mallet used to produce was "topic proportion topic proportion", like this:

0 file:/path/ocr_10.2307_1840442.txt 16 0.4329549238 2 0.3217675612 18 0.0649332732

So now I don't know what the numbers are supposed to represent in the new version - what topics do they refer to? I'm using the old version (2.0.7) for the moment.

from mallet.

jarmoza avatar jarmoza commented on August 15, 2024

Right. So I had forgotten that they were in sorted order like that. My assumption was that with the removal of the topic ID (and as you've pointed out the sorting order), that the order was now just by topic ID. So: [topic 0 weight, topic 1 weight, topic 2 weight....]

from mallet.

 avatar commented on August 15, 2024

As it stands that output is useless...if it was a conscious change then I think it should he changed back. My version gives "topic" "weight" which is helpful.

I can look into the source code to see what is up...which version are we referring to again?

On Oct 17, 2015, at 9:32 AM, Jonathan Armoza [email protected] wrote:

Right. So I had forgotten that they were in sorted order like that. My assumption was that with the removal of the topic ID (and as you've pointed out the sorting order), that the order was now just by topic ID. So: [topic 0 weight, topic 1 weight, topic 2 weight....]


Reply to this email directly or view it on GitHub.

from mallet.

vhulden avatar vhulden commented on August 15, 2024

2.0.8 (downloaded from mallet website) and 2.0.9 (latest build cloned from github yesterday). Thanks!

from mallet.

dhawaljoh avatar dhawaljoh commented on August 15, 2024

Sorry for bumping this, but, does anyone have an answer yet?

Are they just the probabilities of the topics in order? Or?

I just ran MALLET on 800k documents. (12 Hours). I desperately do not want to re-run the training.

from mallet.

christofs avatar christofs commented on August 15, 2024

Just ran the latest version from Github and see the same behavior. I actually welcome this change because the earlier format was more complex to process. Now it is much more like a regular, sortable table or a data frame. Assuming, of course, the weights are indeed saved in ascending order of the topics. A row of headers would help! And it would have been nice if this change had been documented or explained somehow, somewhere. Or maybe it has?

[Edit] I rewrote my topic score extraction script and the new structure saves me a few loops and a lot of time!

from mallet.

mihaiiancu avatar mihaiiancu commented on August 15, 2024

Add the option --doc-topics-threshold with a value larger than 0.0 and you will get the 'old style' output with the topic ids and percentages in decreasing order.
This output is OK for visualizing the topics and their distribution across text sources, but I agree with @christofs that the new format is better for processing using a development tool/language.
The issue should be closed since there is no reason to go back.

from mallet.

vhulden avatar vhulden commented on August 15, 2024

from mallet.

jwr avatar jwr commented on August 15, 2024

I was just bitten by this and lost hours trying to debug my various processing scripts before finding out that the file format has changed. Turns out the last time I used mallet the format was different.

At the moment I'm not even diving into what the new format is or whether it is sufficient, no time right now. I'll just add the --doc-topics-threshold option.

Please — such breaking changes should be documented. The "notes" make no mention of this change (this is where I looked first), neither does the webpage.

from mallet.

mimno avatar mimno commented on August 15, 2024

I'm sorry to hear that -- I was very reluctant to make the change to dense output for exactly this reason. I agree that it's not documented well enough.

from mallet.

vhulden avatar vhulden commented on August 15, 2024

Yes, but the change really is a boon - especially in teaching topic modeling to novices, it makes it so easy and neat to draw graphs of a topic over time (well, over an ordered set of documents) with basic spreadsheet software. If someone could just write a paragraph - easily visible to anyone downloading the new version - explaining the change clearly, I think everyone would be happy.

from mallet.

jwr avatar jwr commented on August 15, 2024

I re-read the discussion and I'm not sure what the new format actually is. Is it topic weights ordered by topic, using a dense representation?

from mallet.

jwr avatar jwr commented on August 15, 2024

Ok, thanks — in that case, it doesn't affect me much, I'll just have to rewrite the processing code. The change could indeed be for the better (I don't care either way, but others do), so I'd just like to ask that in the future such format changes be documented, at least in the release notes. This will help people avoid wasting time hunting non-existing bugs.

from mallet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.