Coder Social home page Coder Social logo

senderle / topic-modeling-tool Goto Github PK

View Code? Open in Web Editor NEW
107.0 15.0 22.0 164.68 MB

A point-and-click tool for creating and analyzing topic models produced by MALLET.

Home Page: https://senderle.github.io/topic-modeling-tool/documentation/2017/01/06/quickstart.html

License: Apache License 2.0

Java 99.95% CSS 0.05%
topic-modeling mallet digital-humanities data-science text-analytics

topic-modeling-tool's Introduction

DOI

Topic Modeling Tool

An updated GUI for MALLET's implementation of LDA.*

New features:

  • Metadata integration
  • Automatic file segmentation
  • Custom CSV delimiters
  • Alpha/Beta optimization
  • Custom regex tokenization
  • Multicore processor support

Getting Started:

To start using some of these new features right away, consult the quickstart guide. For tinkerers, there's a guide to the tool's optional settings. You may also find useful information in the discussion threads under documentation issues.

Requirements:

The Topic Modeling Tool now has native Windows and Mac apps, and because of unicode issues, these are currently the best options for installation. Just follow the instructions for your operating system. Do not try to install by clicking on [Clone or download] > [Download ZIP]. It won't work.

For Macs:

  • Download TopicModelingTool.dmg.
  • Open it by double-clicking.
  • Drag the app into your Applications folder -- or into any folder at all.
  • Run the app by double-clicking.

For Windows PCs:

  • Download TopicModelingTool.zip.
    • NOTE: The native PC build is out-of-date. Help wanted.
  • Extract the files into any folder and open it.
  • Double-click on the file called TopicModelingTool.exe to run it.

If you want to run the plain .jar file, you'll need to have a fairly recent version of Java; the version that came with your computer may not work, especially if your computer is a Mac. Whatever your operating system, you can install an updated version of Java by following the instructions for your operating system here.

Windows Unicode Support:

Windows and Java don't play very well together when it comes to unicode text. If you are using the .jar build, and non-ascii characters are getting garbled on a Windows machine, there's a quick fix involving environment variables that may make things work.

Again, the best answer may just be to use the native app. It should now work correctly at every stage with UTF-8-encoded text. (If it doesn't, let us know and we will moan and gnash our teeth some more.)

Reporting and Replicating Bugs and Other Issues:

If you hadn't already guessed, most testing for this tool happens on a Mac. There are bound to be errors happening on other platforms that have slipped through the cracks. We need you to report them so we can keep improving the tool! But we cannot fix a problem that we don't fully understand, so...

When posting a bug report, please include vast amounts of detail.

Copy and paste everything from the tool's console output if you can, tell us your operating system and version, and let us know the other tools you're using to create and view input and output. It also helps if you verify that the bug still exists in the most recent build of the tool (i.e. the one contained in the .jar, .dmg, or .zip files in the root directory).

We know that there are substantial problems with Windows support for unicode text; if you see problems, please post detailed information under the main issue so that we can start isolating and fixing these bugs.

We love getting new issues because it means the tool is improving! But again, when posting a bug report, please include vast amounts of detail.

Building the Development Version:

If you feel adventurous, you might want to modify the code and compile your own version. To do so, you'll need to install Apache Maven as well as the Java Development Kit. On Macs, Homebrew is the best way to do so; simply install homebrew as described on the Homebrew site, and then type brew install maven at the command line. On Windows PCs -- you're on your own! But we did it and it wasn't terribly hard. You just need an up-to-date JDK and maven package, with their bin folders in your PATH.

With maven installed, simply use the terminal to navigate to the TopicModelingTool folder:

$ cd topic-modeling-tool/TopicModelingTool

Then use maven's package command:

$ mvn package

We now have experimental support for compiling the tool as a native app using the javafx plugin for maven. This will build a native package able to run on your operating system. This has been tested on both Macs and Windows PCs.

$ mvn jfx:native

Acknowledgements:

This version of the tool was forked from the original version by David Newman and Arun Balagopalan.

Previous work on the GUI for MALLET has been supported by a National Leadership Grant (LG-06-08-0057-08) from the Institute of Museum and Library Services to Yale University, the University of Michigan, and the University of California, Irvine. The Institute of Museum and Library Services is the primary source of federal support for the nation's 123,000 libraries and 17,500 museums. The Institute's mission is to create strong libraries and museums that connect people to information and ideas.

Work on this version of the tool has benefited from the support of Penn Libraries and the the University of Pennsylvania's Price Lab for Digital Humanities.

topic-modeling-tool's People

Contributors

senderle avatar xjli865 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

topic-modeling-tool's Issues

Only English worlds in the topics

What steps will reproduce the problem?
1. Russian language texts with several english words in UTF-8 *.txt format as 
input.
2. There are only English words in the topics, without any russian one

What version of the product are you using? On what operating system?
I used the last version from the site on 32-bit Windows XP

I've attached the text files in the archieve


Original issue reported on code.google.com by [email protected] on 25 Nov 2011 at 4:59

Attachments:

Reenable individual file input

Individual file input is currently disabled. It seems to me to be a low-priority feature, and it's pretty buggy right now. At some point, we should fix those bugs... or remove this feature entirely! Either way, this is a low-priority issue.

Error with character encoding for UTF-8 files

What steps will reproduce the problem?
1. Run TMT with texts in UTF-8 which have words that have characters with 
accents, like "é" or "à". For example texts in French.

What is the expected output? What do you see instead?
- I would expect the topic words to include words that have an accented letter. 
Instead, the topic words will not include these, but include words cut off at 
those characters with accents instead, so "privé" becomes "priv" or "était" 
becomes "tait" or "prêt" becomes "pr" (without the final "t").  

What version of the product are you using? On what operating system?
- I'm using the latest version of TMT on Ubuntu 13.10. 
- Note that the procedure works just fine when I use Mallet directly. 

Original issue reported on code.google.com by [email protected] on 9 Dec 2013 at 4:53

Documents requiring mallet's token-regex option cannot be read due to lack of token-regex parameter in TMT

What steps will reproduce the problem?
1. Input documents in a non-English script, e.g. Greek.
2. Run TMT

What is the expected output? What do you see instead?

Mallet doesn't understand where a token starts or stops, so output it just 
gibberish. I expect the words to be recognised as they are.

What version of the product are you using? On what operating system?

TMT 1.0 on Mac OS 10.9

Please provide any additional information below.

This is easily fixed by adding a token-regex input field in "Advanced options" 
which is handed down to mallet.

Original issue reported on code.google.com by [email protected] on 10 Dec 2013 at 11:10

Metadata and files in folders work inconsistently

I'm not certain what problems this will cause, but there is an inconsistency in the way this maps metadata rows to files in the corpus folder when auto-segmentation is turned on. Right now, here's what we do:

  1. Read metadata.
  2. Find filenames in metadata.
  3. For each file that actually exists, segment file and create a new metadata row for each file.
  4. Update settings to reflect location of new file segments and segment metadata.
  5. Read through files in input folder (which is now the file segment folder).
  6. Look for the filenames in the metadata (which is now the segment metadata).
  7. When the filename can't be found, create a new row with empty metadata fields.
  8. Pass the files to MALLET.

Step 2 means that files not listed in the metadata won't get processed. The metadata file is the ultimate source of truth here; other files are ignored. I think that's fine as long as it's consistent.

The problem is that it isn't consistent. Steps 5 and 7 mean that in the second part of the workflow, the files themselves are the ultimate source of truth; the metadata file is modified to list files that are in the folder but can't be found.

That means we could get different results depending on whether we do segmentation or not. If there are files in the folder but not in the metadata, then the segmentation process will skip them, and the end result will not include them. But if we don't do segmentation, the first four steps will not be executed; the files will not be skipped, and the results will include them.

I think that it makes the most sense to resolve this problem by treating the files as the ultimate source of truth in both cases, but that means modifying the segmentation metadata based on a listing of the files in the input directory.

This would have other benefits as well. In fact, maybe we should always preprocess the input file and copy the texts and metadata over to segments even when we aren't segmenting. That takes up some extra disk space but the payoff is that we have a record of the exact input passed to MALLET as well as the exact output. It's tempting to make that an optional behavior but that adds interface complexity and doesn't solve any actual problem, except the disk space one, which -- well -- you probably can't use this on a 10GB corpus anyway, right?

See below for my new preferred solution.

Document UTF-8 corner cases

If you want to have UTF-8 support on Windows machines (and maybe others), you'll need to use the native app, or you'll have to add -Dfile.encoding=UTF-8 to your JAVA_TOOL_OPTIONS environment variable. This should be documented clearly, probably in the quick start guide, and certainly in the longer documentation... once it exists.

Use most recent version of MALLET

Currently the tool uses a specific version of MALLET, 2.0.7; it would be nice to update pom.xml to specify the most recent version indexed by mvn if possible, or at least the currently most recent version (2.0.8), released in 2016 (vs. 2011 for 2.0.7).

I've seen indications from David Mimno that it's best to just use the version on the github master branch (https://github.com/mimno/Mallet) but I'm not 100% sure how to do that using maven, or if that's even possible.

Problems with UTF-8 support for Windows

I may have missed something, but I have a series of files that are encoded UTF-8, but when I run the tool I get all sorts of ASCII characters in my topics (i.e. - "â", etc). I'm wondering if there's a stage in the processing where files are converted to ASCII and then not re-encoded? I could be way off base with this question.

That said, I have gone through my files, ensured they are UTF-8, and done a find and replace for "â" in all the files. If you've come across this issue before I would love to know how you resolved it. At this point I'm thinking of creating an elaborate stop list that excludes common ASCII characters.

Empty files throw off document IDs, rendering output meaningless

If you include empty files in the input directory, stupid assumptions made by buildNtd in CsvBuilder cause the document IDs to fall out of sync. This wreaks havoc on much of the output, making it quite meaningless! (It appears that the metadata file remains correct, but that's about it, sadly.)

The HTML output numbers topics from 1...

The HTML output numbers topics from 1, but the CSV output numbers them from 0. Pick one.

I'm leaning towards 0 to be consistent with the actual MALLET output.

Interfacing with topic-modeling to through command-line

What steps will reproduce the problem?
1.
2.
3.

What is the expected output? What do you see instead?
I cannot see any written output apart from the one on the GUI

What version of the product are you using? On what operating system?
The current version. I just downloaded it now

Please provide any additional information below.
I want to use the GUI from command-line and able to save the outputs with the 
same file names as the original file names in the folder.

Original issue reported on code.google.com by [email protected] on 16 Mar 2015 at 9:47

Allow custom delimiters

Right now output and metadata input delimiters are hard-coded as commas. That should be configurable.

IndexOutOfBoundsExecption, no DocX.html files

What steps will reproduce the problem?
1. Press 'Learn Topics' with file testdata_news_fuel_845docs.txt
2. standard options
3.

What is the expected output? What do you see instead?
The OutputHTML\Docs\DocX.html are not created. And in Topicindocs.csv file 
value for filename=null-source.

Error msg:
java.lang.IndexOutOfBoundsException: Index: 369, Size: 10
    at java.util.ArrayList.rangeCheck(Unknown Source)
    at java.util.ArrayList.get(Unknown Source)
    at cc.mallet.topics.gui.HtmlBuilder.buildHtml2(HtmlBuilder.java:194)
    at cc.mallet.topics.gui.HtmlBuilder.createHtmlFiles(HtmlBuilder.java:293)
    at cc.mallet.topics.gui.TopicModelingTool$TrainButtonListener.outputCsvFiles(TopicModelingTool.java:629)
    at cc.mallet.topics.gui.TopicModelingTool$TrainButtonListener.runMallet(TopicModelingTool.java:581)
    at cc.mallet.topics.gui.TopicModelingTool$TrainButtonListener$1.run(TopicModelingTool.java:446)
Mallet Output files written in C:\TopicModelingTool ---> 
C:\TopicModelingTool\output_state.gz , C:\TopicModelingTool\output_topic_keys

What version of the product are you using? On what operating system?
Topic modeling tool: Releasedate 3 oct. Windows 7 Home Premium SP1

Please provide any additional information below.


Original issue reported on code.google.com by [email protected] on 22 Nov 2011 at 12:18

Handle default stoplists more effectively

Currently, we do a weird thing with the default MALLET stoplist. You see, once you disable it, you can only bring it back by restarting the JVM. Don't ask me why! (Or figure out why, and tell me how, or what I'm doing wrong.)

The best answer to this problem is probably to have our own default stoplists packaged along with everything else required in the native apps. Someday!

Divide Input option is skipping files

As reported in #65. @shawngraham writes:

I went and tried it again, armed with my new knowledge of how it works. In the results, when I opened the metadata.csv, a number of my documents were no longer present; that is to say, no results recorded for them. I had n set for 1000, so I thought perhaps the missing ones were smaller and somehow got folded into the previous 1000-chunk, but no, the missing ones should have been split into three or four chunks at least. So I'm not sure what's going on there... I can't seem to see the commonality between the documents that get dropped.

Stop word list

We are using this tool and wish to add words to the stop word list. Is there 
anywhere that we can download the stop word list/file so that we can add some 
words and use the updated list as our stop word list?   


Original issue reported on code.google.com by [email protected] on 3 Mar 2014 at 8:51

Too many files open

At times, I see the following error:

java.nio.file.FileSystemException: /Users/enderlej/Desktop/Topic Modeling/output/output_html/Docs/Doc491.html: Too many open files in system
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
	at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
	at java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:434)
	at java.nio.file.Files.newOutputStream(Files.java:216)
	at java.nio.file.Files.newBufferedWriter(Files.java:2860)
	at cc.mallet.topics.gui.HtmlBuilder.buildHtml2(HtmlBuilder.java:166)
	at cc.mallet.topics.gui.HtmlBuilder.createHtmlFiles(HtmlBuilder.java:283)
	at cc.mallet.topics.gui.TopicModelingTool.outputCsvFiles(TopicModelingTool.java:1265)
	at cc.mallet.topics.gui.TopicModelingTool.runMallet(TopicModelingTool.java:1181)
	at cc.mallet.topics.gui.TopicModelingTool$TrainButtonListener$1.run(TopicModelingTool.java:468)

I've looked through the code, and haven't found an obvious point where files aren't being closed properly. However, the logic of the function that does this work (buildHtml2 in HtmlBuilder) is truly clunky and obviously bug-prone. I'll need to investigate further. This is a hard bug to reproduce since it only happens sometimes, even for models with lots of files.

Improve error reporting for bad CSV metadata input

Right now, when metadata CSVs are ill-formed or otherwise confusing to the tool, it just soldiers on, producing results that are sometimes very weird, without any feedback. That's not great! This might be solved by incorporating a proper CSV library. See also #27.

Documentation for Linux

Is it all supposed to build and run well also on linux systems?
Some initial input would be nice, before diving in. Perhaps, a comment can then be added to the github readme as well, to make this issue/question superfluous.

Thanks in advance for your comments!

Document memory limitations

Adjusting the amount of memory available to the JVM is challenging, especially now that we're supplying a pre-packaged version with its own JVM. When users run into memory limits, the error messages they'll receive are limited and confusing, and the best way to deal with this problem isn't obvious.

There needs to be better documentation laying out the problem and possible solutions. For intrepid readers, there's some basic, easily digestible information on stackoverflow.

exception


<200> LL/token: -8,42566

Total time: 0 seconds
java.lang.IndexOutOfBoundsException: Index: 302, Size: 10
    at java.util.ArrayList.RangeCheck(Unknown Source)
    at java.util.ArrayList.get(Unknown Source)
    at cc.mallet.topics.gui.HtmlBuilder.buildHtml2(HtmlBuilder.java:194)
    at cc.mallet.topics.gui.HtmlBuilder.createHtmlFiles(HtmlBuilder.java:293)
    at cc.mallet.topics.gui.TopicModelingTool$TrainButtonListener.outputCsvFiles(TopicModelingTool.java:629)
    at cc.mallet.topics.gui.TopicModelingTool$TrainButtonListener.runMallet(TopicModelingTool.java:581)
    at cc.mallet.topics.gui.TopicModelingTool$TrainButtonListener$1.run(TopicModelingTool.java:446)
Mallet Output files written in C:\Users\fr\Desktop ---> 
C:\Users\fr\Desktop\output_state.gz , C:\Users\fr\Desktop\output_topic_keys

Csv Output files written in C:\Users\fr\Desktop\output_csv
Html Output files written in C:\Users\fr\Desktop\output_html

Original issue reported on code.google.com by [email protected] on 25 Nov 2011 at 4:07

"Divide Input" option needs better documentation.

Just a small question regarding the 'divide input into n-word chunks' option in the advanced setting. When I run that on say a 4-gram, I understand what's going on from the point of view of input - but in terms of the output, the topic keywords say are individual words again? A student was asking me this, expecting that the keywords would also be 4-grams, and so I figured, good question...

Thanks! Really appreciate all the work you've done with this tool.

Odd paths

Right now we use absolute paths, but those are ugly and confusing. We want canonical paths!

java.lang.ArrayIndexOutOfBoundsException

Using the most recent version on a Windows 10 machine, I encounter the following issue when running the tool on three .txt files, total filesize about 1.3 MB.
I successfully ran it to obtain 10 topics, this occurred when asking for 20 topics. (I also managed 15 topics).

Data loaded.
Coded LDA: 20 topics, 5 topic bits, 11111 topic mask
max tokens: 42609
total tokens: 107964
WorkerRunnable sampling error: 0.03888995966398971 2.343485639836988 8.986229443976581E-5 0.06042141894627366 0.0
WorkerRunnable sampling error: 0.11620675755975436 2.0607369899223156 8.986229101125395E-5 0.06042142994529243 0.08014427339589261
WorkerRunnable sampling error: 0.26999346979251315 2.857595174390058 8.986228930849394E-5 0.06042143646022559 0.22596239726904083
WorkerRunnable sampling error: 0.01690719312819237 0.14520705659155694 8.986228947375488E-5 0.06042143824293064 0.0
WorkerRunnable sampling error: 2.8025697032725247 2.610079791499949 8.986227491090811E-5 0.06042148710515259 2.7610137339330265
type: 108 new topic: -1
10:84 12:76 7:72 5:68 9:60 19:56 18:56 11:56 8:56 4:56 3:56 6:52 2:47 13:44 16:40 14:40 1:40 0:40 17:36 15:24
already running!
java.lang.ArrayIndexOutOfBoundsException: 20
at cc.mallet.topics.WorkerRunnable.sampleTopicsForOneDoc(WorkerRunnable.java:552)
at cc.mallet.topics.WorkerRunnable.run(WorkerRunnable.java:275)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Metadata alignment

In the current version of the tool, it's possible to combine the document-topic proportion output with a metadata file, but it's assumed that the lines in the file correspond perfectly to the individual files in an input folder. This means that the result will be incorrect for metadata files with records in an order different from the default order of the files in the input folder, or different from the order of rows in the input file, the result will be incorrect.

I need to implement a more robust version of this that checks for alignment between metadata records and files or input file rows. It's not even obvious how to do matching on input file rows, actually -- that might not be a feature worth having. I'll just have to document that the rows must be aligned. But it should be possible to match rows to individual text files based on filename.

Add column headers

Right now, the metadata file output works decently, but is missing headers. It's annoying to have to go back in and edit them by hand! The tool should auto-fill headers with the original metadata headers where applicable, and with topic headwords for topic columns.

Exception error when running topic model on folder containing subfolders

What steps will reproduce the problem?
1. Running "Learn Topics" on a file folder that contains multiple folders, each 
with its own multiple folders (basically trying to look at data two folders 
beneath the overall folder)
2.
3.

What is the expected output? What do you see instead?
When I run it on a single folder within the main folder, I get actual results. 
When I try to run it on the main folder that contains the subfolders and their 
corresponding subfolders, I get the exception errors that others have gotten 
(I'm not posting them here because they read basically the same as the others 
who have posted).

An explanation - I am using the Enron email database that was generated and 
made publicly available after the Enron scandal. Within the overall "maildir" 
folder are folders for 150 users, each of those users having multiple folders 
within their emails (inbox, sent, etc.). Running the program on the folder of a 
single user (e.g., lay-k for Ken Lay's username) produces results. Running it 
using "maildir" as the input file produces the error. I would like to generate 
a list of topics based on the overall database without having to flatten the 
existing folder structure.

What version of the product are you using? On what operating system?
I can't tell what version it is - I just downloaded it from this site a couple 
of days ago. I am running Windows 7.

Please provide any additional information below.


Original issue reported on code.google.com by [email protected] on 6 May 2013 at 4:39

Allow [tab] to be entered as a possible delimiter.

We would like to allow any delimiter, as specified by an escape string ('\t', for example). But it's not clear what the best approach is here. A dropdown could be nice, but would not allow total customization. The current system also doesn't work well with standard CSV double-quote escaping standards. What a mess!

We might just have to add a dependency. I wish CSV were as simple as it seems like it ought to be.

Windows 10 file paths break the tool

For reasons that are not yet entirely clear to me, file paths that start with C: no longer work in Windows 10. Fixing this will require a workaround of some kind because the code that generates the C:-headed files is in MALLET.

Implement auto-chunking

Now that we have basic metadata support, we have a new problem: users might want to break large documents into smaller chunks for better results, but then the metadata file (assuming it's designed to match the full files) won't be correctly aligned with the chunks. I think the most user-friendly approach is to add auto-chunking.

This will involve adding a new parameter, "Chunk size (number of words)", that, when > X1, will automatically split the texts into chunks, and duplicate metadata rows for each of the chunks.

(The alternative, to join results from each chunk into a single composite document result, seems somehow "dishonest" in a tool like this. If it makes sense to do that, users can take care of that themselves using a spreadsheet.)

1. X = 0 at a minimum, but should probably be larger, since 1-word chunks won't help anybody!

test data not available

What steps will reproduce the problem?
1. open page in Chrome.
2. Go to line with "here" that is supposed to link to test data download
3. click "here" - nothing happens

What is the expected output? What do you see instead?
I expect to see a hot link on the word "here"; I see only text.


What version of the product are you using? On what operating system?
tried both Chrome and Firefox, Windows 7


Please provide any additional information below.
I used "view page source" to see if the html was bad, but no link appeared.


Original issue reported on code.google.com by [email protected] on 15 Mar 2014 at 5:32

Make the TMT speak dfr-browser

Andrew Goldstone's dfr-browser produces lovely visualizations, and it appears to require only some .json input. It would be nice if the TMT could generate that input.

How to fix issue with Memroy size!

What steps will reproduce the problem?
1. Using Yelp dataset with 1 + Million documents 
2. Heap memory size is

What is the expected output? What do you see instead?
It should write the result to the output file

What version of the product are you using? On what operating system?
Recent one

Please provide any additional information below.

Is there any way to fix this issue with memory size / limit? while testing with 
large data. 

Original issue reported on code.google.com by [email protected] on 29 Mar 2015 at 10:49

Attachments:

error running topic model gui


Total time: 0 seconds
java.lang.IndexOutOfBoundsException: Index: 284, Size: 10
at java.util.ArrayList.RangeCheck(Unknown Source)
at java.util.ArrayList.get(Unknown Source)
at cc.mallet.topics.gui.HtmlBuilder.buildHtml2(HtmlBuilder.java:194)
at cc.mallet.topics.gui.HtmlBuilder.createHtmlFiles(HtmlBuilder.java:293)
at 
cc.mallet.topics.gui.TopicModelingTool$TrainButtonListener.outputCsvFiles(TopicM
odelingTool.java:629)
at 
cc.mallet.topics.gui.TopicModelingTool$TrainButtonListener.runMallet(TopicModeli
ngTool.java:581)
at 
cc.mallet.topics.gui.TopicModelingTool$TrainButtonListener$1.run(TopicModelingTo
ol.java:446)
Mallet Output files written in C:\results ---> C:\results\output_state.gz , 
C:\results\output_topic_keys


Csv Output files written in C:\results\output_csv
Html Output files written in C:\results\output_html

Original issue reported on code.google.com by [email protected] on 29 Nov 2011 at 5:11

Stable Release 1.0.0 Fails to Install

Clicking on "TopicModelingTool.dmg"
screen shot 2017-08-21 at 12 05 29 pm

Clicking on "TopicModelingTool.jar"
screen shot 2017-08-21 at 12 05 59 pm

Running chmod 777 ./* didn't help. I did build from source which worked fine.

OS: macOS Sierra 10.12.6

--random-seed

Not a lot of people know about MALLET's --random-seed option. Would this be something simple to add?

Find and make test data available

#10 indicates that there was some test data available for download. I'm not certain which data or where it was on the google code repo. I'll dig it out when I get a chance.

Training Error

What steps will reproduce the problem?
1. input the database expected to detect the latent feature
2. input english directory word
3.

What is the expected output? What do you see instead?

in the attached file

What version of the product are you using? On what operating system?
current version

Please provide any additional information below.



Original issue reported on code.google.com by [email protected] on 7 Apr 2013 at 4:53

Contact original creators about relicensing from the EPL

I think it would be best to relicense this tool under the MIT license. I don't see the point in copylefting a pure front-end like this, and a permissive license might draw other developers to the project. But unless the creators are willing to relicense it, we're stuck with this weird, non-standard Eclipse license.

handle inconsistent CSV line lengths

Right now, having a jagged-edged CSV causes bad topic alignment in the output. Pad out incomplete CSV rows. It might make sense to issue a warning, since the input probably has errors.

Improve documentation

This is a big issue with lots of potential sub-issues. Ideas:

  • Provide details about required metadata structure, viz. that the first column must be filenames, the filenames in the metadata must match the actual filenames exactly, but can be extended paths as long as the files are all in the same folder and the matches are exact.
  • Give more detailed help for the individual fields and options.
  • Add contextual information to the HTML output.
  • Duplicate and extend the existing (but stale and hard-to-find) documentation on the wiki branch. (And then delete that branch!)

Feel free to add to this list.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.