mimno / mallet Goto Github PK

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

Home Page: https://mimno.github.io/Mallet/

License: Other

Makefile 0.10% Shell 0.27% Java 99.42% HTML 0.15% Batchfile 0.08%

mallet's People

Contributors

Stargazers

Watchers

Forkers

renaud cy-goh drevicko wzj567 karinabunyik tayl0395 provemyself zhuxiaofengwww apatry steveash wuhaihua chagge zyrockgithub wolfhu qibaoyuan thjashin sing1ee ml-ai-nlp-ir muranava reverb fanyijie byu-nlp-lab adi92 arvids jai-chaudhary modhi snavjot xiongcailuo timeleft-- seantyh stettix vseledkin qudos-com dungpd admackin mwunderlich arcodergh jdkizer9 carschno capdevc iamxiatian yliuaaron dokooh dpressel yunho0130 codingmyfuture krrepo seal-software nehavdeshmukh hussain7 farukhan robbymeals neil-rubens metanix nikhilkandur moherx guomin rosario rygbee attapol kuribot nrockweiler codingafuture lewg fuxingloh abandons codeaudit mansmeg hernic shriphani palumbonicola alphanlp zxsted alansaid cumulodev robmcdan seansouthern palm-toolkit armadillabs youwei-lv danring gturri egbutter taalbrecht conjugateprior zhangjiulong kingsi charlescearl makersf fanweihua fangzheng354 jasonzou manzilzaheer parc asonee belkp mzdu maizifang ru13en chicoranc

mallet's Issues

Failing tests

There are multiple unit tests that fail. They seem to have different roots and therefore will presumably require multiple issues, but I'll begin here with an enumeration.

multiple in cc.mallet.extract.test: if you use a localization that uses different number formatting (e.g. , instead of . in decimal numbers, the string comparisons fail.
various in cc.mallet.pipe.tests.TestSpacePipe
various in cc.mallet.fst.tests
various in cc.mallet.grmm.test

Can anyone confirm these failures? Apart from the localization issues, there are cases in which actual values clearly differ from expected results.

java.lang.ClassCastException processing japanese data

While trying to build a LDA model from japanese docs, using the following cmd line:

 Mallet/bin/mallet train-topics --num-topics 5 --input text.vectors --num-threads 4 --optimize-interval 1 --num-iterations 1500 --output-model mymodel.model --evaluator-filename myevali.eval --diagnostics-file diagnostics.txt --num-top-words 5000 --output-topic-keys TOPIC_KEYS.txt --topic-word-weights-file TOPIC_WORD_WEIGHTS.txt --xml-topic-phrase-report XML_TOPIC_PHRASE_REPORT.txt --output-topic-docs TOPIC_DOCS.txt

I get the stack trace:
Exception in thread "main" java.lang.ClassCastException: java.net.URI cannot be cast to java.lang.String
at cc.mallet.topics.ParallelTopicModel.printTopicDocuments(ParallelTopicModel.java:1748)
at cc.mallet.topics.tui.TopicTrainer.main(TopicTrainer.java:281)

Thoughts?

--gram-sizes for input-file

There is the option --gram-sizes for the command input-dir, but it does not exist for input-file. Why is that so? How can one include 2- and 3-grams from a file as input without creating a huge directory structure just to read in some data?

Fix Javadoc

Multiple errors in Java make a release on Maven central currently impossible:

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-javadoc-plugin:2.10.3:javadoc (default-cli) on project mallet: An error has occurred in JavaDocs report generation:
[ERROR] Exit code: 1 - /home/schnober/git/Mallet/src/cc/mallet/cluster/Clusterer.java:38: warning: no @param for trainingSet
[...]

The comprehensive error log can be reproduced with mvn javadoc:javadoc.

ClassCastException with writing top documents per topics

I am encountering the following ClassCastException when I try to write the top document/topic to a file.

Exception in thread "main" java.lang.ClassCastException: java.net.URI cannot be cast to java.lang.String
        at cc.mallet.topics.ParallelTopicModel.printTopicDocuments(ParallelTopicModel.java:1773)
        at cc.mallet.topics.tui.TopicTrainer.main(TopicTrainer.java:281)

Steps to Reproduce
The following command will reproduce the error (ap.zip contains the imported feature vectors from the AP dataset):

mallet train-topics --input ap.vectors --num-topics 20 --output-topic-docs error.txt

Solution
The following patch fixes the problem. I will submit a pull request shortly.

diff --git a/src/cc/mallet/topics/ParallelTopicModel.java b/src/cc/mallet/topics/ParallelTopicModel.java
index 287ead6..f9d1317 100644
--- a/src/cc/mallet/topics/ParallelTopicModel.java
+++ b/src/cc/mallet/topics/ParallelTopicModel.java
@@ -1770,11 +1770,11 @@

                int doc = sorter.getID();
                double proportion = sorter.getWeight();
-               String name = (String) data.get(doc).instance.getName();
+               Object name = data.get(doc).instance.getName();
                if (name == null) {
                    name = "no-name";
                }
-               out.format("%d %d %s %f\n", topic, doc, name, proportion);
+               out.format("%d %d %s %f\n", topic, doc, name.toString(), proportion);

                i++;
            }

NPE in CharSequenceLexer.updateMatchText

java.lang.NullPointerException: null
        at cc.mallet.util.CharSequenceLexer.updateMatchText(CharSequenceLexer.java:127) ~[mallet-2.0.7.jar:na]
        at cc.mallet.util.CharSequenceLexer.hasNext(CharSequenceLexer.java:143) ~[mallet-2.0.7.jar:na]
        at cc.mallet.pipe.CharSequence2TokenSequence.pipe(CharSequence2TokenSequence.java:66) ~[mallet-2.0.7.jar:na]
        at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:294) ~[mallet-2.0.7.jar:na]
        at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:282) ~[mallet-2.0.7.jar:na]
        at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:290) ~[mallet-2.0.7.jar:na]
        at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:282) ~[mallet-2.0.7.jar:na]
        at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:290) ~[mallet-2.0.7.jar:na]
        at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:282) ~[mallet-2.0.7.jar:na]
        at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:290) ~[mallet-2.0.7.jar:na]
        at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:282) ~[mallet-2.0.7.jar:na]
        at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:290) ~[mallet-2.0.7.jar:na]
        at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:282) ~[mallet-2.0.7.jar:na]
        at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:290) ~[mallet-2.0.7.jar:na]
        at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:282) ~[mallet-2.0.7.jar:na]

matcher.group(); can return null "if the group failed to match part of the input"

ParallelTopicModel ArrayIndexOutOfBounds Exception Nr.2

This is another exception occurring and not the same as mentioned in issue #42

In ParallelTopicModel an ArrayIndexOutOfBounds exception can occur in line 447, because the array is accessed via an index although it was checked that the index is too large. Thus, I added a guard to it. I have to mention that I did not check side-effects.

Previous

while (targetCounts[targetIndex] > 0 && currentTopic != topic) {
    targetIndex++;
    if (targetIndex == targetCounts.length) {
        logger.info("overflow in merging on type " + type);
    }
    currentTopic = targetCounts[targetIndex] & topicMask;
}
currentCount = targetCounts[targetIndex] >> topicBits;

targetCounts[targetIndex] =
    ((currentCount + count) << topicBits) + topic;

// Now ensure that the array is still sorted by
//  bubbling this value up.
while (targetIndex > 0 &&
        targetCounts[targetIndex] > targetCounts[targetIndex - 1]) {
    int temp = targetCounts[targetIndex];
    targetCounts[targetIndex] = targetCounts[targetIndex - 1];
    targetCounts[targetIndex - 1] = temp;

    targetIndex--;
}

Fixed

while (targetCounts[targetIndex] > 0 && currentTopic != topic) {
    targetIndex++;
    if (targetIndex == targetCounts.length) {
        logger.info("overflow in merging on type " + type);
        break;
    }
    currentTopic = targetCounts[targetIndex] & topicMask;
}

if (targetIndex < targetCounts.length) {
    currentCount = targetCounts[targetIndex] >> topicBits;

    targetCounts[targetIndex] =
        ((currentCount + count) << topicBits) + topic;

    // Now ensure that the array is still sorted by
    //  bubbling this value up.
    while (targetIndex > 0 &&
            targetCounts[targetIndex] > targetCounts[targetIndex - 1]) {
        int temp = targetCounts[targetIndex];
        targetCounts[targetIndex] = targetCounts[targetIndex - 1];
        targetCounts[targetIndex - 1] = temp;

        targetIndex--;
    }
}

Edit: Same error in lines 381 and following; fixed the same way.

topic composition file looks very strange

This may not be the right place for this question, but the current version of mallet does not produce the #doc name topic proportion configuration of the topic composition. Instead it gives something like this:

0 file:/(path)/ocr_10.2307_1840442.txt 0.01234185 0.030153958 0.434804596 0.012262633 7.68E-05

I realize that if this was an actual bug you probably would have noticed it, and thus it is more likely to have something to do with my system, but I don't know where else to ask. Someone's got a question on this on stackoverflow but no answers. Thanks.

mallet split broken?

Running mallet split with the command line:

Mallet/bin/mallet split --input file.csv.new2.vectors --training-portion 50.0 --validation-portion 20.0 --testing-file file.csv.new2.vectors_test.vectors --training-file file.csv.new2.vectors_train.vectors

where file.csv.new2.vectors_train.vectors are imported sequence vectors from CSV text.

yields:

$ ./mallet_split.sh file.csv.new2.vectors
Training portion = 50.0
Validation portion = 20.0
Testing portion = -69.0
Prune info gain = 0
Prune count = 0
Prune df = 0
idf range = 0.0-Infinity
Writing instance list to file.csv.new2.vectors_train.vectors

And only writes 1 file with training vectors, no test vectors, no validation vectors. Why is the testing portion -69.0?

Wrong version in POM: 2.0.8

The version in the POM of this repo is still 2.0.8 despite 2.0.8 already having been released and despite changes to master that have been performed since the release.

The version should be switched to 2.0.9-SNAPSHOT or 2.1.0-SNAPSHOT.

Transition weight is changed during calculating gradient values. (2.0.7 developer version)

Hi,

During debugging linear-chain CRF source codes (2.0.7 developer version), I found a question.
Transition weight for calculating xi in 'cc/mallet/fst/SumLatticeDefault.java' seems changed in the process of calculating 'expectation' value for gradient at the first iteration (at the first call of 'linesearch' function).

my training case is like this
환자 F
severe ADJ F
fever dictionary T
로 JO F
응급실 F
내원 F

mild ADJ F
chill dictionary T

report.txt
in the 'report.txt' file,
data sets consist of three rows
'label transition pair'
'transition weight'
'default weight for expectation update'

transition weight for (F,F) is 0.65756... at the beginning, however, it has been changed to 0.47616... (#89)

Include classes to implement Dynamic and Relational Topic Models?

These two extensions to LDA would be quite helpful to have in Mallet. Anyone planning to add this?

I'm specifically thinking about Blei and Chang's RTM and Blei's Dynamic Topic Model (or recent extensions thereof).

I am currently developing a Mallet implementation of the Topic and Query Likelihood model (Wei and Croft, 2006), with some additional features. Will likely have something to pull later this year to add to the managerie of topic modelling classes.

HierarchicalLDA printNodes numWordsToDisplay ignored

In the printNodes method of the HierarchicalLDA/NCRPNode class the numWordsToDisplay parameter is ignored when printing the nodes. In the "getTopWords" method of the NCRPNode inner class the for loop is hard-coded to 10 instead of the input numWords (line 695).

Mallet is not thread safe and shares Alphabets unsafely preventing users from even using in a thread confined manner

The Alphabet class has a readResolve method that tries to return "cached" alphabets instead of creating new ones. Presumably this is to only have one instance of an alphabet in memory per vm despite how many times the model is deserialized. This probably makes sense when deserializing pipes that share alphabet instances. Unfortunately though, Alphabets are not thread safe and thus running multiple instances of CRFs on different threads blows up when new feature tokens are observed (underlying trove map starts rehashing, array index out of bounds,etc).

I'm trying to run a sequence tagging task on the same pre-trained model across 32 cores and am hitting this. I'll do a pull request to at least make this thread safe

HierarchicalLDA can't be serialized

Dear Mallet team,

I would like to ask if there is a design/functional reason behind not making HierarchicalLDA class not serializable?
Just like with the ParallelTopicModel, after I estimate the topics, I would like to be able to serialize it to disk for future use.

From what I see, just making HierarchicalLDA and NCRPNode implement java.io.Serializable would suffice.

I'd be very grateful if you could provide some explanation!
wojtuch

edit:
Unfortunately I'm unable to check if this modification has influence on the whole system as the tests (when building via maven) fail even on a freshly checked out project.
Nevertheless, this modification solves my issue and I can serialize the estimated model.
I can also make a pull request with my changes.

Can it analyse Chinese?

I recently know this mallet. I am an engineer in China. My Boss ask me for telling that if it can analyse Chinese. I found it has some test case about English, Japanese and so on, but do not have Chinese. So I want to ask that question.
If it can, so what should I do? I must change base code or add some plugins?
If some one knows, please tell me.
Thanks!

Unittests are failing for 2.0.8RC as well as for the current source code

See the test results here: http://m.uploadedit.com/ba3e/1437484394674.txt

PS I can build with skiptests, but I am not sure if everything is working very well since 25 test cases will fail?

NegativeArraySizeException with large vocabulary

I've tried to compute word embeddings with a vocabulary size of 6105270 with a dimensionality of 300, resulting in a NegativeArraySizeException in WordEmbeddings.java:100:

weights = new double[numWords * stride];

This seems to be due to an integer overflow because numWords * stride = numWords * 2 * numColumns = 6105270 * 2 * 300 = 3663162000 > 2^31.

The solution seems pretty easy: change the type of WordEmbeddings.numWords from int to long.

Blank space in input path produces errors in commands interpretation

Hi, I was running the import command with a path that contains blank spaces. I know this is not the best solution, but a faster one. I'm posting the solution here in case another developer need to solve this quickly, wants to improve it or if you want to include the functions in another mallet class:

I'm adding the following lines at the beginning of the Csv2Vector file:

PathCommandProcessor cpp = new PathCommandProcessor();
args = cpp.processPathAsCommandArgument(args, "--input");
args = cpp.processPathAsCommandArgument(args, "--output");

And then, this is the class:

package cc.mallet.util;

import java.util.ArrayList;
import java.io.*;
import java.util.Arrays;
import java.util.List;

public class PathCommandProcessor{

    public static String[] processPathAsCommandArgument(String[] args, String commandId) {

        //If the command is not present, just return the original array
        if(!Arrays.asList(args).contains(commandId))
            return args;

        String fullPath = ""; 
        Boolean stackPath = false;
        int inputIndex = 0;
        int lastIndex = 0;

        //Traverse the arguments for joining the path, despite the spaces
        for(int i=0; i<args.length; i++){

            if(args[i].equals(commandId)){
                stackPath = true;
                inputIndex = i+1;
            } 
            else if(stackPath == true){
                if(args[i].contains("--")){
                    lastIndex = i-1;
                    break;
                }
                fullPath = fullPath + " " + args[i];
            }
        }
                if(inputIndex != 0 && lastIndex == 0) lastIndex = args.length-1;
        //Remove the first blank space
        if(fullPath.length() > 0) fullPath = fullPath.substring(1, fullPath.length());

        //Replace the first argument of the command for the full path
        if(inputIndex != 0) args[inputIndex] = fullPath;

        //Remove the remaining elements composing the old path (this is required so they are not taken into account)
        if(inputIndex != 0 && lastIndex != 0 && inputIndex != lastIndex) 
            args = removeElementsFromArgument(args, inputIndex+1, lastIndex);

        return args;
    }
    public static String[] removeElementsFromArgument(String[] input, int startIndex, int lastIndex) {

        List<String> result = new ArrayList<String>();
        for(int i=0; i<input.length; i++){
            if(i < startIndex || i > lastIndex){
                result.add(input[i]);
            }
        }

        return result.toArray(new String[0]);
    }
}

Cheers

No way to output model in PAM?

Is there any way to output a serialized model object in PAM? It looks like this may have been possible from line 194 in PAM4L.java, but then it was commented out.

Saving of the output

I don't know how to use a command that could save my results somewhere else to view them after run to completion if use or run multiple algorithms.

Thanks.

Kaka, O. A
[email protected]

Minkowski distance class is broken

The Minkowski distance class is severely broken. The distance method does only work for SparseVectors which have an equal number of nonzero values which mathematically doesn't make any sense. The euclideanDistance method almost always runs into an infinite loop.

FR: Common interface for TopicModel

I'm trying to get into using ParallelTopicModel, HierarchicalLDA, HierarchicalPAM - but they are all very different. Some output with PrintWriter, some with PrintStream, some save state in a readable format, some dump state to a binary format...

What I could really use is a consistent TopicModel interface. Something where

setNumThreads throws a NotImplementedException when it doesn't support threads > 1
consistent "show topic model in readable form"
consistent "save/load model to/from file" (and alphabet!)
consistent "estimate N top topic membership likelihoods of new doc"
default "what is a sane pipeline of text absorbtion that works 80% of the time"

pipeList.add(new CharSequenceLowercase());
pipeList.add(new CharSequence2TokenSequence(Pattern.compile("\\p{L}[\\p{L}\\p{P}]+\\p{L}")));
pipeList.add(new TokenSequenceRemoveStopwords(new File("data/en.txt"), "UTF-8", false, false, false));
pipeList.add(new TokenSequenceNGrams(new int[] { 1, 2 }));
pipeList.add(new TokenSequence2FeatureSequence());

It FEELS like this would be easy, but I don't know the guts of the various algos to do the "show topic model in readable form" step. (Even nicer if it outputs to JSON then both people and apps can consume it)

Cannot run inference in parallel from trained model

I have a trained Mallet model that I'd like to use for inference on raw text documents. I can convert these into the vocabulary of the trained model with the import command. However, that command modifies the original mallet file if a pipe was used. In my case, I'm running the inference on multiple machines that share the same NFS mount point. When I try to import documents on each machine, this causes the original mallet file to be corrupted since each process is trying to update the file with the vocabulary from the new document.

It would be great if there was a flag to the code to prevent it from modifying the original pipe. I think this makes sense too for inference since any unseen vocabulary won't be used by the trained topic model. The current work around is to copy the mallet file per machine, which is a bit brittle.

Update README with install and usage information

The README should be updated with link to website, installation and usage information. I will send a PR for this.

Bug in initializeFromState(jFile) method of ParallelTopicModel class

There seems to be a bug in initializeFromState() method for the RTopicModel class. It seems like it does not read in the beta and alpha from the state file. This can be something that is due to an old jar in the mallet R package.

The result below is the result of the following testhat testsuite (where 0.1 is the set symmetric prior and 0.0835 is one of the learned alpha priors):
https://github.com/MansMeg/RMallet/blob/master/mallet/tests/testthat/test_mallet-io.R

* `new.doctopic.after.load.prior` not equal to `old.doctopic.prior`.
192540/192540 mismatches (average diff: 0.0303)
[1]  0.1 -  0.0835 == 0.0165
[2]  0.1 -  0.0835 == 0.0165
[3]  1.1 -  1.0835 == 0.0165
[4]  0.1 -  0.0835 == 0.0165
[5]  0.1 -  0.0835 == 0.0165
[6]  0.1 -  0.0835 == 0.0165
[7]  0.1 -  0

Test file hashed.sv.old.ser missing

The test cc.mallet.types.tests.TestHashedSparseVector.testNoTokenText fails because:

testNoTokenText(cc.mallet.grmm.test.TestGenericAcrfData2TokenSequence): Alphabets don't match: Instance: [null, 2], InstanceList: [0, 2]
testJtConstant(cc.mallet.grmm.test.TestInference): Error reading line:
testContinousSample(cc.mallet.grmm.test.TestFactorGraph): Error reading line:
testContinousSample2(cc.mallet.grmm.test.TestFactorGraph): Error reading line:
testAllFactorsOf(cc.mallet.grmm.test.TestFactorGraph): Error reading line:
testPlusEqualsFromSaved(cc.mallet.types.tests.TestHashedSparseVector): java.io.FileNotFoundException: test/resources/edu/umass/cs/mallet/base/types/hashed.sv.old.ser (No such file or directory)

The file hashed.sv.old.ser is not in the repository.
This has already been reported in the more general issue #54.

Incomplete output of HierarchicalLDATUI

When printing the state from HierarchicalLDATUI the files are truncated at the end.

The solution is to close the writer so that it flushes.

        if (stateFile.value() != null) {
            PrintWriter writer = new PrintWriter(new BufferedWriter(new FileWriter(stateFile.value())));
            try {
                hlda.printState(writer);
            } finally {
                writer.close();
            }
        }

Topic inference problem

Hey David,

When I do topic inference, I thought I didn't need to specify the number of iterations, so here's the command I used:
bin/mallet infer-topics \
--input data.sequence \
--inferencer model.mallet \
--output-doc-topics data.out

Then it gave me this error:
java.lang.NullPointerException
at java.io.File.<init>(File.java:277)
at cc.mallet.topics.tui.InferTopics.main(InferTopics.java:79)
null

I looked at the code and then tried to add the argument below, it works:
--num-iterations 100

When I used command --help,
I found that --num-iterations is said to be set to 100 by default.
I'm not sure if there's something wrong with my commands,
but at least when I specify the argument --num-iterations it worked;
If I do not specify it, it doesn't work anyway.

featureselector ( Select by threshold.)

{ // Select by threshold.
for (int i = 0; i < ranking.singleSize(); i++) {
if (ranking.getValueAtRank(i) > minThreshold)
fs.add (ranking.getIndexAtRank(i));
}
}
hi，here order is very slowly when feature quantity is very large, I think sorting is unnecessary

How can i get the original text of a sentence?

Hello,
I just have started with Mallet and trying some examples.
I would like to know if i can get back the real sentence text inside a Pipe, using the Instance object.
I have to create a custom code that deal with real text instead of a TokenSequence.

My problem is that i must avoid the reconstruction of the text by adding a space between Tokens, because for example i can have this text:

"Hello! my email is [email protected]"

I tokenize this in:

Hello
!
my
email
is
john
@
example
.
com

(it is only an example, just to explain my needs)
So if i try to reconstruct the text with spaces it will becase "Hello ! my email is john @ example . com"

That is not the original sentence. After the reconstruction i will do some checks and then save features on specific tokens.

Thanks!

Bug in getDocumentTopics() method of RTopicModelclass

There seems to be a bug in getDocumentTopics() method for the RTopicModel class. It returns more tokens than there are tokens in the data. The result below is the result of the following testhat testsuite:
https://github.com/MansMeg/RMallet/blob/master/mallet/tests/testthat/test_mallet-io.R

Error: Test failed: 'load.mallet'
* `new.doctopic.after.load` not equal to `old.doctopic`.
880/192540 mismatches (average diff: 1.01)
[54]   0 - 1 == -1
[342]  1 - 0 ==  1
[4123] 0 - 1 == -1
[4124] 0 - 1 == -1
[4125] 0 - 1 == -1
[4126] 1 - 2 == -1
[4127] 1 - 0 ==  1
[4128] 1 - 0 ==  1
[4132] 1 - 0 ==  1
...

* 680947 not equal to sum(old.doctopic).
1/1 mismatches
[1] 680947 - 681450 == -503

* 680947 not equal to sum(new.doctopic.before.load).
1/1 mismatches
[1] 680947 - 681450 == -503

* 680947 not equal to sum(new.doctopic.after.load).
1/1 mismatches
[1] 680947 - 681450 == -503

Topic inference on Labeled LDA?

I am trying to run Labeled LDA over a bunch of documents following this blog. All my documents have one label assigned to each of them.

{doc_id} {label_name} {document} \n

While doing the inference, I am providing a new document with this format:

{doc_id} {document} \n

However, after the inferencer generates the file (using the flag --output-doc-topics) , I am supposed to get a file with the heading #doc name topic proportion ..., which I do, but the second line contains name = 0, topic = some float value, and then the proportion of all the topics that I asked. I am not getting the topic name from this document, neither I am getting the proper probability of the topics. I am running the labeled LDA by this command:

bin/mallet run cc.mallet.topics.Label
edLDA --input dump.seq --output-topic-keys dump.keys --output-model dump.
model --inferencer-filename dump.inferencer

And then inferring them by this command:

bin/mallet infer-topics --input file.
seq --inferencer dump.inferencer --output-doc-topics test.output

Any thoughts on how to infer labeled lda topics?

2.0.8RC3 Incremental training exception

I am trying to use incremental mallet (2.0.8RC3) training as described in http://comments.gmane.org/gmane.comp.ai.mallet.devel/2153

I tried different input data, but in all cases except the case in which I incrementally train with the data already used for the first training, I get following exception:

Total time: 40 seconds
Topic Evaluator: 100 topics, 7 topic bits, 1111111 topic mask
 Rewriting extended pipe from /mnt/work/data/docs/t1/test_corpus.mallet
  Instance ID = ca0786a3409cc2fb:aba8caa:152ab48f2dc:-7ffa
max tokens: 669
total tokens: 565287
Data loaded.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 14386
    at cc.mallet.topics.ParallelTopicModel.buildInitialTypeTopicCounts(ParallelTopicModel.java:333)
    at cc.mallet.topics.ParallelTopicModel.addInstances(ParallelTopicModel.java:256)
    at cc.mallet.topics.tui.TopicTrainer.main(TopicTrainer.java:218)

I enclose my train script and there are also attached input text corpora. I am using Linux mint.

#!/bin/sh
# update paths for your mallet bin and data folder (where new folders can be created)
export doc_path='/mnt/work/data/docs';
export mallet_bin='/software/mallet-2.0.8RC3/bin/mallet';
export text_corpus1='/mnt/work/data/docs/textcorpus1.txt';
export text_corpus2='/mnt/work/data/docs/textcorpus2.txt';

mkdir -p ${doc_path}/t1
mkdir -p ${doc_path}/t2

# Preparing data:

${mallet_bin} import-file --preserve-case --keep-sequence --remove-stopwords --token-regex '\S+' --input ${text_corpus1} --output ${doc_path}/t1/test_corpus.mallet

# Training a model:

${mallet_bin} train-topics --input ${doc_path}/t1/test_corpus.mallet --num-topics 100 --optimize-interval 10 --num-threads 11 --output-state ${doc_path}/t1/test_state.mallet.gz --output-doc-topics ${doc_path}/t1/adoctopics.txt --output-topic-keys ${doc_path}/t1/atopickeys.txt --num-iterations 1000 --inferencer-filename ${doc_path}/t1/test_inferencer.mallet --evaluator-filename e.eval --optimize-burn-in 100 --random-seed 1 --output-model ${doc_path}/t1/test_mallet_model



# Preparing data using previous pipe + incremental training using previous corpus.mallet:

${mallet_bin} import-file --preserve-case --keep-sequence --remove-stopwords --token-regex '\S+' --input ${text_corpus2} --output ${doc_path}/t2/test_corpus.mallet --use-pipe-from ${doc_path}/t1/test_corpus.mallet

# This line will raise the exception:
${mallet_bin} train-topics --input ${doc_path}/t2/test_corpus.mallet --num-topics 100 --optimize-interval 10 --num-threads 11 --output-state ${doc_path}/t2/test_state.mallet.gz --output-doc-topics ${doc_path}/t2/adoctopics.txt --output-topic-keys ${doc_path}/t2/atopickeys.txt --num-iterations 1000 --inferencer-filename ${doc_path}/t2/test_inferencer.mallet --evaluator-filename e.eval --optimize-burn-in 100 --random-seed 1 --output-model ${doc_path}/t2/test_mallet_model --input-model ${doc_path}/t1/test_mallet_model

textcorpus1.txt
textcorpus2.txt

Thank you for your help and great library.

mallet.bat -- Bad Input in Windows.

Consider the following section of code, which is effectively run in %MALLET_HOME%\bin\mallet.bat:

set CMD=%1
shift

:getArg

if "%1"=="" goto run
set MALLET_ARGS=%MALLET_ARGS% %1
shift
goto getArg

:run
echo "DONE"

If saved as say "this.bat" in Windows and run as:

this.bat command1 "C:\Users\me\My Directory With Spaces\someotherfile"

produces the error:
Directory was unexpected at this time.

Which prevents MALLET from running. I am not a batch scripting expert and have not been able to undo this error. Running - Windows Server 2012: Datacenter, have not tested in another environment, but I suspect it'd fail in another.

HierarchicalLDA class cast exception

The following exception is thrown when trying to run HierarchicalLDATUI with proper arguments:

Exception in thread "main" java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lcc.mallet.topics.HierarchicalLDA$NCRPNode;
at cc.mallet.topics.HierarchicalLDA.samplePath(HierarchicalLDA.java:230)
at cc.mallet.topics.HierarchicalLDA.estimate(HierarchicalLDA.java:135)
at cc.mallet.topics.tui.HierarchicalLDATUI.main(HierarchicalLDATUI.java:109)

Infinite loop in ParallelTopicModel upon an error when multiple threads are used

When ParallelTopicModel runs with multiple threads and encounters errors during processing the system enters an infinite loop waiting for WorkerRunnables to finish that never will (because they failed). In comparison, when a single thread is used the method returns to the caller.

This is because when WorkerRunnable catches an exception during processing it simply prints a stack trace and then does not set the isFinished flag. ParallelTopicModel sits in a loop polling the WorkerRunnable and will wait forever because the isFinished flag was never set and ParallelTopicModel has no other checks to see if an error occurred.

Iterating over a directory of files

There is not functionality to iterate over a directory of files. Pull request #35 solves this.

mallet can not be used in android

I want to user winnow algorithm on android,but it shows an error that java.rmi.dgc.VMID can not find,then I want to add rt.jar from java to the project but it shows another error,what can i do if i want to user mallet on android

stop words

I am use Mallet to do topic modeling with German txts. When I run import command with extra stopwords list, I got the following information

how can I fix it?

CompareTO Method in IDSorter

The comapareTo Method in IDSorter should have parameter of type IDSorter instead of Object to avoid warnings

Unchecked invocation sort(List<IDSorter>) of the generic method sort(List<T>) of type Collections

when trying to sort an List of IDSorter

void TopicalNGrams.printDocumentTopics (PrintWriter pw) has empty body

Hi,
while the other printDocumentTopics methods of class TopicalNGrams are implemented as well as the printDocumentTopics methods of other classes, the body of the method mentioned in the subject is not implemented and this is probably not on purpose, but it would need an implementation.
Regards, Frank

non eng char in input ignored where running Topic Model in Command Line

Hi,
I am trying to use topic modeling on some segemented chinese documents.
When I ran the topic modeling example code from mallet developer guide page, I found a lot of Chinese words are ignored/escaped. As when I deleted all the chinese words the program designated as keywords from the document, though the rest of the document remain with enough content, the program would not raise any new keywords in the next run.
I replaced the regex with a different one in line:
pipeList.add( new CharSequence2TokenSequence(Pattern.compile("\\p{L}[\\p{L}\\p{P}]+\\p{L}")) );
and it resolved the issue(The program will keep producing new keywords after old ones being deleted).

However when I ran topic modeling using windows command line(I used train-topics with --output-topic-keys option), I found that all the Chinese characters in the input files are ignored, only the English words, abbreviations were elected as keywords. I tried with French and Japanese documents. The same thing happened for Japanese documents, and the French ones are fine.

Note that I am not doing poly lingual document modeling, it's just that the Chinese documents I am working with contain English abbreviations occationally.

ParallelTopicModel Exception java.lang.ArrayIndexOutOfBoundsException: -1

Hi,
I have encountered this issue in my project and wondering if it is in the plans for a fix.
It would be great to reliable use ParallelTopicModel in production.

Copying the description from juanmirocks/mallet#3 from 2014 where @ilyastam wrote

Every once in a while I see the following exception thrown:

java.lang.ArrayIndexOutOfBoundsException: -1
at cc.mallet.topics.WorkerRunnable.sampleTopicsForOneDoc(WorkerRunnable.java:489)
at cc.mallet.topics.WorkerRunnable.run(WorkerRunnable.java:275)
at cc.mallet.topics.ParallelTopicModel.estimate(ParallelTopicModel.java:874)

When I went to the location of where exception is being thrown, I saw the following code:

            i = -1;
            while (sample > 0) {
                i++;
                sample -= topicTermScores[i];
            }

            newTopic = currentTypeTopicCounts[i] & topicMask;

It appears that sometimes sample can in fact be less than zero, which legitimately causes java.lang.ArrayIndexOutOfBoundsException to be thrown when jvm runs into newTopic = currentTypeTopicCounts[-1] & topicMask;

This seems like a bug to me. For my purposes I am patching it as follows:

            i = -1;
            while (sample > 0 || i < 0) {
                i++;
                sample -= topicTermScores[i];
            }

            newTopic = currentTypeTopicCounts[i] & topicMask;

I am not sure about the impact of this on the result, but it seems to fix the immediate problem with the code. Would be great to see a proper fix for this though.

Regularization for classification

How can I add Regularization for the MaxEnt classification?

Information about regularized logistic regression (aka MaxEnt) http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex5/ex5.html

Unsupervised learning using K-Means is not usable

Since unsupervised learning does not need labels, I suppose that all instances shall be contained in a single cluster before being parsed. Unfortunately, calling Clusterings2Clusterings generates the following exception:

$ java -cp dist/*:lib/* cc.mallet.cluster.tui.Clusterings2Clusterings --input text.clusterings --training-proportion 0.5 --output-prefix text.clusterings
number clusterings=1
Exception in thread "main" java.lang.IllegalArgumentException: Number of labels must be strictly positive.
    at cc.mallet.cluster.Clustering.<init>(Clustering.java:41)
    at cc.mallet.cluster.util.ClusterUtils.createSingletonClustering(ClusterUtils.java:107)
    at cc.mallet.cluster.tui.Clusterings2Clusterings.createSmallerClustering(Clusterings2Clusterings.java:141)
    at cc.mallet.cluster.tui.Clusterings2Clusterings.main(Clusterings2Clusterings.java:118)

--use-ngrams gone?

When I try the --use-ngrams option, I receive
Unrecognized option 6: --use-ngrams

build.xml and bin/mallet doesn't exist in mallet-2.0.8RC2.zip

I had to pull them from mallet-2.0.7.zip

The procedure of gaining topics of topic-unknown text using mallet in a java application

Now I use eclipse as my programming ide, and I want to use Mallet in my java application to gain the exact topics of a topic-unknown text ,but what is the procedure of the processing?

Here are my doubts:

Mallet needs to train the topic model first, so how to train it and what are the training data?
How to use the trained topic model to mining the topic from an topic-unknown text?

various unit tests failing on master

master is failing unit tests under jdk8.045 on the mac. checked out master from github, ran mvn install. Would not install due to failing tests, reproduced partially below (let me know if you want a full log).

Result:

Failed tests: testToXml(cc.mallet.extract.test.TestDocumentExtraction): expected:<...IMAL>quick brown fox[ leapt ]over the laz...> but was:<...IMAL>quick brown fox[ leapt ]over the laz...>
testToXmlBIO(cc.mallet.extract.test.TestDocumentExtraction): expected:<... quick brown[ fox leapt over ]the lazy dog...> but was:<... quick brown[ fox leapt over ]the lazy dog...>
(..)stNestedToXML(cc.mallet.extract.test.TestDocumentExtraction): expected:<...ck brown fox[ leapt over the lazy ]dog
testNestedXMLTokenizationFilter(cc.mallet.extract.test.TestDocumentExtraction): expected:<...ck brown fox[ leapt ]over the <AD...> but was:<...ck brown fox[ leapt ]over the <AD...>
testPunctuationIgnoringEvaluator(cc.mallet.extract.test.TestPerDocumentF1Evaluator): expected:<...t Pred Target(..)
testCost(cc.mallet.fst.tests.TestCRF): Value should be 35770 but is-335344.0
testCostSerialized(cc.mallet.fst.tests.TestCRF): Value should be 35770 but is-335344.0
testAddOrderNStates(cc.mallet.fst.tests.TestCRF): expected:<-167.2234457483949> but was:<-363.62387563108734>
testTokenAccuracy(cc.mallet.fst.tests.TestCRF): expected:<0.9409> but was:<0.9101497504159733>
testIncrement(cc.mallet.fst.tests.TestFeatureTransducer)
testInitialState(cc.mallet.fst.tests.TestFeatureTransducer)
testViterbi(cc.mallet.fst.tests.TestFeatureTransducer)
testCacheExpanding(cc.mallet.grmm.test.TestFactorGraph)
testTrpTreeList(cc.mallet.grmm.test.TestInference): (..)
testSerialization(cc.mallet.grmm.test.TestListVarSet)
testSparseMultiplyLogSpace(cc.mallet.grmm.test.TestLogTableFactor): Tast failed! Expected: [LogTableFactor : (C VAR48361 VAR48362)] Actual: [LogTableFactor : (C VAR48361 VAR48362)]
testSparseDivideLogSpace(cc.mallet.grmm.test.TestLogTableFactor): Tast failed! Expected: [LogTableFactor : (C VAR48363 VAR48364)] Actual: [LogTableFactor : (C VAR48363 VAR48364)]
testSample(cc.mallet.grmm.test.TestNormalFactor): expected:<1.4142135623730951> but was:<2.0603227425388737>
testSparseMultiply(cc.mallet.grmm.test.TestTableFactor): Tast failed! Expected: [TableFactor : (C VAR58803 VAR58804)] Actual: [TableFactor : (C VAR58803 VAR58804)]
testSparseDivide(cc.mallet.grmm.test.TestTableFactor): Tast failed! Expected: [TableFactor : (C VAR58805 VAR58806)] Actual: [TableFactor : (C VAR58805 VAR58806)]
testSample(cc.mallet.grmm.test.TestUniformFactor): expected:<0.25> but was:<0.2395009029334333>
testSample(cc.mallet.grmm.test.TestUniNormalFactor): expected:<1.4142135623730951> but was:<1.436053481288838>
testPipesAreStupid(cc.mallet.pipe.tests.TestPipeUtils): Test failed: Should have generated exception.
testConcatenateBadPipes(cc.mallet.pipe.tests.TestPipeUtils)
testSpacePipe(cc.mallet.pipe.tests.TestSpacePipe): expected:<name: array:0(..)
testReadResolve(cc.mallet.types.tests.TestLabelAlphabet)
testRandomTrained(cc.mallet.types.tests.TestPagedInstanceList): expected:<0.40852575488454707> but was:<0.3428063943161634>

Tests in error:
testSpaceViewer(cc.mallet.extract.test.TestDocumentViewer): cc.mallet.extract.StringTokenization cannot be cast to java.lang.CharSequence
testSpaceViewer(cc.mallet.extract.test.TestLatticeViewer): cc.mallet.extract.StringTokenization cannot be cast to java.lang.CharSequence
testDualSpaceViewer(cc.mallet.extract.test.TestLatticeViewer): cc.mallet.extract.StringTokenization cannot be cast to java.lang.CharSequence
testSpaceMaximizable(cc.mallet.fst.tests.TestMEMM): sy = 83.87438991729655 > 0
testSpaceSerializable(cc.mallet.fst.tests.TestMEMM): sy = 83.87438991729655 > 0
testContinousSample(cc.mallet.grmm.test.TestFactorGraph): Error reading line:(..)
testContinousSample2(cc.mallet.grmm.test.TestFactorGraph): Error reading line:(..)
testAllFactorsOf(cc.mallet.grmm.test.TestFactorGraph): Error reading line:(..)
testFromSerialization(cc.mallet.grmm.test.TestGenericAcrfData2TokenSequence): Alphabets don't match: Instance: [null, 2], InstanceList: 0, 2
testFixedNumLabels(cc.mallet.grmm.test.TestGenericAcrfData2TokenSequence): Alphabets don't match: Instance: [null, 2], InstanceList: 0, 2
testLabelsAtEnd(cc.mallet.grmm.test.TestGenericAcrfData2TokenSequence): Alphabets don't match: Instance: [null, 2], InstanceList: 0, 2
testNoTokenText(cc.mallet.grmm.test.TestGenericAcrfData2TokenSequence): Alphabets don't match: Instance: [null, 2], InstanceList: 0, 2
testJtConstant(cc.mallet.grmm.test.TestInference): Error reading line:(..)
testTwo(cc.mallet.pipe.tests.TestInstancePipe)
testThree(cc.mallet.pipe.tests.TestRainbowStyle): /Users/mroberts/Documents/seer/Dependencies/Mallet/foo/bar is not a directory.
testPlusEqualsFromSaved(cc.mallet.types.tests.TestHashedSparseVector): java.io.FileNotFoundException: test/resources/edu/umass/cs/mallet/base/types/hashed.sv.old.ser (No such file or directory)

Tests run: 341, Failures: 27, Errors: 16, Skipped: 0

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:47 min
[INFO] Finished at: 2016-03-02T15:04:46-08:00
[INFO] Final Memory: 21M/505M
[INFO] ------------------------------------------------------------------------