stepthom / lucene-lda Goto Github PK

View Code? Open in Web Editor NEW

58.0 12.0 23.0 16.72 MB

Using latent Dirichlet allocation (LDA) in Apache Lucene

C++ 98.50% Java 1.38% C 0.09% Shell 0.03%

latent-dirichlet-allocation lda lucene-lda lda-model topic vsm lucene

lucene-lda's Introduction

lucene-lda

Use latent Dirichlet allocation (LDA) in Apache Lucene

AUTHOR

Stephen W. Thomas <[email protected]>

DESCRIPTION

lucene-lda allows users to build indexes and perform queries using latent Dirichlet allocation (LDA), an advanced topic model, within the Lucene framework.

lucene-lda was originally developed as part of a research project that compared the performance of the Vector Space Model (VSM), which is Lucene's default IR model, with the performance of LDA. The context was bug localization, where the goal is to determine the similarity between bug reports and source code files. However, lucene-lda is general enough that other contexts can be considered: as long as there are (a) input documents to be searched and (b) queries to be executed.

lucene-lda can work in two different ways:

You have already executed LDA on the input corpus, and you feed to the resultant topics and topic memberships to lucene-lda. In this case, lucene-lda will internalize the topics and topic memberships while building the index and executing the queries. (You can even input multiple LDA executions, for example if you have run LDA with different parameters. Here, you specify and query time which set of parameters you would like to use.) Specifically, you need to specify four files, for each parameter LDA parameter combination:
- vocab.dat: a Vx1 list of terms in the corpus
- words.dat: a KxV matrix (white-space delimited) that specifies the membership of each word in each topic.
- files.dat: A Dx3 matrix (white-space) that lists the original file names that LDA was executed on. The first and third columns are ignored; the second column should contain the file name.
- theta.dat: A DxK matrix (white-space) tat specifies the topic membership of each file in each topic.

In the above, V is the number of terms; K is the number of topics; and D is the number of documents. The order of the terms in vocab.dat should match the order in words.dat; the same is true for the filenames in files.dat and theta.dat.

You have not yet run LDA on the input corpus, and you feed only the raw documents to lucene-lda. In this case, lucene-lda will first execute LDA on the documents (using MALLET), and then build the index using the resultant topics and topic memberships. (NOTE: this scenario is not yet implemented.)

In either case, you can specify at query time if you want to use the VSM model or LDA model for executing a particular query. lucene-lda will then return a ranked list of documents that best match the given query.

lucene-lda assumes that any complicated preprocessing of the documents or queries has already been performed. See [https://github.com/doofuslarge/lscp] for a nice preprocessor.

DESIGN NOTES

The main design goal was to use LDA, not VSM, to compute the similarity between a query and a document. To understand how I achieved this, a bit of background is required:

By default, Lucene uses a slight variant of the Vector Space Model (VSM) to compute the similarity between a query and each document in the index. (There are some bells and whistles that are available, but this is the general idea.) The basic formulation of the similarity comes from the cosine distance between two vectors: one for the document, and one for the query. The numbers in the vectors are the term weights of each term in the document and query.

LDA works very differently. In the LDA model, similarity is computed using conditional probability, which not only involves the terms of the query and document, but also the topics in the query and documents. Basically, we needed a way to store which topics are in each document in Lucene. To do so, we use Payloads to cleverly encode the topics in each document at index time. Then, at query time, we do the following.

Determine which topics are in the query, based on the terms in the query
Create a Payload query based on these topics
Lucene will then find all documents that contain these topics.
We ignore the actual relevancy returned by Lucene, and instead use the contents of the Payload to compute the relevancy ourselves, and re-rank the results.

Two notes about similarity:

In the above process, performance is actually fast for computing conditional probability, since we are only computing it for those documents that have some of the topics in the query, as opposed to every document in the index.
We have created an LDAHelper() class that holds necessary values related to LDA, such as the theta and phi matrices returned by LDA. These values are necessary to compute conditional probability, but are impractical to store along with every document in the index. Currently, these values are written to disk during indexing as a separate "LDA index", and then read into memory again at query time. A potential improvement is to add these matrices to the Lucene index somehow, in a space and time efficient manner.

USAGE

Use on the command line:

bin/indexDirectory [--help] <inDir> <outIndexDir> <outLDAIndex> [--fileCodes <fileCodes>] [--ldaConfig ldaConfig1,ldaConfig2,...,ldaConfigN ]

bin/queryWithVSM [--help] <indexDir> <queryDir> <resultsDir> [--weightingCode <weightingCode>] [--scoringCode <scoringCode>] 

bin/queryWithLDA [--help] <indexDir> <LDAIndexDir> <queryDir> <resultsDir> [--K <K>] [--scoringCode <scoringCode>]

The above scripts simply call the corresponding Java classes, after setting the classpath as needed.

BUILD AND INSTALLATION

Simply type:

ant jar
ant test

DEPENDENCIES

lucene-lda depends on Apache Lucene, MALLET, Apache Commons, Apache log4j, JSAP, and JUnit. All are included in the lib/ directory.

COPYRIGHT AND LICENCE

lucene-lda's People

Contributors

Stargazers

Watchers

Forkers

klainfo renaud jinbochen yuyangzhang ritikakh xiaoxiongzheng mickeysjm data-search chrismattmann asnjudy elaatifi prateeksamaiya whuawell thucdx laranea parvez2014 shrhoads

lucene-lda's Issues

ISSUE:-Unable to run lucene-lda with version 4.1.

It Throws an exception "Exception in thread "main" java.lang.NoSuchFieldError: LUCENE_35" .Then I tried changing the version in the source code to version 4.1 , then after that i am unable to build using version 4.1( i replaced LUCENE_35 with LUCENE_41). Can you suggest me the proper solution for it?

How to transform the data from mallet to one that can be used by this tool?

We are able to get Lucene-LDA to compile by removing the lucene-3.0 Jar(and leave the 3.5 jar) from the lib directory.

However, when we try to run the indexDirectory command on the documents that we have, we observed that as per the readme and the source code, lucene-lda doesn't run MALLET by itself.

So we ran mallet on the data first and obtained the output from MALLET. However, after this Lucene-lda doesn't recognize the output from the mallet file(when we try to run the queryWithLDA. command). Does this need to be in some specific data format?

lucene versions conflict in lucene-lda and solr

The lucene version in lucene-lda is lucene-core-3.0.2, However in solr the lucene version: lucene-core-5.3.1.

So I replace lucene-core-3.0.2 with lucene-core-5.3.1, then build an install lucene-lda. then run bin/indexDirectory test outInde outLDAInde, and I got the following error:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/lucene/document/Fieldable
at ca.queensu.cs.sail.lucenelda.IndexDirectory.main(IndexDirectory.java:159)
Caused by: java.lang.ClassNotFoundException: org.apache.lucene.document.Fieldable
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 1 more

Complete Test002

The goal of this test is to be very simple: 3 documents, a couple of easy queries, and very intuitive LDA topics. That way, it will be easy to "verify" the query results by hand.

The 3 documents are already there; just need to run LDA to generate the LDA output.

Updates to field indexing

My students in my USC CSCI 572 Search engines class found the following issue had to be dealt with to get this project to work:

chrismattmann@1f6c8bc

Would you be interested in me pushing this upstream? Also what are the chances that we'll get this integrated without having to run LDA outside of this tool? Thank you!

Handle the case of no input file codes

If the filecodes option is not set (and hence no filename->integer mapping is provided by the user), we need to create an identity mapping that can be used in the query results. (I.e., instead of outputing (fileCode, relevancyScore) tuples in the output, we should just output (fileName, relevancyScore) tuples.)

Add LDAHelper Object to Lucene Index

Currently, the LDAHelper class (which encapsulates all the LDA functionality) is serialized and written to disk at index time, and then read back again at query time. This is a little clumsy, as it requires the user to specify a filepath for the serialized object at index time, and then regurgitate the same path at query time. It would be easier (and perhaps cleaner) to add all the information in the LDAHelper class to the Lucene index itself. Is this possible? How can we do this?

Integrate with MALLET for on-the-fly LDA computation

One of the much-needed features in lucene-lda is to compute LDA on the fly, for the cases when LDA has not been precomputed on the corpus.

One easy way to do this is to integrate with MALLET:

http://mallet.cs.umass.edu/

MALLET has API calls to run LDA and collect the output. This could all be done in the IndexDirectoryRunLDA.java class.

This may require some changes to the internals of LDAHelper, such as the representation of the matrices (if MALLET returns something different), but should be worth it in the end,

Unable to compile with Lucene_41

I was trying to get lucene-lda to work with lucene-core-4.10.5 jar.
It throws compilation errors on building it with lucene-core-4.10.5-SNAPSHOT.jar .

lucene-core-3.5.0.jar and lucene-analyzers-3.5.0.jar were replaced with the following jars in build.xml.

lucene-analyzers-phonetic-4.10.5-SNAPSHOT.jar
lucene-analyzers-kuromoji-4.10.5-SNAPSHOT.jar
lucene-analyzers-common-4.10.5-SNAPSHOT.jar
lucene-core-4.10.5-SNAPSHOT.jar


jar:
    [javac] Compiling 9 source files to /Users/Balaji/Development/LDA/lucene-lda/build/classes
    [javac] /Users/Balaji/Development/LDA/lucene-lda/src/ca/queensu/cs/sail/lucenelda/LDASimilarity.java:29: error: cannot find symbol
    [javac] import org.apache.lucene.search.DefaultSimilarity;
    [javac]                                ^
    [javac]   symbol:   class DefaultSimilarity
    [javac]   location: package org.apache.lucene.search
    [javac] /Users/Balaji/Development/LDA/lucene-lda/src/ca/queensu/cs/sail/lucenelda/LDASimilarity.java:31: error: cannot find symbol
    [javac] public class LDASimilarity extends DefaultSimilarity {
    [javac]                                    ^
    [javac]   symbol: class DefaultSimilarity
    [javac] /Users/Balaji/Development/LDA/lucene-lda/src/ca/queensu/cs/sail/lucenelda/SimpleIndexer.java:10: warning: [deprecation] Index in Field has been deprecated
    [javac] import org.apache.lucene.document.Field.Index;
    [javac]                                        ^
    [javac] /Users/Balaji/Development/LDA/lucene-lda/src/ca/queensu/cs/sail/lucenelda/SimpleIndexer.java:12: error: cannot find symbol
    [javac] import org.apache.lucene.document.NumericField;
    [javac]                                  ^
    [javac]   symbol:   class NumericField
    [javac]   location: package org.apache.lucene.document
    [javac] /Users/Balaji/Development/LDA/lucene-lda/src/ca/queensu/cs/sail/lucenelda/VSMQueryAllInDirectory.java:30: error: package org.apache.lucene.queryParser does not exist
    [javac] import org.apache.lucene.queryParser.MultiFieldQueryParser;
    [javac]                                     ^
    [javac] /Users/Balaji/Development/LDA/lucene-lda/src/ca/queensu/cs/sail/lucenelda/VSMQueryAllInDirectory.java:31: error: package org.apache.lucene.queryParser does not exist
    [javac] import org.apache.lucene.queryParser.QueryParser;
    [javac]                                     ^
    [javac] /Users/Balaji/Development/LDA/lucene-lda/src/ca/queensu/cs/sail/lucenelda/VSMQueryAllInDirectory.java:56: error: cannot find symbol
    [javac]     private static QueryParser   parser     = null;
    [javac]                    ^
    [javac]   symbol:   class QueryParser
    [javac]   location: class VSMQueryAllInDirectory
    [javac] /Users/Balaji/Development/LDA/lucene-lda/src/ca/queensu/cs/sail/lucenelda/VSMSimilarity.java:27: error: cannot find symbol
    [javac] import org.apache.lucene.search.DefaultSimilarity;
    [javac]                                ^
    [javac]   symbol:   class DefaultSimilarity
    [javac]   location: package org.apache.lucene.search
    [javac] /Users/Balaji/Development/LDA/lucene-lda/src/ca/queensu/cs/sail/lucenelda/VSMSimilarity.java:29: error: cannot find symbol
    [javac] public class VSMSimilarity extends DefaultSimilarity {
    [javac]                                    ^
    [javac]   symbol: class DefaultSimilarity
    [javac] /Users/Balaji/Development/LDA/lucene-lda/src/ca/queensu/cs/sail/lucenelda/IndexDirectory.java:97: warning: [rawtypes] found raw type: Iterator
    [javac]             for (java.util.Iterator errs = config.getErrorMessageIterator(); errs
    [javac]                           ^
    [javac]   missing type arguments for generic class Iterator<E>
    [javac]   where E is a type-variable:
    [javac]     E extends Object declared in interface Iterator
    [javac] /Users/Balaji/Development/LDA/lucene-lda/src/ca/queensu/cs/sail/lucenelda/IndexDirectoryRunLDA.java:83: warning: [rawtypes] found raw type: Iterator
    [javac]             for (java.util.Iterator errs = config.getErrorMessageIterator(); errs.hasNext();) {
    [javac]                           ^
    [javac]   missing type arguments for generic class Iterator<E>
    [javac]   where E is a type-variable:
    [javac]     E extends Object declared in interface Iterator
    [javac] /Users/Balaji/Development/LDA/lucene-lda/src/ca/queensu/cs/sail/lucenelda/LDAQueryAllInDirectory.java:121: warning: [rawtypes] found raw type: Iterator
    [javac]             for (java.util.Iterator errs = config.getErrorMessageIterator(); errs
    [javac]                           ^
    [javac]   missing type arguments for generic class Iterator<E>
    [javac]   where E is a type-variable:
    [javac]     E extends Object declared in interface Iterator
    [javac] /Users/Balaji/Development/LDA/lucene-lda/src/ca/queensu/cs/sail/lucenelda/LDAQueryAllInDirectory.java:166: error: no suitable method found for open(Directory,boolean)
    [javac]         reader   = IndexReader.open(dir, true);
    [javac]                               ^
    [javac]     method IndexReader.open(Directory,int) is not applicable
    [javac]       (argument mismatch; boolean cannot be converted to int)
    [javac]     method IndexReader.open(IndexWriter,boolean) is not applicable
    [javac]       (argument mismatch; Directory cannot be converted to IndexWriter)
    [javac]     method IndexReader.open(IndexCommit,int) is not applicable
    [javac]       (argument mismatch; Directory cannot be converted to IndexCommit)

BUILD FAILED
/Users/Balaji/Development/LDA/lucene-lda/build.xml:45: Compile failed; see the compiler error output for details.

Total time: 1 second

Is there some other change that needs to be made that I'm missing?

Getting classcast exception on running bin/queryWithLDA

I ran Mallet on my input data and converted that to the 4 files required to run LDA.

However, bin/queryWithLDA requires an Lucene index(which can be created with the bin/indexDirectory command) and a LDA index( I substituted the mallet file) and put the 4 files in the query folder and ran it.

After this i get
java.lang.ClassCastException: cc.mallet.types.InstanceList cannot be cast to ca.queensu.cs.sail.lucenelda.LDAHelper
at ca.queensu.cs.sail.lucenelda.LDAQueryAllInDirectory.main(LDAQueryAllInDirectory.java:153)
Exception in thread "main" java.lang.NullPointerException
at ca.queensu.cs.sail.lucenelda.LDAQueryAllInDirectory.main(LDAQueryAllInDirectory.java:160)

Any idea on how to proceed with this?

Better Handling of No LDA

I know the whole purpose of lucene-lda is to run Lucene with LDA. However, to make the tool more general and useful, we need to gracefully accept cases when LDA is not desired, and instead only VSM indices need to be built and queried. This basic functionality works now, but we need to make sure we gracefully exit if the LDAHelper is empty.