Coder Social home page Coder Social logo

gklda's Introduction

GK-LDA (General Knowledge based LDA)

GK-LDA is an open-source Java package implementing the algorithm proposed in the paper (Chen et al., CIKM 2013), created by Zhiyuan (Brett) Chen. For more details, please refer to this paper.

If you use this package, please cite the paper: Zhiyuan Chen, Arjun Mukherjee, Bing Liu, Meichun Hsu, Malu Castellanos, and Riddhiman Ghosh. Discovering Coherent Topics Using General Knowledge. In Proceedings of CIKM 2013, pages 209-218.

If you have any question or bug report, please send it to Zhiyuan (Brett) Chen ([email protected]).

Table of Contents

## Quick Start

First, Clone the repo: git clone https://github.com/czyuan/GKLDA.git.

Then, 2 quick start options are available:

  1. Import the directory into Eclipse (recommended).

If you get the exception Java.Lang.OutOfMemoryError, please increase the Java heap memory for Eclipse: http://www.mkyong.com/eclipse/eclipse-java-lang-outofmemoryerror-java-heap-space/.

  1. Use Maven

a. Then, change the current working directory to Src.

cd GKLDA/Src

b. Build the package.

mvn clean package

c. Increase the Java heap memory for Maven.

export MAVEN_OPTS=-Xmx1024m

d. Run the program.

mvn exec:java -Dexec.mainClass="launch.MainEntry"
## Commandline Arguments The commandline arguments are stored in the file "global/CmdOption.java". If no argument is provided, the program uses the default arguments. There are several arguments that are subject to change:
  1. -i: the path of input domains directory.
  2. -know: the file path of input knowledge file.
  3. -o: the path of output model directory.
  4. -nthreads: the number of threads used in the program. The program runs in parallel supporting multithreading.
  5. -nTopics: the number of topics used in Topic Model for each domain.
## Input and Output ### Input The input directory should contain domain files. For each domain, there should be 2 files (can be opened by text editors):
  1. domain.docs: each line (representing a document) contains a list of word ids.
  2. domain.vocab: mapping from word id (starting from 0) to word, separated by ":".

The input directory should also contain a knowledge file, in which each line represents a must-set (i.e., a set of words that should appear together under the same topic).

The output directory contains topic model results for each learning iteration. LearningIteration 0 is always LDA, i.e., without any knowledge. LearningIteration 1 is GK-LDA with the input knowledge. LDA is run first in order to construct word correlation metric used in GK-LDA.

Under each learning iteration folder and sub-folder "DomainModels", there are a list of domain folders where each domain folder contains topic model results for each domain. Under each domain folder, there are 6 files (can be opened by text editors):

  1. domain.docs: each line (representing a document) contains a list of word ids.
  2. domain.param: parameter settings.
  3. domain.tassign: topic assignment for each word in each document.
  4. domain.twdist: topic-word distribution
  5. domain.twords: top words under each topic. The columns are separated by '\t' where each column corresponds to each topic.
  6. domain.vocab: mapping from word id (starting from 0) to word.
## Contact Information * Author: Zhiyuan (Brett) Chen * Affiliation: University of Illinois at Chicago * Research Area: Text Mining, Machine Learning, Statistical Natural Language Processing, and Data Mining * Email: [email protected] * Homepage: http://www.cs.uic.edu/~zchen/

gklda's People

Contributors

czyuan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.