Coder Social home page Coder Social logo

kendemu / deep-belief-nets-for-topic-modeling Goto Github PK

View Code? Open in Web Editor NEW

This project forked from larsmaaloee/deep-belief-nets-for-topic-modeling

0.0 2.0 0.0 388.91 MB

This repository is a proof of concept toolbox for using Deep Belief Nets for Topic Modeling in Python.

Python 99.89% Shell 0.11%

deep-belief-nets-for-topic-modeling's Introduction

Deep Belief Nets for Topic Modeling

Python toolbox using deep belief nets (DBN) for running topic modeling on document data. The concept of the method is to load bag-of-words (BOW) and produce a strong latent representation that will then be used for a content based recommender system.

The toolbox is written for a M.Sc. thesis project. For a shorter read we urge you to read the article Deep Belief Nets for Topic Modeling accepted at the ICML2014 workshop Knowledge-Powered Deep Learning for Text Mining (KPDLTM).

The toolbox is tested to run on Windows 7, Ubuntu 14.04.1 and OSX 10.8-10. You need following prerequisite packages: nltk, numpy, scipy, scikit-learn and matplotlib installed on your system before running the toolbox. If you are interested in producing 3D plots of the output space you will need to install MENCODER and FFMPEG (only tested on OSX).

PCA on the output of 6 categories from the 20 newsgroups dataset run on a 2000-500-250-125-10 (real output) DBN.

Running the Toolbox

In the main.py python file you will find 3 examples on how to run the toolbox:

Example 1

In order to run this example you will need to download the 20 Newsgroup dataset 20news-bydate.tar.gz from http://qwone.com/~jason/20Newsgroups/ and place the unpacked dir "20-news-bydate" in the "./input" dir.

The execution order is as follows:

  • Stem the documents in the training and test set.
  • Initialise the data processing module and generate the BOWs for the training and test set.
  • Initialise the DBN (shape: 2000-500-500-128 binary output units) and pretrain followed by finetuning for 50 epochs each.
  • Evaluate the accuracy of the trained network by performing forward pass of the test set and comparing the nearest neighbors in the output space.
  • Visualise the test set on 6 categories using PCA.

Example 2

In order to run this example you will need to download the 20 Newsgroup dataset 20news-18828.tar.gz from http://qwone.com/~jason/20Newsgroups/ and place the unpacked dir "20news-18828" in the "./input" dir.

The execution order is as follows:

  • Stem the documents in the data set.
  • Initialise the data processing module and generate the BOWs for the training (70% of the docs) and test set (30% of the docs).
  • Initialise the DBN (shape: 2000-500-250-125-10 real output units) and pretrain followed by finetuning for 50 epochs each.
  • Evaluate the accuracy of the trained network by performing forward pass of the test set and comparing the nearest neighbors in the output space.
  • Visualise the test set on 6 categories using PCA.

Example 3

In the "./output" dir is a compressed file "_20news-19997.zip". These are the output files after running the DBN (shape: 2000-500-250-125-10 real output units) on the 20news-19997.tar.gz from http://qwone.com/~jason/20Newsgroups/ for 50 epochs pretraining and finetuning. Unzip the compressed chunks by running the shell script "output/_unzip.sh".

The execution order is as follows:

  • Evaluate the accuracy of the trained network by performing forward pass of the test set and comparing the nearest neighbors in the output space.
  • Visualise the test set on 6 categories using PCA.

Running the toolbox on other datasets

The toolbox apply to all text datasets as long as the execution order is followed (cf. Examples 1 and 2):

  • Stem documents.
  • Generate BOWs.
  • Initialise the DBN.
  • Pretrain.
  • Finetune.
  • Evaluate v Visualise.

Please note that many of the learning parameters are hardcoded into the pretraining and finetuning modules. The current setting has proven to work on various datasets.

During execution all data is saved to the harddrive which slows down the execution, but will eliminate any out-of-memory errors. Furthermore it gives the analyst the ability to resume the training at a random point in training even with different parameters.

Acknowledgements

(cf. the article or M.Sc. thesis mentioned in the beginning for proper citations to litterature used in order to realize this toolbox.)

  • Geoffrey Hinton and Ruslan Salakhutdinovs work on DBNs for dimensionality reduction, restricted boltzmann machines and replicated softmax models.
  • Roland Memisevic Python interpretation of Carl Edward Rasmussens Conjugate Gradient script.

Note from author

Please do not hessitate to contact or contribute if any errors or ideas occur. Enjoy.

Best regards

Lars Maaloee, PHD student, Technical University of Denmark, LinkedIn

deep-belief-nets-for-topic-modeling's People

Contributors

larsmaaloee avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.