Coder Social home page Coder Social logo

knowledgehacker / github_lda Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mrorii/github_lda

0.0 3.0 0.0 61 KB

GitHub LDA - Collaborative Topic Modeling for Recommending GitHub Repos

Home Page: http://www.cs.cmu.edu/~norii/pub/github-ctr.pdf

Ruby 29.16% Makefile 1.05% C++ 62.78% Shell 1.27% C 5.74%

github_lda's Introduction

Collaborative Topic Modeling for Recommending Github Repos

What is GitHub LDA?

GitHub LDA is a library that applies topic modeling on GitHub repos to improve repository recommendation.

Usage

Download the GitHub Contest dataset

wget https://github.s3.amazonaws.com/data/download.zip
unzip download.zip

Clone the git repositories from GitHub

mkdir repo_dir
github_lda clone -i download/repos.txt -o repo_dir [-p 4]

As there are around 120,000 repositories to download, this will take a VERY long time and will eat up a huge chunk of disk space (up to 1TB). You can specify the number of clones to run in parallel by using the -p option. In order to avoid the number of directories limit in *nix, by default it will subdivide the repositories into 13 subdirectories as follows:

repo_dir
|---0
|   |---1
|   |---2
|   |---...
|   `---9999
|---1
|   |---10000
|   |---10001
|   |---...
.
|   `---119999
`---12
    |---120000
    |---120001
    |---...
    `---123344

Calculate the term frequency for each repository

mkdir term_freq_dir
github_lda calctf -i repo_dir -o term_freq_dir [--stopwords=/path/to/stopwords] [--lang=ruby,javascript] [--process=1]

You can limit the repositories of interest by using the --lang option. By default, term frequencies for source files of all programming languages will be calculated. Refer here for the list of available language options. You can also specify the number of processors to run on by using the --process option.

Preprocess the corpus and convert it into lda-c format and ctr format data

Generate mult.dat, user.dat, item.dat, and vocab.dat in specified directory

mkdir data
github_lda generate --tf term_freq_dir -i download/data.txt -o data

Run lda-c-dist

mkdir lda-result
lda est 0.1 100 settings.txt data/mult.dat random lda-result

Run ctr

ctr --user data/user.dat --item data/item.dat --mult mult.dat \
  --theta_init lda-result/final.gamma --beta_init lda-result/final.beta

Resources

References

Chong Wang and David M. Blei. 2011. Collaborative Topic Modeling for Recommending Scientific Articles. In Proc of KDD'11 [pdf].

github_lda's People

Contributors

mrorii avatar

Watchers

James Cloos avatar ming lin avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.