Coder Social home page Coder Social logo

stlda-c_public's Introduction

stLDA-C

Single topic LDA with clustering. Implementation for stLDA-C, developed to model Twitter posts and cluster users. This repository implements the method described in Author Clustering and Topic Estimation for Short Texts by Tierney, Bail, and Volfovsky: https://arxiv.org/abs/2106.09533.

The primary use-case for this model is when you have many short texts (posts) from a large number of users, and you want to cluster both the texts and users simultaneously. The model extends the basic LDA framework in Blei, Ng, and Jordan (2003). Each post is a multinomial draw over all words with a single latent topic distribution identifying word frequencies. Users post about topics at different rates, modeled by a latent, Dirichlet-distributed random variable. Each user’s latent topic frequencies are drawn from a cluster-specific Dirichlet.

The model builds two intuitive improvements to traditional topic models applied to short text. The first is the single-topic-per-tweet. Because of sparsity in word co-occurrence in short documents, traditional topic models struggle to estimate meaningful topic distributions. By modeling strong dependence among words in the same tweet, topic estimation is dramatically improved because every word in a tweet is used to infer the latent topic distribution over words. The second is unsupervised clustering of users. Learning the topics for users who tweet infrequently is difficult because of small sample sizes. In traditional heirarchical modeling, noisy user-level estimates are shrunk towards a grand mean. With the cluster estimation in our model, noisily estimated parameters are shrunk towards the average of users they are most similar to. If one only observes a few tweets from a user about sports, for example, estimates of his or her topic distribution should be shrunk towards the typical topic selections of other users who talk about sports, rather than the average user who talks about a wide range of topics.

This code provides a collapsed Gibbs sampler to estimate each post’s topic, each user’s cluster, and cluster-specific Dirichlet parameters. demo_code.R loads the scripts, simluates data, and runs the method on that data.

stlda-c_public's People

Contributors

g-tierney avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.