Coder Social home page Coder Social logo

topic-modelling-nlp's Introduction

Topic-Modelling-NLP

This repo marks my attempt to learn and consequently practice the implementation of Topic modelling using algos like LDA, LSA and NMF.

Abstract 🚀

Well, put in a 🥜: Topic modeling is like a magic tool that helps computers understand what a bunch of texts or articles are all about. Imagine you have a big box of letters, and you want to sort them into different categories based on their content. That's what topic modeling does with the texts. It reads through all the documents, finds the main themes, or "topics," and groups similar ones together. So basically, if I have a large source of news articles, and I aim at grouping them to see for only those that revolved around some political stresses, topic modelling would help me narrow down my choices.

Getting into details 📑: Topic modeling is a statistical and machine learning technique used in natural language processing to discover hidden thematic structures, known as "topics," within a collection of documents. The main goal of topic modeling is to automatically identify the underlying themes or subjects that frequently appear across the documents in the dataset.

The typical input for topic modeling is a set of textual documents, such as articles, reviews, or social media posts. The output is a set of topics, each represented as a probability distribution over words. Each document is assumed to be a mixture of these topics, where the proportions of each topic in the document indicate the relevance of that topic to the document.

image

🤔Now why would I use it?

That's what occured to me first, but actually, as the corpuses get larger and larger (which indeed they are getting, given the amount of data users produce each day), techniques like what we are discussing become essential for more fruitful and efficient analysis.

  • Information Retrieval: It can help organize and index large document collections, making it easier to retrieve relevant documents based on topics of interest.
  • Content Analysis: Researchers and businesses can use topic modeling to gain insights into the main themes present in a corpus of text.
  • Recommendation Systems: Topic modeling can be used to recommend relevant content to users based on their interests and preferences.
  • Text Summarization: It aids in generating summaries of lengthy documents by identifying the most important topics.
  • Market Research: Businesses can use topic modeling to analyze customer feedback, reviews, and social media data to understand customer preferences and opinions.

🧮💻

  • Well, you may find a ton of algos supporting the purpose, but I went ahead with three, namingly LDA(Latent Dirichlet Analysis), LSA(Latent Semantic Analysis) and NMF (Non-Negative Matrix Factorization).
  • Latent Dirichlet Allocation (LDA): LDA is based on the principles of probabilistic modeling and Bayesian statistics. The model employs the Dirichlet distribution, a multivariate probability distribution defined on the simplex (a multi-dimensional geometric figure that lies between the coordinate axes). In LDA, the Dirichlet distribution is used to model the topic distributions within documents and word distributions within topics. The central idea is to estimate the optimal topic-word assignments by maximizing the likelihood of generating the observed documents

image

  • Latent Semantic Analysis (LSA): LSA is primarily grounded in linear algebra and matrix factorization. It leverages the concept of singular value decomposition (SVD) to reduce the dimensionality of the term-document matrix. SVD is a powerful linear algebra technique that decomposes a matrix into three matrices (U, Σ, and V^T), where U and V^T represent orthogonal matrices, and Σ is a diagonal matrix containing singular values. By truncating the singular values and corresponding matrices, LSA effectively identifies the most important latent semantic structures in the data

image

  • Non-negative Matrix Factorization (NMF): NMF is also rooted in linear algebra and matrix factorization. Unlike traditional matrix factorization methods like SVD, NMF restricts all elements of the factorized matrices to be non-negative. This property makes NMF suitable for data that exhibits additive, non-negative relationships, such as term frequencies in text data. The factorization aims to find non-negative basis vectors and non-negative coefficients that, when multiplied together, approximate the original data matrix

image

image

References 🤓

These are some sources that I used for understanding the math, the intuition and implementation for topic modelling. These will help you dive deeper into these algorithms - help you get their math, code and preprocessing strategies like 'tokenization', 'lemmatizing', 'stemming', 'omitting stopwords' etc.

topic-modelling-nlp's People

Contributors

udayg01 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.