Coder Social home page Coder Social logo

4511-a1's Introduction

4511-A1

Project codes for IEOR4511-Industry Projects in Analytics & Operations

1. BERTopic

BERTopic is a topic modeling technique that leverages transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

In this project we utilize BERTopic for topic mining. The basic usage of BERTopic is shown in directory BERTopic, in which we utilize and improve the original python package BERTopic: https://maartengr.github.io/BERTopic/index.html.

1

2. LDA

LDA stands for Latent Dirichlet Allocation, which is a popular topic modeling technique in natural language processing (NLP) and machine learning. It's used to automatically discover abstract topics in a collection of documents. However this model is a traditional one which can be hard to apply in reality, due to poor performance. So we use the model as our baseline. We also try a method called online LDA, which is more efficient in process time series data.

Relevant codes are in LDA directory.

3. Represent Topic Using LLM

There are a number of representation models implemented in BERTopic that allows for further fine-tuning of the topic representations, from GPT-like models to fast keyword extraction with KeyBERT-like models. Examples are shown in LLM/BERTopic_with_LLM directory.

We can greatly fine-tune topics to generate labels, summaries, poems of topics, and more. To do so, we first generate a set of keywords and documents that describe a topic best using BERTopic's c-TF-IDF calculate. Then, these candidate keywords and documents are passed to the text generation model and asked to generate output that fits the topic best.

2

A huge benefit of this is that we can describe a topic with only a few documents and we therefore do not need to pass all documents to the text generation model. Not only speeds this the generation of topic labels up significantly, we also do not need a massive amount of credits when using an external API, such as Cohere or OpenAI.

In Llama2_hf.ipynb we load Llama2 model from Huggingface and run this LLM locally. In GPT_openai.ipynb we call api so that we can run the model online. Running model locally can be faster in speed, regarding the response time delay from calling api like openai.

4. Visualization with Topic_wizard

Relevant codes are in Topicwizard directory.

Topic_wizard is a powerful tool to visualize topic clustering results for topic models. However, it cannot be applied directly to BERTopic models due to incompatibility. To address the problem, we modified the source code of topic_wizard to make it compatible with our BERTopic model (see _topic_wizard_bertopic.py_).

3

What's more, we adjusted the input to topic_wizard to make it display the Llama2 representation as topic name instead of the topic names originally generated by BERTopic.

5. Evaluation

In evaluation directory we try different combination of parameters and compare metrics like diversity and coherence score, to decide the best parameter for BERTopic model.

6. Work Flow in Real Dataset

In new_data directory, we utilized a work flow for data fetching and model updating. The usage of this work flow is described in README.md and demo.py file in detail.

4

4511-a1's People

Contributors

xic18 avatar jiayi-wang0606 avatar chengchengcai006 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.