Coder Social home page Coder Social logo

yumeng5 / topclus Goto Github PK

View Code? Open in Web Editor NEW
85.0 2.0 11.0 110.08 MB

[WWW 2022] Topic Discovery via Latent Space Clustering of Pretrained Language Model Representations

License: Apache License 2.0

Shell 1.36% Python 98.64%
language-model pretrained-language-model topic-modeling topic-discovery clustering

topclus's Introduction

TopClus

The source code used for Topic Discovery via Latent Space Clustering of Pretrained Language Model Representations, published in WWW 2022.

Requirements

At least one GPU is required to run the code.

Before running, you need to first install the required packages by typing following commands (Using a virtual environment is recommended):

pip3 install -r requirements.txt

You need to also download the following resources in NLTK:

import nltk
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')

Overview

TopClus is an unsupervised topic discovery method that jointly models words, documents and topics in a latent spherical space derived from pretrained language model representations.

Running Topic Discovery

The entry script is src/trainer.py and the meanings of the command line arguments will be displayed upon typing

python src/trainer.py -h

The topic discovery results will be written to results_${dataset}.

We provide two example scripts nyt.sh and yelp.sh for running topic discovery on the New York Times and the Yelp Review corpora used in the paper, respectively. You need to first extract the text files from the .tar.gz tarball files under datasets/nyt and datasets/yelp.

You could expect to obtain results like the following (the Topic IDs are random):

On New York Times:
Topic 20: months,weeks,days,decades,years,hours,decade,seconds,moments,minutes
Topic 28: weapons,missiles,missile,nuclear,grenades,explosions,explosives,launcher,bombs,bombing
Topic 30: healthcare,medical,medicine,physicians,patients,health,hospitals,bandages,medication,physician
Topic 41: economic,commercially,economy,business,industrial,industry,market,consumer,trade,commerce
Topic 46: senate,senator,congressional,legislators,legislatures,ministry,legislature,minister,ministerial,parliament
Topic 72: government,administration,governments,administrations,mayor,gubernatorial,mayoral,mayors,public,governor
Topic 77: aircraft,airline,airplane,airlines,voyage,airplanes,aviation,planes,spacecraft,flights
Topic 88: baseman,outfielder,baseball,innings,pitchers,softball,inning,basketball,shortstop,pitcher
On Yelp Review:
Topic 1: steamed,roasted,fried,shredded,seasoned,sliced,frozen,baked,canned,glazed
Topic 15: nice,cozy,elegant,polite,charming,relaxing,enjoyable,pleasant,helpful,luxurious
Topic 16: spicy,fresh,creamy,stale,bland,salty,fluffy,greasy,moist,cold
Topic 17: flavor,texture,flavors,taste,quality,smells,tastes,flavour,scent,ingredients
Topic 20: japanese,german,australian,moroccan,russian,greece,italian,greek,asian,
Topic 40: drinks,beers,beer,wine,beverages,alcohol,beverage,vodka,champagne,wines
Topic 55: horrible,terrible,shitty,awful,dreadful,worst,worse,disgusting,filthy,rotten
Topic 75: strawberry,berry,onion,peppers,tomato,onions,potatoes,vegetable,mustard,garlic

Running Document Clustering

The latent document embeddings will be saved to results_${dataset}/latent_doc_emb.pt which can be used as features to clustering algorithms (e.g., K-Means).

If you have ground truth document labels, you could obtain the document clustering evaluation results by passing the document label file and the saved latent document embedding file to the cluster_eval function in src/utils.py. For example:

from src.utils import TopClusUtils
utils = TopClusUtils()
utils.cluster_eval(label_path="datasets/nyt/label_topic.txt", emb_path="results_nyt/latent_doc_emb.pt")

Running on New Datasets

To execute the code on a new dataset, you need to

  1. Create a directory named your_dataset under datasets.
  2. Prepare a text corpus texts.txt (one document per line) under your_dataset as the target corpus for topic discovery.
  3. Run src/trainer.py with appropriate command line arguments (the default values are usually good start points).

Citations

Please cite the following paper if you find the code helpful for your research.

@inproceedings{meng2022topic,
  title={Topic Discovery via Latent Space Clustering of Pretrained Language Model Representations},
  author={Meng, Yu and Zhang, Yunyi and Huang, Jiaxin and Zhang, Yu and Han, Jiawei},
  booktitle={The Web Conference},
  year={2022},
}

topclus's People

Contributors

yumeng5 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

topclus's Issues

Question on AE pretraining part

optimizer = Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=self.args.lr)

Thank you for sharing the code. As far as I understand, this stage only trains autoencoder parameters, so It seems that

'''
Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=self.args.lr)
'''
should be converted to

'''
Adam(filter(lambda p: p.requires_grad, model.ae.parameters()), lr=self.args.lr).
'''

If it is not, please let me know.

Question for NMI measurement

Hi, thank you for sharing the code:)

Since I am a novice here, It may be a trivial question; but it seems that NMI is tested on the training dataset.

Is this a typical setting for measuring the NMI? (measuring on the training set)

Thank you

A request for evaluation script

Hello,
Could you please share the script used for evaluation? The obtained results may differ with different evaluation scripts.
Thanks a lot!

Best wishes,
Lei

Request for using TopClus on different pretrained language models

Hi,

I've read your paper and I like this approach. Thank you for sharing the code. I've one question regarding the pretrained language models (PLMs) that you use for getting the contextualized word representations. I saw in the source code that the model you use is fixed, and it's the classical 'bert-base-uncased':

pretrained_lm = 'bert-base-uncased'

Suppose I'm interested on using this method on a corpus of italian texts. In that case, would it be possible to change this model and use a bert-base-multilingual-uncased instead?

If that's possible, can we make pretrained_lm a parameter of the TopClusTrainer?

Thank you.

cluster loss is 0 on my own dataset

Dear authors,
Thanks for your excellent work! I'm running the model on my own dataset and trying to understand the code. However, I always get cluster_loss = 0. I am using a Twitter dataset which is noisy and sparse, I have changed the BERT sequense length to 128 acoordingly in the pre-processing step. I have tried several different batch size and lr. (It seems to be working fine on the given yelp examples). Does anyone has the same problem or perhaps know why..? Thanks in advance!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.