Coder Social home page Coder Social logo

elmo-kmeans's Introduction

ELMo Embeddings for Clustering

To convert text datasets into clusters based on topic. Currently in progress for benchmarking NVIDIA Rapids cuML/pyGDF for performance.

Requirements:

  • Python3 (>=3.6 for AllenNLP)
  • AllenNLP
  • TensorFlow
  • NumPy
  • SKLearn (for clustering)
  • Torch (for AllenNLP)
  • SciPy (for SKLearn stop words)

Install

To install the necessary dependencies:

apt-get install python3-pip
pip3 install allennlp
pip3 install tensorflow-gpu
pip3 install numpy
pip3 install sklearn
pip3 install torch
# pip3 install scipy

Usage

To generate sentence embeddings, make sure that the sentences.txt file is formatted as such:

<sentence/transcription>
<sentence/transcription>
# and so on...

Run python3 main.py with the following options:

  • --mode embed to embed the sentence file
  • --mode sif to enhance sentence embeddings with SIF
  • --mode cluster to cluster embeddings
  • --mode project to reduce dimensionality for visualization
  • --mode metadata to write metadata file
  • --mode tensorboard to create TensorBoard files

Adjust runtime flags in main.py

A couple auxiliary files:

  • Run sh clean.sh to convert transcriptions to lower case and remove stop words

Run

To run inside a Docker container:

docker build -t elmo-embeddings .
docker exec -it elmo-embeddings /bin/bash
python3.6 main.py

Outputs

An output folder will be created in the current directory containing:

  • embeddings.npy: a NumPy array of sentence embeddings (NumPy arrays) in binary format
  • embeddings_sif.npy: a NumPy array of sentence embeddings after SIF
  • embeddings_pc.npy: a NumPy array of sentence embeddings after PCA
  • embeddings_ts.npy: a NumPy array of sentence embeddings after t-SNE
  • km_labels.json: a list of cluster labels generated by KMeans
  • metadata.tsv: metadata of sentence labels for visualization

Other nested output folders:

  • tensorboard: for TensorBoard output logs
  • kmeans: for clustered sentences with kmeans
  • trimmed: for another copy of embeddings/sentences with specified clusters removed
  • hierarchy: for clustered sentences with hierarchical kmeans

GPU-acceleration:

  • Allow ELMo to use GPU for embedding (UPDATE: GPU speedup by as much as 4x)
  • Utilize NVIDIA Rapids cuML to GPU-accelerate clustering (by ~10x)

TO-DO

  • Preprocess transcriptions
  • Embed each sentence with ELMo
  • Enhance embeddings with SIF
  • Cluster using SKLearn KMeans (optional: hierarchically)
  • Find optimal k using elbow method and silhouette scores (optional)
  • Reduce dimensionality for visualization (PCA, t-SNE)
  • Run in TensorBoard
  • Conclusions

elmo-kmeans's People

Contributors

vliu15 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.