Coder Social home page Coder Social logo

scbert's Introduction

Sentence Clustering with BERT (SCB)

Sentence Clustering with BERT project which aim to use state-of-the-art BERT models to compute vectors for sentences. A few tools are also implemented to explore those vectors and how sentences are related to each others in the latent space.

Demonstration

  • Load example data set :
from SCBert.load_data import DataLoader

cls = DataLoader().load_cls_fr()
data = cls.review
  • Create vectors from raw data :
#How to transform raw french texts into vectors using BERT model. 
from SCBert.SCBert import Vectorizer

vectorizer = Vectorizer("flaubert_small")
#Here the small version of FLauBERT only has 6 layers and we will take layers 4 and 5 and mean pool them to create 
#a vector for each word, then mean pool all words vectors to have a unique vector for each text
text_vectors = vectorizer.vectorize(data, layers=[4,5], word_pooling_method="average", sentence_pooling_method="average")
  • Explore the embedded space :
#How to explore the relation in your data. 
from SCBert.SCBert import EmbeddingExplorer

ee = EmbeddingExplorer(data,text_vectors)
labels = ee.cluster(k=3,  cluster_algo="quick_k-means")     #Cluster with k-means 
ee.extract_keywords(num_top_words=15)                       #Extract 15 keywords using Rake algorithm, then accessible with ee.keywords
ee.compute_coherence(vectorizer)                            #Compute coherence for the keywords in each cluster
ee.explore_cls(color_label=cls.code, 'PCA')                              #This function is here to explore a the repartition of cluster in the FULL cls dataset 

Built-in example

There is a built-in example that you can find here. It comes with it's own data which is the CLS-fr composed of Amazon reviews from different sources (DVD, CD, Livres)

Installation

You can either download the zip file or use the Pypi package that you can install with the following command :

> pip install SCBert

If you encounter problems during the installation it may be because of the multi-rake dependy with cld2-cffi. I will try to address this later on. To bypass, just follow the instructions :

> export CFLAGS="-Wno-narrowing"
> pip install cld2-cffi
> pip install multi-rake

scbert's People

Contributors

kevinferin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

scbert's Issues

Unable to install scBERT due to restriction in torch version

Hello

I am trying to install scBERT using pip
, but I am facing trouble in installing it due to the strict version restriction of torch==1.3.1
Currently, i have the torch=1.12.0 version; if I force to install the 1.3.1 version, I am not able to install

Please help!!!

I have attached the error message
scbert

Thanks
Akila

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.